STAT 29000: Project 4 — Spring 2021
Motivation: In this project we will continue to hone your web scraping skills, introduce you to some "gotchas", and give you a little bit of exposure to a powerful tool called cron.
Context: We are in the second to last project focused on web scraping. This project will introduce some supplementary tools that work well with web scraping: cron, sending emails from Python, etc.
Scope: python, web scraping, selenium, cron
Questions
Question 1
Check out the following website: project4.tdm.wiki
Use selenium to scrape and print the 6 colors of pants offered.
| You may have to interact with the webpage for certain elements to render. | 
- 
Python code used to solve the problem. 
- 
Output from running your code. 
Question 2
Websites are updated frequently. You can imagine a scenario where a change in a website is a sign that there is more data available, or that something of note has happened. This is a fake website designed to help students emulate real changes to a website. Specifically, there is one part of the website that has two possible states (let’s say, state A and state B). Upon refreshing the website, or scraping the website again, there is an x% chance that the website will be in state A and a 1-x% chance the website will be in state B.
Describe the two states (the thing (element or set of elements) that changes as you refresh the page), and scrape the website enough to estimate x.
| You will need to interact with the website to "see" the change. | 
| Since we are just asking about a state, and not any specific element, you could use the  | 
| Your estimate of x does not need to be perfect. | 
- 
Python code used to solve the problem. 
- 
Output from running your code. 
- 
What state AandBrepresent.
- 
An estimate for x.
Question 3
Dig into the changing "thing" from question (2). What specifically is changing? Use selenium and xpath expressions to scrape and print the content. What are the two possible values for the content?
| Due to the changes that occur when a button is clicked, I’d highly advice you to use the  | 
| 
 | 
- 
Python code used to solve the problem. 
- 
Output from running your code. 
Question 4
The following code allows you to send an email using Python from your Purdue email account. Replace the username and password with your own information and send a test email to yourself to ensure that it works.
| Do NOT include your password in your homework submission. Any time you need to type your password in you final submission just put something like "SUPERSECRETPASSWORD" or "MYPASSWORD". | 
| To include an image (or screenshot) in RMarkdown, try  | 
| The spacing and tabs near the  | 
| Questions 4 and 5 were inspired by examples and borrowed from the code found at the Real Python website. | 
def send_purdue_email(my_purdue_email, my_password, to, my_subject, my_message):
  import smtplib, ssl
  from email.mime.text import MIMEText
  from email.mime.multipart import MIMEMultipart
  message = MIMEMultipart("alternative")
  message["Subject"] = my_subject
  message["From"] = my_purdue_email
  message["To"] = to
  # Create the plain-text and HTML version of your message
  text = f'''\
Subject: {my_subject}
To: {to}
From: {my_purdue_email}
{my_message}'''
  html = f'''\
<html>
  <body>
    {my_message}
  </body>
</html>
'''
  # Turn these into plain/html MIMEText objects
  part1 = MIMEText(text, "plain")
  part2 = MIMEText(html, "html")
  # Add HTML/plain-text parts to MIMEMultipart message
  # The email client will try to render the last part first
  message.attach(part1)
  message.attach(part2)
  context = ssl.create_default_context()
  with smtplib.SMTP("smtp.purdue.edu", 587) as server:
    server.ehlo()  # Can be omitted
    server.starttls(context=context)
    server.ehlo()  # Can be omitted
    server.login(my_purdue_email, my_password)
    server.sendmail(my_purdue_email, to, message.as_string())
# this sends an email from kamstut@purdue.edu to mdw@purdue.edu
# replace supersecretpassword with your own password
# do NOT include your password in your homework submission.
send_purdue_email("kamstut@purdue.edu", "supersecretpassword", "mdw@purdue.edu", "put subject here", "put message body here")- 
Python code used to solve the problem. 
- 
Output from running your code. 
- 
Screenshot showing your received the email. 
Question 5
The following is the content of a new Python script called is_in_stock.py:
def send_purdue_email(my_purdue_email, my_password, to, my_subject, my_message):
  import smtplib, ssl
  from email.mime.text import MIMEText
  from email.mime.multipart import MIMEMultipart
  message = MIMEMultipart("alternative")
  message["Subject"] = my_subject
  message["From"] = my_purdue_email
  message["To"] = to
  # Create the plain-text and HTML version of your message
  text = f'''\
Subject: {my_subject}
To: {to}
From: {my_purdue_email}
{my_message}'''
  html = f'''\
<html>
  <body>
    {my_message}
  </body>
</html>
'''
  # Turn these into plain/html MIMEText objects
  part1 = MIMEText(text, "plain")
  part2 = MIMEText(html, "html")
  # Add HTML/plain-text parts to MIMEMultipart message
  # The email client will try to render the last part first
  message.attach(part1)
  message.attach(part2)
  context = ssl.create_default_context()
  with smtplib.SMTP("smtp.purdue.edu", 587) as server:
    server.ehlo()  # Can be omitted
    server.starttls(context=context)
    server.ehlo()  # Can be omitted
    server.login(my_purdue_email, my_password)
    server.sendmail(my_purdue_email, to, message.as_string())
def main():
    # scrape element from question 3
    # does the text indicate it is in stock?
    # if yes, send email to yourself telling you it is in stock.
    # otherwise, gracefully end script using the "pass" Python keyword
if __name__ == "__main__":
    main()First, make a copy of the script in your $HOME directory:
cp /class/datamine/data/scraping/is_in_stock.py $HOME/is_in_stock.py
```If you now look in the "Files" tab in the lower right hand corner of RStudio, and click the refresh button, you should see the file is_in_stock.py. You can open and modify this file directly in RStudio. Before you do so, however, change the permissions of the $HOME/is_in_stock.py script so only YOU can read, write, and execute it:
chmod 700 $HOME/is_in_stock.pyThe script should now appear in RStudio, in your home directory, with the correct permissions. Open the script (in RStudio) and fill in the main function as indicated by the comments. We want the script to scrape to see whether the pants from question 3 are in stock or not.
A cron job is a task that runs at a certain interval. Create a cron job that runs your script, /class/datamine/apps/python/f2020-s2021/env/bin/python $HOME/is_in_stock.py every 5 minutes. Wait 10-15 minutes and verify that it is working properly. The long path, /class/datamine/apps/python/f2020-s2021/env/bin/python simply makes sure that our script is run with access to all of the packages in our course environment. $HOME/is_in_stock.py is the path to your script ($HOME expands or transforms to /home/<my_purdue_alias>).
| If you struggle to use the text editor used with the  | 
| Don’t forget to copy your import statements from question (3) as well. | 
| Once you are finished with the project, if you no longer wish to receive emails every so often, follow the instructions here to remove the cron job. | 
- 
Python code used to solve the problem. 
- 
Output from running your code. 
- 
The content of your cron job in a bash code chunk. 
- 
The content of your is_in_stock.pyscript.