STAT 39000: Project 15 — Fall 2020
Motivation: We’ve done a lot of work with SQL this semester. Let’s review concepts in this project and mix and match R and SQL to solve data-driven problems.
Context: In this project, we will reinforce topics you’ve already learned, with a focus on SQL.
Scope: SQL, sqlite, R
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/movies_and_tv/imdb.db
F.R.I.E.N.D.S is a popular tv show. They have an interesting naming convention for the names of their episodes. They all begin with the text "The One …". There are 6 primary characters in the show: Chandler, Joey, Monica, Phoebe, Rachel, and Ross. Let’s use SQL and R to take a look at how many times each characters' names appear in the title of the episodes.
Questions
Question 1
Write a query that gets the episode_title_id, primary_title, rating, and votes, of all of the episodes of Friends (title_id is tt0108778).
| You can slightly modify the solution to question (5) in project 13. | 
- 
SQL query used to answer the question. 
- 
First 5 results of the query. 
Question 2
Now that you have a working query, connect to the database and run the query to get the data into an R data frame. In previous projects, we learned how to used regular expressions to search for text. For each character, how many episodes `primary_title`s contained their name?
- 
R code in a code chunk that was used to find the solution. 
- 
The solution pasted below the code chunk. 
Question 3
Create a graphic showing our results in (2) using your favorite package. Make sure the plot has a good title, x-label, y-label, and try to incorporate some of the following colors: #273c8b, #bd253a, #016f7c, #f56934, #016c5a, #9055b1, #eaab37.
- 
The R code used to generate the graphic. 
- 
The graphic in a png or jpg/jpeg format. 
Question 4
Now we will turn our focus to other information in the database. Use a combination of SQL and R to find which of the following 3 genres has the highest average rating for movies (see type column from titles table): Romance, Comedy, Animation. In the titles table, you can find the genres in the genres column. There may be some overlap (i.e. a movie may have more than one genre), this is ok.
To query rows which have the genre Action as one of its genres:
SELECT * FROM titles WHERE genres LIKE '%action%';- 
Any code you used to solve the problem in a code chunk. 
- 
The average rating of each of the genres listed for movies. 
Question 5
Write a function called top_episode in R which accepts the path to the imdb.db database, as well as the title_id of a tv series (for example, "tt0108778" or "tt1266020"), and returns the season_number, episode_number, primary_title, and rating of the highest rated episode in the series. Test it out on some of your favorite series, and share the results.
- 
Any code you used to solve the problem in a code chunk. 
- 
The results for at least 3 of your favorite tv series.