Watching Cartoonies

Using IMDb rvest scraper to look at some animated shows

In an earlier post, I used rvest to scrape IMDb episode ratings about The Magicians. Under the section “Advanced Spellcasting”, I made a function that could scrape any show, and I gave an example grabbing some data about The Simpsons. It worked so well last time, let’s keep on using it! Let’s try grabbing some more animated shows and see what we can make.

Take a seat. It’s time to watch cartoons…

…We Came In?

We are going to start where we ended, using the same grab_imdb_ratings() function from the earlier post (with a small tweak).

library(tidyverse) # cause duh
library(rvest) # for web scraping
library(glue) # for that warm and fuzzy fstring feeling (python-like)
library(RColorBrewer) # for a nicer set of colors

theme_set(theme_light())
grab_imdb_ratings <- function(imdb_code, seasons) {
    # Grabbing Rating Data for a show on IMDb
    # 
    # - imdb_code: url code for a given show (the "tt<number_string>" in the url)
    # - seasons: list of desired seasons
  
  # empty list to store dataframes
  df_list = list()
  
  # lez go!
  for (season in seasons) {
    
    # define url, using glue for combining strings
    base_url <- "https://www.imdb.com/title/"
    season_url <- glue("{base_url}{imdb_code}/episodes?season={season}")
    
    # go get the html
    html <- read_html(season_url)
    
    # isolate the desired data
    show <- html_nodes(html, ".parent a") %>% 
      html_text(trim = TRUE)
    title <- html_nodes(html, "#episodes_content strong a") %>% 
      html_text(trim = TRUE)
    rating <- html_nodes(html, ".ipl-rating-star.small .ipl-rating-star__rating") %>% 
      html_text(trim = TRUE) %>% 
      as.numeric()
    votes <- html_nodes(html, ".ipl-rating-star__total-votes") %>% 
      html_text(trim = TRUE) %>% 
      parse_number() # this saved the day! super helpful readr function
    air_date <- html_nodes(html, ".airdate") %>% 
      html_text(trim = TRUE) %>% 
      str_remove("[.]") %>%  # remove periods (May doesn't have a period like the rest: Apr., Oct.)
      parse_date("%d %b %Y")
    
    # make a tibble for each season 
    df <- tibble(show, air_date, title, rating, votes) %>%
      mutate(season = season,
             episode = seq(1, nrow(.))) %>% 
      select(show, season:episode, everything())
    
    # add to list
    df_list[[season]] <- df
  }
  
  # smoosh the list into one tibble
  show_run <- bind_rows(df_list)
  
  return(show_run)
  
}

In this version, we want to know what show each episode belongs to, so I added a show field when parsing the html. Works like a charm!

Notes:

  • So…many…cartoons: I gave this a go with a bunch of different shows, settling on a smaller subset. This is a super fun process to explore these shows, so I’ll probably be grabbing more someday! Interestingly, the code broke for season eight of Bob’s Burger’s. Turns out The episode “The Bleakening” has a separate entry for “The Bleakening - Part Two” with no reviews or ratings, which made my code mad. Would require some code surgery to fix that season’s output.

Watch Cartoons

Here we grab all of the data for a bunch of different cartoons. Focusing on popular and/or long-running animated comedies in primetime (I guess).

# collect cartoon shows
simpsons <- grab_imdb_ratings("tt0096697", c(1:31)) 
family_guy <- grab_imdb_ratings("tt0182576", c(1:18)) 
south_park <- grab_imdb_ratings("tt0121955", c(1:23))
rick_and_morty <- grab_imdb_ratings("tt2861424", c(1:4))
futurama <- grab_imdb_ratings("tt0149460", c(1:7))
king_of_the_hill <- grab_imdb_ratings("tt0118375", c(1:13))

# all-in-one
shows <- bind_rows(simpsons, 
                   family_guy, 
                   south_park, 
                   rick_and_morty,
                   futurama,
                   king_of_the_hill)

In Hell They Make You Watch Them All in a Row

Let’s look at the ratings over time, as the episodes aired.

# Ratings Over Time
plot_title <-  glue("Oh Jeez, Rick! {dim(shows)[1]} Episodes! Oh Man!")
plot_subtitle <- "IMDb Episode Ratings Over Time"

shows %>% 
  ggplot(aes(air_date, rating, color = show)) +
    geom_point(aes(size = votes), alpha = 0.2, stroke = 0) +
    geom_smooth(se = FALSE, linetype = "solid") +
    scale_color_brewer(palette = "Set2") +
    labs(title = plot_title,
         subtitle = plot_subtitle,
         x = "Episode Air Date",
         y = "Weighted Average IMDb Rating",
         caption = "zachbogart.com\nSource: IMDb",
         size = "Total Votes",
         color = "Show")

The plot shows one dot for each episode, placed according to the date it was aired and it’s weighted average IMDB rating (explained, sort of). The size of each dot is mapped to how many users voted to create the IMDb rating. Finally, some geom_smooth() lines to see the overall trend for each show.

The Simpsons started in the 1990’s (just kissed 1989 with their inaugural Christmas special) and set the stage for the other shows that followed. There first decade has proven to have been the highest rated (as is the case with the other long-running shows pictured). Over time, they have had a slight, but steady decrease, yet they are still quite highly-rated after three decades on the air.

Other shows followed in the late 90’s/early aughts. They all start at similar levels to the yellow juggernaut, but some shows have been able to keep a higher rating for a longer stretch. South Park for example, has kept a higher average rating than The Simpsons in the early years, and they’re also comfortably above other shows that started around a similar time.

Finally, Rick and Morty, though much newer than the others, has had a very high approval rating across its run. It is also interesting that, of all the shows, it has the most user votes for each episode (bigger dots) by a big margin, with close to ten thousand reviews per episode.

# look at average episode votes for each show
shows %>% 
  count(show, 
        wt = mean(votes), 
        sort = TRUE,
        name = "avg_votes_per_episode")
## # A tibble: 6 x 2
##   show             avg_votes_per_episode
##   <chr>                            <dbl>
## 1 Rick and Morty                   9556.
## 2 South Park                       2047.
## 3 Futurama                         1491.
## 4 The Simpsons                     1344.
## 5 Family Guy                       1145.
## 6 King of the Hill                  199.

The Terrifying Lows. The Dizzying Highs. The Creamy Middles.

Now let’s look at the distribution of each show’s ratings. To spice up the plot, let’s include the names of the episodes that earn the highest and lowest rating for each show. I made them as separate tibbles and added them as geom_text() layers.

# get highest/lowest rated episodes, breaking ties
best_rated <- shows %>% 
  group_by(show) %>% 
  slice_max(rating, with_ties = FALSE)

worst_rated <- shows %>% 
  group_by(show) %>% 
  slice_min(rating, with_ties = FALSE)

Here’s a look at each show’s collection of episodes as a boxplot.

# horizontal boxplots
plot_title <-  "Atlantis, Baby!"
plot_subtitle <- "IMDb Episode Ratings with Highest/Lowest-Rated Episode Titles"

text_size = 3
text_offset = 0.7
text_wrap_limit = 17

shows %>% 
  ggplot(aes(x = fct_reorder(show, rating, "median"), y = rating, color = show)) +
    geom_boxplot() +
    geom_jitter(alpha = 0.1) +
    # use best/worst rated data
    geom_text(data = best_rated, 
              aes(y = rating + text_offset, label = str_wrap(title, text_wrap_limit)),
              size = text_size, fontface = "bold") +
    geom_text(data = worst_rated, 
              aes(y = rating - text_offset, label = str_wrap(title, text_wrap_limit)),
              size = text_size, fontface = "bold") +
    ylim(2.6, 10.8) +
    scale_color_brewer(palette = "Set2") +
    guides(color = FALSE) +
    labs(x = "",
         y = "Weighted Average IMDb Rating",
         title = plot_title,
         subtitle = plot_subtitle,
         caption = "zachbogart.com\nSource: IMDb") +
    coord_flip()

Interesting to see how different shows span different rating ranges. Granted it has the smallest number of episodes (41), but Rick and Morty has a very narrow distribution of ratings, all clustered high up the scale. At the other extreme, The Simpsons has a ridiculous number of episodes (683), which span a much wider range. It is also noteworthy that while many shows have episodes averaging a six, The Simpsons and Family Guy are the only ones with episodes that dip below four. While corner cases, they both seem to have an episode that is quite disliked by viewers (or at least those willing to say something on IMDb). Do any of the best/worst episodes ring a bell? Do you agree with IMDb?

Isn’t This Where…

Kinda feel like I’m repeating myself, but this was such a breeze to setup/joy to play with since the earlier post did the heavy-lifting (yes, future-me is thankful). Maybe I’ll grab some more shows in the future. I’m sure there is a lot to dig up and since things are all in place, it should be easy to ask questions about multiple shows together. Should make for some good viewing.

Getting More Complicated

  • Could look into getting the actual reviews people leave about episodes or shows. Getting my feet wet with sentiment analysis would be worth doing.
  • I think I’m not done with The Simpsons just yet. Would be interesting to see that show’s long run split up a bit, maybe tell some more specific stories (season-by-season, Treehouse of Horror, etc.)

Alright. Till next time!

<vsauce_voice>And as always…thanks for watching.</vsauce_voice>

Learning

  • Color was an accident in terms of getting The Simpsons to be yellow. Would want to get better at manually binding palette colors to a given part of the data.
  • Factoring out styling variables proved useful when I was deciding between displaying a horizontal versus vertical boxplot. Allowed for easy tweaking across multiple plots.
  • Would be nice to allow for links to headers in a post in blogdown. I found some things online, but I’ll have to look it over more thoroughly.
  • Another blogdown question: how to get twitter URLs to render as cards with an image. With Twitter’s Card Validator, it doesn’t seem to include an image. Wondering how I add one without getting too much into the HUGO weeds.

Image Credit

large sofa by Zach Bogart from the Noun Project

Zach Bogart
Zach Bogart
Data Explorer

Science, Design, & Data. I’ll know it when I see it.

Related