Looking into YouTube Free Movies

You get what you (don’t) pay for

I saw that YouTube had a wide selection of movies, but the free movies didn’t seem very good. Was interested in seeing what kind of ratings the free movies on YouTube receive. Let’s figure it out!

Go to the Movies

To start, let’s import some things.

import pandas as pd
import requests
from bs4 import BeautifulSoup

import seaborn as sns
import matplotlib.pyplot as plt

We can use requests to grab the html.

url = 'https://www.youtube.com/feed/storefront?bp=kgEmCGQSIlBMSFBUeFR4dEMwaWJWWnJUMl9XS1dVbDJTQXhzS3VLd3iiBQIoAg%3D%3D'
page = requests.get(url)

soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify()[:200])

<!DOCTYPE html>
<html data-cast-api-enabled="true" lang="en">
 <head>
  <style name="www-roboto">
   @font-face{font-family:'Roboto';font-style:normal;font-weight:500;src:local('Roboto Medium'),local(

At this point, we can do some inspecting to find that each film has a yt-lockup-title class that we can access. With a little bit of fiddling, we can get a dataframe of the movie names. There is also a common format which we can use to split the strings into separate columns.

html_films = soup.find_all(class_="yt-lockup-title")

for film in html_films[:5]:
    print(film.get_text())
Dino King - Duration: 1:28:47.
Snow Queen - Duration: 1:16:07.
Beyond Beyond - Duration: 1:19:24.
Igor (U.S) - Duration: 1:26:31.
Sleepover - Duration: 1:29:29.
movies = [film.get_text() for film in html_films]
movies[:6]
['Dino King - Duration: 1:28:47.',
 'Snow Queen - Duration: 1:16:07.',
 'Beyond Beyond - Duration: 1:19:24.',
 'Igor (U.S) - Duration: 1:26:31.',
 'Sleepover - Duration: 1:29:29.',
 'The Secret of Nimh - Duration: 1:22:46.']
df = pd.DataFrame(movies)
df.rename(columns={0: 'movie'}, inplace=True)
df[df.movie.str.contains(' - Duration: ')].head()

movie
0 Dino King - Duration: 1:28:47.
1 Snow Queen - Duration: 1:16:07.
2 Beyond Beyond - Duration: 1:19:24.
3 Igor (U.S) - Duration: 1:26:31.
4 Sleepover - Duration: 1:29:29.
df = df.movie.str.split(' - Duration: ', expand=True)
df[1] = df[1].str.rstrip('.')
df = df.reset_index()
df.rename(columns={0: 'yt_title', 1: 'yt_duration', 'index': 'yt_id'}, inplace=True)
df.head()

yt_id yt_title yt_duration
0 0 Dino King 1:28:47
1 1 Snow Queen 1:16:07
2 2 Beyond Beyond 1:19:24
3 3 Igor (U.S) 1:26:31
4 4 Sleepover 1:29:29

Convert Duration to minutes

The string movie time doesn’t mean much to us. Let’s turn it into minutes.

def split_time(x):
    numbers = x.split(':')
    time = int(numbers[0]) * 60 + int(numbers[1])
    return time
df['yt_minutes'] = df['yt_duration'].apply(split_time)

IMDb Data

  • We will use data from the IMDb datasets to get info on the movie ratings.
imdb_ratings = pd.read_csv('/Users/zachbogart/Downloads/title.ratings.tsv', sep='\t')
imdb_basics = pd.read_csv('/Users/zachbogart/Downloads/title.basics.tsv', sep='\t')
/Users/zachbogart/Documents/venv-breadbasket/sourdough_venv/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3058: DtypeWarning: Columns (5) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
imdb = imdb_basics.merge(imdb_ratings, how='left', on='tconst')
imdb.shape
(6831547, 11)

Let’s look just at items that are movies

imdb.titleType.value_counts()
tvEpisode       4869408
short            741081
movie            551301
video            265727
tvSeries         184466
tvMovie          121175
tvMiniSeries      31078
tvSpecial         29209
videoGame         25548
tvShort           12554
Name: titleType, dtype: int64
imdb = imdb.loc[imdb.titleType == 'movie']
imdb.shape
(551301, 11)
imdb.head()

tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres averageRating numVotes
8 tt0000009 movie Miss Jerry Miss Jerry 0 1894 \N 45 Romance 5.9 153.0
145 tt0000147 movie The Corbett-Fitzsimmons Fight The Corbett-Fitzsimmons Fight 0 1897 \N 20 Documentary,News,Sport 5.2 346.0
332 tt0000335 movie Soldiers of the Cross Soldiers of the Cross 0 1900 \N \N Biography,Drama 6.1 40.0
499 tt0000502 movie Bohemios Bohemios 0 1905 \N 100 \N 3.8 6.0
571 tt0000574 movie The Story of the Kelly Gang The Story of the Kelly Gang 0 1906 \N 70 Biography,Crime,Drama 6.1 574.0

Try Joining

We are playing this pretty fast and loose, but we can get a rough sense of the movie ratings using a simple join and dropping any corner cases.

joined = df.merge(imdb, how='left', left_on='yt_title', right_on='primaryTitle')
# remove any values with null
joined = joined.dropna().sort_values('primaryTitle')

Let’s not deal with any overlap. Just the movies that have one match (Could also try to match durations to get more movies, but let’s ignore that right now).

  • there are a bunch of items with multiple results
  • Let’s see what we have without trying to dig out overlapping movies
joined.yt_title.value_counts()
Romeo and Juliet      10
Happily Ever After    10
Zoo                   10
The Suspect            8
Gone                   8
                      ..
A Cowgirl's Story      1
Mad Money              1
Bakery in Brooklyn     1
Snow Queen             1
17 Miracles            1
Name: yt_title, Length: 262, dtype: int64
singles = joined.groupby('yt_title')['yt_title'].filter(lambda x: len(x) == 1)
easy = joined[joined.yt_title.isin(singles)]

What Do We Find?

The Highest (and Lowest) Rated

  • What are the top-rated movies available? Worst Rated?
cols = ['yt_title', 'averageRating', 'numVotes']
easy[cols].sort_values('averageRating', ascending=False).head(10)

yt_title averageRating numVotes
478 The Usual Suspects 8.5 956313.0
429 Zeitgeist: Moving Forward 8.2 17246.0
440 Bones Brigade: An Autobiography 8.1 2306.0
433 Requiem for the American Dream 8.1 8297.0
712 Citizenfour 8.0 50670.0
413 Sound City 7.8 11887.0
308 Muscle Shoals 7.7 3400.0
303 Fat, Sick & Nearly Dead 7.5 8338.0
272 The Secret of Roan Inish 7.5 6792.0
319 Unbranded 7.4 1326.0
easy[cols].sort_values('averageRating').head(10)

yt_title averageRating numVotes
10 Disaster Movie 1.9 85558.0
636 Alcatraz 3.3 359.0
169 Spiders 3D 3.3 3960.0
707 The Remains 3.7 2870.0
513 Fist of the Warrior 3.8 369.0
740 Fishing Naked 4.0 782.0
160 Ribbit 4.0 611.0
74 Hollow Creek 4.2 592.0
327 Arthur & Merlin 4.3 1226.0
611 Out of Liberty 4.4 262.0

How Old Are These Movies?

  • Most are a few years old. Some are decades old.
easy['startYear'] = pd.to_numeric(easy.startYear)
/Users/zachbogart/Documents/venv-breadbasket/sourdough_venv/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
ax = sns.relplot(data=easy,
                x='startYear',
                y='averageRating',
                kind='scatter',
                hue='yt_minutes')
plt.show()

png

Scraping the Bottom of the Barrel

  • Looks like the movies are pretty poorly rated, averaging around 6 out of 10
ax = easy.averageRating.hist()
plt.show()

png

print(f"{easy.averageRating.mean():.2f}")
5.95

Overall

You get what you (don’t) pay for.

Resources

Image Credit

integrated system by Zach Bogart from the Noun Project

Zach Bogart
Zach Bogart
Data Explorer

Science, Design, & Data