Luke Gulson: https://lgulson12.github.io/Data-Science-Final
Light housekeeping to get pandas
working and proprely connected to Gooogle Collab.
import pandas as pd
import numpy as np
import re
pd.set_option("display.max_rows", None, "display.max_columns", None)
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/Colab_Notebooks/Final_Tutorial
By Luke Gulson
Project Goals
The goal of this project is to explore whether TV viewers are currently more in favor of shorter and more quickly consumable content. People tend to associate the younger generations of today (Millenials/Gen Z) with shorter attention spans, and I decided I'd set out to see if data about television viewership supported this generalization. Additionally, I've sampled some data pertaining to trending YouTube videos to see if the same pattern emmerged there. I wanted to see specifically if there was any trends or correlation between runtimes of these visual media forms and the reception that they have recieved.
By analyzing the relationships of the following variables, I believe we'll look to answer
The first two tables here will be based on five different datasets, all acquired from kaggle which contains data from over 20,000 different shows and movies available the top streaming services - Netflix, HBO Max, Disney+, Amazon Prime, Hulu respectively. Sampled from specifically titles available in the United States as of May - July 2022. Each of these unique titles contains a number of corresponding variables:
id
: The title ID on JustWatch.title
: The name of the title.show_type
: TV show or movie.description
: A brief description.release_year
: The release year.age_certification
: The age certification.runtime
: The length of the episode (SHOW) or movie.genres
: A list of genres.production_countries
: A list of countries that produced the title.seasons
: Number of seasons if it's a SHOW.imdb_id
: The title ID on IMDB.imdb_score
: Score on IMDB.imdb_votes
: Votes on IMDB.tmdb_popularity
: Popularity on TMDB.tmdb_score
: Score on TMDB.While all of these variables are nice and dandy, for our purposes we'll only be looking at the variables: runtime
, imdb_score
, imdb_votes
, tmdb_popularity
, and tmdb_score
.
For clarification, IMDB (Internet Movie DataBase) and TMDB (The Movie DataBase) are separate but very similar sites, differing slightly in that TMDB is completely powered by its community, unlike IMDB which sources from internal critics as well as audience reviews. For the purpose of gaining a broader understanding of the subject matter, I'll be analyzing movie rating scores from both of these sites seperately and together.
Now for the other tables I'll be showing, I'll be using another dataset acquired from kaggle. This dataset includes many variables corresponding to 115 of the highest trending videos from March of 2007 to June of 2016. Each observation of a video contains the following variables:
Position
: An index column which gives the position of the channel in youtube channel Channel Id
: ID of the youtube channelChannel Title
: Youtube channel titleVideo Id
: ID of video in the youtube channelPublished At
: date of the video published atVideo Title
: Title of the videoVideo Description
: Description of the video(what the video is about)Video Category Id
: Category of the video in youtube channelVideo Category Label
:type of category the video belongs Duration
: duration of the videoDuration Sec
: Duration of video in secondsDimension
: Dimension of the video (2D , Hd) Definition
: Defining the videoCaption
: Boolean type caption (True or False) Licensed Content
View Count
: number of people viewed the video Like Count
: Number of likes the channel gotDislike Count
: Number of dislikes the channel gotFavorite Count
: Number of people marked as favouriteComment Count
: Number of people commented on the videoWhile there are a great many variables to be looked at here, we'll only be using durationSec
, likeCount
, and dislikeCount
.
In order to be able to actually compare these variables, I need to first manipulate the datasets a bit. With the final goal in mind being, one large DataFrame that contains all of the unique TV shows from all of these different steaming services.
First we need to read in all of the datasets that we have.
netflix_df = pd.read_csv(r'netflix_titles.csv') #import Netflix movies and shows data to a DF
hbo_max_df = pd.read_csv(r'hbo_max_titles.csv') #import HBO Max movies and shows data to a DF
disneyp_df = pd.read_csv(r'disneyp_titles.csv') #import Disney+ movies and shows data to a DF
amazonp_df = pd.read_csv(r'amazonp_titles.csv') #import Amazon Prime movies and shows data to a DF
hulu_df = pd.read_csv(r'hulu_titles.csv') #import Hulu movies and shows data to a DF
youtube_df = pd.read_csv(r'Trending videos on youtube dataset.csv') #import trending YouTube video data to a DF
Then we can move on to combining all of the DataFrames into one which contains only shows and not movies. I decided movies are less of a regularly consumed thing and therefore less relevant to our data analysis.
#Creates one cohesive DataFrame by merging each streaming service's DataFrame together.
combined_df = netflix_df.merge(hbo_max_df, on = ['id', 'title', 'type', 'description', 'release_year', 'age_certification', 'runtime', 'genres', 'production_countries', 'seasons', 'imdb_id', 'imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score'], how = 'outer')
combined_df = combined_df.merge(disneyp_df, on = ['id', 'title', 'type', 'description', 'release_year', 'age_certification', 'runtime', 'genres', 'production_countries', 'seasons', 'imdb_id', 'imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score'], how = 'outer')
combined_df = combined_df.merge(amazonp_df, on = ['id', 'title', 'type', 'description', 'release_year', 'age_certification', 'runtime', 'genres', 'production_countries', 'seasons', 'imdb_id', 'imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score'], how = 'outer')
combined_df = combined_df.merge(hulu_df, on = ['id', 'title', 'type', 'description', 'release_year', 'age_certification', 'runtime', 'genres', 'production_countries', 'seasons', 'imdb_id', 'imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score'], how = 'outer')
#Removes all instances of shows that appear on multiple platforms such that they only appear once and with no duplicates.
combined_df.drop_duplicates()
#Makes a DataFrame which only contains observations that are shows, and not movies.
all_shows_df = combined_df[combined_df.type == 'SHOW']
Now, I've merged all of the databases from the streaming services together and made a new DataFrame containing only TV shows.
#Display the type of each variable in the DataFrame
all_shows_df.dtypes
We can see here that the variables: runtime, seasons, imdb_score
, imdb_votes
, tmdb_popularity
, and tmdb_score
are already in the table as either a int64
or float64
type. This is helpful because we can use this to represent these variables as numerical data in our visualizations. Had they been represented by an object or string, I would've had to cast them to an int or float myself.
#Display the data type of all of youtube_df's features
youtube_df.dtypes
Just as we did with the show DataFrames, we now can see that durationSec
, viewCount
, likeCount
, and dislikeCount
are all represented as either an int64
or float64
. This is beneficial to us, as we can manipulate this data as numerical data to view relationships and perform necessary calculations with it.
Now we can do some work to display our representation of the relationship between runtime and IMDB scores.
#Finds the value of the 10th quantile of imdb_votes
votes_minimum = all_shows_df.imdb_votes.quantile(0.10)
This value is useful, as it informs us what is a basis for 'enough votes' to qualify as a legimate contribution to the data analysis. I chose the 10th quantile, as it seems to include a large enough majority fo the shows without including outliers. The reviews on these shows could greatly skew the data in a way that would make it harder to reach an accurate conclusion.
#Creates a DataFrame that includes only shows that are above the established vote minimum
imdb_quantiled_df = all_shows_df[all_shows_df.imdb_votes > votes_minimum]
#Displays a scatter plot comparing runtime and imdb_score
imdb_quantiled_df.plot.scatter(x = 'runtime', y= 'imdb_score', s = 5)
With this chart we can see a representation of the relationship that a show's runtime has with it's imdb score. At first glance, it appears that my claim's being countered by this data. The shows with longer runtimes appear to be just as positively or negatively reviewed as the shows with shorter runtimes.
Now we can start a similar process to find a visual representation of the relationship between TV show runtimes and their respective tmdb scores.
#Finds the value of the 10th quantile of tmdb_popularity
popularity_minimum = all_shows_df.tmdb_popularity.quantile(0.10)
While the variable tmdb_popularity
represents a slightly different quantity from imdb_votes
, it is still not a good idea to remove data from the dataset that represents less popular shows. Similarly to with imdb_votes
, I'll be cutting this data down using the 10th quantile value as a baseline. The reviews on these shows could greatly skew the data in a way that would make it harder to reach an accurate conclusion.
import math
#Creates a new DataFrame with only the shows that are above the established tmdb_popularity minimum
tmdb_quantiled_df = all_shows_df[all_shows_df.tmdb_popularity > popularity_minimum]
tmdb_quantiled_df.plot.scatter(x = 'runtime', y = 'tmdb_score', s = 3)
With this chart we can see a representation of the relationship that a show's runtime has with it's tmdb score. Viewing this chart, it appears that once again, my claim's being countered by this data. There is almost no visible correlationship between any of the runtimes and their TMDB score.
For an extra layer of analysis, I'll be looking at a new variable which will combine and average the IMDB and TMDB scores of each respective show. This way we can analyze this data free of the bias of these two reviewing sites.
First, I'll establish the same minimum constraints on a DataFrame as was done with first two tables based on imdb_votes
and tmdb_popularity
respectfully.
#Creates a new DataFrame that has the constraints of both votes_minimum and popularity_minimum
imdb_tmdb_df = all_shows_df[(all_shows_df.imdb_votes > votes_minimum)]
imdb_tmdb_df = imdb_tmdb_df[(imdb_tmdb_df.tmdb_popularity > popularity_minimum)]
Now, I'll create a new variable, titled combined_score
which will represent the average of a particular show's imdb_score
and tmdb_score
.
#Creates the variable combined_score, which averages the imdb_score and the tmdb_score
imdb_tmdb_df['combined_score'] = (imdb_tmdb_df['imdb_score'] + imdb_tmdb_df['tmdb_score']) / 2
Time to display the data in another scatterplot.
#Displays a scatterplot which compares runtime and combined_score
imdb_tmdb_df.plot.scatter(x = 'runtime', y = 'combined_score', s = 3)
To not much surprise, this graph does not provide any support to the decreased appreciation of longform video. It doesn't support that there's any length of show that's more or less likely to be reviewed well.
For the next table we'll be looking at, we'll move on to analyzing the dataset from YouTube.
First, let's just take a look at the relationship between viewCount
and runtime
. This will give an indication to whether the length of a video has an impact on how popular it is.
#Display youtube_df as a scatterplot that compares durationSec and viewCount
youtube_df.plot.scatter(x = 'durationSec', y = 'viewCount', s = 5)
As we can see, this graph doesn't really display anything readable. Let's cut down this DataFrame so that it doesn't display the unnecessary outliers.
#Create a new DataFrame with maximum constraints on durationSec and viewCount
youtube_cut = youtube_df[youtube_df['durationSec'] < 1500]
youtube_cut = youtube_cut[youtube_df['viewCount'] < 10000]
Now let's take a look at this slimmed down graph.
#Display youtube_cut as a scatterplot that compares durationSec and viewCount
youtube_cut.plot.scatter(x = 'durationSec', y = 'viewCount', s = 10)
This plot doesn't read very well, but with the help of a logarithmic transformation we can notice a bit more of a pattern.
#Create a new variable viewCountLog that represents a log base 2 transformation of the viewCount variable
youtube_cut['viewCountLog'] = np.log2(youtube_cut['viewCount'])
youtube_cut.plot.scatter(x = 'durationSec', y = 'viewCountLog', s = 5)
As we can see here, the data becomes sparser as durationSec
increases, but the plot that we've created reflects a very slight positive relationship, it is not discretely indicitive of real correlation, but it provides a bit of insight.
Now we'll be taking a look at how video duration has an impact on a particular video's reviews. Specifically, we'll be comparing durationSec
to a calculated difference of likeCount
and dislikeCount
.
First, let's create a new feature in the DataFrame that will represent the overall review score of a video based on its likeCount
and dislikeCount
. We'll call this new feature: review_score
.
#Computes and adds the new feature: reviewScore to youtube_df
youtube_df['reviewScore'] = youtube_df['likeCount'] - youtube_df['dislikeCount']
Now that we have that, let's take a look at a graphical representation of that relationship.
#Displays a scatterplot comparing durationSec and reviewScore
youtube_df.plot.scatter(x = 'durationSec', y = 'reviewScore', s = 5)
We can see that the same issue is presenting itself as with Table 4, so let's slim it down again.
#Creates a new DataFrame with maximum constraints on durationSec and reviewScore
youtube_slimmed = youtube_df[youtube_df['durationSec'] < 1000]
#youtube_slimmed = youtube_slimmed[youtube_slimmed['reviewScore'] < 150]
Let's take a look at that graph again.
#Displays a scatterplot comparing durationSec and reviewScore
youtube_slimmed.plot.scatter(x = 'durationSec', y = 'reviewScore', s = 5)
This is another situation in which a logarithmic transformation would provide a bit more insight.
#Create a new variable reviewScoreLog that represents a log base 2 transformation of the reviewScore variable
youtube_slimmed['reviewScoreLog'] = np.log2(youtube_slimmed['reviewScore'])
youtube_slimmed.plot.scatter(x = 'durationSec', y = 'reviewScoreLog', s = 5)
This scatter plot really just reads as an almost even distribution of scores Once again, we're left with a graphical representation of this data that does not serve to prove any relationship between video media runtime and overall reception.
Starting with the TV show data: Let's begin by finding that curve
Currently, it seems as though the further I delve into this topic, the less there seems to be a relationship where I thought there would be. Perhaps one could simply conclude that digital media is not indicative of this change in human attention span. I do believe that there's more to be seen here, and there's more that I could discover. Perhaps it's time to shift my analysis to answer some side question which is more insightful and actually supported by trends that I find. I've thought of two plausible models that I could manufacture in order to further develop my investigation from here.
I believe that shifting my focus towards these two models could move me towards some other questions, specifically what makes a particular piece of video media 'good'. I believe that this line of analysis will make my work more insightful and creative going forward.
Linked below are the sources where I found the datasets I used throughout:
Below is just a way of converting my file to HTML.
%%shell
jupyter nbconvert /content/drive/MyDrive/Colab_Notebooks/Final_Tutorial/Luke_Gulson_Final_Tutorial.ipynb