This project extracts movie data from LetterBoxd and performs Exploratory Data Analysis to understand trends in movies and different movie genres.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
df = pd.read_json("feb2021.json")
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
title | year | duration | genre | rating | language | viewers | |
---|---|---|---|---|---|---|---|
0 | Parasite | 2019.0 | 133 | [comedy, drama, thriller] | 4.6 | Korean | Weighted average of 4.60 based on 705,993Â ratings |
1 | Call Me by Your Name | 2017.0 | 132 | [drama, romance] | 4.1 | English | Weighted average of 4.09 based on 355,937Â ratings |
2 | Hereditary | 2018.0 | 127 | [thriller, horror, mystery] | 4.0 | English | Weighted average of 3.99 based on 365,955Â ratings |
3 | Spirited Away | 2001.0 | 125 | [fantasy, animation, family] | 4.5 | Japanese | Weighted average of 4.47 based on 386,917Â ratings |
4 | Moonlight | 2016.0 | 111 | [drama] | 4.2 | English | Weighted average of 4.20 based on 347,384Â ratings |
df.shape
(94051, 7)
df.columns
Index(['title', 'year', 'duration', 'genre', 'rating', 'language', 'viewers'], dtype='object')
df.isnull().sum()
title 0
year 3379
duration 0
genre 0
rating 70251
language 586
viewers 70251
dtype: int64
df = df.loc[~(df.year.isnull())]
df = df.loc[~(df.rating.isnull())]
df = df.loc[~(df.language.isnull())]
df.isnull().sum()
title 0
year 0
duration 0
genre 0
rating 0
language 0
viewers 0
dtype: int64
df.columns
Index(['title', 'year', 'duration', 'genre', 'rating', 'language', 'viewers'], dtype='object')
df.year = df.year.astype('int')
df.year = df.year.astype('category')
df.year.value_counts()
2019 1247
2018 1218
2017 1196
2016 1090
2015 1003
...
1895 8
1894 5
1893 1
1888 1
1886 1
Name: year, Length: 131, dtype: int64
df.loc[(df.duration=="")]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
title | year | duration | genre | rating | language | viewers | |
---|---|---|---|---|---|---|---|
715 | The King's Man | 2021 | [action, adventure, comedy] | 3.3 | English | Weighted average of 3.29 based on 50Â ratings | |
969 | The Matrix Resurrections | 2021 | [action, adventure, science fiction] | 3.8 | English | Weighted average of 3.76 based on 97Â ratings | |
1958 | Coming 2 America | 2021 | [family, comedy] | 3.3 | English | Weighted average of 3.27 based on 57Â ratings | |
2270 | Minions: The Rise of Gru | 2021 | [family, action, adventure, comedy, animation] | 2.6 | English | Weighted average of 2.65 based on 237Â ratings | |
2891 | Hotel Transylvania 4 | 2021 | [adventure, animation, family, fantasy] | 3.3 | English | Weighted average of 3.30 based on 113Â ratings | |
... | ... | ... | ... | ... | ... | ... | ... |
41910 | Murdered | 2012 | [crime] | 3.0 | French | Weighted average of 3.05 based on 30Â ratings | |
42148 | New Moon | 2010 | [action] | 2.9 | English | Weighted average of 2.90 based on 31Â ratings | |
42179 | Uthaman | 2001 | [] | 3.1 | Malayalam | Weighted average of 3.08 based on 32Â ratings | |
42906 | Daniël Arends: Joko 79 | 2008 | [comedy] | 3.4 | Dutch | Weighted average of 3.42 based on 30 ratings | |
45660 | Simiocracia (CrĂłnica de la gran resaca econĂłmica) | 2012 | [comedy, animation, documentary] | 3.3 | Spanish | Weighted average of 3.25 based on 32Â ratings |
851 rows Ă— 7 columns
len(df.loc[(df.duration=="")])
851
df =df.loc[~(df.duration=="")]
df.duration = df.duration.astype("int")
df.duration.describe()
count 22793.000000
mean 84.253631
std 69.173581
min 1.000000
25% 65.000000
50% 90.000000
75% 103.000000
max 3708.000000
Name: duration, dtype: float64
df.columns
Index(['title', 'year', 'duration', 'genre', 'rating', 'language', 'viewers'], dtype='object')
df.genre.head()
0 [comedy, drama, thriller]
1 [drama, romance]
2 [thriller, horror, mystery]
3 [fantasy, animation, family]
4 [drama]
Name: genre, dtype: object
myGenre = set()
df.shape
(22793, 7)
df = df.reset_index()
df.drop("index", axis=1, inplace=True)
for i in range(0,len(df)):
for movGen in df.loc[i].genre:
myGenre.add(movGen)
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
title | year | duration | genre | rating | language | viewers | |
---|---|---|---|---|---|---|---|
0 | Parasite | 2019 | 133 | [comedy, drama, thriller] | 4.6 | Korean | Weighted average of 4.60 based on 705,993Â ratings |
1 | Call Me by Your Name | 2017 | 132 | [drama, romance] | 4.1 | English | Weighted average of 4.09 based on 355,937Â ratings |
2 | Hereditary | 2018 | 127 | [thriller, horror, mystery] | 4.0 | English | Weighted average of 3.99 based on 365,955Â ratings |
3 | Spirited Away | 2001 | 125 | [fantasy, animation, family] | 4.5 | Japanese | Weighted average of 4.47 based on 386,917Â ratings |
4 | Moonlight | 2016 | 111 | [drama] | 4.2 | English | Weighted average of 4.20 based on 347,384Â ratings |
myGenre
{'action',
'adventure',
'animation',
'comedy',
'crime',
'documentary',
'drama',
'family',
'fantasy',
'history',
'horror',
'music',
'mystery',
'romance',
'science fiction',
'thriller',
'tv movie',
'war',
'western'}
len(myGenre)
19
for col in myGenre:
df[col] = 0
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
title | year | duration | genre | rating | language | viewers | thriller | tv movie | mystery | ... | action | music | history | comedy | science fiction | romance | crime | animation | war | adventure | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Parasite | 2019 | 133 | [comedy, drama, thriller] | 4.6 | Korean | Weighted average of 4.60 based on 705,993Â ratings | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | Call Me by Your Name | 2017 | 132 | [drama, romance] | 4.1 | English | Weighted average of 4.09 based on 355,937Â ratings | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | Hereditary | 2018 | 127 | [thriller, horror, mystery] | 4.0 | English | Weighted average of 3.99 based on 365,955Â ratings | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | Spirited Away | 2001 | 125 | [fantasy, animation, family] | 4.5 | Japanese | Weighted average of 4.47 based on 386,917Â ratings | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | Moonlight | 2016 | 111 | [drama] | 4.2 | English | Weighted average of 4.20 based on 347,384Â ratings | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows Ă— 26 columns
for i in range(0,len(df)):
for x in df.loc[i]["genre"]:
df.at[i, x]=1
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
title | year | duration | genre | rating | language | viewers | thriller | tv movie | mystery | ... | action | music | history | comedy | science fiction | romance | crime | animation | war | adventure | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Parasite | 2019 | 133 | [comedy, drama, thriller] | 4.6 | Korean | Weighted average of 4.60 based on 705,993Â ratings | 1 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | Call Me by Your Name | 2017 | 132 | [drama, romance] | 4.1 | English | Weighted average of 4.09 based on 355,937Â ratings | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | Hereditary | 2018 | 127 | [thriller, horror, mystery] | 4.0 | English | Weighted average of 3.99 based on 365,955Â ratings | 1 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | Spirited Away | 2001 | 125 | [fantasy, animation, family] | 4.5 | Japanese | Weighted average of 4.47 based on 386,917Â ratings | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | Moonlight | 2016 | 111 | [drama] | 4.2 | English | Weighted average of 4.20 based on 347,384Â ratings | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows Ă— 26 columns
df.columns
Index(['title', 'year', 'duration', 'genre', 'rating', 'language', 'viewers',
'thriller', 'tv movie', 'mystery', 'documentary', 'western', 'family',
'horror', 'drama', 'fantasy', 'action', 'music', 'history', 'comedy',
'science fiction', 'romance', 'crime', 'animation', 'war', 'adventure'],
dtype='object')
df.rating = df.rating.astype("float")
df.rating.describe()
count 22793.000000
mean 3.137270
std 0.429866
min 1.000000
25% 2.900000
50% 3.200000
75% 3.400000
max 4.600000
Name: rating, dtype: float64
df.language.value_counts().head(10)
English 13344
French 1276
Japanese 1078
Spanish 952
German 644
Portuguese 588
Italian 547
Chinese 340
Hindi 315
Korean 303
Name: language, dtype: int64
df.viewers.head()
0 Weighted average of 4.60 based on 705,993Â ratings
1 Weighted average of 4.09 based on 355,937Â ratings
2 Weighted average of 3.99 based on 365,955Â ratings
3 Weighted average of 4.47 based on 386,917Â ratings
4 Weighted average of 4.20 based on 347,384Â ratings
Name: viewers, dtype: object
df.viewers = df.viewers.apply(lambda x: int(x.split()[-2].replace(",","")))
df.viewers.describe()
count 22793.000000
mean 2443.237705
std 18353.907074
min 30.000000
25% 50.000000
50% 102.000000
75% 324.000000
max 705993.000000
Name: viewers, dtype: float64
fig = plt.figure(figsize=(20,6))
axis1 = fig.add_subplot(1,2,1)
axis2 = fig.add_subplot(1,2,2)
sns.boxplot(x = "rating", data=df, ax=axis1)
sns.boxplot(x = "duration", data=df, ax=axis2)
plt.show()
sns.boxplot(x = "viewers", data=df)
plt.show()
df.columns
Index(['title', 'year', 'duration', 'genre', 'rating', 'language', 'viewers',
'thriller', 'tv movie', 'mystery', 'documentary', 'western', 'family',
'horror', 'drama', 'fantasy', 'action', 'music', 'history', 'comedy',
'science fiction', 'romance', 'crime', 'animation', 'war', 'adventure'],
dtype='object')
df.columns[0:7]
Index(['title', 'year', 'duration', 'genre', 'rating', 'language', 'viewers'], dtype='object')
df.columns[7:]
Index(['thriller', 'tv movie', 'mystery', 'documentary', 'western', 'family',
'horror', 'drama', 'fantasy', 'action', 'music', 'history', 'comedy',
'science fiction', 'romance', 'crime', 'animation', 'war', 'adventure'],
dtype='object')
pd.melt(df, id_vars = df.columns[0:7], value_vars=df.columns[7:])
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
title | year | duration | genre | rating | language | viewers | variable | value | |
---|---|---|---|---|---|---|---|---|---|
0 | Parasite | 2019 | 133 | [comedy, drama, thriller] | 4.6 | Korean | 705993 | thriller | 1 |
1 | Call Me by Your Name | 2017 | 132 | [drama, romance] | 4.1 | English | 355937 | thriller | 0 |
2 | Hereditary | 2018 | 127 | [thriller, horror, mystery] | 4.0 | English | 365955 | thriller | 1 |
3 | Spirited Away | 2001 | 125 | [fantasy, animation, family] | 4.5 | Japanese | 386917 | thriller | 0 |
4 | Moonlight | 2016 | 111 | [drama] | 4.2 | English | 347384 | thriller | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
433062 | Streets of Rio | 2007 | 97 | [drama] | 3.1 | English | 30 | adventure | 0 |
433063 | Whose Dominion? | 2017 | 10 | [history, documentary] | 2.9 | English | 41 | adventure | 0 |
433064 | Lazania | 2019 | 90 | [comedy] | 2.4 | Persian (Farsi) | 33 | adventure | 0 |
433065 | Prisoners | 2019 | 101 | [comedy] | 2.5 | Persian (Farsi) | 30 | adventure | 0 |
433066 | Muraren | 2002 | 95 | [] | 3.5 | English | 33 | adventure | 0 |
433067 rows Ă— 9 columns
df_pivot = pd.melt(df, id_vars = df.columns[0:7], value_vars=df.columns[7:])
df_pivot
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
title | year | duration | genre | rating | language | viewers | variable | value | |
---|---|---|---|---|---|---|---|---|---|
0 | Parasite | 2019 | 133 | [comedy, drama, thriller] | 4.6 | Korean | 705993 | thriller | 1 |
1 | Call Me by Your Name | 2017 | 132 | [drama, romance] | 4.1 | English | 355937 | thriller | 0 |
2 | Hereditary | 2018 | 127 | [thriller, horror, mystery] | 4.0 | English | 365955 | thriller | 1 |
3 | Spirited Away | 2001 | 125 | [fantasy, animation, family] | 4.5 | Japanese | 386917 | thriller | 0 |
4 | Moonlight | 2016 | 111 | [drama] | 4.2 | English | 347384 | thriller | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
433062 | Streets of Rio | 2007 | 97 | [drama] | 3.1 | English | 30 | adventure | 0 |
433063 | Whose Dominion? | 2017 | 10 | [history, documentary] | 2.9 | English | 41 | adventure | 0 |
433064 | Lazania | 2019 | 90 | [comedy] | 2.4 | Persian (Farsi) | 33 | adventure | 0 |
433065 | Prisoners | 2019 | 101 | [comedy] | 2.5 | Persian (Farsi) | 30 | adventure | 0 |
433066 | Muraren | 2002 | 95 | [] | 3.5 | English | 33 | adventure | 0 |
433067 rows Ă— 9 columns
df_pivot = df_pivot.loc[df_pivot.value==1]
df_pivot.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
title | year | duration | genre | rating | language | viewers | variable | value | |
---|---|---|---|---|---|---|---|---|---|
0 | Parasite | 2019 | 133 | [comedy, drama, thriller] | 4.6 | Korean | 705993 | thriller | 1 |
2 | Hereditary | 2018 | 127 | [thriller, horror, mystery] | 4.0 | English | 365955 | thriller | 1 |
7 | 1917 | 2019 | 119 | [thriller, war, drama, action] | 4.1 | English | 358009 | thriller | 1 |
8 | Gone Girl | 2014 | 149 | [drama, thriller, mystery] | 4.0 | English | 445648 | thriller | 1 |
10 | The Lighthouse | 2019 | 109 | [thriller, fantasy, drama, horror] | 4.1 | English | 304207 | thriller | 1 |
df_pivot.shape
(39384, 9)
df_pivot.loc[df_pivot.title=="Soul"]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
title | year | duration | genre | rating | language | viewers | variable | value | |
---|---|---|---|---|---|---|---|---|---|
113982 | Soul | 2020 | 102 | [family, drama, music, comedy, animation, fant... | 4.1 | English | 281921 | family | 1 |
159568 | Soul | 2020 | 102 | [family, drama, music, comedy, animation, fant... | 4.1 | English | 281921 | drama | 1 |
182361 | Soul | 2020 | 102 | [family, drama, music, comedy, animation, fant... | 4.1 | English | 281921 | fantasy | 1 |
227947 | Soul | 2020 | 102 | [family, drama, music, comedy, animation, fant... | 4.1 | English | 281921 | music | 1 |
273533 | Soul | 2020 | 102 | [family, drama, music, comedy, animation, fant... | 4.1 | English | 281921 | comedy | 1 |
364705 | Soul | 2020 | 102 | [family, drama, music, comedy, animation, fant... | 4.1 | English | 281921 | animation | 1 |
df_pivot.drop(["genre", "value"], axis=1, inplace=True)
df_pivot.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
title | year | duration | rating | language | viewers | variable | |
---|---|---|---|---|---|---|---|
0 | Parasite | 2019 | 133 | 4.6 | Korean | 705993 | thriller |
2 | Hereditary | 2018 | 127 | 4.0 | English | 365955 | thriller |
7 | 1917 | 2019 | 119 | 4.1 | English | 358009 | thriller |
8 | Gone Girl | 2014 | 149 | 4.0 | English | 445648 | thriller |
10 | The Lighthouse | 2019 | 109 | 4.1 | English | 304207 | thriller |
df_pivot.rename(columns={"variable": "genre"}, inplace=True)
df = df_pivot
def annotate_graph_both(ax):
total_height = 0
for bar in ax.patches:
total_height=total_height+bar.get_height()
# for bar in ax.patches:
# ax.annotate(format((bar.get_height()/total_height )* 100, '.1f')+" %",
# (bar.get_x() + bar.get_width() / 2, bar.get_height()/2),
# ha='center', va='center',
# size=10, xytext=(0, 8),
# textcoords='offset points')
for bar in ax.patches:
ax.annotate(format((bar.get_height()), '.2f'),
(bar.get_x() + bar.get_width() / 2, bar.get_height()),
ha='center', va='center',
size=10, xytext=(0, 8),
textcoords='offset points')
return ax
grouped_rating_median = df.loc[:,['genre', 'rating']].groupby(['genre']).median().sort_values(by='rating')
grouped_rating_median = pd.DataFrame(grouped_rating_median)
grouped_rating_median = grouped_rating_median.reset_index()
grouped_rating_median
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
genre | rating | |
---|---|---|
0 | horror | 2.8 |
1 | thriller | 2.9 |
2 | science fiction | 2.9 |
3 | action | 3.0 |
4 | adventure | 3.0 |
5 | tv movie | 3.0 |
6 | romance | 3.1 |
7 | mystery | 3.1 |
8 | western | 3.1 |
9 | family | 3.1 |
10 | crime | 3.1 |
11 | comedy | 3.1 |
12 | fantasy | 3.1 |
13 | war | 3.2 |
14 | drama | 3.2 |
15 | animation | 3.3 |
16 | history | 3.3 |
17 | music | 3.4 |
18 | documentary | 3.4 |
grouped_rating_mean = df.loc[:,['genre', 'rating']].groupby(['genre']).mean().sort_values(by='rating')
grouped_rating_mean = pd.DataFrame(grouped_rating_mean)
grouped_rating_mean = grouped_rating_mean.reset_index()
grouped_rating_mean
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
genre | rating | |
---|---|---|
0 | horror | 2.800131 |
1 | science fiction | 2.889947 |
2 | thriller | 2.918565 |
3 | action | 2.938684 |
4 | adventure | 2.986467 |
5 | family | 2.991188 |
6 | tv movie | 3.015921 |
7 | fantasy | 3.041466 |
8 | mystery | 3.053564 |
9 | comedy | 3.057474 |
10 | romance | 3.078985 |
11 | crime | 3.104104 |
12 | western | 3.125157 |
13 | drama | 3.192025 |
14 | war | 3.228829 |
15 | animation | 3.239053 |
16 | history | 3.295238 |
17 | music | 3.354450 |
18 | documentary | 3.410717 |
fig = plt.figure(figsize=(20,8), dpi=600)
ax1 = fig.add_subplot(2,1,1)
ax2 = fig.add_subplot(2,1,2)
sns.despine()
fig1=sns.barplot(x="genre",y="rating",data=grouped_rating_median, order=grouped_rating_median.genre, ax=ax1)
annotate_graph_both(fig1)
fig1.set_ylabel("median rating")
sns.despine()
fig2=sns.barplot(x="genre",y="rating",data=grouped_rating_mean, order=grouped_rating_mean.genre, ax=ax2)
annotate_graph_both(fig2)
fig2.set_ylabel("avg. rating")
plt.show()
COMEDY IS HEAVILY AFFECTED BY OUTLIERS
fig = plt.figure(figsize=(20,6))
sns.boxplot(x = "genre", y="rating" ,data=df, order=grouped.index)
plt.show()
fig = plt.figure(figsize=(20,6))
sns.violinplot(x = "genre", y="rating" ,data=df, order=grouped_rating_median.index)
plt.show()
#--------------------------------------------- DURATOIN -------------------------------------------------#
grouped_duration_median = df.loc[:,['genre', 'duration']].groupby(['genre']).median().sort_values(by='duration')
grouped_duration_median = pd.DataFrame(grouped_duration_median)
grouped_duration_median = grouped_duration_median.reset_index()
grouped_duration_median
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
genre | duration | |
---|---|---|
0 | animation | 11 |
1 | documentary | 78 |
2 | horror | 88 |
3 | tv movie | 89 |
4 | family | 89 |
5 | fantasy | 89 |
6 | science fiction | 89 |
7 | comedy | 91 |
8 | music | 92 |
9 | western | 93 |
10 | mystery | 93 |
11 | adventure | 94 |
12 | crime | 95 |
13 | thriller | 95 |
14 | action | 96 |
15 | drama | 97 |
16 | romance | 99 |
17 | war | 102 |
18 | history | 106 |
grouped_duration_mean = df.loc[:,['genre', 'duration']].groupby(['genre']).mean().sort_values(by='duration')
grouped_duration_mean = pd.DataFrame(grouped_duration_mean)
grouped_duration_mean = grouped_duration_mean.reset_index()
grouped_duration_mean
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
genre | duration | |
---|---|---|
0 | animation | 49.370629 |
1 | documentary | 76.599204 |
2 | fantasy | 77.324976 |
3 | horror | 79.642826 |
4 | family | 80.514555 |
5 | science fiction | 81.720459 |
6 | comedy | 85.384798 |
7 | tv movie | 88.597368 |
8 | western | 89.503145 |
9 | music | 90.194379 |
10 | adventure | 94.015736 |
11 | thriller | 96.223235 |
12 | mystery | 99.488470 |
13 | romance | 99.847783 |
14 | action | 100.359537 |
15 | drama | 101.025998 |
16 | crime | 101.373134 |
17 | war | 103.414414 |
18 | history | 112.403880 |
fig = plt.figure(figsize=(20,8), dpi=600)
ax1 = fig.add_subplot(2,1,1)
ax2 = fig.add_subplot(2,1,2)
sns.despine()
fig1=sns.barplot(x="genre",y="duration",data=grouped_duration_median, order=grouped_duration_median.genre, ax=ax1)
annotate_graph_both(fig1)
fig1.set_ylabel("median duration")
sns.despine()
fig2=sns.barplot(x="genre",y="duration",data=grouped_duration_mean, order=grouped_duration_mean.genre, ax=ax2)
annotate_graph_both(fig2)
fig2.set_ylabel("avg. duration")
plt.show()
fig = plt.figure(figsize=(20,6))
sns.boxplot(x = "genre", y="duration" ,data=df.loc[~(df.duration>300)], order=grouped_rating_median.genre)
plt.show()
fig = plt.figure(figsize=(20,6))
sns.violinplot(x = "genre", y="duration" ,data=df.loc[~(df.duration>300)], order=grouped_rating_median.genre)
plt.show()
df.columns
Index(['title', 'year', 'duration', 'rating', 'language', 'viewers', 'genre'], dtype='object')
fig = plt.figure(figsize=(20,6), dpi=300)
# axis1 = fig.add_subplot(1,2,1)
# axis2 = fig.add_subplot(1,2,2)
sns.scatterplot(hue = "genre", y="rating",x="year", size="duration",data=df, sizes=(10,250))
# sns.boxplot(x = "duration", data=df, ax=axis2)
plt.show()
#----------------------------------------------------------------------------------------------------------------#
grouped_viewers_mean = df.loc[:,['genre', 'viewers']].groupby(['genre']).mean().sort_values(by='viewers')
grouped_viewers_mean = pd.DataFrame(grouped_viewers_mean)
grouped_viewers_mean = grouped_viewers_mean.reset_index()
grouped_viewers_mean
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
genre | viewers | |
---|---|---|
0 | documentary | 378.595886 |
1 | tv movie | 652.826316 |
2 | western | 1712.000000 |
3 | music | 2240.846604 |
4 | horror | 2624.518535 |
5 | comedy | 2695.007110 |
6 | drama | 3087.450755 |
7 | war | 3091.150901 |
8 | animation | 3097.246907 |
9 | romance | 3221.518977 |
10 | crime | 3540.656716 |
11 | history | 3729.504409 |
12 | thriller | 4492.473424 |
13 | family | 5471.936271 |
14 | mystery | 5710.663522 |
15 | action | 6085.403806 |
16 | fantasy | 7251.688525 |
17 | science fiction | 8458.423280 |
18 | adventure | 10093.222659 |
grouped_viewers_median = df.loc[:,['genre', 'viewers']].groupby(['genre']).median().sort_values(by='viewers')
grouped_viewers_median = pd.DataFrame(grouped_viewers_median)
grouped_viewers_median = grouped_viewers_median.reset_index()
grouped_viewers_median
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
genre | viewers | |
---|---|---|
0 | documentary | 80.5 |
1 | music | 99.0 |
2 | tv movie | 101.0 |
3 | animation | 109.0 |
4 | western | 125.5 |
5 | drama | 127.0 |
6 | comedy | 129.0 |
7 | war | 142.0 |
8 | romance | 143.0 |
9 | history | 145.0 |
10 | horror | 146.0 |
11 | family | 153.0 |
12 | mystery | 157.5 |
13 | action | 165.0 |
14 | thriller | 166.5 |
15 | science fiction | 171.0 |
16 | crime | 171.5 |
17 | adventure | 179.0 |
18 | fantasy | 191.0 |
fig = plt.figure(figsize=(20,8), dpi=600)
ax1 = fig.add_subplot(2,1,1)
ax2 = fig.add_subplot(2,1,2)
sns.despine()
fig1=sns.barplot(x="genre",y="viewers",data=grouped_viewers_median, order=grouped_viewers_median.genre, ax=ax1)
annotate_graph_both(fig1)
fig1.set_ylabel("median viewers")
sns.despine()
fig2=sns.barplot(x="genre",y="viewers",data=grouped_viewers_mean, order=grouped_viewers_mean.genre, ax=ax2)
annotate_graph_both(fig2)
fig2.set_ylabel("avg. viewers")
plt.show()
Documentries are highest rated but least viewed. Appraently, "documentries" enojoy an eclectic group who more often than not like what they see and rate it very highly.