Skip to content

zhuoqunw/Big-data-analysis-on-movie-dataset

Repository files navigation

Big Data Analysis on Movie Dataset

SI 618 final project

Main Tools:

💻Data Analysis: Spark SQL, PySpark, Hadoop
🎨Data visualization: Python Altair

Motivation

It’s not a secret that there exists an inherent gender bias in the movie business. Female actors usually have less income than male actors. There are fewer female protagonists in movies, and female characters are usually lack of serious development and depth compared to their male counterparts. How to quantify this gender bias is one of the topics of interest. Inspired by an article by FiveThirtyEight about the relationship between female prominence in movies, evaluated by the Bechdel test, and movie budget and box office, I decided to explore this topic further in this project. Specifically, I was wondering what kind of movie will pass the Bechdel test. Thus, in this project, I examined and discussed the relationship between a set of movie characteristics with passing the Bechdel test, including release decade, country of production, movie genre, crew gender, IMDb rating, budget, domestic and international box office and return of investment (ROI).

Data Source

* Bechdel movie dataset from BechdelTest.com

* Boxofficemojo dataset from Kaggle

* IMDb movies extensive dataset from Kaggle

Data Manipulation

workflow

workflow

Analysis and Visualization

Some of the findings:

💡For more analyses, including details of data preprocessing and manipulation, and more visualizations, please refer to the final report

1. Overall, there is an increasing trend in the percentage of movies passing the Bechdel test (represented by the green bars) over decades.

time_trend

2. There is an overall trend that the higher the IMDb rating, the lower the percentage of movies passing the Bechdel test.

rating

Another interesting finding is that, this negative relationship between IMDb rating and the percentage passing the Bechdel test doesn’t differ between male and female voters

male_female_rating

3. Movies passing the test (represented by the highlighted orange bar) have the lowest median budget, but not the lowest box office and ROI

budget

box office

roi

About

SI 618 final project (PySpark, SparkSQL, Hadoop, Altair)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published