This project was done as part of Udacity's Data Analyst Nanodegree - Term 1.
The TMDb Movie dataset, one of Udacity's curated datasets has been selected for investigation using NumPy and Pandas. The dataset is a collection of information on around 10000 movies. For each movie, the dataset includes information on aspects such as popularity, budget, revenue, cast, directors, production house, date of release, runtime, and its rating.
- Assessed the data and brainstormed questions that could be answered using the data
- Performed necessary cleaning steps to unify formats, deal with missing data and prepare the dataset for analysis
- Wrangled and explored the data using Pandas and Numpy to gather insights about the relationship between different aspects, created visualizations using matplotlib and made inferences to answer research questions
- How have movie production trends varied over the years?
- What are the top 20 highest grossing movies? What are the top 20 most expensive movies?
- How do budgets correlate with revenues? Do higher budget movies have higher revenue?
- Do certain months of release associate with better revenues?
- Which months have seen the maximum releases?
- How do ratings correlate with commercial success (profits)?
- What run times are associated with each genre?
- Who are the top 20 directors who made highly rated films? The directors considered for should have made atleast 5 movies in the time period 1960 - 2015 represented in the dataset.