In this competition, you're presented with metadata on over 7,000 past films from The Movie Database to try and predict their overall worldwide box office revenue. Data points provided include cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries. You can collect other publicly available data to use in your model predictions, but in the spirit of this competition, use only data that would have been available before a movie's release.
-
data_analysis.ipynb
-
feature_engineering.ipynb
- drop:
belongs_to_collection
genres
homepage
imdb_id
original_title
overview
poster_path
production_companies
production_countries
spoken_languages
status
tagline
title
Keywords
cast
crew
- parse
release_date
for a month, day and year
-
revenue
-
budget
-
popularity
-
runtime
-
revenue
-
budget
-
popularity
-
budget
-
revenue
-
genres
-
spoken_languages
-
production_companies
-
belongs_to_collection
-
production_countries
-
Keywords
-
cast
(actor name) -
crew
(department + name)
-
cast
(actor name) -
genres
-
crew
(department + name) -
spoken_languages
-
production_companies
-
belongs_to_collection
-
production_countries
-
Keywords
-
tagline
-
title
-
overview
-
is not null
homepage
-
parse
homepage
for domain, https...
- (2.972, 83%) baseline
- (2.329, 58%) log transformed revenue
- (2.329, 58%) log transformed budget, popularity
- (2.325, 58%) clear outliers for popularity