Create an ETL pipeline from raw data to a SQL database.
Amazing Prime wants to develop an algorithm to predict which low budget movies will become popular so they can buy the streaming rights for those movies. To have some fun and connect with the local coding community, Amazing Prime is sponsoring a hackathon where the goal is to create their prediction algorithm using a clean database of movies which we will create. We are tasked with automating the ETL process by creating one function to extract multiple data sets from Wikipedia and Kaggle, transform the data sets and clean out any null or unwanted data and then combine these data sets into a workable form and load them into a SQL database for the hackathon participants to use.
Wikipedia Movie Dataset
- wikipedia-movies.json
- movies_metadata.csv
- ratings.csv
Using the wikipedia-movies.json, movies_metadata.csv and ratings.csv files, we begin by loading each into our Juypter notebook to verify the data looks good by performing various checks( .head(), .tail(), .sample()
) to make sure the data is in the proper format and not corrupted.
Inspecting our data, we can identify where problems occur. Once we have identified the problems, we can develop a plan and make sure it is worth the time to fix. Once we have our plan, we can then execute it.
- entries with missing of null values
- removing un-needed and duplicate columns
- renaming columns for better understanding
- converting data types using regular expressions (text to numeric, dates, Booleans)
We found both data sets contained the same categories for multiple columns. Before merging, we used our 'Inspect, Plan, Execute' method to identify which set would be more useful to include and how to fill in missing data.
Using the to_sql
method, we connected Python to PostgreSQL to import our DataFrames and create workable databases for the hackathon participants to use.
We were successful in creating one function to perform the ETL process and create two databases with over 26 million entries combined.
E-mail: boyerjason700@gmail.com