Skip to content

Automating the ETL process by creating one function to extract multiple data sets from Wikipedia and Kaggle, transform the data sets and clean out any null or unwanted data and then combine these data sets into a workable form and load them into a SQL database.

Notifications You must be signed in to change notification settings

boyerjason700/MoviesETL

Repository files navigation

Movies-Extract, Transform, Load

Overview

Create an ETL pipeline from raw data to a SQL database.

Amazing Prime wants to develop an algorithm to predict which low budget movies will become popular so they can buy the streaming rights for those movies. To have some fun and connect with the local coding community, Amazing Prime is sponsoring a hackathon where the goal is to create their prediction algorithm using a clean database of movies which we will create. We are tasked with automating the ETL process by creating one function to extract multiple data sets from Wikipedia and Kaggle, transform the data sets and clean out any null or unwanted data and then combine these data sets into a workable form and load them into a SQL database for the hackathon participants to use.

Resources

Wikipedia Movie Dataset

  • wikipedia-movies.json

Kaggle-The Movie Data Set

  • movies_metadata.csv
  • ratings.csv

Process

1. Extract

Using the wikipedia-movies.json, movies_metadata.csv and ratings.csv files, we begin by loading each into our Juypter notebook to verify the data looks good by performing various checks( .head(), .tail(), .sample() ) to make sure the data is in the proper format and not corrupted.

2. Transform

Inspecting our data, we can identify where problems occur. Once we have identified the problems, we can develop a plan and make sure it is worth the time to fix. Once we have our plan, we can then execute it.

Problems encountered

  • entries with missing of null values
  • removing un-needed and duplicate columns
  • renaming columns for better understanding
  • converting data types using regular expressions (text to numeric, dates, Booleans)

Merging DataFrames

We found both data sets contained the same categories for multiple columns. Before merging, we used our 'Inspect, Plan, Execute' method to identify which set would be more useful to include and how to fill in missing data.

3. Load

Using the to_sql method, we connected Python to PostgreSQL to import our DataFrames and create workable databases for the hackathon participants to use.

Summary

We were successful in creating one function to perform the ETL process and create two databases with over 26 million entries combined.

Contact

E-mail: boyerjason700@gmail.com

About

Automating the ETL process by creating one function to extract multiple data sets from Wikipedia and Kaggle, transform the data sets and clean out any null or unwanted data and then combine these data sets into a workable form and load them into a SQL database.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published