ETL - Extract, Transform, Load

"ETL is a type of data integration process referring to three distinct but interrelated steps (Extract, Transform and Load) and is used to synthesize data from multiple sources many times to build a Data Warehouse, Data Hub, or Data Lake." -Punit Pathak

Extract

Original data sources and how the data was formatted (CSV, JSON, pgAdmin 4, etc)

The above datasets were all converted to .csv format, in order to work with them in jupyternotebook

The above process was done with both 2018 and 2019 data sets as well.

Transform:

The type of transformation needed for this data (cleaning, joining, filtering, aggregating, etc)

Cleaned data - dropped NA's
Inspected column names to find column names that contained the same info across all three years
Filtered data - filtered only for columns that we wanted: 'id','gender','race','country','education_level','undergrad_major','years_coding','dev_type','salary'
Renamed columns to be uniform across all 3 years (2017-2019)

Load:

The type of final production database to load the data into (relational or non-relational)

Relational database (.sql)

The final tables or collections that will be used in the production database

Below are the codes written to create each table, in Postgres, to then connect to our database:

SQL Database: 'stackoverflow_survey_db'
Tables within Database: 'survey_2017', 'survey_2018', 'survey_2019'

Below shows some of the code used to create and connect to the database:

The SQL database and tables were chosen so that survey would have a separate table with common elements across all 3 years.

In the future, new tables can be added, with the same elements, iterating through the same process for each year (read in .CSV, clean, transform, insert into .SQL table, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.ipynb_checkpoints		.ipynb_checkpoints
csv.png		csv.png
database.png		database.png
datasets_2017_survey_results_schema.csv		datasets_2017_survey_results_schema.csv
datasets_2018_survey_results_schema.csv		datasets_2018_survey_results_schema.csv
datasets_2019_survey_results_schema.csv		datasets_2019_survey_results_schema.csv
readme.ipynb		readme.ipynb
readme.md		readme.md
sql_tables.png		sql_tables.png
starter_data.ipynb		starter_data.ipynb
survey_tables.sql		survey_tables.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL - Extract, Transform, Load

Extract

Original data sources and how the data was formatted (CSV, JSON, pgAdmin 4, etc)

Transform:

The type of transformation needed for this data (cleaning, joining, filtering, aggregating, etc)

Load:

The type of final production database to load the data into (relational or non-relational)

The final tables or collections that will be used in the production database

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

violasignorile01/ETL

Folders and files

Latest commit

History

Repository files navigation

ETL - Extract, Transform, Load

Extract

Original data sources and how the data was formatted (CSV, JSON, pgAdmin 4, etc)

Transform:

The type of transformation needed for this data (cleaning, joining, filtering, aggregating, etc)

Load:

The type of final production database to load the data into (relational or non-relational)

The final tables or collections that will be used in the production database

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages