clean, impute, handle outliers, feature engineer, visualize, analyze, containerize, parallelize workload, and build a pipeline
The 4 Milestones aim to build a Data Engineering Pipeline
View the Screenshots »
·
Demo Video
·
Report Bug
·
Be a Contributer
This project focused on NYC taxi data. It started by studying and improving the dataset for green taxis, doing things like organizing, visualizing, and preparing the data for later analysis or machine learning. The work was made into easy-to-use packages using Docker, allowing it to be moved and used easily. It was then put into a PostgreSQL database for easy access. Using PySpark, similar steps were taken for yellow taxi data. Later, tasks were organized using Airflow in Docker, making it easier to clean, change, and add data. Overall, this project showed skills in handling data, making it better, and organizing tasks efficiently.
Note: Every Milestone Folder have it's own Readme for How to use it
The goal of this milestone is to load a csv file, perform exploratory data analysis with visualization, extract additional data, perform feature engineering and pre- process the data for downstream cases such as ML and data analysis. The dataset you will be working on is NYC green taxis dataset. It contains records about trips conducted in NYC through green taxis.
There are multiple datasets for this case study(a dataset for each month). Download dataset from here.
My dataset was 10/2016, the code is reproducible and can work with any month/year
The objective of this milestone is to package your milestone 1 code in a docker image that can be run anywhere. In addition, you will load your cleaned and prepared dataset as well as your lookup table into a PostgreSQL database which would act as your data warehouse.
The goal of this milestone is to preprocess the dataset 'New York yellow taxis' by performing basic data preparation and basic analysis to gain a better understanding of the data using PySpark. Use the same month and year you used for the green taxis in milestone 1. Datasets (download the yellow taxis dataset).
For this milestone, we were required to orchestrate the tasks performed in milestones 1 and 2 using Airflow in Docker. For this milestone, we will primarily work on the green dataset and pre-process using pandas only for simplicity. The tasks you have performed in milestones 1 and 2 were as follows. Read csv(green_taxis) file >> clean and transform >> load to csv(both the cleaned dataset and the lookup table) >> extract additional resources(GPS coordinates) >> Integrate with the cleaned dataset and load back to csv >> load both csv files(lookup and cleaned dataset) to postgres database as 2 separate tables.