Skip to content

Latest commit

 

History

History
27 lines (20 loc) · 1.24 KB

README.md

File metadata and controls

27 lines (20 loc) · 1.24 KB

NYC-Taxi-Demand-Prediction

A repository to contain the notebook for my big data project involving EDA and Machine Learning on the NY Taxi Fare dataset

Motivation, Challenge & Accomplishment

  • Motivation: Building and deploying a data science project with cloud technologies such as Apache Spark and AWS. After having enough theoretical knowledge of data science workflow and machine learning techniques and implementing few projects locally, I thought of going further and working out a workflow typically used in the industry for development and deployment (that is cloud operations) and searched a big enough dataset for it.

  • Challenge: Get up and started with Spark and its Python wrapper, PySpark as well as managing clusters on AWS EMR, none of which I had done earlier. This also started my attempt at completing one data science project each month starting with this for October.

  • Accomplishment: Successfully performing EDA and feature engineering on the dataset and using the Spark MLlib to build RF and DT models and achieve an RMSE error of 4.28.


Tech Stack

  • Python 3
  • pandas
  • matplotlib
  • seaborn
  • PySpark
  • AWS EMR
  • AWS EC2
  • Spark MLlib