Skip to content

Latest commit

 

History

History
46 lines (35 loc) · 4.27 KB

README.md

File metadata and controls

46 lines (35 loc) · 4.27 KB

BigData-Ops-on-TLC-Yellow-Taxi

Conducted Big Data analytics New York City's Yellow taxi data set of the year 2017 (5.17 GB) with Big Data tools such as Hadoop, HBase, Sqoop, MapReduce, AWS EMR, AWS RDS (MySQL)

image

The data set for the assignment can be downloaded from these links:
yellow_tripdata_2017-01.csv
yellow_tripdata_2017-02.csv
yellow_tripdata_2017-03.csv
yellow_tripdata_2017-04.csv
yellow_tripdata_2017-05.csv
yellow_tripdata_2017-06.csv

Data Dictionary


Task

The Big Data tools used - Hadoop Framework, Apache HBase and Apache Sqoop. Used an AWS EMR instance of m4.xlarge cluster 64 GB with all the services and install additional services as needed (AWS RDS MySQL etc).

Data Ingestion Tasks:

Task 1. Create an RDS instance in your AWS account and upload the data from two files (yellow_tripdata_2017-01.csv & yellow_tripdata_2017-02.csv) from the dataset. Make sure to create an appropriate schema for the data sets before uploading them to RDS.

Task 2. Use Sqoop command to ingest the data from RDS into the HBase Table.

Task 3. Bulk import data from next two files in the dataset on your EMR cluster to your HBase Table using the relevant codes. Note: For the above task 3, you just need to import data from the subsequent 2 csv files

MapReduce Tasks:

Task 4. Write MapReduce codes to perform the tasks using the files you’ve downloaded on your EMR Instance:

  • Which vendors have the most trips, and what is the total revenue generated by that vendor?
  • Which pickup location generates the most revenue?
  • What are the different payment types used by customers and their count? The final results should be in a sorted format.
  • What is the average trip time for different pickup locations?
  • Calculate the average tips to revenue ratio of the drivers for different pickup locations in sorted format.
  • How does revenue vary over time? Calculate the average trip revenue per month - analysing it by hour of the day (day vs night) and the day of the week (weekday vs weekend). NOTE: It's recommended to use MRJob for completing the MapReduce taks above.

Optional Task:

Task 5. Use Sqoop export command to export the results of each MapReduce tasks above to your RDS instance. Use the RDS connection string connection to visualise the dataset using a dashboarding tool (Google Data Studio, Tableau or PowerBI)

Assests

  1. RDS.pdf: A document containing the codes, with the explanations, used for loading the datasets mentioned into an AWS RDS instance. This should have the code along with the screenshots of the EMR instance showing the table creation.
  2. Ingestiontask.pdf: A document containing the code to create the HBase table. The file should also include the Sqoop command to ingest data from RDS into the HBase table. The document should be well commented explaining the code.
  3. batch_ingest.py used to ingest the batch data to the HBase table.
  4. The Python codes used for the MapReduce tasks. Answers to the query and the screenshots of the results of the MapReduce tasks must be included in a separate document (MapReducetasks.pdf) in sequence.