Skip to content

rahulnsanand/Zomato-Project

Repository files navigation

Zomato-Project

Getting my hands on an official BigData Poject

  1. External Source on boarding to HDFS
  2. Download Zomato restaurant data into zomato_raw_files folder

  1. Converting un-structured data into csv data
  2. Copy first three files from zomato_raw_files (Unix) to zomato_etl/source/json folder
  3. Write an application to convert each json file into csv file with suffix as *_filedate.csv and store them in zomato_etl/source/csv/ (Unix) folder

For example:

json1** -> **** zomato_20190609.csv**

json2** -> **** zomato_20190610.csv**

json3** -> **** zomato_20190611.csv**

json4 zomato_20190612.csv

json5 zomato_20190613.csv

  1. Note: "zomato_*.csv" should have below fields only

  2. Restaurant ID

  3. Restaurant Name

  4. Country Code

  5. City

  6. Address

  7. Locality

  8. Locality Verbose

  9. Longitude

  10. Latitude

  11. Cuisines

  12. Average Cost for two

  13. Currency

  14. Has Table booking

  15. Has Online delivery

  16. Is delivering now

  17. Switch to order menu

  18. Price range

  19. Aggregate rating

  20. Rating text

  21. Votes

  22. Creation of External/Internal Hive table

  23. Move these csv files from zomato_etl/source/csv folder to HDFS _ <HDFS LOCATION> _

  24. Create External Table named " zomato" partitioned by fildedate and load zomato_<filedate>.csv into respective partition: 1. Table should have all columns as per csv file Analyzing Big Data with Hive 2. New partition should be created whenever new file arrives

  25. Create Hive Managed Table named " dim_country" using country_code.csv file as per below details: 1. Table should have all columns as per csv file

  26. Create zomato_summary_log table , schema given below: 1. Job id 2. Job Step 3. Spark submit command 4. Job Start time 5. Job End time 6. Job status

  27. Transformation using Hive and Spark

  28. Write a spark application in scala to load summary table named "zomato_summary" and apply the following transformation 1. Create "zomato_summary" table partitioned by p_filedate , p_country_name 2. Schema for zomato_summary table is mentioned below:

    1. Restaurant ID
    2. Restaurant Name
    3. Country Code
    4. City
    5. Address
    6. Locality
    7. Locality Verbose
    8. Longitude
    9. Latitude
    10. Cuisines
    11. Average Cost for two
    12. Currency
    13. Has Table booking
    14. Has Online delivery
    15. Is delivering now
    16. Switch to order menu
    17. Price range
    18. Aggregate rating
    19. Rating text
    20. Votes
    21. m_rating_colour
    22. m_cuisines
    23. p_filedate
    24. p_country_name
    25. create_datetime
    26. user_id
1. Data in this hive table should be in ORC format
2. Add audit columns "create\_datetime" and "user\_id" in zomato\_summary table
3. Derive a column "Rating Colour" based on the rule listed below
Rating Text Aggregate Rating m_rating_colour
Poor 1.9-2.4 Red
Average 2.5-3.4 Amber
Good 3.5-.39 Light Green
Very Good 4.0-4.4 Green
Excellent 4.5-5 Gold
4. Derive a column " **m\_cuisines**" and map the Indian (Andhra,Goan,Hyderabadi,North Indian etc.) cuisines to **"Indian"** and rest of the cuisines to **"World Cuisines"**
5. Filter out the restaurants with NULL/BLANCK **Cuisines** values
6. Populate "NA" in case of Null/blank values for string columns
7. There should be no duplicate record in the summary table
  1. Spark application should be able to perform the following load strategies

    1. Manual  should be able to load the data for a Particular filedate & country_name
    2. Historical  should be able to load the data historically for all the filedate & country_name
  2. Create a shell script wrapper to execute the complete flow as given

  3. Spark application and shell script should be parametrized (dbname, tablename, filters, arguments etc.)

  4. _ Check if already another instances is running for the same application _ 1. _ Exit and send notification if already an instances is running or the previous application failed _

  5. _ User should be able to execute each module separately AND all the modules together _

  6. Module 1 : To call application that converts json file to csv 1. Capture Logs in a file 2. Check execution status 3. If failed then add a failure entry into log table, send failure notification and exit 4. If pass then add a success entry into log table and move to next step

  7. Module 2 : Execute command to load the csv files into Hive external/managed table (New partition should be created whenever new file is being loaded with new filedate) 1. Capture Logs in a file 2. Check execution status 3. If failed then add a failure entry into log table, send failure notification and exit 4. If pass then add a success entry into log table and move to next step

  8. Module 3 : To call spark application to load the zomato summary table 1. Capture Logs in a file 2. Check execution status 3. If failed then add a failure entry into log table, send failure notification and exit 4. If pass then add a success entry into log table and send a final notification

  9. Purge last 7 days of logs from the log directory

  10. Write a beeline command and insert log details for each successful and un-successful execution containing below detail 1. Job id 2. Job Step 3. Spark submits that got triggered 4. Job Start time 5. Job End time 6. Job status

  11. Once the execution for 3 json file is completed, move these files into archive folder

  12. Add new source file into source folder and execute complete workflow again

  13. Schedule the job to run daily at 01:00 AM using crontab

  14. Execute the complete job and perform the unit testing to check the complete ETL flow and data loading anomalies

  15. Document execution statistics (Start time,End time, Total time taken for execution,No.of executors, No. of Cores, Driver Memory) along with application URLs

  16. Create a unit test case document and reports bugs/observations.

  17. Standard Coding guidelines e.g.

  18. Variable Names - Variable names will be all lower case, with individual words separated by an underscore.

  19. For each Function/Procedure add Comments in code

  20. Code Alignment and indentation should be proper

  21. Perform Exception Handling

  22. in-built functions, Code, location, table, database name, etc should be in lower cases

  23. Folder Structure on Linux:

  • zomato_etl
    • source
      • json
      • csv
    • archive
    • hive
      • ddl
      • dml
    • spark
      • jars
      • scala
    • script (shell scripts and property files)
    • logs (log_ddmmyyyy_hhmm.log)
  • zomato_raw_files
  1. Folder Structure on HDFS
  • zomato_etl_<username>
    • log
    • zomato_ext
      • zomato
      • dim_country

About

Getting my hands on an official BigData Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published