Zomato-Project

Getting my hands on an official BigData Poject

External Source on boarding to HDFS
Download Zomato restaurant data into zomato_raw_files folder

Converting un-structured data into csv data
Copy first three files from zomato_raw_files (Unix) to zomato_etl/source/json folder
Write an application to convert each json file into csv file with suffix as *_filedate.csv and store them in zomato_etl/source/csv/ (Unix) folder

For example:

json1** -> **** zomato_20190609.csv**

json2** -> **** zomato_20190610.csv**

json3** -> **** zomato_20190611.csv**

json4 zomato_20190612.csv

json5 zomato_20190613.csv

Note: "zomato_*.csv" should have below fields only
Restaurant ID
Restaurant Name
Country Code
City
Address
Locality
Locality Verbose
Longitude
Latitude
Cuisines
Average Cost for two
Currency
Has Table booking
Has Online delivery
Is delivering now
Switch to order menu
Price range
Aggregate rating
Rating text
Votes
Creation of External/Internal Hive table
Move these csv files from zomato_etl/source/csv folder to HDFS _ <HDFS LOCATION> _
Create External Table named " zomato" partitioned by fildedate and load zomato_<filedate>.csv into respective partition: 1. Table should have all columns as per csv file Analyzing Big Data with Hive 2. New partition should be created whenever new file arrives
Create Hive Managed Table named " dim_country" using country_code.csv file as per below details: 1. Table should have all columns as per csv file
Create zomato_summary_log table , schema given below: 1. Job id 2. Job Step 3. Spark submit command 4. Job Start time 5. Job End time 6. Job status
Transformation using Hive and Spark
Write a spark application in scala to load summary table named "zomato_summary" and apply the following transformation 1. Create "zomato_summary" table partitioned by p_filedate , p_country_name 2. Schema for zomato_summary table is mentioned below:
1. Restaurant ID
2. Restaurant Name
3. Country Code
4. City
5. Address
6. Locality
7. Locality Verbose
8. Longitude
9. Latitude
10. Cuisines
11. Average Cost for two
12. Currency
13. Has Table booking
14. Has Online delivery
15. Is delivering now
16. Switch to order menu
17. Price range
18. Aggregate rating
19. Rating text
20. Votes
21. m_rating_colour
22. m_cuisines
23. p_filedate
24. p_country_name
25. create_datetime
26. user_id

1. Data in this hive table should be in ORC format
2. Add audit columns &quot;create\_datetime&quot; and &quot;user\_id&quot; in zomato\_summary table
3. Derive a column &quot;Rating Colour&quot; based on the rule listed below

Rating Text	Aggregate Rating	m_rating_colour
Poor	1.9-2.4	Red
Average	2.5-3.4	Amber
Good	3.5-.39	Light Green
Very Good	4.0-4.4	Green
Excellent	4.5-5	Gold

4. Derive a column &quot; **m\_cuisines**&quot; and map the Indian (Andhra,Goan,Hyderabadi,North Indian etc.) cuisines to **&quot;Indian&quot;** and rest of the cuisines to **&quot;World Cuisines&quot;**
5. Filter out the restaurants with NULL/BLANCK **Cuisines** values
6. Populate &quot;NA&quot; in case of Null/blank values for string columns
7. There should be no duplicate record in the summary table

Spark application should be able to perform the following load strategies
1. Manual  should be able to load the data for a Particular filedate & country_name
2. Historical  should be able to load the data historically for all the filedate & country_name
Create a shell script wrapper to execute the complete flow as given
Spark application and shell script should be parametrized (dbname, tablename, filters, arguments etc.)
_ Check if already another instances is running for the same application _ 1. _ Exit and send notification if already an instances is running or the previous application failed _
_ User should be able to execute each module separately AND all the modules together _
Module 1 : To call application that converts json file to csv 1. Capture Logs in a file 2. Check execution status 3. If failed then add a failure entry into log table, send failure notification and exit 4. If pass then add a success entry into log table and move to next step
Module 2 : Execute command to load the csv files into Hive external/managed table (New partition should be created whenever new file is being loaded with new filedate) 1. Capture Logs in a file 2. Check execution status 3. If failed then add a failure entry into log table, send failure notification and exit 4. If pass then add a success entry into log table and move to next step
Module 3 : To call spark application to load the zomato summary table 1. Capture Logs in a file 2. Check execution status 3. If failed then add a failure entry into log table, send failure notification and exit 4. If pass then add a success entry into log table and send a final notification
Purge last 7 days of logs from the log directory
Write a beeline command and insert log details for each successful and un-successful execution containing below detail 1. Job id 2. Job Step 3. Spark submits that got triggered 4. Job Start time 5. Job End time 6. Job status
Once the execution for 3 json file is completed, move these files into archive folder
Add new source file into source folder and execute complete workflow again
Schedule the job to run daily at 01:00 AM using crontab
Execute the complete job and perform the unit testing to check the complete ETL flow and data loading anomalies
Document execution statistics (Start time,End time, Total time taken for execution,No.of executors, No. of Cores, Driver Memory) along with application URLs
Create a unit test case document and reports bugs/observations.
Standard Coding guidelines e.g.
Variable Names - Variable names will be all lower case, with individual words separated by an underscore.
For each Function/Procedure add Comments in code
Code Alignment and indentation should be proper
Perform Exception Handling
in-built functions, Code, location, table, database name, etc should be in lower cases
Folder Structure on Linux:

zomato_etl
- source
  - json
  - csv
- archive
- hive
  - ddl
  - dml
- spark
  - jars
  - scala
- script (shell scripts and property files)
- logs (log_ddmmyyyy_hhmm.log)
zomato_raw_files

Folder Structure on HDFS

zomato_etl_<username>
- log
- zomato_ext
  - zomato
  - dim_country

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Screenshots		Screenshots
hive		hive
shell-script		shell-script
spark		spark
.gitattributes		.gitattributes
BigData Documentation.pdf		BigData Documentation.pdf
BigData Project.pdf		BigData Project.pdf
Execution Statistics.pdf		Execution Statistics.pdf
README.md		README.md
Unit Testing.pdf		Unit Testing.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zomato-Project

About

Releases

Packages

Languages

rahulnsanand/Zomato-Project

Folders and files

Latest commit

History

Repository files navigation

Zomato-Project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages