Getting my hands on an official BigData Poject
- External Source on boarding to HDFS
- Download Zomato restaurant data into zomato_raw_files folder
- Converting un-structured data into csv data
- Copy first three files from zomato_raw_files (Unix) to zomato_etl/source/json folder
- Write an application to convert each json file into csv file with suffix as *_filedate.csv and store them in zomato_etl/source/csv/ (Unix) folder
For example:
json1** -> **** zomato_20190609.csv**
json2** -> **** zomato_20190610.csv**
json3** -> **** zomato_20190611.csv**
json4 zomato_20190612.csv
json5 zomato_20190613.csv
-
Note: "zomato_*.csv" should have below fields only
-
Restaurant ID
-
Restaurant Name
-
Country Code
-
City
-
Address
-
Locality
-
Locality Verbose
-
Longitude
-
Latitude
-
Cuisines
-
Average Cost for two
-
Currency
-
Has Table booking
-
Has Online delivery
-
Is delivering now
-
Switch to order menu
-
Price range
-
Aggregate rating
-
Rating text
-
Votes
-
Creation of External/Internal Hive table
-
Move these csv files from zomato_etl/source/csv folder to HDFS _ <HDFS LOCATION> _
-
Create External Table named " zomato" partitioned by fildedate and load zomato_<filedate>.csv into respective partition: 1. Table should have all columns as per csv file Analyzing Big Data with Hive 2. New partition should be created whenever new file arrives
-
Create Hive Managed Table named " dim_country" using country_code.csv file as per below details: 1. Table should have all columns as per csv file
-
Create zomato_summary_log table , schema given below: 1. Job id 2. Job Step 3. Spark submit command 4. Job Start time 5. Job End time 6. Job status
-
Transformation using Hive and Spark
-
Write a spark application in scala to load summary table named "zomato_summary" and apply the following transformation 1. Create "zomato_summary" table partitioned by p_filedate , p_country_name 2. Schema for zomato_summary table is mentioned below:
- Restaurant ID
- Restaurant Name
- Country Code
- City
- Address
- Locality
- Locality Verbose
- Longitude
- Latitude
- Cuisines
- Average Cost for two
- Currency
- Has Table booking
- Has Online delivery
- Is delivering now
- Switch to order menu
- Price range
- Aggregate rating
- Rating text
- Votes
- m_rating_colour
- m_cuisines
- p_filedate
- p_country_name
- create_datetime
- user_id
1. Data in this hive table should be in ORC format
2. Add audit columns "create\_datetime" and "user\_id" in zomato\_summary table
3. Derive a column "Rating Colour" based on the rule listed below
Rating Text | Aggregate Rating | m_rating_colour |
---|---|---|
Poor | 1.9-2.4 | Red |
Average | 2.5-3.4 | Amber |
Good | 3.5-.39 | Light Green |
Very Good | 4.0-4.4 | Green |
Excellent | 4.5-5 | Gold |
4. Derive a column " **m\_cuisines**" and map the Indian (Andhra,Goan,Hyderabadi,North Indian etc.) cuisines to **"Indian"** and rest of the cuisines to **"World Cuisines"**
5. Filter out the restaurants with NULL/BLANCK **Cuisines** values
6. Populate "NA" in case of Null/blank values for string columns
7. There should be no duplicate record in the summary table
-
Spark application should be able to perform the following load strategies
- Manual should be able to load the data for a Particular filedate & country_name
- Historical should be able to load the data historically for all the filedate & country_name
-
Create a shell script wrapper to execute the complete flow as given
-
Spark application and shell script should be parametrized (dbname, tablename, filters, arguments etc.)
-
_ Check if already another instances is running for the same application _ 1. _ Exit and send notification if already an instances is running or the previous application failed _
-
_ User should be able to execute each module separately AND all the modules together _
-
Module 1 : To call application that converts json file to csv 1. Capture Logs in a file 2. Check execution status 3. If failed then add a failure entry into log table, send failure notification and exit 4. If pass then add a success entry into log table and move to next step
-
Module 2 : Execute command to load the csv files into Hive external/managed table (New partition should be created whenever new file is being loaded with new filedate) 1. Capture Logs in a file 2. Check execution status 3. If failed then add a failure entry into log table, send failure notification and exit 4. If pass then add a success entry into log table and move to next step
-
Module 3 : To call spark application to load the zomato summary table 1. Capture Logs in a file 2. Check execution status 3. If failed then add a failure entry into log table, send failure notification and exit 4. If pass then add a success entry into log table and send a final notification
-
Purge last 7 days of logs from the log directory
-
Write a beeline command and insert log details for each successful and un-successful execution containing below detail 1. Job id 2. Job Step 3. Spark submits that got triggered 4. Job Start time 5. Job End time 6. Job status
-
Once the execution for 3 json file is completed, move these files into archive folder
-
Add new source file into source folder and execute complete workflow again
-
Schedule the job to run daily at 01:00 AM using crontab
-
Execute the complete job and perform the unit testing to check the complete ETL flow and data loading anomalies
-
Document execution statistics (Start time,End time, Total time taken for execution,No.of executors, No. of Cores, Driver Memory) along with application URLs
-
Create a unit test case document and reports bugs/observations.
-
Standard Coding guidelines e.g.
-
Variable Names - Variable names will be all lower case, with individual words separated by an underscore.
-
For each Function/Procedure add Comments in code
-
Code Alignment and indentation should be proper
-
Perform Exception Handling
-
in-built functions, Code, location, table, database name, etc should be in lower cases
-
Folder Structure on Linux:
- zomato_etl
- source
- json
- csv
- archive
- hive
- ddl
- dml
- spark
- jars
- scala
- script (shell scripts and property files)
- logs (log_ddmmyyyy_hhmm.log)
- source
- zomato_raw_files
- Folder Structure on HDFS
- zomato_etl_<username>
- log
- zomato_ext
- zomato
- dim_country