We will be building an ETL pipeline for the dataset which contains salary information of the people living in San Francisco. The dataset has been downloaded from Kaggle. We extract the data from local DBFS. The dataset is in the form of csv, which consists of personal information like name, salary etc. We ingest the raw data into staging area using batch load and add two new columns - ingest date and ingest time. Then we perform some cleaning and analysis using various python libraries. Finally, we load the cleaned data into an SQL server and perform query operations. This is a POC for ETL pipeline. Tech stack used: Azure Databricks, DBeaver/SQLite
-
Notifications
You must be signed in to change notification settings - Fork 0
Saishiyam21/ETL-Salary-Pipeline
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Personal ETL project
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published