Introduction

Data Engineer Role: My objective is to curate the most accurate datasets for advanced analytics.
Data Scientist Role: I apply statistical methods to deeply analyze data, providing valuable predictions to inform decisions.

Overview

This project amalgamates both Data Engineering and Data Scientist skill sets. I employed a framework based on Azure for ETL processes. Post the data cleaning and transformation, the data is passed to R for further advanced analysis.

Tools Used:

Azure Data Factory: For Data Integration
Data Lake Gen 2: Storage for Raw & Transformed Data
Azure Databricks: Data Transformations
Azure Synapse Analytics: Advanced Analytics
R Studio: Advanced Data Science Analysis

Framework:

Instructions:

Step 1: Data Set Up

Download the dataset from Kaggle.
Store the data on this Github Repository.

Step 2: Extract Data into Azure Cloud

Use Data Factory to integrate raw data from the data source.
Download the raw data and import it into Data Lake Gen 2.

Step 3: Data Transformation using Azure Databricks

Set up an environment to bridge Azure Data Lake Gen 2 with Azure Databricks.
Review the codebase here.
Note: Ensure the "final data" is loaded back into the Azure Data Lake Gen 2 under the transformed-data folder.

You have the flexibility to load data into any preferred environment, be it Azure Synapse Analytics, R Studio, or others.

Step 4a: Load Data by Connecting Azure Data Lake Gen 2 & R Studio

Establish Token and Endpoint Token.
Connect to the 'tokyodatasources' container within the Azure Blob storage endpoint.
Load the data into R Studio and proceed with analysis.
Review the codebase here.

Step 4b: Load Data by Connecting Azure Data Lake Gen 2 & Azure Synapse Analytics

Note: I haven't primarily utilized Azure Data Synapse for this project. However, I'll provide a brief guide based on my knowledge.

Why use a Serverless SQL Pool?
- Pay-per-Query: Costs are based only on actual usage.
- Direct Data Analysis: Directly query large datasets.
- Versatile Data Format Support: Handles Parquet, CSV, JSON, etc.
- Familiarity: Employs T-SQL for querying.
- Integrated with Azure Synapse Studio: Aids in data exploration and visualization.
- Security: Incorporates Azure Active Directory authentication.
- Use Case: Best suited for ad-hoc data exploration and analytics.
Construct a Lake Database, given that the dataset resides on Azure Data Lake Gen 2.
Create an external table from the Data Lake. You'll need to:
- Name the external table.
- Link services to your storage account.
- Specify the input file or folder.
- Define the source file format settings. Be attentive to headers.
- Configure General, Columns & Relationships based on your dataset.
- Finally, validate and publish your configurations.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Analysis Document		Analysis Document
Code		Code
Data Source		Data Source
Picture		Picture
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Overview

Tools Used:

Framework:

Instructions:

Step 1: Data Set Up

Step 2: Extract Data into Azure Cloud

Step 3: Data Transformation using Azure Databricks

Step 4a: Load Data by Connecting Azure Data Lake Gen 2 & R Studio

Step 4b: Load Data by Connecting Azure Data Lake Gen 2 & Azure Synapse Analytics

References:

About

Uh oh!

Releases

Packages

Languages

tpham45/Azure-Data-Engingeer-Project

Folders and files

Latest commit

History

Repository files navigation

Introduction

Overview

Tools Used:

Framework:

Instructions:

Step 1: Data Set Up

Step 2: Extract Data into Azure Cloud

Step 3: Data Transformation using Azure Databricks

Step 4a: Load Data by Connecting Azure Data Lake Gen 2 & R Studio

Step 4b: Load Data by Connecting Azure Data Lake Gen 2 & Azure Synapse Analytics

References:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages