Lake House Architecture with Azure

Problem Statement

This project demonstrates the implementation of a Lake House Architecture using various Azure services. Data is extracted from an on-premise SQL Server database using Azure Data Factory, which then orchestrates the movement and transformation process. The data lands in Azure Data Lake Storage Gen2, where it's transformed using Azure Databricks. Finally, the transformed data is visualized with Microsoft Power BI. The project also automates the ETL pipeline using Azure Data Factory and successfully passes all testing stages.

Project Objective, Plan and Approach

Objective

The primary goal of this project is to implement a robust data pipeline using Azure services, following the Lake House Architecture. This architecture will facilitate seamless data ingestion, transformation, and visualization, providing a scalable solution for big data analytics and business intelligence.

Project Plan

  1. Setup On-Premise SQL Server Database
  2. Ingest Data to Azure Data Lake Storage Gen2 using Azure Data Factory
  3. Transform Data in Azure Databricks
  4. Load Transformed Data to Azure Synapse Analytics
  5. Visualize Data using Power BI
  6. Automate Pipelines and Ensure End-to-End Data Flow

Approach

  Data Ingestion: Use Azure Data Factory to copy data from an on-premise SQL Server database to Azure Data Lake Storage Gen2 (Bronze Layer).
  Data Transformation: Utilize Azure Databricks for transforming and cleansing data, moving it from the Bronze to Silver and then to the Gold Layer.
  Data Analytics: Use Azure Synapse Analytics to create views and queries for data analysis.
  Data Visualization: Integrate Power BI with Azure Synapse Analytics for real-time data visualization.
  Automation and Monitoring: Automate data pipelines and set up triggers to ensure continuous data flow and synchronization.

Tools and Services Used

  1. Azure Data Factory: Orchestrates data movement and transformation.
  2. Azure Data Lake Storage Gen2: Stores raw and transformed data.
  3. Azure Databricks: Performs data transformation and big data analytics.
  4. Azure Synapse Analytics: Provides advanced data analytics and SQL capabilities.
  5. Azure Key Vault: Secures sensitive information such as passwords and secrets.
  6. Azure Active Directory (AAD): Manages identities and access control.
  7. Microsoft Power BI: Visualizes data and provides business intelligence.

Project Steps

Step 1: Setup On-Premise SQL Server

Prepare the on-premise SQL Server database for data extraction

Restore Backup and Create Login:

  create Login ks with password = 'Migrationproject#login';
  create user ks for login ks;

Configuration Checks:

  Ensure SQL Server and Windows Authentication are enabled.
  Ensure port 1433 allows inbound connections.
  Assign the data reader role to the user.

Step 2: Azure Key Vault Configuration

Securely store and manage sensitive information required for data access and transfer.

Assign Access to IAM:

  Assign the user necessary access for creating secrets in Key Vault.
  Ensure the Azure AD user account has the "Key Vault Administrator" or "Key Vault Secrets Operator" role.

Create Secrets:

  Create secrets for the username and password in Azure Key Vault.

Step 3: Azure Data Factory Configuration

Use Azure Data Factory (ADF) to copy data from the on-premise SQL Server to Azure Data Lake Storage Gen2.

Install Self-Hosted Integration Runtime (SHIR):

  Install SHIR on the machine hosting the Azure SQL Server database.
  Under Data Factory > Manage > Integration Runtimes, create a new "Azure Self-Hosted" runtime.

Create Data Pipeline:

  Author a pipeline with a copy data activity to ingest data from on-premise SQL Server to Azure Data Lake Gen2 (Bronze Layer).
  Configure linked services and set up secrets from Azure Key Vault.

Step 4: Configure Azure Data Lake Storage Gen2

Mount the Azure Data Lake in Azure Databricks to access and process data.

Mount Data Lake in Azure Databricks:

  storage_account_name = "dataMigrationKS"
  storage_account_access_key = "<your_storage_account_access_key>"

  configs = {
  "fs.azure.account.auth.type.migrationstorageks.dfs.core.windows.net": "OAuth",
  "fs.azure.account.oauth.provider.type.migrationstorageks.dfs.core.windows.net": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
  "fs.azure.account.oauth2.client.id.migrationstorageks.dfs.core.windows.net": "<your_client_id>",
  "fs.azure.account.oauth2.client.secret.migrationstorageks.dfs.core.windows.net": "<your_client_secret>",
  "fs.azure.account.oauth2.client.endpoint.migrationstorageks.dfs.core.windows.net": "https://login.microsoftonline.com/<your_tenant_id>/oauth2/token"
  }

  for k, v in configs.items():
  spark.conf.set(k, v)

  spark.conf.set(
  f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net",
  storage_account_access_key
  )

Step 5: Data Transformation in Azure Databricks

Transform and cleanse data, moving it through the Bronze, Silver, and Gold layers.

Transform Data from Bronze to Silver Layer:

  from pyspark.sql.functions import from_utc_timestamp, date_format
  from pyspark.sql.types import TimestampType

  for i in table_name:
  path = bronzePath + i + '/' + i + '.parquet'
  df = spark.read.format('parquet').load(path)
  column = df.columns

  for col in column:
          if "Date" in col or "date" in col:
          df = df.withColumn(col, date_format(from_utc_timestamp(df[col].cast(TimestampType()), "UTC"), "yyyy-MM-dd"))
  
  output_path = silverPath + i + '/'
  df.write.format('delta').mode("overwrite").save(output_path)

Transform Data from Silver to Gold Layer:

  for name in table_name:
  path = silverPath + name
  df = spark.read.format('delta').load(path)

  column_names = df.columns

  for old_col_name in column_names:
          new_col_name = "".join(["_" + char if char.isupper() and not old_col_name[i-1].isupper() else char for i, char in enumerate(old_col_name)]).lstrip("_")
          df = df.withColumnRenamed(old_col_name, new_col_name)
  
  output_path = goldPath + name + '/'
  df.write.format('delta').mode("overwrite").save(output_path)

Step 6: Configure Azure Synapse Analytics

Load transformed data into Azure Synapse Analytics for advanced analytics and SQL querying

Create a Serverless SQL Database.

Create a Stored Procedure to Dynamically Create Views:

  USE gold_db
  GO

  CREATE OR ALTER PROC CreateSQLServerlessView_gold 
  @ViewName nvarchar(100)
  AS
  BEGIN
  DECLARE @statement VARCHAR(MAX)
  SET @statement = N'CREATE OR ALTER VIEW ' + @ViewName + ' AS
          SELECT *
          FROM
          OPENROWSET(
          BULK ''https://migrationstorageks.dfs.core.windows.net/gold/SalesLT/' + @ViewName + '/'',
          FORMAT = ''DELTA''
          ) as [result]'
  EXEC (@statement)
  END
  GO

Step 7: Integrate with Power BI

Visualize data using Power BI, providing business intelligence and insights.

Connect Power BI to Azure Synapse Analytics:

  Use the serverless SQL endpoint from Synapse Workspace properties.
  Load the data and design the Power BI dashboard as per the required KPIs.

Dashboard Link : https://app.powerbi.com/groups/me/reports/1dc83904-a082-4f2d-8b3b-d160c7b66ed4/54deb341271d941e4291?experience=power-bi

Step 8: End-to-End Testing

Ensure the entire pipeline is automated and functioning correctly.

Automate Pipelines in Azure Data Factory:

  Configure a scheduled trigger to automate data ingestion and transformation.

Update Data and Verify:

  Insert new data into the SQL Server database.
  Verify the updated data in Power BI to ensure the pipeline runs successfully.

This documentation provides a comprehensive overview of the steps and configurations required to implement the Lake House Architecture using Azure services.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
Files		Files
Detailed Project Documentation.docx		Detailed Project Documentation.docx
README.md		README.md
SalesPerformance.pbix		SalesPerformance.pbix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lake House Architecture with Azure

Problem Statement

Project Objective, Plan and Approach

Project Steps

Dashboard Link : https://app.powerbi.com/groups/me/reports/1dc83904-a082-4f2d-8b3b-d160c7b66ed4/54deb341271d941e4291?experience=power-bi

About

Releases

Packages

Languages

Definitive-KD/AzureLakeHouseArchitecture

Folders and files

Latest commit

History

Repository files navigation

Lake House Architecture with Azure

Problem Statement

Project Objective, Plan and Approach

Project Steps

Dashboard Link : https://app.powerbi.com/groups/me/reports/1dc83904-a082-4f2d-8b3b-d160c7b66ed4/54deb341271d941e4291?experience=power-bi

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages