This project demonstrates an end-to-end data engineering pipeline on Azure, designed to address a hypothetical business scenario and deepen my understanding of data workflows.
In this project, I built a comprehensive data pipeline using Azure services to address a business need for insights on customer demographics and sales trends. Data from an on-premises SQL database is ingested, transformed, and visualized in a Power BI dashboard. This dashboard provides key metrics on gender distribution and product category performance, allowing for data exploration by date, product category, and gender.
The business identified a need for improved insights into customer demographics—specifically, gender distribution and its impact on purchasing behavior. Key objectives include:
- Sales by Gender and Product Category: Dashboard metrics for total sales, revenue, and customer demographics.
- Data Filtering: Flexible filtering options by product category, gender, and date.
- User-Friendly Interface: An intuitive dashboard for stakeholders to interact with key metrics.
To meet these needs, the project includes the following phases:
- Extract customer and sales data from an on-premises SQL Server.
- Load the data into Azure Data Lake Storage (ADLS) using Azure Data Factory (ADF).
- Clean and process data using Azure Databricks, creating structured layers: Bronze (raw data), Silver (cleansed), and Gold (aggregated insights).
- Load the transformed data into Azure Synapse Analytics.
- Use Power BI to create interactive dashboards, enabling business users to view sales and demographic insights.
- Automate the pipeline to refresh daily, ensuring up-to-date insights.
- Azure Data Factory: Orchestration of data movement and transformation.
- Azure Data Lake Storage: Storage for raw and processed data.
- Azure Databricks: Data transformation and processing.
- Azure Synapse Analytics: Data warehousing and SQL-based analytics.
- Power BI: Data visualization and reporting.
- Azure Key Vault: Secure management of credentials.
- SQL Server: On-premises database as the data source.
- Azure account with sufficient credits
- Access to an on-premises SQL database
- Create a resource group and provision necessary Azure services (Data Factory, Data Lake Storage, Databricks, Synapse Analytics, Key Vault).
- Set up SQL Server and restore the AdventureWorks database.
- Use ADF to copy data from SQL Server to the Bronze layer in ADLS.
- Configure Databricks to access ADLS.
- Perform data cleaning and aggregation in Databricks, moving data from Bronze to Silver to Gold layers.
- Load aggregated data into Synapse Analytics.
- Create Power BI dashboards to visualize business metrics.
- Schedule the pipeline to run daily in ADF.
- Use monitoring tools to track the success of pipeline runs.
- Manage permissions through Azure Entra ID (formerly Azure Active Directory).
- Insert new data in SQL to verify the end-to-end pipeline flow and ensure updated reports in Power BI.
This project provides a structured pipeline for analyzing customer demographics and their impact on sales performance. With automated data flows, stakeholders have access to real-time, actionable insights.