This project involves setting up an Azure Databricks environment, integrating it with Azure storage accounts, automating data processing workflows, and implementing CI/CD pipelines to ensure seamless integration and deployment of data and notebooks.
An Azure Resource Group was created to organize and manage all related resources.
Two storage accounts were created to store and manage the project data.
Within the projectstgaccount
, three containers were created, with the landing
container designated for storing raw data.
Three folders were created in the medallion structure to organize data systematically.
An Azure Databricks workspace was established to facilitate data processing and analysis.
A Databricks access connector was created and added to the Blob Storage Contributor role for the two storage accounts, ensuring secure data access.
Within the Azure Databricks workspace, a metastore was created and attached to the workspace. Subsequently, a development catalog was set up.
Storage credentials and external locations were configured to manage data access and storage.
A Databricks cluster was created to execute data processing tasks.
All provided files were manually run to verify that paths and variable names were correctly defined.
All schemas are created in the dev catalog
Autoscaling was enabled, and workflows were created to automate the execution of data processing tasks.
Keys and parameters for dbutils
widgets were created to handle dynamic configurations.
Triggers were created to automate task execution. Multiple triggers were cloned to manage different data streams, such as raw roads and raw traffic. New files added to Azure Data Lake Storage (ADLS) initiate the triggers, ensuring incremental data processing and successful job completion.
Processed data was integrated with Power BI for comprehensive reporting and analysis.
A CI/CD pipeline was established to automate the deployment process. When there is a push to the main branch, all folders are copied to the live folder, requiring admin access for interaction. This setup ensures seamless integration and deployment of all notebooks to different environments, keeping the live folder updated with the latest data.
This project demonstrates the efficient setup and automation of an Azure Databricks environment. It includes secure data integration, automated workflows, and comprehensive reporting, enhanced by a robust CI/CD pipeline to ensure consistent and up-to-date data deployment across different environments. This approach facilitates seamless integration, deployment, and data accessibility while maintaining data integrity and security.
At the end of the project, the workspace appears as shown in the image. It includes the main branch with all the changes already pulled, all the files organized Notebooks folder. The setup ensures all project components are easily accessible and well-structured. A pipeline was created to facilitate the movement of data from the development catalog to the UAT catalog, requiring admin approval. The Azure DevOps interface illustrates the stages of deployment, ensuring a controlled and authorized transition of data between these environments.