This repository contains various tasks and projects related to data engineering, primarily focusing on working with the Azure cloud platform and data analysis using Python and Jupyter notebooks.
The repository is divided into the following parts:
- part_a: Contains Jupyter notebooks and scripts related to the first set of data engineering tasks.
- part_b: Contains Jupyter notebooks and scripts for the second set of tasks.
- requirements.txt: Specifies the Python dependencies required to run the project.
- azure_vm_setup.sh: A bash script to set up an Azure VM, install Azure CLI, login, and download files from Azure Blob Storage.
- Python
- Jupyter Notebooks
- Azure Cloud Services: Networking, Identity, and Compute.
- Pandas: Data manipulation and analysis.
- Python 3.x
- Install dependencies from
requirements.txt
:
Make sure you have set up your .env file in the root directory.
- Clone the repository:
git clone https://github.com/andreas789/dtu_de.git
- Navigate into the cloned repository:
pip install -r requirements.txt
- Install the required dependencies:
pip install -r requirements.txt
- Run any script/notebook you like.
This script performs the following actions:
- SSH into the Azure VM.
- Install the Azure CLI.
- Log into Azure using the CLI.
- Download files from an Azure Blob Storage container to the VM.
- Cloud-based data engineering
- Data manipulation with Pandas
- Azure services for data solutions