Data warehouse teck stack with PostgreSQL, Airflow, dbt, and Powerbi

Project Overview

This project focuses on analyzing vehicle trajectories with SQL and Python. The analysis covers calculations of each vehicle's maximum and minimum speeds, quickest times, and total distance traveled. Additionally, dbt is utilized for data transformation and to create lineage graphs.

Data

pNEUMA is a comprehensive, large-scale dataset containing naturalistic trajectories for half a million vehicles, collected by a fleet of drones and fixed roadside cameras in the busy downtown area of Athens, Greece.

The vehicle trajectory data is derived from traffic footage analysis, with each file corresponding to a specific (area, date, time) and averaging around 87 MB in size.

Tools to be used

PostgreSQL

PostgreSQL is a robust, open-source object-relational database system that builds on the SQL language with additional features to securely handle and scale complex data workloads. In this project, PostgreSQL serves as the data warehouse for storing vehicle trajectory data extracted from footage captured by drone swarms and stationary roadside cameras. The data is ingested into PostgreSQL from a data source (such as a CSV file) via an Airflow task.

Airflow

Apache Airflow is an open-source platform designed for programmatically creating, scheduling, and monitoring workflows. In this project, Airflow is utilized to schedule and automate data pipeline tasks, such as loading data into PostgreSQL and executing transformation scripts with dbt. Each Airflow task is defined as a Python function and structured within a Directed Acyclic Graph (DAG), which outlines the task sequence and dependencies.

dbt (Data Build Tool)

dbt is a command-line tool that helps data analysts and engineers perform data transformations more efficiently within their data warehouses. In this project, dbt is used to transform data stored in PostgreSQL by running SQL scripts defined in the dbt project. These scripts handle tasks like aggregation, joining, filtering, and data cleaning. Once transformed, the data is reloaded into PostgreSQL for analysis. dbt also includes features for data testing (to verify it meets specific criteria) and documenting data models.

PowerBI

PowerBI helps you make sense of your data by allowing you to connect and query your data sources, build dashboards to visualize data, and share them with your company. In this project, PowerBI is used to create visualizations and dashboards from the transformed data. It connects to your PostgreSQL database, allows you to visualize the data imported from the different views created by dbt to visualizations in various formats (like tables, line charts, bar charts, etc.)

Getting Started

Prerequisites

Python 3.x
PostgreSQL
dbt
Airflow
Using WSL(recommended since dbt and airflow might display some issues on windows terminal. This is not applicable if you're using a windows system)

Installation

1. Clone the repository

git clone https://github.com/AlazarG19/Data-Warehouse-Using-PostgreSQL-DBT-and-Airflow.git

2. Setup WSL(recommended for window users)

wsl -l -v

(list all installed distros in ubuntu)

wsl -d Ubuntu-22.04

(distro i had installed and used for the project)

3. Setup Virtual Environment

python3 -m venv venv

or

py -3.10 -m venv venv

(for 3.10 version. This is the version i used)

4. Activate Virtual Environment

source venv/bin/activate

5. Install the required Python packages

pip install -r requirements.txt

6. Set up PostgreSQL

Install PostgreSQL
Create a new database
Create an env file in the dag file and the dbt file and add credentials like

HOST=your_db_host DATABASE=your_db PORT=your_port USER=your_user PASSWORD=your_password FILE_PATH=the file path where you'll be putting the csv data files

when importing the .env files you might encounter some issues so the best approach would be to use

set -o allexport; source .env; set +o allexport

this is used to load environment variable from the .env to the current shell session

7. Set up dbt

Install dbt(included in the requirements.txt)
Move to the dbt path

cd dbt

dbt init my_project

if no profiles.yml is found create it in the same folder. the credentials are imported from the env file so be to sure to follow the previous steps

8. Set up Airflow

Install Airflow(included in the requirement folder)
Initialize the Airflow database

airflow db init

Start the Airflow web server

airflow webserver --port 8080

Start the Airflow scheduler

airflow scheduler

To Do it all together use

airflow standalone

This is only recommended for development

to change the airflow config location (since it will resort to default everytime you close and open it).this will be the directory where your airflow.cfg will exist. You won't be able to locate your dag and the airflow server won't start in some cases.

export AIRFLOW_HOME=$(pwd)

Access the Airflow web interface

http://localhost:8080

Create a new DAG in Airflow using the dags/vehicle_trajectory_dag.py file
Trigger the DAG to run
Monitor the DAG run in the Airflow web interface
Check the transformed data in your PostgreSQL database
Create visualizations in Redash using the transformed data

Usage

Load data into PostgreSQL
Transform data using dbt
Create visualizations in PowerBI
Schedule the data pipeline in Airflow
Monitor the data pipeline in Airflow
Analyze the data in PostgreSQL
Document the data models in dbt
Update the data pipeline as needed
Repeat steps 1-8 as needed

Visualisation

Visualisations from the powerbi

Visualisations from the dbt docs

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
dags		dags
dbt		dbt
notebook		notebook
visualisation		visualisation
.gitignore		.gitignore
README.md		README.md
Visualisation.pbix		Visualisation.pbix
Visualisation.pdf		Visualisation.pdf
airflow-webserver.pid		airflow-webserver.pid
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data warehouse teck stack with PostgreSQL, Airflow, dbt, and Powerbi

Project Overview

Data

Tools to be used

PostgreSQL

Airflow

dbt (Data Build Tool)

PowerBI

Getting Started

Prerequisites

Installation

1. Clone the repository

2. Setup WSL(recommended for window users)

3. Setup Virtual Environment

4. Activate Virtual Environment

5. Install the required Python packages

6. Set up PostgreSQL

7. Set up dbt

8. Set up Airflow

Usage

Visualisation

Visualisations from the powerbi

Visualisations from the dbt docs

License

Acknowledgements

About

Releases

Packages

Languages

AlazarG19/Data-Warehouse-Using-PostgreSQL-DBT-and-Airflow

Folders and files

Latest commit

History

Repository files navigation

Data warehouse teck stack with PostgreSQL, Airflow, dbt, and Powerbi

Project Overview

Data

Tools to be used

PostgreSQL

Airflow

dbt (Data Build Tool)

PowerBI

Getting Started

Prerequisites

Installation

1. Clone the repository

2. Setup WSL(recommended for window users)

3. Setup Virtual Environment

4. Activate Virtual Environment

5. Install the required Python packages

6. Set up PostgreSQL

7. Set up dbt

8. Set up Airflow

Usage

Visualisation

Visualisations from the powerbi

Visualisations from the dbt docs

License

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages