This project focuses on analyzing vehicle trajectories with SQL and Python. The analysis covers calculations of each vehicle's maximum and minimum speeds, quickest times, and total distance traveled. Additionally, dbt is utilized for data transformation and to create lineage graphs.
pNEUMA is a comprehensive, large-scale dataset containing naturalistic trajectories for half a million vehicles, collected by a fleet of drones and fixed roadside cameras in the busy downtown area of Athens, Greece.
The vehicle trajectory data is derived from traffic footage analysis, with each file corresponding to a specific (area, date, time) and averaging around 87 MB in size.
PostgreSQL is a robust, open-source object-relational database system that builds on the SQL language with additional features to securely handle and scale complex data workloads. In this project, PostgreSQL serves as the data warehouse for storing vehicle trajectory data extracted from footage captured by drone swarms and stationary roadside cameras. The data is ingested into PostgreSQL from a data source (such as a CSV file) via an Airflow task.
Apache Airflow is an open-source platform designed for programmatically creating, scheduling, and monitoring workflows. In this project, Airflow is utilized to schedule and automate data pipeline tasks, such as loading data into PostgreSQL and executing transformation scripts with dbt. Each Airflow task is defined as a Python function and structured within a Directed Acyclic Graph (DAG), which outlines the task sequence and dependencies.
dbt is a command-line tool that helps data analysts and engineers perform data transformations more efficiently within their data warehouses. In this project, dbt is used to transform data stored in PostgreSQL by running SQL scripts defined in the dbt project. These scripts handle tasks like aggregation, joining, filtering, and data cleaning. Once transformed, the data is reloaded into PostgreSQL for analysis. dbt also includes features for data testing (to verify it meets specific criteria) and documenting data models.
PowerBI helps you make sense of your data by allowing you to connect and query your data sources, build dashboards to visualize data, and share them with your company. In this project, PowerBI is used to create visualizations and dashboards from the transformed data. It connects to your PostgreSQL database, allows you to visualize the data imported from the different views created by dbt to visualizations in various formats (like tables, line charts, bar charts, etc.)
- Python 3.x
- PostgreSQL
- dbt
- Airflow
- Using WSL(recommended since dbt and airflow might display some issues on windows terminal. This is not applicable if you're using a windows system)
git clone https://github.com/AlazarG19/Data-Warehouse-Using-PostgreSQL-DBT-and-Airflow.git
wsl -l -v
(list all installed distros in ubuntu)
wsl -d Ubuntu-22.04
(distro i had installed and used for the project)
python3 -m venv venv
or
py -3.10 -m venv venv
(for 3.10 version. This is the version i used)
source venv/bin/activate
pip install -r requirements.txt
- Install PostgreSQL
- Create a new database
- Create an env file in the dag file and the dbt file and add credentials like
HOST=your_db_host DATABASE=your_db PORT=your_port USER=your_user PASSWORD=your_password FILE_PATH=the file path where you'll be putting the csv data files
when importing the .env files you might encounter some issues so the best approach would be to use
set -o allexport; source .env; set +o allexport
this is used to load environment variable from the .env to the current shell session
- Install dbt(included in the requirements.txt)
- Move to the dbt path
cd dbt
dbt init my_project
if no profiles.yml is found create it in the same folder. the credentials are imported from the env file so be to sure to follow the previous steps
-
Install Airflow(included in the requirement folder)
-
Initialize the Airflow database
airflow db init
- Start the Airflow web server
airflow webserver --port 8080
- Start the Airflow scheduler
airflow scheduler
- To Do it all together use
airflow standalone
This is only recommended for development
- to change the airflow config location (since it will resort to default everytime you close and open it).this will be the directory where your airflow.cfg will exist. You won't be able to locate your dag and the airflow server won't start in some cases.
export AIRFLOW_HOME=$(pwd)
- Access the Airflow web interface
http://localhost:8080
- Create a new DAG in Airflow using the
dags/vehicle_trajectory_dag.py
file - Trigger the DAG to run
- Monitor the DAG run in the Airflow web interface
- Check the transformed data in your PostgreSQL database
- Create visualizations in Redash using the transformed data
- Load data into PostgreSQL
- Transform data using dbt
- Create visualizations in PowerBI
- Schedule the data pipeline in Airflow
- Monitor the data pipeline in Airflow
- Analyze the data in PostgreSQL
- Document the data models in dbt
- Update the data pipeline as needed
- Repeat steps 1-8 as needed
This project is licensed under the MIT License - see the LICENSE file for details.