This project analyzes and visualizes the evolution of the number of lawyers in France over time. Using public data, I created a data pipeline to process, analyze, and visualize this information.
The main objective is to provide insights into the growth and distribution of the legal profession in France.
The project utilizes data from the French government's open data platform. The dataset is accessed directly from link.
This CSV file contains detailed information about the number of lawyers per bar association in France over time. The data is regularly updated, ensuring that our analysis reflects the most current trends in the French legal profession.
Our data pipeline leverages two powerful tools for efficient data processing and transformation:
dlt is an open-source library that simplifies the process of extracting, normalizing, and loading data. Key features include:
- Automated schema inference and evolution
- Built-in data verification and error handling
- Support for various data sources and destinations
In this project, dlt is used for efficient data extraction from the CSV source and loading into our data processing pipeline.
yato is a lightweight SQL transformation orchestrator designed to work seamlessly with DuckDB. Its main advantages are:
- Efficient execution of SQL queries in the correct order
- Easy management of dependencies between transformations
We use yato, the smallest DuckDB SQL orchestrator on Earth, to orchestrate our SQL transformations on the extracted data. Yato works seamlessly with DuckDB, ensuring a clean and well-structured dataset for analysis.
The combination of dlt and yato, leveraging the power of DuckDB, creates a flexible, maintainable, and easy-to-understand data pipeline that forms the backbone of our analysis. This setup allows us to efficiently process and transform data using SQL queries within the DuckDB environment.
To set up the project environment:
-
Clone the repository:
git clone https://github.com/MohamedBsh/test-dlt-yato-avocado.git cd test-dlt-yato-avocado
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate
-
Install the required dependencies:
pip install -r requirements.txt
To run the data pipeline and generate visualizations:
-
Ensure your virtual environment is activated.
-
Execute the main script:
python app/pipeline.py
-
Clean the database
python app/data_explorer.py Choose 'clean'
-
Explore the database
python app/data_explorer.py Choose 'explore'
-
Visualize the data
python generate_plot.py