This repository contains a mini project demonstrating an ETL (Extract, Transform, Load) pipeline for processing sales data using Docker and Apache Airflow for orchestration.
- Docker and Docker Compose installed
- Python 3.x
- Apache Airflow: Set up Apache Airflow, and ensure it's running in a Docker container
- Postgresql (you can use pgAdmin4 or DBeaver )
- Make sure you have PostgreSQL installed and running.
- Use your preferred database management tool (e.g., pgAdmin4 or DBeaver) to create a new database for this project and set Airflow as the owner.
- Connect to your PostgreSQL instance and execute this SQL script to create the table
-- SQL script to create the sales_data table
CREATE TABLE etl( Item_Identifier character varying(45) NOT NULL ,
Item_Weight FLOAT NOT NULL ,
Item_Fat_Content character varying(45) NOT NULL,
Item_Visibility FLOAT NOT NULL , Item_Type character varying(45) NOT NULL,
Item_MRP FLOAT NOT NULL ,
Outlet_Identifier character varying(45) NOT NULL,
Outlet_Establishment_Year INTEGER NOT NULL,
Outlet_Size character varying(45) NOT NULL,
Outlet_Location_Type character varying(45) NOT NULL,
Outlet_Type character varying(45) NOT NULL,
Item_Outlet_Sale FLOAT NOT NULL,
source character varying(45) NOT NULL
);
- Open the Airflow web UI.
- Go to the "Admin" section and select "Connections."
- Click "Create" to add a new connection.
- Set the connection type to "PostgreSQL."
- Fill in the required connection details (host, port, schema, login, password).
Once the PostgreSQL connection is properly configured, you can trigger the Airflow DAG, and it will smoothly commence the ETL process.
Following this, you will find the data successfully uploaded to your PostgreSQL database.
In this structure:
/docker-compose
contains the Docker Compose file for defining and running the Docker containers./etl_script
contains the Python script for the ETL pipeline (pipeline.py
)./data
contains the input data files (Train.csv
andTest.csv
).Dockerfile
specifies how to build the Docker image for the ETL pipeline.requirements.txt
lists the Python dependencies needed to run the ETL pipeline.README.md
provides information about the project and how to set it up.