ETL Pipeline with Pandas

Overview

This project implements an ETL (Extract, Transform, Load) pipeline using Pandas, a versatile library for data manipulation and analysis in Python. The pipeline extracts data from various sources, such as JSON and CSV files, transforms it through data cleaning, aggregation, and manipulation operations using Pandas DataFrames, and loads the transformed data into CSV output files.

Technologies Used

Python: Programming language used for scripting and data transformations.
Pandas: Library for data manipulation and analysis.
Git: Version control system for collaborative development.
Pytest: Framework for testing Python code.
JSON and CSV: File formats for data extraction and storage.

Project Structure

The project is structured as follows:

main.py: Entry point for the ETL pipeline, initializes Spark session and orchestrates the data flow.
scripts/
- extract.py: Contains classes for data extraction from JSON and CSV files.
- transform.py: Implements data transformations using PySpark DataFrame API.
- load.py: Handles loading of transformed data into a CSV file.
tests/: Directory for unit tests.
data/: Directory containing sample JSON and CSV data files.

Pipeline Logic

Initialization: The main.py script initializes a Spark session to start processing data.
Extraction: Data is extracted from JSON and CSV files using the Extract class in extract.py.
Transformation: The Transform class in transform.py processes the extracted data. It performs operations like adding new columns, joining datasets, and ensuring data quality.
Loading: Transformed data is loaded into a CSV file using the Load class in load.py.
Output: The final output is stored in the output/etl_output.csv file.

Problems Faced

Schema Mismatch: Ensuring that the schema of extracted data matches the expected format in transformation steps.
Column Missing Errors: Handling errors due to missing columns in extracted data files.
Execution Plans: Dealing with large execution plans in Spark, which required tuning the logging level and configuration parameters.
Unit Testing: Setting up unit tests with Pytest for ensuring the correctness of data transformations and pipeline behavior.

Running the Project

To run the ETL pipeline:

Ensure you have Python and PySpark installed on your system.

Clone this repository:

git clone https://github.com/jteoni/pandas-etl-nyctaxi
cd pandas-etl-nyctaxi

Install dependencies:
```
pip install -r requeriments.txt
```
Execute the main script:
```
python main.py
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL Pipeline with Pandas

Overview

Technologies Used

Project Structure

Pipeline Logic

Problems Faced

Running the Project

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
output		output
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requeriments.txt		requeriments.txt

jteoni/pandas-etl-nyctaxi

Folders and files

Latest commit

History

Repository files navigation

ETL Pipeline with Pandas

Overview

Technologies Used

Project Structure

Pipeline Logic

Problems Faced

Running the Project

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages