Skip to content

A mini ETL pipeline in Python that performs batch data validation using the Luhn algorithm. An extension of a freeCodeCamp project to demonstrate data quality concepts.

License

Notifications You must be signed in to change notification settings

zenleonardo/luhn-data-validator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Validation Pipeline: An Extension of the Luhn Algorithm Project

Python Badge Pandas Badge

Read in Portuguese / Ler em Português

📖 About the Project

This project originated from the "Learn How to Work with Numbers and Strings by Implementing the Luhn Algorithm" module, part of the freeCodeCamp "Scientific Computing with Python" course.

Going beyond the original exercise, I took the initiative to extrapolate the core concept into a real-world data engineering scenario. Instead of just a function that validates a single number, I built a mini-ETL pipeline that reads a batch of "dirty" records from a CSV file, cleans them, applies the Luhn validation logic, and loads the enriched results into a new file.

This demonstrates a practical approach to solving data quality problems at scale.

✨ Relevance to Data Engineering

This project showcases fundamental skills required for building robust data pipelines:

  • ETL Process: A complete Extract, Transform, and Load workflow using Python and Pandas.
  • Data Quality & Validation: Applying a specific business rule (the Luhn algorithm) to programmatically check data integrity.
  • Data Cleaning: Preprocessing raw data to standardize formats before validation.
  • Batch Processing: Handling an entire file of records, which simulates a real-world data processing task.

📋 Pipeline Overview

  1. Extract: Reads a CSV file containing card numbers in various formats.
  2. Transform: Cleans each number (removing spaces/hyphens) and applies the Luhn algorithm to check its validity, creating a new is_valid column.
  3. Load: Writes a new CSV file containing the original data, the cleaned data, and the validation result.

🚀 How to Run

  1. Clone the repository.
  2. Create and activate a virtual environment.
  3. Install the required dependencies:
    pip install -r requirements.txt
  4. Run the main script to execute the pipeline:
    python3 src/main.py

About

A mini ETL pipeline in Python that performs batch data validation using the Luhn algorithm. An extension of a freeCodeCamp project to demonstrate data quality concepts.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages