Read in Portuguese / Ler em Português
This project originated from the "Learn How to Work with Numbers and Strings by Implementing the Luhn Algorithm" module, part of the freeCodeCamp "Scientific Computing with Python" course.
Going beyond the original exercise, I took the initiative to extrapolate the core concept into a real-world data engineering scenario. Instead of just a function that validates a single number, I built a mini-ETL pipeline that reads a batch of "dirty" records from a CSV file, cleans them, applies the Luhn validation logic, and loads the enriched results into a new file.
This demonstrates a practical approach to solving data quality problems at scale.
This project showcases fundamental skills required for building robust data pipelines:
- ETL Process: A complete Extract, Transform, and Load workflow using Python and Pandas.
- Data Quality & Validation: Applying a specific business rule (the Luhn algorithm) to programmatically check data integrity.
- Data Cleaning: Preprocessing raw data to standardize formats before validation.
- Batch Processing: Handling an entire file of records, which simulates a real-world data processing task.
- Extract: Reads a CSV file containing card numbers in various formats.
- Transform: Cleans each number (removing spaces/hyphens) and applies the Luhn algorithm to check its validity, creating a new
is_valid
column. - Load: Writes a new CSV file containing the original data, the cleaned data, and the validation result.
- Clone the repository.
- Create and activate a virtual environment.
- Install the required dependencies:
pip install -r requirements.txt
- Run the main script to execute the pipeline:
python3 src/main.py