Latest release: v1.0.0-portfolio-ready
- Objective
- Setup
- Project Structure
- Pipeline
- Skills Demonstrated
- Challenges and Insights
- API Documentation
- License
git clone https://github.com/space-lumps/ecommerce-data-cleaning.git
cd ecommerce-data-cleaning
# Create & activate virtual environment (recommended: uv)
uv venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
# Install dependencies + editable package
uv pip install -e .
# Copy sample data (no Kaggle needed)
cp data/samples/*.csv data/raw/
# Run the full pipeline
uv run python run_pipeline.pyBuild a reproducible, production-style data cleaning and validation pipeline for the Olist e-commerce dataset.
This project demonstrates:
- Structured modular pipeline design
- Explicit schema enforcement
- Data type auditing and validation
- Reproducible execution using
uv - Clean
src/package architecture - Module-based execution (
python -m)
git clone https://github.com/space-lumps/ecommerce-data-cleaning.git
cd ecommerce-data-cleaningThe pipeline can run using either the included sample dataset or the full Kaggle dataset.
- Copy sample CSVs into
data/raw/:
cp data/samples/*.csv data/raw/Ensure data/raw/ contains only one dataset version (either samples or full Kaggle data).
Run the pipeline:
uv run python run_pipeline.pyExpected location:
data/raw/
- Download the dataset from Kaggle: Brazilian E-Commerce Public Dataset by Olist
- Extract the CSV files.
- Move all
.csvfiles into:data/raw/
Install Kaggle CLI:
pip install kagglePlace kaggle.json in ~/.kaggle/, then run:
kaggle datasets download -d olistbr/brazilian-ecommerce -p data/raw --unzipdata/raw/ # Original source files
data/interim/ # Optional intermediate artifacts
data/clean/ # Cleaned parquet outputs
data/samples/ # Static lightweight dataset (no Kaggle required)
ecommerce-data-cleaning/
├── .github/
│ └── workflows/
│ └── ci.yml
├── data/
│ ├── clean/
│ ├── interim/
│ ├── raw/
│ └── samples/
│ └── [sample CSV files] # e.g., olist_orders_dataset.csv (truncated for brevity)
├── docs/
│ ├── api/
│ │ └── [generated API docs] # e.g., index.html (via pdoc)
│ ├── data_dictionary.md
│ ├── schema_contract.md
│ └── validation_strategy.md
├── reports/
│ └── [generated reports] # e.g., raw_profile.csv (generated at runtime)
├── src/
│ └── ecom_pipeline/
│ ├── __init__.py
│ ├── config/
│ │ ├── __init__.py
│ │ └── schema_contract.py
│ ├── pipeline/
│ │ ├── __init__.py
│ │ ├── audit_all_clean_dtypes.py
│ │ ├── enforce_schema.py
│ │ ├── generate_data_dictionary.py
│ │ ├── profile_raw.py
│ │ ├── sanity_check_raw.py
│ │ ├── standardize_columns.py
│ │ ├── validate_clean_schema.py
│ │ └── validate_schema_contract.py
│ └── utils/
│ ├── __init__.py
│ ├── io.py
│ └── logging.py
├── tests/
│ ├── test_io.py # smoke tests for io utils
│ └── test_pipeline_e2e.py # end-to-end pipeline smoke test
├── .pre-commit-config.yaml
├── .gitignore
├── .ruff.toml
├── LICENSE
├── README.md
├── pyproject.toml
├── requirements-lock.txt
├── requirements.txt
├── run_pipeline.py
└── uv.lock
The project follows a proper src/ layout.
All reusable code lives inside the ecom_pipeline package.
Tests live in a top-level tests/ directory.
pyproject.toml— Defines the installable package and dependencies.uv.lock— Locked dependency graph for reproducible environments (used byuv).requirements.txt— Traditional dependency list (optional compatibility).requirements-lock.txt— Pinned dependency versions (optional compatibility).
For this project, uv + pyproject.toml + uv.lock are the authoritative installation method.
Run the full pipeline:
uv run python run_pipeline.pyRun individual modules:
uv run python -m ecom_pipeline.pipeline.sanity_check_raw
uv run python -m ecom_pipeline.pipeline.profile_raw
uv run python -m ecom_pipeline.pipeline.generate_data_dictionary
uv run python -m ecom_pipeline.pipeline.standardize_columns
uv run python -m ecom_pipeline.pipeline.enforce_schema
uv run python -m ecom_pipeline.pipeline.validate_clean_schema
uv run python -m ecom_pipeline.pipeline.audit_all_clean_dtypes
uv run python -m ecom_pipeline.pipeline.validate_schema_contract- Sanity Check Raw Confirms raw files exist and are readable.
- Profile Raw Profiles source datasets before transformation.
- Generate Data Dictionary
Generatesdocs/data_dictionary.mdfromreports/raw_profile.csv. - Standardize Columns Applies consistent column naming.
- Enforce Schema Applies explicit casting rules to produce clean parquet outputs.
- Validate Clean Schema Verifies data types match expectations.
- Audit Dtypes Flags suspicious type patterns using heuristics.
- Validate Schema Contract Enforces required columns, primary key uniqueness, and logical dtype guarantees.
data/clean/*.parquetdocs/data_dictionary.mdreports/raw_profile.csvreports/clean_schema_audit.csvreports/clean_dtypes_full.csvreports/clean_dtypes_flags.csvreports/clean_contract_audit.csv
The pipeline enforces structural guarantees after cleaning.
- Implemented in
enforce_schema.py - Ensures consistent logical dtypes (str, datetime, numeric)
- Implemented in
validate_clean_schema.py - Verifies expected dtype families after casting
- Implemented in
validate_schema_contract.py - Enforces:
- Required columns
- Primary key uniqueness
- Logical dtype expectations
- Structural dataset integrity
If any contract rule fails, the dataset is considered invalid.
- Python package architecture (
src/layout) - Schema-driven data cleaning
- Defensive data validation
- Reproducible execution environments (
uv) - Logging and structured reporting
- Implemented E2E testing with pytest and CI via GitHub Actions for automated validation on samples
- Clean project organization for portfolio use
This project involved iterating on a real-world data cleaning pipeline, revealing several practical lessons in data engineering:
- Handling Inconsistent Data Types in Schema Validation: Encountered nuances with Python/Pandas dtypes like "str" vs. "string" vs. "object"—e.g., CSV imports defaulting to "object" required explicit coercion in
enforce_schema.pyto match expected schemas, preventing downstream errors in analysis or ML workflows. This highlighted the importance of strict type enforcement early in pipelines. - Balancing Reproducibility and Simplicity: Setting up a virtual environment with
uvfor fast, locked dependencies was straightforward, but integrating environment variables (e.g., for data dirs) taught me about flexible config without hardcoding paths. - Modular Design for Maintainability: Structuring as a package with separate scripts for extraction, validation, and auditing improved testability, but required careful import management to avoid circular dependencies.
- Testing Real-World Data Quirks: Sample CSVs had edge cases like missing values or inconsistent formats, reinforcing the need for audits and E2E tests to catch issues that unit tests might miss.
- Automation Trade-offs: Implementing CI with GitHub Actions automated checks, but debugging workflow failures (e.g., env var mismatches) emphasized clear logging and isolation in tests.
These experiences strengthened my approach to building robust, scalable data pipelines.
- Add foreign key integrity validation (cross-table checks – partially implemented)
- Add domain/value constraints (e.g., non-negative price, valid order_status domain)
- Optional: Add test coverage reporting to CI (pytest-cov)
- Optional: Containerize with Dockerfile
- Explore dlt for declarative orchestration (in a separate branch)
The project is structured as an installable Python package (ecom_pipeline).
Full API reference (auto-generated from docstrings):
→ View API Documentation
MIT License
Copyright (c) 2026 Corin Stedman
See the LICENSE file for details.