E-commerce Data Cleaning (Olist)

Quick Start

git clone https://github.com/space-lumps/ecommerce-data-cleaning.git
cd ecommerce-data-cleaning

# Create & activate virtual environment (recommended: uv)
uv venv
source .venv/bin/activate   # or .venv\Scripts\activate on Windows

# Install dependencies + editable package
uv pip install -e .

# Copy sample data (no Kaggle needed)
cp data/samples/*.csv data/raw/

# Run the full pipeline
uv run python run_pipeline.py

Objective

Build a reproducible, production-style data cleaning and validation pipeline for the Olist e-commerce dataset.

This project demonstrates:

Structured modular pipeline design
Explicit schema enforcement
Data type auditing and validation
Reproducible execution using uv
Clean src/ package architecture
Module-based execution (python -m)

Setup

Clone the repository

git clone https://github.com/space-lumps/ecommerce-data-cleaning.git
cd ecommerce-data-cleaning

Dataset Setup

The pipeline can run using either the included sample dataset or the full Kaggle dataset.

Option A — Use included sample dataset (no Kaggle required)

Copy sample CSVs into data/raw/:

cp data/samples/*.csv data/raw/

Ensure data/raw/ contains only one dataset version (either samples or full Kaggle data).

Run the pipeline:

uv run python run_pipeline.py

Option B - Download full dataset from Kaggle

Expected location:

data/raw/

Manual Download

Download the dataset from Kaggle: Brazilian E-Commerce Public Dataset by Olist
Extract the CSV files.
Move all .csv files into: data/raw/

Kaggle CLI (recommended)

Install Kaggle CLI:

pip install kaggle

Place kaggle.json in ~/.kaggle/, then run:

kaggle datasets download -d olistbr/brazilian-ecommerce -p data/raw --unzip

Data Directories

data/raw/       # Original source files
data/interim/   # Optional intermediate artifacts
data/clean/     # Cleaned parquet outputs
data/samples/   # Static lightweight dataset (no Kaggle required)

Project Structure

ecommerce-data-cleaning/
├── .github/
│   └── workflows/
│       └── ci.yml
├── data/
│   ├── clean/
│   ├── interim/
│   ├── raw/
│   └── samples/
│       └── [sample CSV files]     # e.g., olist_orders_dataset.csv (truncated for brevity)
├── docs/
│   ├── api/
│   │   └── [generated API docs]   # e.g., index.html (via pdoc)
│   ├── data_dictionary.md
│   ├── schema_contract.md
│   └── validation_strategy.md
├── reports/
│   └── [generated reports]        # e.g., raw_profile.csv (generated at runtime)
├── src/
│   └── ecom_pipeline/
│       ├── __init__.py
│       ├── config/
│       │   ├── __init__.py
│       │   └── schema_contract.py
│       ├── pipeline/
│       │   ├── __init__.py
│       │   ├── audit_all_clean_dtypes.py
│       │   ├── enforce_schema.py
│       │   ├── generate_data_dictionary.py
│       │   ├── profile_raw.py
│       │   ├── sanity_check_raw.py
│       │   ├── standardize_columns.py
│       │   ├── validate_clean_schema.py
│       │   └── validate_schema_contract.py
│       └── utils/
│           ├── __init__.py
│           ├── io.py
│           └── logging.py
├── tests/
│   ├── test_io.py                 # smoke tests for io utils
│   └── test_pipeline_e2e.py       # end-to-end pipeline smoke test
├── .pre-commit-config.yaml
├── .gitignore
├── .ruff.toml
├── LICENSE
├── README.md
├── pyproject.toml
├── requirements-lock.txt
├── requirements.txt
├── run_pipeline.py
└── uv.lock

The project follows a proper src/ layout.
All reusable code lives inside the ecom_pipeline package. Tests live in a top-level tests/ directory.

Dependency Files

pyproject.toml — Defines the installable package and dependencies.
uv.lock — Locked dependency graph for reproducible environments (used by uv).
requirements.txt — Traditional dependency list (optional compatibility).
requirements-lock.txt — Pinned dependency versions (optional compatibility).

For this project, uv + pyproject.toml + uv.lock are the authoritative installation method.

Pipeline

Execution

Run the full pipeline:

uv run python run_pipeline.py

Run individual modules:

uv run python -m ecom_pipeline.pipeline.sanity_check_raw
uv run python -m ecom_pipeline.pipeline.profile_raw
uv run python -m ecom_pipeline.pipeline.generate_data_dictionary
uv run python -m ecom_pipeline.pipeline.standardize_columns
uv run python -m ecom_pipeline.pipeline.enforce_schema
uv run python -m ecom_pipeline.pipeline.validate_clean_schema
uv run python -m ecom_pipeline.pipeline.audit_all_clean_dtypes
uv run python -m ecom_pipeline.pipeline.validate_schema_contract

Pipeline Stages

Sanity Check Raw Confirms raw files exist and are readable.
Profile Raw Profiles source datasets before transformation.
Generate Data Dictionary
Generates docs/data_dictionary.md from reports/raw_profile.csv.
Standardize Columns Applies consistent column naming.
Enforce Schema Applies explicit casting rules to produce clean parquet outputs.
Validate Clean Schema Verifies data types match expectations.
Audit Dtypes Flags suspicious type patterns using heuristics.
Validate Schema Contract Enforces required columns, primary key uniqueness, and logical dtype guarantees.

Outputs

data/clean/*.parquet
docs/data_dictionary.md
reports/raw_profile.csv
reports/clean_schema_audit.csv
reports/clean_dtypes_full.csv
reports/clean_dtypes_flags.csv
reports/clean_contract_audit.csv

Validation & Schema Enforcement

The pipeline enforces structural guarantees after cleaning.

1. Deterministic Type Casting

Implemented in enforce_schema.py
Ensures consistent logical dtypes (str, datetime, numeric)

2. Clean Schema Validation

Implemented in validate_clean_schema.py
Verifies expected dtype families after casting

3. Schema Contract Validation

Implemented in validate_schema_contract.py
Enforces:
- Required columns
- Primary key uniqueness
- Logical dtype expectations
- Structural dataset integrity

If any contract rule fails, the dataset is considered invalid.

Skills Demonstrated

Python package architecture (src/ layout)
Schema-driven data cleaning
Defensive data validation
Reproducible execution environments (uv)
Logging and structured reporting
Implemented E2E testing with pytest and CI via GitHub Actions for automated validation on samples
Clean project organization for portfolio use

Challenges and Insights

This project involved iterating on a real-world data cleaning pipeline, revealing several practical lessons in data engineering:

Handling Inconsistent Data Types in Schema Validation: Encountered nuances with Python/Pandas dtypes like "str" vs. "string" vs. "object"—e.g., CSV imports defaulting to "object" required explicit coercion in enforce_schema.py to match expected schemas, preventing downstream errors in analysis or ML workflows. This highlighted the importance of strict type enforcement early in pipelines.
Balancing Reproducibility and Simplicity: Setting up a virtual environment with uv for fast, locked dependencies was straightforward, but integrating environment variables (e.g., for data dirs) taught me about flexible config without hardcoding paths.
Modular Design for Maintainability: Structuring as a package with separate scripts for extraction, validation, and auditing improved testability, but required careful import management to avoid circular dependencies.
Testing Real-World Data Quirks: Sample CSVs had edge cases like missing values or inconsistent formats, reinforcing the need for audits and E2E tests to catch issues that unit tests might miss.
Automation Trade-offs: Implementing CI with GitHub Actions automated checks, but debugging workflow failures (e.g., env var mismatches) emphasized clear logging and isolation in tests.

These experiences strengthened my approach to building robust, scalable data pipelines.

Future Improvements

Add foreign key integrity validation (cross-table checks – partially implemented)
Add domain/value constraints (e.g., non-negative price, valid order_status domain)
Optional: Add test coverage reporting to CI (pytest-cov)
Optional: Containerize with Dockerfile
Explore dlt for declarative orchestration (in a separate branch)

API Documentation

The project is structured as an installable Python package (ecom_pipeline).

Full API reference (auto-generated from docstrings):
→ View API Documentation

License

MIT License

See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

E-commerce Data Cleaning (Olist)

Table of Contents

Quick Start

Objective

Setup

Clone the repository

Dataset Setup

Option A — Use included sample dataset (no Kaggle required)

Option B - Download full dataset from Kaggle

Manual Download

Kaggle CLI (recommended)

Data Directories

Project Structure

Dependency Files

Pipeline

Execution

Pipeline Stages

Outputs

Validation & Schema Enforcement

1. Deterministic Type Casting

2. Clean Schema Validation

3. Schema Contract Validation

Skills Demonstrated

Challenges and Insights

Future Improvements

API Documentation

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.github/workflows		.github/workflows
data/samples		data/samples
docs		docs
reports		reports
src/ecom_pipeline		src/ecom_pipeline
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.ruff.toml		.ruff.toml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-lock.txt		requirements-lock.txt
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py
uv.lock		uv.lock

License

space-lumps/ecommerce-data-cleaning

Folders and files

Latest commit

History

Repository files navigation

E-commerce Data Cleaning (Olist)

Table of Contents

Quick Start

Objective

Setup

Clone the repository

Dataset Setup

Option A — Use included sample dataset (no Kaggle required)

Option B - Download full dataset from Kaggle

Manual Download

Kaggle CLI (recommended)

Data Directories

Project Structure

Dependency Files

Pipeline

Execution

Pipeline Stages

Outputs

Validation & Schema Enforcement

1. Deterministic Type Casting

2. Clean Schema Validation

3. Schema Contract Validation

Skills Demonstrated

Challenges and Insights

Future Improvements

API Documentation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages