Manual data cleaning costs companies hundreds of hours in lost productivity. This project demonstrates a fully automated "Self-Service" pipeline that transforms fragmented, messy CSV exports into a production-ready analytical dashboard.
This solution eliminates human error and provides instant, actionable insights for decision-makers.
Raw data from CRMs or Sales exports is often "broken." This system is designed to handle:
- Date Inconsistency: Mixed formats (e.g., "Jan 23", "2023-01-01", "N/A").
- Financial Formatting: Currency symbols and strings that prevent mathematical analysis.
- Categorical Noise: Inconsistent regional/product naming (e.g., "north" vs "North").
- Data Redundancy: Duplicate transactions that skew revenue totals.
I developed a custom Python-based engine that performs:
- Automated Validation: Ensures the uploaded file meets business requirements.
- Vectorized Cleaning: Uses Pandas to clean thousands of rows in milliseconds.
- Regex Sanitization: Advanced text processing to extract numeric values from dirty strings.
- Interactive BI: A Streamlit dashboard that allows stakeholders to explore data in real-time.
- Engine: Python, Pandas, NumPy
- Visualization: Plotly Express (Interactive Charts)
- Deployment: Streamlit Cloud (SaaS Interface)
- One-Click Upload: Users drag a raw CSV and get results instantly.
- Smart Metrics: Real-time calculation of Total Revenue and Transaction Volume.
- Automated Audit: Shows exactly how many duplicates were removed and rows processed.
- Export Ready: Download the cleaned data immediately for use in other tools.
- Live Demo: [INSERT YOUR STREAMLIT URL HERE]
- Local Setup:
- Clone this repository.
- Install dependencies:
pip install -r requirements.txt - Run the app:
streamlit run app.py
