This repository contains a complete solution for the Slooze data engineering challenge, implementing both Part A (Data Collection) and Part B (Exploratory Data Analysis) with a focus on B2B marketplace data from IndiaMART.
- Target Platform: IndiaMART (B2B marketplace)
- Product Categories: Industrial machinery, CNC machines, packaging machinery, textile machinery, food processing, construction equipment
- Approach: Custom web scraper using Selenium + BeautifulSoup
- Features: Rate limiting, user agent rotation, robust error handling
- Data Processing: Comprehensive cleaning and structuring pipeline
- Analysis: Price patterns, category distribution, geographical insights, data quality assessment
- Visualizations: Interactive charts and graphs using matplotlib, seaborn, and plotly
- Insights: Actionable recommendations and market insights
data-engineering-challenge-main/
├── scraper/
│ ├── __init__.py
│ └── simple_indiamart_scraper.py # Working IndiaMART scraper
├── data_processing/
│ ├── __init__.py
│ └── data_cleaner.py # Data cleaning pipeline
├── analysis/
│ ├── __init__.py
│ └── eda_analysis.py # EDA and visualization
├── public/
│ └── FFFFFF-1.png # Logo
├── main.py # Complete pipeline execution
├── run_simple_indiamart.py # IndiaMART scraper only
├── run_analysis_only.py # Analysis only
├── requirements.txt # Dependencies
├── real_indiamart_data.json # Scraped data (JSON)
├── real_indiamart_data.csv # Scraped data (CSV)
├── cleaned_indiamart_data.csv # Processed data
├── eda_insights.json # Analysis insights
├── *.png # Visualization files
└── README.md # This file
- Python 3.8+
- Chrome browser (for Selenium)
- Internet connection
-
Clone the repository
git clone <repository-url> cd data-engineering-challenge-main
-
Create virtual environment (recommended)
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Run the complete pipeline
python main.py
Run only the IndiaMART scraper:
python run_simple_indiamart.pyRun only the analysis (requires existing data):
python run_analysis_only.pyRun individual components:
# Run scraper only
python run_simple_indiamart.py
# Run analysis only (requires existing data)
python run_analysis_only.py- Selenium WebDriver: Handles dynamic content and JavaScript
- Rate Limiting: Respects website policies with random delays
- User Agent Rotation: Avoids detection using fake-useragent
- Error Handling: Robust exception handling and retry mechanisms
- Data Extraction: Product details, pricing, company info, specifications
- Text Cleaning: Normalization and standardization
- Price Extraction: Numeric price parsing from text
- Location Parsing: City, state, and country extraction
- Product Categorization: AI-based category classification
- Quality Metrics: Completeness scoring and validation
- Statistical Analysis: Descriptive statistics and distributions
- Price Analysis: Range analysis, outlier detection, category-wise pricing
- Geographical Analysis: Regional distribution and patterns
- Category Analysis: Product type distribution and trends
- Data Quality Assessment: Completeness and reliability metrics
- Completeness Score: Average data field completion rate
- High Quality Records: Records with ≥80% completeness
- Field-wise Analysis: Individual field completion rates
- Validation Rules: Data consistency and format validation
- Chrome Driver: Automatically downloads and manages ChromeDriver
- Rate Limiting: Built-in delays to respect website policies
- Error Handling: Graceful handling of network issues and parsing errors
- Data Validation: Comprehensive data quality checks and cleaning
© Slooze. All Rights Reserved.
