Skip to content

sridhar852002/slooze-data-engineer

Repository files navigation

Logo

Slooze Data Engineering Challenge - Complete Solution

Challenge Overview

This repository contains a complete solution for the Slooze data engineering challenge, implementing both Part A (Data Collection) and Part B (Exploratory Data Analysis) with a focus on B2B marketplace data from IndiaMART.

Solution Architecture

Part A - Data Collection

  • Target Platform: IndiaMART (B2B marketplace)
  • Product Categories: Industrial machinery, CNC machines, packaging machinery, textile machinery, food processing, construction equipment
  • Approach: Custom web scraper using Selenium + BeautifulSoup
  • Features: Rate limiting, user agent rotation, robust error handling

Part B - Exploratory Data Analysis

  • Data Processing: Comprehensive cleaning and structuring pipeline
  • Analysis: Price patterns, category distribution, geographical insights, data quality assessment
  • Visualizations: Interactive charts and graphs using matplotlib, seaborn, and plotly
  • Insights: Actionable recommendations and market insights

Project Structure

data-engineering-challenge-main/
├── scraper/
│   ├── __init__.py
│   └── simple_indiamart_scraper.py   # Working IndiaMART scraper
├── data_processing/
│   ├── __init__.py
│   └── data_cleaner.py               # Data cleaning pipeline
├── analysis/
│   ├── __init__.py
│   └── eda_analysis.py               # EDA and visualization
├── public/
│   └── FFFFFF-1.png                  # Logo
├── main.py                           # Complete pipeline execution
├── run_simple_indiamart.py          # IndiaMART scraper only
├── run_analysis_only.py             # Analysis only
├── requirements.txt                  # Dependencies
├── real_indiamart_data.json         # Scraped data (JSON)
├── real_indiamart_data.csv          # Scraped data (CSV)
├── cleaned_indiamart_data.csv       # Processed data
├── eda_insights.json                # Analysis insights
├── *.png                            # Visualization files
└── README.md                        # This file

Quick Start

Prerequisites

  • Python 3.8+
  • Chrome browser (for Selenium)
  • Internet connection

Installation

  1. Clone the repository

    git clone <repository-url>
    cd data-engineering-challenge-main
  2. Create virtual environment (recommended)

    python3 -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Run the complete pipeline

    python main.py

Alternative Execution Options

Run only the IndiaMART scraper:

python run_simple_indiamart.py

Run only the analysis (requires existing data):

python run_analysis_only.py

Run individual components:

# Run scraper only
python run_simple_indiamart.py

# Run analysis only (requires existing data)
python run_analysis_only.py

🔧 Technical Implementation

Web Scraper Features

  • Selenium WebDriver: Handles dynamic content and JavaScript
  • Rate Limiting: Respects website policies with random delays
  • User Agent Rotation: Avoids detection using fake-useragent
  • Error Handling: Robust exception handling and retry mechanisms
  • Data Extraction: Product details, pricing, company info, specifications

Data Processing Pipeline

  • Text Cleaning: Normalization and standardization
  • Price Extraction: Numeric price parsing from text
  • Location Parsing: City, state, and country extraction
  • Product Categorization: AI-based category classification
  • Quality Metrics: Completeness scoring and validation

EDA Analysis

  • Statistical Analysis: Descriptive statistics and distributions
  • Price Analysis: Range analysis, outlier detection, category-wise pricing
  • Geographical Analysis: Regional distribution and patterns
  • Category Analysis: Product type distribution and trends
  • Data Quality Assessment: Completeness and reliability metrics

Data Quality Metrics

  • Completeness Score: Average data field completion rate
  • High Quality Records: Records with ≥80% completeness
  • Field-wise Analysis: Individual field completion rates
  • Validation Rules: Data consistency and format validation

Important Notes

  • Chrome Driver: Automatically downloads and manages ChromeDriver
  • Rate Limiting: Built-in delays to respect website policies
  • Error Handling: Graceful handling of network issues and parsing errors
  • Data Validation: Comprehensive data quality checks and cleaning

📄 License

© Slooze. All Rights Reserved.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages