Skip to content

RaktimChandra/SmartPricingChallenge

Repository files navigation

πŸ›οΈ Amazon ML Challenge 2025 - Smart Product Pricing

Multi-Modal Deep Learning for E-Commerce Price Prediction

Project Banner

Python PyTorch

πŸ† Built for: Amazon ML Challenge 2025
🎯 Task: Predict prices for 75,000 e-commerce products using multi-modal data
πŸ“Š Solution: Competition-grade deep learning combining NLP + Computer Vision + Ensemble Methods


πŸ“Œ Table of Contents


πŸ† Competition Context

Event: Amazon ML Challenge 2025
Organizer: Amazon
Challenge: Build ML models to predict product prices from multimodal e-commerce data
Dataset: 75,000 training samples, 75,000 test samples
Evaluation Metric: SMAPE (Symmetric Mean Absolute Percentage Error)

Note: This is a complete, competition-grade solution built as a learning and portfolio project. The implementation demonstrates production-level ML engineering skills applicable to real-world e-commerce pricing systems.

Challenge Highlights

  • Real-world e-commerce pricing problem
  • Multi-modal data (text descriptions + product images)
  • Large-scale dataset requiring optimization
  • Production-level code quality required

πŸ’‘ Project Overview

This project showcases an end-to-end production-grade ML pipeline built for the Amazon ML Challenge 2025. It demonstrates advanced machine learning engineering skills and serves as a comprehensive portfolio piece by combining:

Input Data Format

Data Format Example

🧠 Multi-Modal Learning

  • Text Analysis: Product descriptions using transformer models (DistilBERT)
  • Image Analysis: Product images using CNNs (EfficientNet-B0)
  • Feature Fusion: Intelligent combination of text + image features

🎯 What Makes This Special

Aspect Implementation Why It Matters
🌟 Multi-Modal Text + Images combined 70% of competitors use only one modality
πŸ€– State-of-the-Art DistilBERT + EfficientNet Production-grade architectures
🎭 Advanced Ensemble 4 models + stacking Robust predictions, reduced overfitting
⚑ Optimized GPU acceleration, caching 5.6 hours vs 30+ hours baseline
πŸ“ 2,263 Features Engineered domain features Brand, category, quality metrics
πŸ—οΈ Production-Ready Clean, modular, documented Deploy-ready code

🌟 Key Achievements

Technical Excellence

  • βœ… Competition-grade solution built for Amazon ML Challenge 2025
  • βœ… Multi-modal architecture combining NLP + Computer Vision
  • βœ… Advanced ensemble with 4 diverse models + meta-learner
  • βœ… 2,263 engineered features from text, images, and domain knowledge
  • βœ… Production-quality code with proper error handling and logging
  • βœ… GPU optimization reducing training time by 5x

Performance Metrics

  • πŸ“Š Estimated SMAPE: 10-15% (cross-validation)
  • ⚑ Training Time: 5.6 hours on consumer GPU
  • πŸš€ Inference Speed: <0.1 seconds per sample
  • πŸ“ˆ Dataset Scale: 75,000 training + 75,000 test samples

πŸ—οΈ Technical Architecture

Architecture Diagram

Detailed Pipeline

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    INPUT DATA                                β”‚
β”‚  β€’ Product Descriptions (Text)                              β”‚
β”‚  β€’ Product Images (URLs)                                    β”‚
β”‚  β€’ Price (Target Variable)                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              FEATURE EXTRACTION PIPELINE                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                              β”‚
β”‚  TEXT FEATURES (880-dim)          IMAGE FEATURES (1,333-dim)β”‚
β”‚  β”œβ”€ DistilBERT Embeddings (768)   β”œβ”€ EfficientNet CNN (1280)β”‚
β”‚  β”œβ”€ TF-IDF Vectors (100)           β”œβ”€ Color Histograms (39) β”‚
β”‚  └─ Statistical Features (12)      β”œβ”€ Texture (Gabor) (8)   β”‚
β”‚                                     └─ Quality Metrics (6)   β”‚
β”‚                                                              β”‚
β”‚  DOMAIN FEATURES (50-dim)                                   β”‚
β”‚  β”œβ”€ Brand Extraction                                        β”‚
β”‚  β”œβ”€ Item Pack Quantity (IPQ)                               β”‚
β”‚  └─ Category Inference                                      β”‚
β”‚                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 FEATURE CONCATENATION                        β”‚
β”‚              Total: 2,263 Features                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              ENSEMBLE LEARNING (STACKING)                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                              β”‚
β”‚  Level 1: Base Models (5-Fold CV each)                     β”‚
β”‚  β”œβ”€ XGBoost                                                 β”‚
β”‚  β”œβ”€ LightGBM                                                β”‚
β”‚  β”œβ”€ CatBoost                                                β”‚
β”‚  └─ Neural Network (PyTorch)                               β”‚
β”‚                                                              β”‚
β”‚  Level 2: Meta-Learner                                      β”‚
β”‚  └─ Ridge Regression (on OOF predictions)                  β”‚
β”‚                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   FINAL PREDICTIONS                          β”‚
β”‚              75,000 Price Predictions                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“Š Results & Performance

Key Metrics Dashboard

Metrics Dashboard

Model Performance Comparison

Performance Comparison

Sample Predictions

Sample Predictions

Detailed Metrics (Cross-Validation)

Model CV SMAPE Training Time Strengths
XGBoost ~12-14% ~60 min Handles non-linear patterns
LightGBM ~11-13% ~40 min Fast, memory efficient
CatBoost ~12-14% ~70 min Robust to outliers
Neural Net ~13-15% ~50 min Captures complex interactions
Ensemble ~10-12% 5.6 hours Best overall performance

Key Metrics

  • Dataset: 72,762 training samples (after outlier removal)
  • Features: 2,263 dimensions
  • Cross-Validation: 5-Fold Stratified
  • Hardware: NVIDIA GPU (CUDA-enabled)
  • Predictions: 75,000 test samples

πŸ’» Installation & Usage

Prerequisites

Python 3.8+
CUDA 11.8+ (optional, for GPU acceleration)
16GB+ RAM

Quick Start

  1. Clone Repository
git clone https://github.com/YOUR_USERNAME/SmartPricingChallenge.git
cd SmartPricingChallenge
  1. Install Dependencies
pip install -r requirements.txt
  1. Prepare Data
# Place train.csv and test.csv in dataset/ directory
mkdir -p dataset
# Add your data files
  1. Train Models (Full Pipeline)
# Complete training with all features
python train_pipeline.py --ensemble-method stacking --use-cached-features

# From scratch (download images, extract all features)
python train_pipeline.py --ensemble-method stacking --use-image-features --download-images
  1. Generate Predictions
python generate_submission.py --output test_out.csv

Command-Line Options

--ensemble-method    # stacking | weighted | single (default: stacking)
--use-transformers   # Enable DistilBERT embeddings
--use-image-features # Extract image features (CNN, color, texture)
--use-cached-features # Reuse previously extracted features
--download-images    # Download images from URLs
--optimize-hyperparams # Run Optuna hyperparameter tuning

πŸŽ“ Skills Demonstrated

This project showcases professional-level skills across the entire ML pipeline:

🧠 Machine Learning & AI

  • βœ… Deep Learning Frameworks: PyTorch, TensorFlow/Keras
  • βœ… NLP: Transformer models (DistilBERT), TF-IDF, text preprocessing
  • βœ… Computer Vision: CNNs (EfficientNet), image augmentation, feature extraction
  • βœ… Ensemble Methods: Stacking, boosting (XGBoost, LightGBM, CatBoost)
  • βœ… Feature Engineering: Domain-specific features, PCA, scaling

πŸ’» Software Engineering

  • βœ… Code Quality: Modular architecture, clean code, documentation
  • βœ… Version Control: Git, GitHub
  • βœ… Error Handling: Robust exception handling, logging
  • βœ… Optimization: GPU acceleration, parallel processing, caching
  • βœ… Testing: Validation strategies, cross-validation

πŸ“Š Data Science

  • βœ… EDA: Exploratory data analysis, visualization
  • βœ… Data Preprocessing: Outlier detection, normalization, missing data handling
  • βœ… Validation: K-Fold CV, stratified sampling, out-of-fold predictions
  • βœ… Metrics: SMAPE optimization, model evaluation

πŸš€ MLOps & Production

  • βœ… Pipeline Design: End-to-end ML pipelines
  • βœ… Scalability: Batch processing, memory management
  • βœ… Reproducibility: Fixed seeds, deterministic training
  • βœ… Deployment-Ready: Modular code, configuration management

πŸ“ Project Structure

SmartPricingChallenge/
β”‚
β”œβ”€β”€ πŸ“„ Core Scripts
β”‚   β”œβ”€β”€ train_pipeline.py           # Main training pipeline
β”‚   β”œβ”€β”€ generate_submission.py      # Prediction generation
β”‚   β”œβ”€β”€ verify_setup.py             # Environment verification
β”‚   └── requirements.txt            # Python dependencies
β”‚
β”œβ”€β”€ πŸ“‚ src/                         # Source code modules
β”‚   β”œβ”€β”€ config.py                   # Configuration & hyperparameters
β”‚   β”œβ”€β”€ utils.py                    # Helper functions
β”‚   β”œβ”€β”€ text_features.py            # NLP feature extraction
β”‚   β”œβ”€β”€ image_features.py           # CV feature extraction
β”‚   β”œβ”€β”€ feature_engineering.py      # Domain feature engineering
β”‚   β”œβ”€β”€ models.py                   # ML model implementations
β”‚   └── ensemble.py                 # Ensemble & stacking methods
β”‚
β”œβ”€β”€ πŸ“‚ dataset/                     # Data files
β”‚   β”œβ”€β”€ train.csv                   # Training data (75K samples)
β”‚   └── test.csv                    # Test data (75K samples)
β”‚
β”œβ”€β”€ πŸ“‚ outputs/                     # Generated outputs
β”‚   β”œβ”€β”€ features/                   # Cached feature files
β”‚   β”œβ”€β”€ models/                     # Trained model checkpoints
β”‚   └── submissions/                # Prediction files
β”‚
β”œβ”€β”€ πŸ“‚ notebooks/                   # Jupyter notebooks
β”‚   └── EDA.ipynb                   # Exploratory analysis
β”‚
└── πŸ“„ Documentation
    β”œβ”€β”€ README.md                   # This file
    β”œβ”€β”€ APPROACH_DOCUMENT.md        # Technical methodology
    β”œβ”€β”€ PROJECT_FINAL_REPORT.md     # Complete documentation
    └── UPGRADE_SUGGESTIONS.md      # Future improvements

πŸ”¬ Feature Engineering Deep Dive

Feature Breakdown

Text Features (880 dimensions)

  1. Transformer Embeddings (768-dim)

    • Model: DistilBERT (distilbert-base-uncased)
    • Captures: Semantic meaning, context, product attributes
  2. TF-IDF Vectors (100-dim)

    • N-grams: (1, 3)
    • Captures: Important keywords, brand names, categories
  3. Statistical Features (12-dim)

    • Text length, word count, avg word length
    • Numeric mentions, special characters, ratios

Image Features (1,333 dimensions)

  1. CNN Features (1,280-dim)

    • Model: EfficientNet-B0 (pre-trained on ImageNet)
    • Captures: High-level visual patterns, product type
  2. Color Features (39-dim)

    • RGB histograms (27-dim)
    • Dominant colors (9-dim) via K-means
    • Average color (3-dim)
  3. Texture Features (8-dim)

    • Gabor filters (4 orientations Γ— 2 scales)
    • Captures: Material properties, surface characteristics
  4. Quality Features (6-dim)

    • Sharpness (Laplacian variance)
    • Brightness, contrast, aspect ratio

Domain Features (50 dimensions)

  • Brand extraction (pattern matching)
  • Item Pack Quantity (IPQ) parsing
  • Product category inference
  • Price-related keyword detection

πŸ“ˆ Technical Optimizations

Optimization Results

Performance Improvements Implemented

Optimization Before After Speedup
GPU Acceleration 60 min 6 min 10x
Feature Caching 2 hours 5 min 24x
Parallel Processing 120 min 20 min 6x
Batch Processing 90 min 15 min 6x
Total Training 30+ hours 5.6 hours ~5x

Key Optimizations

  • βœ… CUDA GPU acceleration for neural networks and image processing
  • βœ… Multi-core CPU parallelization (16 workers)
  • βœ… Smart feature caching (NumPy arrays)
  • βœ… Efficient image processing pipelines
  • βœ… Memory-mapped arrays for large datasets

🎯 Model Details

Neural Network Architecture

Input: 2,263 features
  ↓
Dense(512) + ReLU + Dropout(0.3)
  ↓
Dense(256) + ReLU + Dropout(0.3)
  ↓
Dense(128) + ReLU + Dropout(0.3)
  ↓
Dense(64) + ReLU + Dropout(0.3)
  ↓
Output: 1 (price prediction)

Optimizer: AdamW (lr=0.001)
Scheduler: ReduceLROnPlateau
Early Stopping: 15 epochs patience

XGBoost Configuration

{
    'n_estimators': 2000,
    'learning_rate': 0.03,
    'max_depth': 8,
    'min_child_weight': 3,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_alpha': 0.1,
    'reg_lambda': 1.0
}

πŸ” Validation Strategy

Cross-Validation Setup

  • Method: Stratified K-Fold (5 folds)
  • Stratification: Price bins (10 bins)
  • OOF Predictions: Used for meta-learner training
  • Prevents: Data leakage, overfitting

Data Preprocessing

  1. Outlier Removal: IQR method (removed 2,238 samples)
  2. Feature Scaling: RobustScaler (robust to outliers)
  3. Missing Data: Imputation strategies
  4. Image Handling: Graceful fallback for failed downloads

πŸš€ Future Improvements

See UPGRADE_SUGGESTIONS.md for detailed enhancement ideas:

  • πŸ”§ Hyperparameter tuning with Optuna (expected +0.5-1% improvement)
  • 🧠 Fine-tune transformers (BERT, RoBERTa)
  • πŸ–ΌοΈ Object detection for image analysis
  • 🎭 Multi-task learning (price + category)
  • 🌐 Model deployment with FastAPI

πŸ“š Key Learnings

What Worked Well

βœ… Multi-modal approach significantly improved accuracy
βœ… Stacking ensemble reduced overfitting
βœ… Feature caching saved hours of computation
βœ… GPU acceleration crucial for image processing

Challenges Overcome

  • Image download failures (handled with fallbacks)
  • Memory management for large feature matrices
  • Long training times (optimized with caching)
  • Index alignment after preprocessing

πŸ† Competition Compliance

  • βœ… No external data used (only provided train/test data)
  • βœ… License compliance (all models MIT/Apache 2.0)
  • βœ… Model size <8B parameters
  • βœ… Positive prices enforced in predictions

πŸ“ Citation

If you use this code or approach, please cite:

@misc{smartpricing2025,
  title={Smart Product Pricing - Multi-Modal ML Solution},
  author={Raktim Chandra},
  year={2025},
  publisher={GitHub},
  journal={Amazon ML Challenge 2025},
  url={https://github.com/RaktimChandra/SmartPricingChallenge}
}

πŸ“§ Contact & Links


πŸ“„ License

This project is open for educational and portfolio purposes.


πŸ™ Acknowledgments

  • Amazon for organizing the ML Challenge
  • HackerEarth for hosting the competition
  • Hugging Face for transformer models
  • PyTorch and scikit-learn communities

⭐ If you find this project helpful, please star it! ⭐

Built with ❀️ for the Amazon ML Challenge 2025

GitHub stars GitHub forks