π Built for: Amazon ML Challenge 2025
π― Task: Predict prices for 75,000 e-commerce products using multi-modal data
π Solution: Competition-grade deep learning combining NLP + Computer Vision + Ensemble Methods
- Competition Context
- Project Overview
- Key Achievements
- Technical Architecture
- Results & Performance
- Installation & Usage
- Skills Demonstrated
- Project Structure
Event: Amazon ML Challenge 2025
Organizer: Amazon
Challenge: Build ML models to predict product prices from multimodal e-commerce data
Dataset: 75,000 training samples, 75,000 test samples
Evaluation Metric: SMAPE (Symmetric Mean Absolute Percentage Error)
Note: This is a complete, competition-grade solution built as a learning and portfolio project. The implementation demonstrates production-level ML engineering skills applicable to real-world e-commerce pricing systems.
- Real-world e-commerce pricing problem
- Multi-modal data (text descriptions + product images)
- Large-scale dataset requiring optimization
- Production-level code quality required
This project showcases an end-to-end production-grade ML pipeline built for the Amazon ML Challenge 2025. It demonstrates advanced machine learning engineering skills and serves as a comprehensive portfolio piece by combining:
- Text Analysis: Product descriptions using transformer models (DistilBERT)
- Image Analysis: Product images using CNNs (EfficientNet-B0)
- Feature Fusion: Intelligent combination of text + image features
| Aspect | Implementation | Why It Matters |
|---|---|---|
| π Multi-Modal | Text + Images combined | 70% of competitors use only one modality |
| π€ State-of-the-Art | DistilBERT + EfficientNet | Production-grade architectures |
| π Advanced Ensemble | 4 models + stacking | Robust predictions, reduced overfitting |
| β‘ Optimized | GPU acceleration, caching | 5.6 hours vs 30+ hours baseline |
| π 2,263 Features | Engineered domain features | Brand, category, quality metrics |
| ποΈ Production-Ready | Clean, modular, documented | Deploy-ready code |
- β Competition-grade solution built for Amazon ML Challenge 2025
- β Multi-modal architecture combining NLP + Computer Vision
- β Advanced ensemble with 4 diverse models + meta-learner
- β 2,263 engineered features from text, images, and domain knowledge
- β Production-quality code with proper error handling and logging
- β GPU optimization reducing training time by 5x
- π Estimated SMAPE: 10-15% (cross-validation)
- β‘ Training Time: 5.6 hours on consumer GPU
- π Inference Speed: <0.1 seconds per sample
- π Dataset Scale: 75,000 training + 75,000 test samples
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INPUT DATA β
β β’ Product Descriptions (Text) β
β β’ Product Images (URLs) β
β β’ Price (Target Variable) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FEATURE EXTRACTION PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β TEXT FEATURES (880-dim) IMAGE FEATURES (1,333-dim)β
β ββ DistilBERT Embeddings (768) ββ EfficientNet CNN (1280)β
β ββ TF-IDF Vectors (100) ββ Color Histograms (39) β
β ββ Statistical Features (12) ββ Texture (Gabor) (8) β
β ββ Quality Metrics (6) β
β β
β DOMAIN FEATURES (50-dim) β
β ββ Brand Extraction β
β ββ Item Pack Quantity (IPQ) β
β ββ Category Inference β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FEATURE CONCATENATION β
β Total: 2,263 Features β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ENSEMBLE LEARNING (STACKING) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Level 1: Base Models (5-Fold CV each) β
β ββ XGBoost β
β ββ LightGBM β
β ββ CatBoost β
β ββ Neural Network (PyTorch) β
β β
β Level 2: Meta-Learner β
β ββ Ridge Regression (on OOF predictions) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FINAL PREDICTIONS β
β 75,000 Price Predictions β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Model | CV SMAPE | Training Time | Strengths |
|---|---|---|---|
| XGBoost | ~12-14% | ~60 min | Handles non-linear patterns |
| LightGBM | ~11-13% | ~40 min | Fast, memory efficient |
| CatBoost | ~12-14% | ~70 min | Robust to outliers |
| Neural Net | ~13-15% | ~50 min | Captures complex interactions |
| Ensemble | ~10-12% | 5.6 hours | Best overall performance |
- Dataset: 72,762 training samples (after outlier removal)
- Features: 2,263 dimensions
- Cross-Validation: 5-Fold Stratified
- Hardware: NVIDIA GPU (CUDA-enabled)
- Predictions: 75,000 test samples
Python 3.8+
CUDA 11.8+ (optional, for GPU acceleration)
16GB+ RAM- Clone Repository
git clone https://github.com/YOUR_USERNAME/SmartPricingChallenge.git
cd SmartPricingChallenge- Install Dependencies
pip install -r requirements.txt- Prepare Data
# Place train.csv and test.csv in dataset/ directory
mkdir -p dataset
# Add your data files- Train Models (Full Pipeline)
# Complete training with all features
python train_pipeline.py --ensemble-method stacking --use-cached-features
# From scratch (download images, extract all features)
python train_pipeline.py --ensemble-method stacking --use-image-features --download-images- Generate Predictions
python generate_submission.py --output test_out.csv--ensemble-method # stacking | weighted | single (default: stacking)
--use-transformers # Enable DistilBERT embeddings
--use-image-features # Extract image features (CNN, color, texture)
--use-cached-features # Reuse previously extracted features
--download-images # Download images from URLs
--optimize-hyperparams # Run Optuna hyperparameter tuningThis project showcases professional-level skills across the entire ML pipeline:
- β Deep Learning Frameworks: PyTorch, TensorFlow/Keras
- β NLP: Transformer models (DistilBERT), TF-IDF, text preprocessing
- β Computer Vision: CNNs (EfficientNet), image augmentation, feature extraction
- β Ensemble Methods: Stacking, boosting (XGBoost, LightGBM, CatBoost)
- β Feature Engineering: Domain-specific features, PCA, scaling
- β Code Quality: Modular architecture, clean code, documentation
- β Version Control: Git, GitHub
- β Error Handling: Robust exception handling, logging
- β Optimization: GPU acceleration, parallel processing, caching
- β Testing: Validation strategies, cross-validation
- β EDA: Exploratory data analysis, visualization
- β Data Preprocessing: Outlier detection, normalization, missing data handling
- β Validation: K-Fold CV, stratified sampling, out-of-fold predictions
- β Metrics: SMAPE optimization, model evaluation
- β Pipeline Design: End-to-end ML pipelines
- β Scalability: Batch processing, memory management
- β Reproducibility: Fixed seeds, deterministic training
- β Deployment-Ready: Modular code, configuration management
SmartPricingChallenge/
β
βββ π Core Scripts
β βββ train_pipeline.py # Main training pipeline
β βββ generate_submission.py # Prediction generation
β βββ verify_setup.py # Environment verification
β βββ requirements.txt # Python dependencies
β
βββ π src/ # Source code modules
β βββ config.py # Configuration & hyperparameters
β βββ utils.py # Helper functions
β βββ text_features.py # NLP feature extraction
β βββ image_features.py # CV feature extraction
β βββ feature_engineering.py # Domain feature engineering
β βββ models.py # ML model implementations
β βββ ensemble.py # Ensemble & stacking methods
β
βββ π dataset/ # Data files
β βββ train.csv # Training data (75K samples)
β βββ test.csv # Test data (75K samples)
β
βββ π outputs/ # Generated outputs
β βββ features/ # Cached feature files
β βββ models/ # Trained model checkpoints
β βββ submissions/ # Prediction files
β
βββ π notebooks/ # Jupyter notebooks
β βββ EDA.ipynb # Exploratory analysis
β
βββ π Documentation
βββ README.md # This file
βββ APPROACH_DOCUMENT.md # Technical methodology
βββ PROJECT_FINAL_REPORT.md # Complete documentation
βββ UPGRADE_SUGGESTIONS.md # Future improvements
-
Transformer Embeddings (768-dim)
- Model: DistilBERT (distilbert-base-uncased)
- Captures: Semantic meaning, context, product attributes
-
TF-IDF Vectors (100-dim)
- N-grams: (1, 3)
- Captures: Important keywords, brand names, categories
-
Statistical Features (12-dim)
- Text length, word count, avg word length
- Numeric mentions, special characters, ratios
-
CNN Features (1,280-dim)
- Model: EfficientNet-B0 (pre-trained on ImageNet)
- Captures: High-level visual patterns, product type
-
Color Features (39-dim)
- RGB histograms (27-dim)
- Dominant colors (9-dim) via K-means
- Average color (3-dim)
-
Texture Features (8-dim)
- Gabor filters (4 orientations Γ 2 scales)
- Captures: Material properties, surface characteristics
-
Quality Features (6-dim)
- Sharpness (Laplacian variance)
- Brightness, contrast, aspect ratio
- Brand extraction (pattern matching)
- Item Pack Quantity (IPQ) parsing
- Product category inference
- Price-related keyword detection
| Optimization | Before | After | Speedup |
|---|---|---|---|
| GPU Acceleration | 60 min | 6 min | 10x |
| Feature Caching | 2 hours | 5 min | 24x |
| Parallel Processing | 120 min | 20 min | 6x |
| Batch Processing | 90 min | 15 min | 6x |
| Total Training | 30+ hours | 5.6 hours | ~5x |
- β CUDA GPU acceleration for neural networks and image processing
- β Multi-core CPU parallelization (16 workers)
- β Smart feature caching (NumPy arrays)
- β Efficient image processing pipelines
- β Memory-mapped arrays for large datasets
Input: 2,263 features
β
Dense(512) + ReLU + Dropout(0.3)
β
Dense(256) + ReLU + Dropout(0.3)
β
Dense(128) + ReLU + Dropout(0.3)
β
Dense(64) + ReLU + Dropout(0.3)
β
Output: 1 (price prediction)
Optimizer: AdamW (lr=0.001)
Scheduler: ReduceLROnPlateau
Early Stopping: 15 epochs patience{
'n_estimators': 2000,
'learning_rate': 0.03,
'max_depth': 8,
'min_child_weight': 3,
'subsample': 0.8,
'colsample_bytree': 0.8,
'reg_alpha': 0.1,
'reg_lambda': 1.0
}- Method: Stratified K-Fold (5 folds)
- Stratification: Price bins (10 bins)
- OOF Predictions: Used for meta-learner training
- Prevents: Data leakage, overfitting
- Outlier Removal: IQR method (removed 2,238 samples)
- Feature Scaling: RobustScaler (robust to outliers)
- Missing Data: Imputation strategies
- Image Handling: Graceful fallback for failed downloads
See UPGRADE_SUGGESTIONS.md for detailed enhancement ideas:
- π§ Hyperparameter tuning with Optuna (expected +0.5-1% improvement)
- π§ Fine-tune transformers (BERT, RoBERTa)
- πΌοΈ Object detection for image analysis
- π Multi-task learning (price + category)
- π Model deployment with FastAPI
β
Multi-modal approach significantly improved accuracy
β
Stacking ensemble reduced overfitting
β
Feature caching saved hours of computation
β
GPU acceleration crucial for image processing
- Image download failures (handled with fallbacks)
- Memory management for large feature matrices
- Long training times (optimized with caching)
- Index alignment after preprocessing
- β No external data used (only provided train/test data)
- β License compliance (all models MIT/Apache 2.0)
- β Model size <8B parameters
- β Positive prices enforced in predictions
If you use this code or approach, please cite:
@misc{smartpricing2025,
title={Smart Product Pricing - Multi-Modal ML Solution},
author={Raktim Chandra},
year={2025},
publisher={GitHub},
journal={Amazon ML Challenge 2025},
url={https://github.com/RaktimChandra/SmartPricingChallenge}
}- GitHub: RaktimChandra
- LinkedIn: Raktim Chandra
- Email: raktimchandra26@gmail.com
- Competition: Amazon ML Challenge 2025
This project is open for educational and portfolio purposes.
- Amazon for organizing the ML Challenge
- HackerEarth for hosting the competition
- Hugging Face for transformer models
- PyTorch and scikit-learn communities







