Smart Product Pricing Challenge - ML Challenge 2025

This repository contains the solution for the ML Challenge 2025, a competition focused on predicting e-commerce product prices from their catalog descriptions and images. The solution employs a powerful and efficient two-stage, multimodal deep learning approach to tackle the problem.

🚀 Final Result

Best Validation SMAPE Score: 73.9450%

🧠 Methodology Overview

To handle the large dataset and the complexity of training multimodal models within a hackathon timeline, a Two-Stage Pre-computation Strategy was implemented. This approach decouples the slow feature extraction from the fast model training, allowing for rapid iteration.

Stage 1: Pre-computing Embeddings

The first stage involves using large, pre-trained deep learning models as feature extractors. These models are run only once to process the entire dataset and save the resulting high-dimensional feature vectors (embeddings).

Text Feature Extraction: A pre-trained distilbert-base-uncased model from the Hugging Face transformers library was used to convert product descriptions (catalog_content) into 768-dimension text embeddings.
Image Feature Extraction: A pre-trained efficientnet_b0 model from the timm library was used to convert product images into 1280-dimension image embeddings.

This process was executed in a memory-efficient manner by processing data in batches and saving each batch's embeddings directly to disk, preventing RAM crashes in the Colab environment.

Stage 2: Training a Lightweight Regression Head

The second stage involves training a small, fast neural network on the pre-computed embeddings.

Input: The text and image embeddings from Stage 1 are concatenated to form a single 2048-dimension feature vector for each product.
Model: A simple feed-forward neural network (Regression Head) with two hidden layers, BatchNorm, and Dropout was trained to map these features to the final price prediction.
Speed: This training process is incredibly fast, completing in just a few minutes on a GPU, which allows for extensive experimentation with hyperparameters like learning rate and model architecture.

📁 Repository Structure

.
├── ML_Challenge_2025/
│   ├── dataset/
│   │   ├── train.csv           # Training data
│   │   ├── test.csv            # Test data
│   │   └── images/             # Downloaded product images
│   │
│   ├── embeddings_batched/
│   │   ├── train_text/         # Saved text embeddings for training set
│   │   ├── train_image/        # Saved image embeddings for training set
│   │   ├── test_text/          # Saved text embeddings for test set
│   │   └── test_image/         # Saved image embeddings for test set
│   │
│   ├── ML_Challange.ipynb      # Main Colab notebook with all code
│   ├── fast_regression_model.pth # Saved weights of the trained model
│   └── test_out.csv            # Final submission file
│
└── README.md                   # You are here

⚙️ How to Run

This project was developed in Google Colab using a GPU runtime.

Setup Google Drive:
- Create a folder named ML_Challenge_2025 in your Google Drive.
- Inside it, create a dataset folder and upload train.csv and test.csv.
- Run an image downloader script to populate the dataset/images/ folder.
Part 1 - Generate Embeddings:
- Open the ML_Challange.ipynb notebook in Google Colab and set the runtime to GPU.
- Run the "Part 1" code cells. This will process all text and images and save the embeddings into the embeddings_batched folder in your Drive. (Note: This is the slow part).
Part 2 - Train and Predict:
- Once Part 1 is complete, run the "Part 2" code cells in the same notebook.
- This will load the saved embeddings, train the fast regression model, and generate the final test_out.csv file in your project directory.

🛠️ Key Libraries Used

PyTorch: Core deep learning framework.
Transformers (Hugging Face): For loading the DistilBERT text model.
timm (PyTorch Image Models): For loading the EfficientNet-B0 image model.
Pandas: For data manipulation.
Scikit-learn: For data splitting.
Pillow: For image processing.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
ML_Challenge_2025		ML_Challenge_2025
ML_Challange.ipynb		ML_Challange.ipynb
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Smart Product Pricing Challenge - ML Challenge 2025

🚀 Final Result

🧠 Methodology Overview

Stage 1: Pre-computing Embeddings

Stage 2: Training a Lightweight Regression Head

📁 Repository Structure

⚙️ How to Run

🛠️ Key Libraries Used

About

Uh oh!

Releases

Packages

Uh oh!

Languages

pathan-07/ML-Challenge-2025

Folders and files

Latest commit

History

Repository files navigation

Smart Product Pricing Challenge - ML Challenge 2025

🚀 Final Result

🧠 Methodology Overview

Stage 1: Pre-computing Embeddings

Stage 2: Training a Lightweight Regression Head

📁 Repository Structure

⚙️ How to Run

🛠️ Key Libraries Used

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages