The goal of this project was to predict used car prices from tabular vehicle listing data. I aimed to build a robust machine learning pipeline that could handle challenges common in real-world datasets, such as missing values, high-cardinality categorical features, and diverse vehicle specifications.
This work was submitted for the DSN AI Bootcamp Qualification Hackathon, which focused on practical predictive modeling tasks.
-
Source: Hackathon-provided data
-
Size:
- 188,533 training entries (13 columns, including the target
price) - 125,690 test entries (12 columns, no target)
- 188,533 training entries (13 columns, including the target
-
Key Features:
- Numeric:
horsepower,mileage,model_year,num_speeds - Categorical:
brand,base_model,transmission_type,body_style,engine
- Numeric:
-
Target Variable:
price(continuous numeric)
The dataset had both numeric and categorical columns, some missing values, and high-cardinality features, which required careful preprocessing.
The focus here was on keeping the data accurate while preparing it for modeling.
clean_title(11.3% missing) → imputed using accident history correlationsfuel_type(2.7% missing) → imputed from engine specificationsaccident(1.3% missing) → imputed appropriately
- Standardized case and formatting inconsistencies
- Identified rare categories (<0.5% frequency) and train-test mismatches
- Parsed engine specifications into horsepower, displacement, and cylinder count
- Normalized mileage and derived vehicle age from
model_year - Created binary flags:
has_accident,has_clean_title,is_luxury_brand - Log-transformed
priceandmileageto reduce skewness - Imputed missing categorical values using combinations of
(brand, model, model_year)and added null indicators for numeric features
Brand: 57 unique valuesBase_model: 542 unique valuesEngine: 1,117 unique specifications → parsed into structured features
EDA helped guide feature engineering and model decisions.
Numeric Features:
horsepowerpositively correlated with price (0.25)mileage,log_mileage, andcar_agenegatively correlated with price- Observed skewed distributions and outliers, which motivated log transformations
Categorical Features:
brandandmaterialwere top predictors by variance, thoughmaterialwas mostly uninformative and dropped- High-cardinality features like
base_modelshowed long-tail distributions
I focused on gradient boosting algorithms, which perform well on tabular datasets.
Algorithms Tested:
- LightGBM (LGBMRegressor): fast and effective on large datasets
- CatBoost (CatBoostRegressor): handles categorical features robustly
Hyperparameter Tuning & Cross-Validation:
- Used Optuna for automated hyperparameter search
- 5-fold stratified CV on binned log-transformed prices ensured stable performance
Ensembling:
- Final predictions were generated by stacking LightGBM, CatBoost, and a Ridge baseline using a Ridge meta-model
- The meta-model learned optimal weights for each base model, improving generalization
- LightGBM CV RMSE: 0.4852 – 0.4927
- CatBoost CV RMSE: 0.4859 – 0.4939
- Stacked Ensemble Mean Price: $43,401
The ensemble consistently reduced error compared to individual models.
-
The pipeline effectively handled missing values, skewed distributions, and high-cardinality features
-
Key insights:
horsepower,brand, and vehicle age are strong predictors -
Future improvements could include:
- Adding external features (e.g., market trends, location)
- More advanced encoding for rare categories
- Automated feature selection or feature importance-guided pruning