Used Car Price Prediction – DSN AI Bootcamp Qualification Hackathon

Project Overview

The goal of this project was to predict used car prices from tabular vehicle listing data. I aimed to build a robust machine learning pipeline that could handle challenges common in real-world datasets, such as missing values, high-cardinality categorical features, and diverse vehicle specifications.

This work was submitted for the DSN AI Bootcamp Qualification Hackathon, which focused on practical predictive modeling tasks.

Dataset

Source: Hackathon-provided data
Size:
- 188,533 training entries (13 columns, including the target price)
- 125,690 test entries (12 columns, no target)
Key Features:
- Numeric: horsepower, mileage, model_year, num_speeds
- Categorical: brand, base_model, transmission_type, body_style, engine
Target Variable: price (continuous numeric)

The dataset had both numeric and categorical columns, some missing values, and high-cardinality features, which required careful preprocessing.

Data Cleaning & Preprocessing

The focus here was on keeping the data accurate while preparing it for modeling.

Missing Value Handling

clean_title (11.3% missing) → imputed using accident history correlations
fuel_type (2.7% missing) → imputed from engine specifications
accident (1.3% missing) → imputed appropriately

Categorical Feature Auditing

Standardized case and formatting inconsistencies
Identified rare categories (<0.5% frequency) and train-test mismatches

Feature Engineering

Parsed engine specifications into horsepower, displacement, and cylinder count
Normalized mileage and derived vehicle age from model_year
Created binary flags: has_accident, has_clean_title, is_luxury_brand
Log-transformed price and mileage to reduce skewness
Imputed missing categorical values using combinations of (brand, model, model_year) and added null indicators for numeric features

High-Cardinality Features

Brand: 57 unique values
Base_model: 542 unique values
Engine: 1,117 unique specifications → parsed into structured features

Exploratory Data Analysis (EDA)

EDA helped guide feature engineering and model decisions.

Numeric Features:

horsepower positively correlated with price (0.25)
mileage, log_mileage, and car_age negatively correlated with price
Observed skewed distributions and outliers, which motivated log transformations

Categorical Features:

brand and material were top predictors by variance, though material was mostly uninformative and dropped
High-cardinality features like base_model showed long-tail distributions

Modeling Approach

I focused on gradient boosting algorithms, which perform well on tabular datasets.

Algorithms Tested:

LightGBM (LGBMRegressor): fast and effective on large datasets
CatBoost (CatBoostRegressor): handles categorical features robustly

Hyperparameter Tuning & Cross-Validation:

Used Optuna for automated hyperparameter search
5-fold stratified CV on binned log-transformed prices ensured stable performance

Ensembling:

Final predictions were generated by stacking LightGBM, CatBoost, and a Ridge baseline using a Ridge meta-model
The meta-model learned optimal weights for each base model, improving generalization

Results

LightGBM CV RMSE: 0.4852 – 0.4927
CatBoost CV RMSE: 0.4859 – 0.4939
Stacked Ensemble Mean Price: $43,401

The ensemble consistently reduced error compared to individual models.

Conclusion & Next Steps

The pipeline effectively handled missing values, skewed distributions, and high-cardinality features
Key insights: horsepower, brand, and vehicle age are strong predictors
Future improvements could include:
- Adding external features (e.g., market trends, location)
- More advanced encoding for rare categories
- Automated feature selection or feature importance-guided pruning

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dsn-qual-cleaning.ipynb		dsn-qual-cleaning.ipynb
dsn-qual-entry (1).ipynb		dsn-qual-entry (1).ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Used Car Price Prediction – DSN AI Bootcamp Qualification Hackathon

Project Overview

Dataset

Data Cleaning & Preprocessing

Missing Value Handling

Categorical Feature Auditing

Feature Engineering

High-Cardinality Features

Exploratory Data Analysis (EDA)

Modeling Approach

Results

Conclusion & Next Steps

About

Uh oh!

Releases

Packages

Languages

License

Ell06arch/DSN_AI_Bootcamp_Qualification_Entry

Folders and files

Latest commit

History

Repository files navigation

Used Car Price Prediction – DSN AI Bootcamp Qualification Hackathon

Project Overview

Dataset

Data Cleaning & Preprocessing

Missing Value Handling

Categorical Feature Auditing

Feature Engineering

High-Cardinality Features

Exploratory Data Analysis (EDA)

Modeling Approach

Results

Conclusion & Next Steps

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages