Skip to content

Predict used car prices using machine learning. This project includes data cleaning, feature engineering, EDA, gradient-boosting models (LightGBM & CatBoost), and an ensemble approach, with careful handling of missing values and high-cardinality features for robust predictions.

License

Notifications You must be signed in to change notification settings

Ell06arch/DSN_AI_Bootcamp_Qualification_Entry

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Used Car Price Prediction – DSN AI Bootcamp Qualification Hackathon

Project Overview

The goal of this project was to predict used car prices from tabular vehicle listing data. I aimed to build a robust machine learning pipeline that could handle challenges common in real-world datasets, such as missing values, high-cardinality categorical features, and diverse vehicle specifications.

This work was submitted for the DSN AI Bootcamp Qualification Hackathon, which focused on practical predictive modeling tasks.


Dataset

  • Source: Hackathon-provided data

  • Size:

    • 188,533 training entries (13 columns, including the target price)
    • 125,690 test entries (12 columns, no target)
  • Key Features:

    • Numeric: horsepower, mileage, model_year, num_speeds
    • Categorical: brand, base_model, transmission_type, body_style, engine
  • Target Variable: price (continuous numeric)

The dataset had both numeric and categorical columns, some missing values, and high-cardinality features, which required careful preprocessing.


Data Cleaning & Preprocessing

The focus here was on keeping the data accurate while preparing it for modeling.

Missing Value Handling

  • clean_title (11.3% missing) → imputed using accident history correlations
  • fuel_type (2.7% missing) → imputed from engine specifications
  • accident (1.3% missing) → imputed appropriately

Categorical Feature Auditing

  • Standardized case and formatting inconsistencies
  • Identified rare categories (<0.5% frequency) and train-test mismatches

Feature Engineering

  • Parsed engine specifications into horsepower, displacement, and cylinder count
  • Normalized mileage and derived vehicle age from model_year
  • Created binary flags: has_accident, has_clean_title, is_luxury_brand
  • Log-transformed price and mileage to reduce skewness
  • Imputed missing categorical values using combinations of (brand, model, model_year) and added null indicators for numeric features

High-Cardinality Features

  • Brand: 57 unique values
  • Base_model: 542 unique values
  • Engine: 1,117 unique specifications → parsed into structured features

Exploratory Data Analysis (EDA)

EDA helped guide feature engineering and model decisions.

Numeric Features:

  • horsepower positively correlated with price (0.25)
  • mileage, log_mileage, and car_age negatively correlated with price
  • Observed skewed distributions and outliers, which motivated log transformations

Categorical Features:

  • brand and material were top predictors by variance, though material was mostly uninformative and dropped
  • High-cardinality features like base_model showed long-tail distributions

Modeling Approach

I focused on gradient boosting algorithms, which perform well on tabular datasets.

Algorithms Tested:

  • LightGBM (LGBMRegressor): fast and effective on large datasets
  • CatBoost (CatBoostRegressor): handles categorical features robustly

Hyperparameter Tuning & Cross-Validation:

  • Used Optuna for automated hyperparameter search
  • 5-fold stratified CV on binned log-transformed prices ensured stable performance

Ensembling:

  • Final predictions were generated by stacking LightGBM, CatBoost, and a Ridge baseline using a Ridge meta-model
  • The meta-model learned optimal weights for each base model, improving generalization

Results

  • LightGBM CV RMSE: 0.4852 – 0.4927
  • CatBoost CV RMSE: 0.4859 – 0.4939
  • Stacked Ensemble Mean Price: $43,401

The ensemble consistently reduced error compared to individual models.


Conclusion & Next Steps

  • The pipeline effectively handled missing values, skewed distributions, and high-cardinality features

  • Key insights: horsepower, brand, and vehicle age are strong predictors

  • Future improvements could include:

    • Adding external features (e.g., market trends, location)
    • More advanced encoding for rare categories
    • Automated feature selection or feature importance-guided pruning

About

Predict used car prices using machine learning. This project includes data cleaning, feature engineering, EDA, gradient-boosting models (LightGBM & CatBoost), and an ensemble approach, with careful handling of missing values and high-cardinality features for robust predictions.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published