Automobile Price Prediction

Machine learning regression model for predicting 1985 import vehicle prices. Achieves 91.7% test R-squared and 89.4% cross-validation R-squared with Lasso regression.

Overview

Predicts automobile prices from 25 vehicle specification features (brand, engine, body design, performance). Addresses data leakage, missing values (18%), multicollinearity (VIF > 16,000), outliers (10% of samples), and high-cardinality categoricals.

Final Model: Lasso Regression (alpha=10.0)

Test: R-squared = 0.917, RMSE = $1,987
Cross-validation: R-squared = 0.894 +/- 0.027
Features: 42 (6 PCA + 36 categorical), 29 non-zero coefficients
Overfitting gap: 3.3%

See Model_Comparison_Report.md for why Lasso was selected over XGBoost despite 16% higher test error.

Quick Start

Installation

pip install -r requirements.txt

Load Trained Model

import joblib
model = joblib.load('models/final_lasso_model.joblib')
predictions = model.predict(X_new)  # Requires 42 engineered features

Run Full Pipeline

Open notebooks/auto-price-prediction.ipynb for complete data preparation, modeling, and evaluation workflow.

Dataset

Property	Details
Source	1985 Auto Imports Database (UCI ML Repository)
Samples	200 vehicles (205 original, 5 removed)
Features	26 attributes (25 predictors + price)
Split	158 train / 40 test (stratified)

See data/raw/dataset-info.txt for full metadata.

Project Structure

Core directories:

data/raw/ - Original dataset (auto_imports.csv)
data/processed/train-test/ - Train/test split
notebooks/ - Full analysis pipeline (auto-price-prediction.ipynb)
src/ - Reusable modules (utils, statistical analysis, model evaluation)
models/ - Trained model artifacts (final_lasso_model.joblib)
reports/ - Detailed analysis reports

Working with the Notebook

Import pattern used: The notebook imports functions from src/ modules using:

from src.utils import memory_usage, dataframe_memory_usage
from src.statistical_analysis import normality_test_with_skew_kurt, spearman_correlation_with_target
from src.model_evaluation import evaluate_regression_model, hyperparameter_tuning

Running analysis: The notebook contains the full ML pipeline. Execute cells sequentially for:

Data loading and cleaning
Statistical analysis (normality tests, correlation)
Feature engineering (PCA on numerical features)
Model comparison (10 algorithms tested)
Hyperparameter tuning (5-fold GridSearchCV)
Final model selection and persistence

Model Training Workflow

Base model evaluation:

from src.model_evaluation import evaluate_regression_model

metrics = evaluate_regression_model(model, X_train, y_train, X_test, y_test)
# Returns: MAE, MSE, RMSE, R², Adjusted R², MSLE, MAPE, CV R², Training R², Overfit, Training Time

Hyperparameter tuning:

from src.model_evaluation import hyperparameter_tuning

best_models, best_params, times = hyperparameter_tuning(
    models={'Lasso': Lasso()},
    param_grids={'Lasso': {'alpha': [0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000]}},
    X_train=X_train,
    y_train=y_train,
    scoring_metric='neg_mean_squared_error',
    cv_folds=5
)

Statistical Analysis Functions

Normality testing:

from src.statistical_analysis import normality_test_with_skew_kurt

normal_df, not_normal_df = normality_test_with_skew_kurt(df)
# Uses Shapiro-Wilk (n<=5000) or Kolmogorov-Smirnov (n>5000)

Multicollinearity detection:

from src.statistical_analysis import calculate_vif

vif_data, high_vif_features = calculate_vif(data, exclude_target='TARGET', multicollinearity_threshold=5.0)
# Returns VIF scores and features exceeding threshold

Spearman correlation:

from src.statistical_analysis import spearman_correlation_with_target

corr_data = spearman_correlation_with_target(
    data,
    non_normal_cols=['col1', 'col2'],
    target_col='TARGET',
    plot=True,
    table=True
)

Model Persistence

Loading the final model:

import joblib
model = joblib.load('models/final_lasso_model.joblib')
predictions = model.predict(X_new)

Key Design Decisions

Model selection criteria (weighted):

Generalization (CV R² stability) - 40%
Accuracy (Test RMSE/R²) - 30%
Stability (overfitting gap) - 20%
Efficiency (training time, interpretability) - 10%

Why Lasso over XGBoost:

XGBoost achieved lower test RMSE (1,663 vs 1,987) but showed 8.3-point CV-test gap vs Lasso's 2.3-point gap
Lasso provides interpretability (29 sparse coefficients) vs XGBoost black box
Training: 11.5x faster (0.014s vs 0.161s)
Inference: 600x faster operations
Trade-off: Accept 2.5% higher error for 4.1 points better CV R² and full transparency

Reports

Detailed analysis in reports/:

Complete_Data_Analysis_Report.md - Full methodology, statistical analysis, and results
Model_Comparison_Report.md - Model selection rationale and performance comparison
Challenges_Report.md - Technical challenges and solutions
GALLERY.md - Visualizations

Development

Code Quality

# Format code
black .
isort .

# Run pre-commit hooks
pre-commit run --all-files

Pre-commit Hooks

black (88-char lines)
isort (black-compatible)
nbqa-black (notebooks)
Validation (YAML, JSON, trailing whitespace)

GitHub: https://github.com/dhaneshbb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automobile Price Prediction

Overview

Quick Start

Installation

Load Trained Model

Run Full Pipeline

Dataset

Project Structure

Working with the Notebook

Model Training Workflow

Statistical Analysis Functions

Model Persistence

Key Design Decisions

Reports

Development

Code Quality

Pre-commit Hooks

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Automobile Price Prediction

Overview

Quick Start

Installation

Load Trained Model

Run Full Pipeline

Dataset

Project Structure

Working with the Notebook

Model Training Workflow

Statistical Analysis Functions

Model Persistence

Key Design Decisions

Reports

Development

Code Quality

Pre-commit Hooks