Machine learning regression model for predicting 1985 import vehicle prices. Achieves 91.7% test R-squared and 89.4% cross-validation R-squared with Lasso regression.
Predicts automobile prices from 25 vehicle specification features (brand, engine, body design, performance). Addresses data leakage, missing values (18%), multicollinearity (VIF > 16,000), outliers (10% of samples), and high-cardinality categoricals.
Final Model: Lasso Regression (alpha=10.0)
- Test: R-squared = 0.917, RMSE = $1,987
- Cross-validation: R-squared = 0.894 +/- 0.027
- Features: 42 (6 PCA + 36 categorical), 29 non-zero coefficients
- Overfitting gap: 3.3%
See Model_Comparison_Report.md for why Lasso was selected over XGBoost despite 16% higher test error.
pip install -r requirements.txtimport joblib
model = joblib.load('models/final_lasso_model.joblib')
predictions = model.predict(X_new) # Requires 42 engineered featuresOpen notebooks/auto-price-prediction.ipynb for complete data preparation, modeling, and evaluation workflow.
| Property | Details |
|---|---|
| Source | 1985 Auto Imports Database (UCI ML Repository) |
| Samples | 200 vehicles (205 original, 5 removed) |
| Features | 26 attributes (25 predictors + price) |
| Split | 158 train / 40 test (stratified) |
See data/raw/dataset-info.txt for full metadata.
Core directories:
data/raw/- Original dataset (auto_imports.csv)data/processed/train-test/- Train/test splitnotebooks/- Full analysis pipeline (auto-price-prediction.ipynb)src/- Reusable modules (utils, statistical analysis, model evaluation)models/- Trained model artifacts (final_lasso_model.joblib)reports/- Detailed analysis reports
Import pattern used: The notebook imports functions from src/ modules using:
from src.utils import memory_usage, dataframe_memory_usage
from src.statistical_analysis import normality_test_with_skew_kurt, spearman_correlation_with_target
from src.model_evaluation import evaluate_regression_model, hyperparameter_tuningRunning analysis: The notebook contains the full ML pipeline. Execute cells sequentially for:
- Data loading and cleaning
- Statistical analysis (normality tests, correlation)
- Feature engineering (PCA on numerical features)
- Model comparison (10 algorithms tested)
- Hyperparameter tuning (5-fold GridSearchCV)
- Final model selection and persistence
Base model evaluation:
from src.model_evaluation import evaluate_regression_model
metrics = evaluate_regression_model(model, X_train, y_train, X_test, y_test)
# Returns: MAE, MSE, RMSE, R², Adjusted R², MSLE, MAPE, CV R², Training R², Overfit, Training TimeHyperparameter tuning:
from src.model_evaluation import hyperparameter_tuning
best_models, best_params, times = hyperparameter_tuning(
models={'Lasso': Lasso()},
param_grids={'Lasso': {'alpha': [0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000]}},
X_train=X_train,
y_train=y_train,
scoring_metric='neg_mean_squared_error',
cv_folds=5
)Normality testing:
from src.statistical_analysis import normality_test_with_skew_kurt
normal_df, not_normal_df = normality_test_with_skew_kurt(df)
# Uses Shapiro-Wilk (n<=5000) or Kolmogorov-Smirnov (n>5000)Multicollinearity detection:
from src.statistical_analysis import calculate_vif
vif_data, high_vif_features = calculate_vif(data, exclude_target='TARGET', multicollinearity_threshold=5.0)
# Returns VIF scores and features exceeding thresholdSpearman correlation:
from src.statistical_analysis import spearman_correlation_with_target
corr_data = spearman_correlation_with_target(
data,
non_normal_cols=['col1', 'col2'],
target_col='TARGET',
plot=True,
table=True
)Loading the final model:
import joblib
model = joblib.load('models/final_lasso_model.joblib')
predictions = model.predict(X_new)Model selection criteria (weighted):
- Generalization (CV R² stability) - 40%
- Accuracy (Test RMSE/R²) - 30%
- Stability (overfitting gap) - 20%
- Efficiency (training time, interpretability) - 10%
Why Lasso over XGBoost:
- XGBoost achieved lower test RMSE (1,663 vs 1,987) but showed 8.3-point CV-test gap vs Lasso's 2.3-point gap
- Lasso provides interpretability (29 sparse coefficients) vs XGBoost black box
- Training: 11.5x faster (0.014s vs 0.161s)
- Inference: 600x faster operations
- Trade-off: Accept 2.5% higher error for 4.1 points better CV R² and full transparency
Detailed analysis in reports/:
- Complete_Data_Analysis_Report.md - Full methodology, statistical analysis, and results
- Model_Comparison_Report.md - Model selection rationale and performance comparison
- Challenges_Report.md - Technical challenges and solutions
- GALLERY.md - Visualizations
# Format code
black .
isort .
# Run pre-commit hooks
pre-commit run --all-files- black (88-char lines)
- isort (black-compatible)
- nbqa-black (notebooks)
- Validation (YAML, JSON, trailing whitespace)
- MIT License - Copyright (c) 2025 Dhanesh B. B.
- GitHub: https://github.com/dhaneshbb