Skip to content

An end-to-end machine learning project to predict housing prices in California

Notifications You must be signed in to change notification settings

bhandeystruck/L.A-Housing-Data-Machine-Learning-PipeLine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

California Housing Data Analysis - Complete Machine Learning Pipeline

This project implements a comprehensive machine learning pipeline for predicting California housing prices, following Chapter 2 of "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow". The project demonstrates best practices in data science workflow from data exploration to production-ready model deployment.

Dataset Overview

  • Size: 20,640 housing records
  • Features: 10 attributes including geographic, demographic, and economic data
  • Target: median_house_value (continuous regression target)
  • Missing Values: 207 missing values in total_bedrooms feature
  • Categorical Feature: ocean_proximity with 5 categories

Complete Implementation Pipeline

1. Data Loading & Initial Exploration

  • ✅ Loaded housing data from housing.csv using pandas
  • ✅ Comprehensive data inspection (df.info(), df.describe())
  • ✅ Missing value analysis and categorical feature exploration
  • ✅ Data type validation and memory usage optimization

2. Data Visualization & Analysis

  • Box plots for numerical features to identify outliers and distributions
  • Histograms showing distributions of all numerical attributes
  • Geographic visualization (longitude vs latitude scatter plots)
  • Population-weighted scatter plots with house values as color mapping
  • Scatter matrix for pairwise feature relationships
  • Correlation analysis with target variable identification

3. Advanced Data Splitting Strategy

  • Stratified sampling using income categories for representative train-test splits
  • ✅ Created income_cat feature with 5 bins (0-1.5, 1.5-3.0, 3.0-4.5, 4.5-6.0, 6.0+)
  • ✅ Used StratifiedShuffleSplit to ensure proportional representation
  • ✅ Proper cleanup of temporary features after splitting

4. Feature Engineering

  • Custom transformer (CombinedAttributesAdder) for automated feature creation
  • New engineered features:
    • rooms_per_household = total_rooms / households
    • bedrooms_per_room = total_bedrooms / total_rooms
    • population_per_household = population / households
  • Correlation re-analysis showing improved predictive power

5. Comprehensive Data Preprocessing Pipeline

  • Missing value handling with SimpleImputer (median strategy)
  • Categorical encoding:
    • OrdinalEncoder (demonstration)
    • OneHotEncoder (production pipeline)
  • Custom transformation pipeline:
    • Numerical: Imputer → AttributeAdder → StandardScaler
    • Categorical: OneHotEncoder
  • ColumnTransformer for unified preprocessing
  • Final prepared data: (16,512 samples × 16 features)

6. Multi-Model Training & Evaluation

  • Linear Regression: RMSE ≈ 69,104
  • Decision Tree: RMSE ≈ 71,630 (overfitting detected)
  • Random Forest: RMSE ≈ 50,436 (best performance)
  • Support Vector Machine: Comprehensive hyperparameter tuning
  • Cross-validation (10-fold) for robust model evaluation
  • Performance comparison with standardized metrics

7. Advanced Hyperparameter Optimization

  • Grid Search on Random Forest with 18 parameter combinations
  • Randomized Search for efficient parameter exploration
  • SVM hyperparameter tuning with 50 combinations:
    • Linear kernel: C values [10, 30, 100, 300, 1000, 3000, 10000, 30000]
    • RBF kernel: C values [1, 3, 10, 30, 100, 300, 1000] × gamma values [0.01, 0.03, 0.1, 0.3, 1.0, 3.0]

8. Model Selection & Final Evaluation

  • Best model: Random Forest with optimized hyperparameters
  • Optimal parameters: max_features=8, n_estimators=30
  • Feature importance analysis identifying key predictors
  • Test set evaluation with final model
  • Statistical confidence intervals (95% confidence level)
  • Error analysis using both t-distribution and normal distribution

Key Results & Insights

Model Performance Ranking

  1. Random Forest (Optimized): RMSE ≈ 49,898 ⭐ BEST
  2. Random Forest (Default): RMSE ≈ 50,436
  3. Linear Regression: RMSE ≈ 69,104
  4. Decision Tree: RMSE ≈ 71,630

Most Important Features (Feature Importance Analysis)

  1. median_income - Primary economic indicator
  2. rooms_per_household - Engineered feature showing high predictive power
  3. population_per_household - Demographic density indicator
  4. longitude - Geographic location factor
  5. latitude - Geographic location factor
  6. housing_median_age - Property age indicator

Technical Achievements

  • Production-ready preprocessing pipeline with automated transformations
  • Robust stratified sampling ensuring representative data splits
  • Advanced feature engineering with custom transformers
  • Comprehensive model comparison with statistical validation
  • Optimized hyperparameters through systematic search
  • Statistical rigor with confidence intervals and error analysis

Requirements

  • Python 3.x
  • pandas
  • numpy
  • matplotlib
  • seaborn
  • scikit-learn
  • scipy (for statistical analysis)

Project Structure

├── Chapter2.ipynb          # Complete ML pipeline implementation
├── housing.csv              # California housing dataset (20,640 records)
└── README.md               # This documentation

Usage

  1. Run the notebook: Execute Chapter2.ipynb cells sequentially
  2. Model training: The pipeline automatically trains and compares multiple models
  3. Best model: Random Forest with optimized hyperparameters is selected
  4. Predictions: Use final_model.predict() with preprocessed data

Key Learnings

  • Stratified sampling is crucial for representative train-test splits
  • Feature engineering significantly improves model performance
  • Random Forest outperforms simpler models for this regression task
  • Hyperparameter optimization provides substantial performance gains
  • Cross-validation is essential for reliable model evaluation
  • Custom transformers enable reusable preprocessing pipelines

This project demonstrates a complete, production-ready machine learning pipeline following industry best practices for data science workflow.

About

An end-to-end machine learning project to predict housing prices in California

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published