This project implements a comprehensive machine learning pipeline for predicting California housing prices, following Chapter 2 of "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow". The project demonstrates best practices in data science workflow from data exploration to production-ready model deployment.
- Size: 20,640 housing records
- Features: 10 attributes including geographic, demographic, and economic data
- Target:
median_house_value(continuous regression target) - Missing Values: 207 missing values in
total_bedroomsfeature - Categorical Feature:
ocean_proximitywith 5 categories
- ✅ Loaded housing data from
housing.csvusing pandas - ✅ Comprehensive data inspection (
df.info(),df.describe()) - ✅ Missing value analysis and categorical feature exploration
- ✅ Data type validation and memory usage optimization
- ✅ Box plots for numerical features to identify outliers and distributions
- ✅ Histograms showing distributions of all numerical attributes
- ✅ Geographic visualization (longitude vs latitude scatter plots)
- ✅ Population-weighted scatter plots with house values as color mapping
- ✅ Scatter matrix for pairwise feature relationships
- ✅ Correlation analysis with target variable identification
- ✅ Stratified sampling using income categories for representative train-test splits
- ✅ Created
income_catfeature with 5 bins (0-1.5, 1.5-3.0, 3.0-4.5, 4.5-6.0, 6.0+) - ✅ Used
StratifiedShuffleSplitto ensure proportional representation - ✅ Proper cleanup of temporary features after splitting
- ✅ Custom transformer (
CombinedAttributesAdder) for automated feature creation - ✅ New engineered features:
rooms_per_household= total_rooms / householdsbedrooms_per_room= total_bedrooms / total_roomspopulation_per_household= population / households
- ✅ Correlation re-analysis showing improved predictive power
- ✅ Missing value handling with
SimpleImputer(median strategy) - ✅ Categorical encoding:
- OrdinalEncoder (demonstration)
- OneHotEncoder (production pipeline)
- ✅ Custom transformation pipeline:
- Numerical: Imputer → AttributeAdder → StandardScaler
- Categorical: OneHotEncoder
- ✅ ColumnTransformer for unified preprocessing
- ✅ Final prepared data: (16,512 samples × 16 features)
- ✅ Linear Regression: RMSE ≈ 69,104
- ✅ Decision Tree: RMSE ≈ 71,630 (overfitting detected)
- ✅ Random Forest: RMSE ≈ 50,436 (best performance)
- ✅ Support Vector Machine: Comprehensive hyperparameter tuning
- ✅ Cross-validation (10-fold) for robust model evaluation
- ✅ Performance comparison with standardized metrics
- ✅ Grid Search on Random Forest with 18 parameter combinations
- ✅ Randomized Search for efficient parameter exploration
- ✅ SVM hyperparameter tuning with 50 combinations:
- Linear kernel: C values [10, 30, 100, 300, 1000, 3000, 10000, 30000]
- RBF kernel: C values [1, 3, 10, 30, 100, 300, 1000] × gamma values [0.01, 0.03, 0.1, 0.3, 1.0, 3.0]
- ✅ Best model: Random Forest with optimized hyperparameters
- ✅ Optimal parameters:
max_features=8,n_estimators=30 - ✅ Feature importance analysis identifying key predictors
- ✅ Test set evaluation with final model
- ✅ Statistical confidence intervals (95% confidence level)
- ✅ Error analysis using both t-distribution and normal distribution
- Random Forest (Optimized): RMSE ≈ 49,898 ⭐ BEST
- Random Forest (Default): RMSE ≈ 50,436
- Linear Regression: RMSE ≈ 69,104
- Decision Tree: RMSE ≈ 71,630
median_income- Primary economic indicatorrooms_per_household- Engineered feature showing high predictive powerpopulation_per_household- Demographic density indicatorlongitude- Geographic location factorlatitude- Geographic location factorhousing_median_age- Property age indicator
- ✅ Production-ready preprocessing pipeline with automated transformations
- ✅ Robust stratified sampling ensuring representative data splits
- ✅ Advanced feature engineering with custom transformers
- ✅ Comprehensive model comparison with statistical validation
- ✅ Optimized hyperparameters through systematic search
- ✅ Statistical rigor with confidence intervals and error analysis
- Python 3.x
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
- scipy (for statistical analysis)
├── Chapter2.ipynb # Complete ML pipeline implementation
├── housing.csv # California housing dataset (20,640 records)
└── README.md # This documentation
- Run the notebook: Execute
Chapter2.ipynbcells sequentially - Model training: The pipeline automatically trains and compares multiple models
- Best model: Random Forest with optimized hyperparameters is selected
- Predictions: Use
final_model.predict()with preprocessed data
- Stratified sampling is crucial for representative train-test splits
- Feature engineering significantly improves model performance
- Random Forest outperforms simpler models for this regression task
- Hyperparameter optimization provides substantial performance gains
- Cross-validation is essential for reliable model evaluation
- Custom transformers enable reusable preprocessing pipelines
This project demonstrates a complete, production-ready machine learning pipeline following industry best practices for data science workflow.