🏠 California Housing Price Prediction

Capstone Project – Purwadhika JCDS 3004-009 (Module 3: Machine Learning)

Author: Bonifasius Sinurat

Overview

This project builds a machine learning model to predict the median house price in California using the 1990 census dataset. Beyond accuracy, the project translates predictions into business actions: ROI analysis, negotiation range, and key price drivers for decision makers.

Stakeholders: Chief Marketing Officer (CMO) & Sales Director Goal: Reliable pricing support for faster decisions and better margins.

Repository Structure

California House/ – dataset
catboost_info/ – CatBoost training logs
png/ – images used in docs/README
- actual_vs_predicted.png
- feature_importances.png
- pipeline.png
app.py – Streamlit app
CA_housing_price_regressor.sav – trained model (pickle)
California Housing Price Prediction.pdf – slides
capstone_m3_machine_learning.ipynb – main notebook
requirements.txt – dependencies (or requirements)
runtime.txt – Python version for Streamlit (e.g., 3.11)

Dataset

Source: Pace & Barry (1997), California Housing (1990 Census)
Rows: 14,448 districts
Target: Median House Values
Features: 9 numerical + 1 categorical (oceanProx)
Notes:
- Censoring at $500,001 for high-end prices
- medIncome is recorded in tens of thousands USD (e.g., 3.5 → $35,000)

Methodology

Preprocessing & Feature Engineering

Missing values: KNNImputer (137 NaN)
Outliers: Winsorizer (Gaussian, fold 2.5)
Scaling: RobustScaler
Encoding: Binary / Ordinal / One-Hot
Target transform: log ↔ exp (via TransformedTargetRegressor)
New features:
- roomsPerHouseholds, bedroomsPerRoom, popPerHouseholds
- isManyRooms (rooms > 95th percentile)
- housingAgeBin (young/middle/old)

Models Evaluated

Linear, Ridge, Lasso, KNN, Decision Tree, Random Forest, XGBoost, CatBoost

Best Model – CatBoost

MAE: $27,533
MAPE: 14.97% (meets ≤ 15% target)
RMSLE: 0.2113

Top Drivers (Feature Importance)

Ocean Proximity (~25.1%)
Median Income (~21.6%)
Latitude (~14%)
Longitude (~14%)

Visuals

Pipeline

Feature Importance

Actual vs Predicted

Business Insights

Negotiation Anchor (around a $260,000 prediction)
- MAE-based range: $222,467 – $277,533
- MAPE-based range: $212,575 – $287,425
ROI Check (example scenario)
- Total investment: $236,000
- Selling fee: 6%
- ROI target: ≥ 10% (profitability threshold)
- → No Deal Below $276,170

Actionable ROI Framework

ROI Range	Decision	Description
< 8%	❌ Reject	Below minimum margin
8–10%	⚠️ Review	Requires manual validation
≥ 10%	✅ Proceed	Meets business target
≥ 14%	💰 Prioritize	High-margin opportunity
---	---	---

Financial Impact
- Each $1 prediction error affects ≈ $0.06 in commission.
- Reducing MAE by $3,223 (CatBoost vs RandomForest) → ≈ $193K annual revenue uplift (based on 1,000 transactions).

Business Takeaway

With a consistent ROI-based decision rule, the model enables property investors and sales teams to act with confidence. Every $1,000 improvement in prediction accuracy protects roughly $60 in commission, turning data accuracy into profit efficiency.

How to Run (Local)

Install dependencies:
```
pip install -r requirements.txt
```
Run the notebook:

Open capstone_m3_machine_learning.ipynb in Jupyter/VS Code and run all cells.

Limitations & Next Steps

Error grows on luxury/high-end prices due to censoring and limited samples
Add geospatial signals (distance to coast/city center/POIs)
Expand high-end segment coverage
Consider quantile regression / conformal prediction for price intervals
Regular retraining & validation; use SHAP for explainability

Deployment

This repo includes runtime.txt (e.g., python-3.11) for Streamlit Cloud.

Live App: California Housing Price Predictor

License

Educational use for Purwadhika Module 3 – Machine Learning capstone project.

For other uses, please provide proper attribution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🏠 California Housing Price Prediction

Overview

Repository Structure

Dataset