Capstone Project – Purwadhika JCDS 3004-009 (Module 3: Machine Learning)
Author: Bonifasius Sinurat
This project builds a machine learning model to predict the median house price in California using the 1990 census dataset. Beyond accuracy, the project translates predictions into business actions: ROI analysis, negotiation range, and key price drivers for decision makers.
Stakeholders: Chief Marketing Officer (CMO) & Sales Director Goal: Reliable pricing support for faster decisions and better margins.
- California House/ – dataset
- catboost_info/ – CatBoost training logs
- png/ – images used in docs/README
- actual_vs_predicted.png
- feature_importances.png
- pipeline.png
- app.py – Streamlit app
- CA_housing_price_regressor.sav – trained model (pickle)
- California Housing Price Prediction.pdf – slides
- capstone_m3_machine_learning.ipynb – main notebook
- requirements.txt – dependencies (or
requirements) - runtime.txt – Python version for Streamlit (e.g., 3.11)
- Source: Pace & Barry (1997), California Housing (1990 Census)
- Rows: 14,448 districts
- Target:
Median House Values - Features: 9 numerical + 1 categorical (
oceanProx) - Notes:
- Censoring at $500,001 for high-end prices
medIncomeis recorded in tens of thousands USD (e.g., 3.5 → $35,000)
- Missing values: KNNImputer (137 NaN)
- Outliers: Winsorizer (Gaussian, fold 2.5)
- Scaling: RobustScaler
- Encoding: Binary / Ordinal / One-Hot
- Target transform: log ↔ exp (via
TransformedTargetRegressor) - New features:
roomsPerHouseholds,bedroomsPerRoom,popPerHouseholdsisManyRooms(rooms > 95th percentile)housingAgeBin(young/middle/old)
Linear, Ridge, Lasso, KNN, Decision Tree, Random Forest, XGBoost, CatBoost
Best Model – CatBoost
- MAE: $27,533
- MAPE: 14.97% (meets ≤ 15% target)
- RMSLE: 0.2113
Top Drivers (Feature Importance)
- Ocean Proximity (~25.1%)
- Median Income (~21.6%)
- Latitude (~14%)
- Longitude (~14%)
-
Negotiation Anchor (around a $260,000 prediction)
- MAE-based range: $222,467 – $277,533
- MAPE-based range: $212,575 – $287,425
-
ROI Check (example scenario)
- Total investment: $236,000
- Selling fee: 6%
- ROI target: ≥ 10% (profitability threshold)
- → No Deal Below $276,170
-
Actionable ROI Framework
ROI Range Decision Description < 8% ❌ Reject Below minimum margin 8–10% ⚠️ ReviewRequires manual validation ≥ 10% ✅ Proceed Meets business target ≥ 14% 💰 Prioritize High-margin opportunity --- --- --- -
Financial Impact
- Each $1 prediction error affects ≈ $0.06 in commission.
- Reducing MAE by $3,223 (CatBoost vs RandomForest) → ≈ $193K annual revenue uplift (based on 1,000 transactions).
With a consistent ROI-based decision rule, the model enables property investors and sales teams to act with confidence. Every $1,000 improvement in prediction accuracy protects roughly $60 in commission, turning data accuracy into profit efficiency.
-
Install dependencies:
pip install -r requirements.txt
-
Run the notebook:
Open
capstone_m3_machine_learning.ipynbin Jupyter/VS Code and run all cells.
- Error grows on luxury/high-end prices due to censoring and limited samples
- Add geospatial signals (distance to coast/city center/POIs)
- Expand high-end segment coverage
- Consider quantile regression / conformal prediction for price intervals
- Regular retraining & validation; use SHAP for explainability
This repo includes runtime.txt (e.g., python-3.11) for Streamlit Cloud.
Live App: California Housing Price Predictor
Educational use for Purwadhika Module 3 – Machine Learning capstone project.
For other uses, please provide proper attribution.


