Skip to content

This initiative focuses on predicting housing prices using regression modeling, creating a regression model specifically designed to provide millennials with an easy-to-use tool to estimate the appropriate cost of a house.

Notifications You must be signed in to change notification settings

sean-atkinson/machine_learning_real_estate_forecasting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning and Real Estate Forecasting

Table of Contents


Problem Statement
Executive Summary
Models and Conclusion
Data Dictionary
Datasets and Libraries Used

Problem Statement

Many of us believe that homeownership is a natural thing we all should aspire to. It is a sign that our hard work has paid off. It’s our way of saying to the world that we made it.

Unfortunately, this dream is becoming more fantasy than reality for millennials. While 89% of millennials want to own a home, it will take a staggering two decades before 67% of them can afford one.

Still, even with the odds against them, many millennials are pursuing homeownership. Sadly, even if millennials do somehow overcome those odds and purchase a home, many still aren’t finding the security they hoped they’d find through owning a home.

Two-thirds of millennials have home-buying regrets and are more likely than the generations before them to think a home isn’t a good investment or that they overpaid for their house.

That’s where we at MyFirstHome come in.

Our goal is simple, provide millennial homebuyers with an easy-to-use tool that answers the question: how much should a home cost?

Not how much a realtor says a house should cost.

Not how much a house with a porch you really like should cost.

And not how much a house should cost because one two blocks sold $60K above asking.

Our hope is by doing this we will help them make better purchasing decisions in the future.

Executive Summary

(Back to table of contents)
Since this is more a labor of love for our fellow millennials, not profit, we wanted to do this with easily accessible information. We aim to have a turnkey process that can be used across North America.

And so, for our test case, we've picked Ames, Iowa.

Why Ames?

  1. The data is publicly available.

  2. With a population of around 66,000, there is enough data to work for a test case, but not so much that we’d be overwhelmed.

  3. Its “ridiculously friendly people” helped to lift it to a top 15 rating in Livability.com’s Top 100 Best Places to Live, so it’s a place we could actually see millennials living in.

The data set we worked with had 80 features and 2051 observations.

You can read an explanation of all the original features here.

While it was nice to have 80 features to work with, our goal, above all else, is simplicity.

This is a labor of love for our fellow millennials, with an aim to help, not profit from them. We are not looking to charge exorbitant subscription fees or depend heavily on outside investment.

Since that is the case, we wanted to create a model that focused on essential features with information that is easy to obtain and does not depend on a “special sauce” that might not be easily reproducible from one location to another.

We tried four general types of models:

  • Linear Regression
  • Lasso
  • Ridge
  • KNN

While the results varied from model to model, we generally saw a final R2 score of 91% to 92%.

Well, outside of our attempts with KNN, which on average had a final R2 score that was 6-7% lower than the other 3 model types we tested.

Models

(Back to table of contents)

Model Parameters CV Score Training Score Test Score
Linear Regression Polynomial Features(include_bias=False), Standard Scaler, poly degree: 2, lr fit intercept: True 88.2% 90.5% 91.4%
Lasso Polynomial Features(include_bias=False), Standard Scaler, LassoCV, poly degree: 2, lasso eps: 0.001, lasso max iter: 1000, lasso n_alphas: 100, lasso normalize: False, lasso tol: 0.0001 88.9% 89.7% 92.1%
Ridge Polynomial Features(include_bias=False), Standard Scaler, RidgeCV, poly degree: 2, poly interaction_only: False, ridge fit intercept: True 88.9% 89.7% 92.1%
KNN Polynomial Features(include_bias=False), Standard Scaler, KNeighborsClassifier, poly degree: 2, knn metric: minkowski, knn n_neighbors: 5, knn p: 2, knn weights: uniform 82.8% 88.8% 85.0%

Since Ridge performed slightly better than Lasso, that is the model we chose in the end.

Our final model's results:

Ridge: Actual vs Predicted

While our model struggles a bit at higher price points, since millennials on average have $87,448, it's not something we are particularly concerned with, most simply do not have the budget for those price points.

What's more, while our tool can be a useful guide, once you get to those price points we would advised you to seek out more specialized help.

Data Dictionary

(Back to table of contents)

Feature Type Engineerd Description
bsmt_exposure ordinal No Refers to walkout or garden level walls (range:0-4)
bsmt_quality ordinal No Evaluates the general condition of the basement (range: 0-5)
bsmtfin_type_1 ordinal No Rating of basement finished area (range: 0-6)
central_air nominal No Whether a property has central air conditioning (range: 0-1)
exter_qual ordinal No Evaluates the quality of the material on the exterior (range: 1-5)
fireplace_qu ordinal No Evaluates fireplace quality (range: 0-5)
garage_finish ordinal No Evaluates the interior finish of the garage (range: 0-3)
garage_qual ordinal No Evaluates the quality of the garage (range: 0-5)
heating_qc ordinal No Evaluates the quality and condition of the heating (range: 1-5)
kitchen_qual ordinal No Evaluates the quality of the kitchen (range: 1-5)
mas_vnr_type nominal No Masonry veneer type (range: 1-5)
overall_qual ordinal No Rates the overall material and finish of the house (range: 1-10)
paved_drive ordinal No Whether a property has a paved driveway (range: 0-1)
total_sf float Yes Combined square footage of 1st floor, 2nd floor, basement area, open porch area, and wood deck area

Datasets and Libraries Used

(Back to table of contents)

Datasets:

Libraries: matplotlib, numpy, pandas, seaborn, and sklearn.

About

This initiative focuses on predicting housing prices using regression modeling, creating a regression model specifically designed to provide millennials with an easy-to-use tool to estimate the appropriate cost of a house.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published