Skip to content

czarina-ds/regression-analysis-for-estimating-prices

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Regression Analysis for Estimating Prices

Overview

Business and Data Understanding

Real estate agencies in King County, Washington may be able to improve their advisory services by identifying important features that factor in home valuation based on relevant data. Doing so will allow real estate agents to provide more accurate prices to clients as supported by historical records. Using publicly available data, I describe patterns in real estate transactions such as features that likely drive prices.

The King County data spans a year between 2014 and 2015 with over 21,000 real estate transactions and 21 features, of which the dependent variable to predict is price. A full description of all column features is provided along with other data files, available in the repository's data folder.

Exploratory Data Analysis

The highest peaks in the number of houses sold per month happened in the seasons of spring and summer. The decline in the months that followed dropped to the lowest point at the first month of the new year. As for the price of houses sold per month, the values followed roughly the same pattern.

Price is strongly correlated with sqft_living, grade, sqft_living15, bathrooms, and bathrooms.

Let's visualize their relationships and distributions.

Geospatial Mapping

"Location, location, location"

Location is important in real estate and in analysis!

Let's map the data points.

The concentrated geographic patterns reveal parts of the county populated by the more expensive houses represented by darker colors like the island at the center. The even more expensive houses sold for over a million dollars are located at about the same spots as the dark dots.

Locate the highest priced houses in the data in the following map:

Interactive Maps

I created choropleth maps (map notebook) to further understand how house prices vary by location.

To interact with the maps, please use the notebook viewer.

Data Modeling and Results

Model Performance:

Baseline to Bestimate Model

The baseline model simple_lr significantly improved to the poly_tuned_rf, our bestimate model:

  • from an R-squared of 0.47 to 0.88, and
  • from a Root Mean Squared Error of USD 217,000 to USD 107,000

To visualize the significant difference, let's plot how far away the predictions of the baseline model are to the actual prices versus how much closer the predictions of the bestimate model are:

For advisory, the top five features with the highest mean feature importances to the model are:

  1. Square footage of living space
  2. Distance to Seattle
  3. Square footage of living space of the nearest 15 neighbors
  4. Distance to Redmond
  5. Total distance to both Seattle and Redmond

The other important features that follows are population density and population of the city, overall grade related to the construction and design of the house, whether the house is on a waterfront, and the zipcode.

Model Deployment

Finally, I deploy the Random Forest regression model (demo) as a prototype I develop for a client-facing application that serves as a Home Value Estimator.


SOURCE CODE: Main Notebook

Contact

Feel free to contact me for any questions and connect with me on Linkedin.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 59.1%
  • HTML 40.9%