LandingPad Realtors is a real estate business that helps families with school-aged children relocate to King County and find the perfect home to meet their families needs. LandingPad provides potential homeowners with home purchase options within their ideal budget.
- Stakeholder: LandingPad Realtors
- Busines Case: I have been hired by LandingPad to accurately predict the housing prices within the King County Housing Market. Executives at LandingPad want to launch a multimedia campaign to reach their target audience of young families moving to the Kings County Area and want a reliable model that can be refined over time as more information becomes available.
Primarily, I will start by identifying the characteristics of homes that increase housing costs. The effect of each relevant feature will then be identified and communicated to the team at LandingPad. This project will be grounded in performing a statistical analysis of the price of houses in the King County House dataset and creating a multiple linear regression model that accurately predicts the sale price of a house in King County.
In this project I will use the CRISP DM method. The dataset selected in this project are from the :
- King County House Sales Dataset found in
kc_house_data.csv
The dataset can be found in the data folder of this repository along with a file called column_names.md
which provides description of the features within the dataset. More information about the features on the site of the King County Assessor.
The King County House Sales Dataset includes sales data for 21,597 homes with 20 features including but not limited to:
Name | Description | Final Datatype | Numeric or Categorical | Target or Feature |
---|---|---|---|---|
id |
Unique identifier for a house | int |
Numeric | Feature |
date |
Date house was sold | datetime |
Numeric | Feature |
price |
Sale price (prediction target) | int |
Numeric | Target |
bedrooms |
Number of bedrooms | int |
Numeric | Feature |
bathrooms |
Number of bathrooms | float |
Numeric | Feature |
sqft_living |
Square footage of living space in the home | int |
Numeric | Feature |
sqft_lot |
Square footage of the lot | int |
Numeric | Feature |
floors |
Number of floors(levels) in house | float |
Numeric | Feature |
waterfront |
Whether the house is on a waterfront | float |
Categorical | Feature |
view |
Quality of view from house | float |
Categorical | Feature |
condition |
How good the overall condition of the house is. Related to the maintenance of house | int |
Numeric | Feature |
grade |
Overall grade of the house. Related to the construction and design of the house | int |
Numeric | Feature |
yr_built |
Year when house was built | int |
Numeric | Feature |
yr_renovated |
Year when house was renovated | int |
Numeric | Feature |
lat |
Latitude coordinate | float |
Numeric | Feature |
long |
Longitude coordinate | float |
Numeric | Feature |
Importing libraries at the beginning allows access to modules and other tools throughout this project that help to make the tasks within this project manageable to implement. The main libraries that will be used within this project include:
pandas
: a data analysis and manipulation library which allows for flexible reading, writing, and reshaping of datanumpy
: a key library that brings the computationaly power of languages like C to Pythonmatplotlib
: a comprehensive visualization libraryseaborn
: a data visualization library based on matplotlib
Read in data from kc_house_data.csv
using .read_csv()
from the pandas library.
In order to clean the data, I typically address missing data, place holders and datatypes. This is the most important step of this project because if data is not appropriate for the model, the results will be inherently inaccuarate and my model will result in lackluster predictions.
To dig deeper into the data, I will:
- Review the datatypes found within the entire dataframe
- Address duplicates, missing and placeholder data
- Address incorrect or incongruous datatypes for the model
- Explore correlation between features
First, I set the dependent variable (y
) to be the price
. Then I chose the most highly correlated features from the dataframe to be the baseline independent variable (X
).
Finally, I followed this methodology:
- Build a linear regression using
statsModels
- Describe the overall model performance
- Interpret its coefficients.
This simple linear regression model is statistically significant overall, and explains 36.5% of the variance in house price. Both the intercept and the coefficient for sqft_living are statistically significant.
The intercept is a small negative number, meaning a home with 0 square feet of living would cost around $0.
The coefficient for sqft_living is about 157, which means that for each additional square foot of living space, I expect the price to increase about $157.
The results Summary from the statsmodel ordinary least squares shows:
For the first question I looked for correlations between attributes and used price as my target variable. I explored data related to this question using visualizations created with seaborn
, plotly express
and matplotlib
.
For the second question, I removed features with high p-values and correlations, truncated the data so that it was more suitable for a linear regression model:
- linear : one or more predictor features have a linear relationship with the target
- normal : one or more features (random variables from the data) all have a bell shaped curve
- homoscedasticity : little to no multicollinearity (highly correlated variables) . I explored data related to this question using visualizations created with
seaborn
,plotly express
andmatplotlib
.
For the third question, I used data from the greatschools website and created a function that calculated the closest distance to a school that was rated Above Average or higher (betwen 7 and 10, inclusive):
-
Curate a set of listings using the interactive map that are a between 2 and 5 bedroom homes.
-
Use the interactive map to narrow down homes that have a minimum sale price of 470K dollars.
-
Show families homes that and are within 10 miles of an Above Average school. This will allow Landing Pad to reach a broader set of home owners who are within our target market.
Moving forward I would like to explore the effect of distance to local attractions (ex. parks, third places, places of worship) on sale price in the King County dataset.
. └── real_estate_linear_regression/ ├── README.md ├── final_project_phase_2.ipynb ├── notebook.pdf ├── presentation.pdf ├── Images/ └── .gitignore