Zillow first launched in 2006. Since then, our company has accumulated a living database of more than 110 million US homes. Not only do we keep track of homes on the market, but off the market as well. What do we do with this data? One value we create is the Zestimate® - the estimate of a home's market value.
The purpose of this project is determine what influences this value based on previous property data during the two months of high real estate demand - May and June.
We want to be able to predict :
- Key Drivers of market value for single unit properties
- Distribution of tax rates for each county
- Final model created to predict the values of single unit properties that the tax district assesses
- A reproducable Github repo containing
- Walkthrough of the DS pipeline within a jupyter notebook
- Acquire/Prepare modules in .py files
- In a seperate presentation, answers for the
- States and counties properties are located in
- Distribution of tax rates for each county
- Tableau Workbooks on:
How does the current Zestimate® perform? According to FreeStoneProperties,
"The median error for larger markets is usually around 2% of the sale price of the home. But the problem with Zestimates is that when they are wrong, they can be significantly wrong. For example, depending on the metro area, Zillow might be within 5% of the sale price only 62% of the time."
The main tables within the Zillow database are predictions_2016 and predictions_2017. These contain the observations of each house: the tax value and several different features of the house. The data definition table is below.
Feature | Definition | Type |
---|---|---|
parcelid | Assigned by your local tax assessment office, unique for each property | discrete |
calculatedfinishedsquarefeet | Square feet: length (ft.) x width (ft.) | continuous |
bedroomcnt | Number of bedrooms in a unit | discrete |
bathroomcnt | Number of bathrooms in a unit | discrete |
taxvaluedollarcnt | Property taxes of unit for the year | continuous |
fips | Federal Information Processing Standards (6037, 6059, 6111, or null) uniquely identify geographic areas | categorical |
latitude | Distance of a place north or south of the earth's equator | continuous |
longitude | Distance of a place east or west of the meridian | continuous |
propertycountylandusecode | Land use zones are the codes that the government uses to classify parcels of land (chars with numbers) | discrete |
propertyzoningdesc | Zoning refers to municipal or local laws or regulations that dictate how real property can and cannot be used in certain geographic areas (chars with numbers) | discrete |
The visual below takes a more in-depth look at the original database. We can see how the properties tables for 2016 and 2017 contain a majority of the data. These tables also have a whopping amount of 52 columns. Before prepping the data, we can use this visual to make ideas on which features we may not need, can be combined, etc.
The descriptive tables shown below have various ranges of values and corresponding meanings. Though we will not be joining these tables during the aquire stage, it has been added for reference. Theses values in the main predictions tables will need to be scaled so they have the same weight.
For better quality images, these visuals can also be found on my pinterest here.
Taking a surface look as the database, it is apparent feature engineering will be neccessary to create manageable exploration and modeling from the 52 columns. This may include...
- Creating simpler features, such as story count being 1, 2, or more
- Combining different features, such as
- Removing features with high null values
- Removing unnecessary features, such as long. and latt. when we also have region ids
- Scaling the final dataframe for modeling
More square footage increases value
Null hypothesis: House square footage is independent of the tax value
Alternative hypothesis: The larger the square feet, the greater the tax value
Older and younger houses have higher values (or could this be like a reverse bell curve? Old houses have historic value and new houses have value?)
Null hypothesis: The age of the house is independent of the tax value
Alternative hypothesis: Newer houses have higher tax values
Location affects property value
Null hypothesis: The location of the house is independent of the tax values
Alternative hypothesis: One county has higher tax values than the average
- Data was aquired from the Zillow SQL Database.
- Login credentials are required.
- The functions are stored in the Acquire.py file.
- File is a reproducible component for gathering the data.
- Created a prep.py file.
- Data is split into train, validate, and test.
- Dataset is ready to be analyzed.
- Data is scaled as necessary.
- Erroneous or invalid data is identified.
- File is a reproducible component that handles missing values, fixes data integrity issues, changes data types, and scales data.
- Run at least 1 t-test and 1 correlation test.
- Visualize all combinations of variables in some way(s).
- What independent variables are correlated with the dependent?
- Which independent variables are correlated with other independent variables?
Summarized your takeaways and conclusions.
- Developed a regression model that performs better than a baseline.
- Extablished a baseline model.
- Documented various algorithms and/or hyperparameters.
- Plotted the residuals, computed the evaluation metrics (SSE, RMSE, and/or MSE), compared to baseline, and plotted y by ^y.
- Created a model.py file as
- File is a reproducible component that will have the functions to fit, predict and evaluate the final model on the test data set.
Key Drivers of market value for single unit properties are (listed from highest correlation to lowest):
- Square footage of property
- Bathroom count
- Bedroom count
Observations:
- Most units are 3 bedrooms/ 2 bath
- Whether a property has a garage, it greatly affects that total square footage
- Three counties were listed in database:
- Los Angeles County
- Orange County
- Ventura County
- Orange County has the highest average tax value for properties
- Los Angeles County has the most amount of properties
All files are reproducible and available for download and use (need login credentials for access to Zillow company database).
- Read this README.md
- Download the Aquire.py, Prepare.py, and Final_Report.ipynb files into your working directory
- Run the Final_Report.ipynb notebook
- Have fun doing your own exploring, modeling, and more!
Bethany Thompson
Dani Bojado