Estimating Bay Area Rent Prices - Project Overview

Used Selenium to scrape data from over 12,000 apartment listings on apartments.com in the San Jose, Oakland and San Francisco areas
Cleaned data and engineered features from text description of apartment amenities by applying NLP techniques to gain insight on what amenities might be useful to include in the final model
Created an ML model that estimates rent prices (RMSE ~$365 on test set) given a number of inputs including # of bedrooms, # of bathrooms, square footage and amenities
Productionized this ML model and collaborated with a friend to create an interactive web app. Web app can be accessed here: http://3.128.33.149/

Packages Used and Sources Referenced

Python Version: 3.7
Packages:

Web Scraping: selenium, pandas, re
Data Cleaning/Feature Engineering: pandas, numpy, re, matplotlib, seaborn, sklearn, nltk
EDA/Model Building: pandas, numpy, matplotlib, seaborn, scipy, sklearn, xgboost
To Install Requirements to Run Pickled ML Model: pip install -r requirements.txt

Sources Referenced:

Tutorial on scraping Glassdoor using selenium
Selenium unofficical documentation
Guide on productionizing an ML model (used for reference on how to pickle and load an ML model)

Data Cleaning & Feature Engineering

Created 3 new features from address: city, neighborhood and zip code
Stripped text from numerical features (bedrooms, bathrooms, rent, square footage)
For listings that had a range for rent and/or square footage, converted the range into an average
Applied NLP techniques (bag of words) and fit a random forest regression model using just the bedroom size and raw text from the amenities column to gain insight on features that may be useful to extract from the amenities text

Feature Importances from NLP Model

Exploratory Data Analysis

Performed more feature engineering during my exploratory data analysis:

Removed some of the outliers (some rent prices were $15,000+)
Chose to drop neighborhood feature and use only zip codes and cities
Consolidated cities that appeared less frequently into an “Other” category
Consolidated zip codes that appeared less frequently into a “City Name - Other” category
Created dummy variables for categorical features
Took natural log of rent and square footage columns to address positive skewness observed in distributions
Dropped some of the amenities features that were infrequent, unclear or less impactful

Model Building

Compared 4 different models and evaluated performance on validation set:

Multiple Linear Regression
Support Vector Machine
Random Forest
XGBoost

Choosing Final Model (Random Forest)

After tuning hyperparameters, the final random forest model achieved the following results on the test set:

Root Mean Square Error: 364.64 (in dollars)
R² Score: 0.896
Adjusted R² Score: 0.895

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
Images		Images
Apartment_Data_Cleaning.ipynb		Apartment_Data_Cleaning.ipynb
Apartments_EDA_&_Model_Building.ipynb		Apartments_EDA_&_Model_Building.ipynb
README.md		README.md
apartment_data_clean.csv		apartment_data_clean.csv
apt_scraper_selenium.py		apt_scraper_selenium.py
oakland_apartment_data.csv		oakland_apartment_data.csv
rent_estimator_model.zip		rent_estimator_model.zip
rent_prediction_function.py		rent_prediction_function.py
requirements.txt		requirements.txt
san_francisco_apartment_data.csv		san_francisco_apartment_data.csv
san_jose_apartment_data.csv		san_jose_apartment_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Estimating Bay Area Rent Prices - Project Overview

Packages Used and Sources Referenced

Data Cleaning & Feature Engineering

Feature Importances from NLP Model

Exploratory Data Analysis

Model Building

Choosing Final Model (Random Forest)

Features Importances for Final Random Forest Model

About

Releases

Packages

Languages

bryandaetz1/Apartment_Rent_Prices

Folders and files

Latest commit

History

Repository files navigation

Estimating Bay Area Rent Prices - Project Overview

Packages Used and Sources Referenced

Data Cleaning & Feature Engineering

Feature Importances from NLP Model

Exploratory Data Analysis

Model Building

Choosing Final Model (Random Forest)

Features Importances for Final Random Forest Model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages