Skip to content

Scraping data from apartments.com to train ML model that estimates rent prices for apartments in the Bay Area

Notifications You must be signed in to change notification settings

bryandaetz1/Apartment_Rent_Prices

Repository files navigation

Estimating Bay Area Rent Prices - Project Overview

  • Used Selenium to scrape data from over 12,000 apartment listings on apartments.com in the San Jose, Oakland and San Francisco areas
  • Cleaned data and engineered features from text description of apartment amenities by applying NLP techniques to gain insight on what amenities might be useful to include in the final model
  • Created an ML model that estimates rent prices (RMSE ~$365 on test set) given a number of inputs including # of bedrooms, # of bathrooms, square footage and amenities
  • Productionized this ML model and collaborated with a friend to create an interactive web app. Web app can be accessed here: http://3.128.33.149/

Packages Used and Sources Referenced

Python Version: 3.7
Packages:

  • Web Scraping: selenium, pandas, re
  • Data Cleaning/Feature Engineering: pandas, numpy, re, matplotlib, seaborn, sklearn, nltk
  • EDA/Model Building: pandas, numpy, matplotlib, seaborn, scipy, sklearn, xgboost
  • To Install Requirements to Run Pickled ML Model: pip install -r requirements.txt

Sources Referenced:

  • Tutorial on scraping Glassdoor using selenium
  • Selenium unofficical documentation
  • Guide on productionizing an ML model (used for reference on how to pickle and load an ML model)

Data Cleaning & Feature Engineering

  • Created 3 new features from address: city, neighborhood and zip code
  • Stripped text from numerical features (bedrooms, bathrooms, rent, square footage)
  • For listings that had a range for rent and/or square footage, converted the range into an average
  • Applied NLP techniques (bag of words) and fit a random forest regression model using just the bedroom size and raw text from the amenities column to gain insight on features that may be useful to extract from the amenities text

Feature Importances from NLP Model

bow_feature_importances

Exploratory Data Analysis

Performed more feature engineering during my exploratory data analysis:

  • Removed some of the outliers (some rent prices were $15,000+)
  • Chose to drop neighborhood feature and use only zip codes and cities
  • Consolidated cities that appeared less frequently into an “Other” category
  • Consolidated zip codes that appeared less frequently into a “City Name - Other” category
  • Created dummy variables for categorical features
  • Took natural log of rent and square footage columns to address positive skewness observed in distributions
  • Dropped some of the amenities features that were infrequent, unclear or less impactful

heatmap

bed_barplot bath_barplot

table scatter

rent_by_city

rent_by_zip

Model Building

Compared 4 different models and evaluated performance on validation set:

  • Multiple Linear Regression
  • Support Vector Machine
  • Random Forest
  • XGBoost

comparing_ML_models

distribution

Choosing Final Model (Random Forest)

After tuning hyperparameters, the final random forest model achieved the following results on the test set:

Root Mean Square Error: 364.64 (in dollars)
R2 Score: 0.896
Adjusted R2 Score: 0.895

final_model_dist

Features Importances for Final Random Forest Model

rf_feature_importances

About

Scraping data from apartments.com to train ML model that estimates rent prices for apartments in the Bay Area

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published