Property Rental Price Estimation in Paris

Author: Waldy Setiono (waldysetiono@gmail.com)

Background: An online accomodation and property rental company based in South East Asia plans to expand its service area to Europe and the first city will be Paris. The company is now conducting research to determine the pricing strategy in order to compete with existing similar companies. There has been no standard method to determine how much a property owner should cost their customers so this company considers analyzing historical pricing data of one of its major competitors to predict its potential future service prices. This project aims to make a predictive model that can estimate property rental price based on facility, location, room capacity, and other related features.

Data: The data used in this project is from insideairbnb.com.

Outline

Data Preparation
Exploratory Data Analysis and Data Cleaning
Feature Engineering
Modeling and Evaluation

Data Preparation

Import packages and load data

# Import packages
import pandas as pd
import numpy as np
import requests
from zipfile import ZipFile
from io import BytesIO
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
from statsmodels.tsa.seasonal import seasonal_decompose
% pip install geopandas
import geopandas as gpd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score
import xgboost as xgb
from xgboost import plot_importance
from sklearn.metrics import explained_variance_score, mean_squared_error, r2_score
import time
from keras import models, layers, optimizers, regularizers

Requirement already satisfied: geopandas in /usr/local/lib/python3.7/dist-packages (0.10.2)
Requirement already satisfied: pyproj>=2.2.0 in /usr/local/lib/python3.7/dist-packages (from geopandas) (3.2.1)
Requirement already satisfied: shapely>=1.6 in /usr/local/lib/python3.7/dist-packages (from geopandas) (1.7.1)
Requirement already satisfied: fiona>=1.8 in /usr/local/lib/python3.7/dist-packages (from geopandas) (1.8.20)
Requirement already satisfied: pandas>=0.25.0 in /usr/local/lib/python3.7/dist-packages (from geopandas) (1.1.5)
Requirement already satisfied: click>=4.0 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (7.1.2)
Requirement already satisfied: cligj>=0.5 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (0.7.2)
Requirement already satisfied: six>=1.7 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (1.15.0)
Requirement already satisfied: attrs>=17 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (21.2.0)
Requirement already satisfied: munch in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (2.5.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (2021.5.30)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (57.4.0)
Requirement already satisfied: click-plugins>=1.0 in /usr/local/lib/python3.7/dist-packages (from fiona>=1.8->geopandas) (1.1.1)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.0->geopandas) (1.19.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.0->geopandas) (2.8.2)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.0->geopandas) (2018.9)

r = requests.get("https://github.com/waldysetio/price-estimation/blob/main/data/listings.zip?raw=true")
files = ZipFile(BytesIO(r.content))
data = pd.read_csv(files.open("listings.csv"))
data

	id	listing_url	scrape_id	last_scraped	name	description	neighborhood_overview	picture_url	host_id	host_url	host_name	host_since	host_location	host_about	host_response_time	host_response_rate	host_acceptance_rate	host_is_superhost	host_thumbnail_url	host_picture_url	host_neighbourhood	host_listings_count	host_total_listings_count	host_verifications	host_has_profile_pic	host_identity_verified	neighbourhood	neighbourhood_cleansed	neighbourhood_group_cleansed	latitude	longitude	property_type	room_type	accommodates	bathrooms	bathrooms_text	bedrooms	beds	amenities	price	minimum_nights	maximum_nights	minimum_minimum_nights	maximum_minimum_nights	minimum_maximum_nights	maximum_maximum_nights	minimum_nights_avg_ntm	maximum_nights_avg_ntm	calendar_updated	has_availability	availability_30	availability_60	availability_90	availability_365	calendar_last_scraped	number_of_reviews	number_of_reviews_ltm	number_of_reviews_l30d	first_review	last_review	review_scores_rating	review_scores_accuracy	review_scores_cleanliness	review_scores_checkin	review_scores_communication	review_scores_location	review_scores_value	license	instant_bookable	calculated_host_listings_count	calculated_host_listings_count_entire_homes	calculated_host_listings_count_private_rooms	calculated_host_listings_count_shared_rooms	reviews_per_month
0	5396	https://www.airbnb.com/rooms/5396	20210909211521	2021-09-11	Explore the heart of old Paris	Cozy, well-appointed and graciously designed s...	You are within walking distance to the Louvre,...	https://a0.muscache.com/pictures/52413/f9bf76f...	7903	https://www.airbnb.com/users/show/7903	Borzou	2009-02-14	İstanbul, İstanbul, Turkey	The flat is owned by journalists who spend a l...	within an hour	100%	89%	f	https://a0.muscache.com/im/users/7903/profile_...	https://a0.muscache.com/im/users/7903/profile_...	Saint-Paul - Ile Saint-Louis	1.0	1.0	['email', 'phone', 'facebook', 'google', 'revi...	t	t	Paris, Ile-de-France, France	Hôtel-de-Ville	NaN	48.852470	2.358350	Entire rental unit	Entire home/apt	2	NaN	1 bath	NaN	1.0	["Hot water kettle", "Cooking basics", "Smoke ...	$110.00	2	1125	2.0	2.0	1125.0	1125.0	2.0	1125.0	NaN	t	0	3	29	29	2021-09-11	260	35	3	2013-09-22	2020-08-08	4.51	4.55	4.47	4.78	4.82	4.96	4.53	7510402838018	f	1	1	0	0	2.68
1	7397	https://www.airbnb.com/rooms/7397	20210909211521	2021-09-11	MARAIS - 2ROOMS APT - 2/4 PEOPLE	VERY CONVENIENT, WITH THE BEST LOCATION !<br /...	NaN	https://a0.muscache.com/pictures/67928287/330b...	2626	https://www.airbnb.com/users/show/2626	Franck	2008-08-30	Paris, Île-de-France, France	I am a writer,51, author of novels, books of l...	within an hour	100%	80%	t	https://a0.muscache.com/im/pictures/user/58f00...	https://a0.muscache.com/im/pictures/user/58f00...	Le Marais	4.0	4.0	['email', 'phone', 'reviews', 'jumio', 'offlin...	t	t	NaN	Hôtel-de-Ville	NaN	48.859090	2.353150	Entire rental unit	Entire home/apt	4	NaN	1 bath	2.0	2.0	["Cooking basics", "Smoke alarm", "Iron", "Ove...	$100.00	10	130	10.0	10.0	130.0	130.0	10.0	130.0	NaN	t	0	0	20	257	2021-09-11	278	9	2	2011-08-11	2021-08-18	4.70	4.79	4.44	4.91	4.88	4.92	4.70	7510400829623	f	1	1	0	0	2.26
2	7964	https://www.airbnb.com/rooms/7964	20210909211521	2021-09-11	Large & sunny flat with balcony !	Very large & nice apartment all for you! <br /...	NaN	https://a0.muscache.com/pictures/4471349/6fb3d...	22155	https://www.airbnb.com/users/show/22155	Anaïs	2009-06-18	Paris, Île-de-France, France	Hello ! \r\nOur apartment is great and I am su...	within a day	60%	0%	f	https://a0.muscache.com/im/users/22155/profile...	https://a0.muscache.com/im/users/22155/profile...	Gare du Nord - Gare de I'Est	0.0	0.0	['email', 'phone', 'reviews', 'jumio', 'offlin...	t	t	NaN	Opéra	NaN	48.874170	2.342450	Entire rental unit	Entire home/apt	2	NaN	1 bath	1.0	1.0	["Essentials", "TV with standard cable", "Wifi...	$130.00	6	365	6.0	6.0	365.0	365.0	6.0	365.0	NaN	t	13	43	73	348	2021-09-11	6	0	0	2014-09-11	2015-09-14	4.80	5.00	5.00	5.00	5.00	5.00	5.00	7510903576564	f	1	1	0	0	0.07
3	9359	https://www.airbnb.com/rooms/9359	20210909211521	2021-09-11	Cozy, Central Paris: WALK or VELIB EVERYWHERE !	Location! Location! Location! Just bring your ...	NaN	https://a0.muscache.com/pictures/c2965945-061f...	28422	https://www.airbnb.com/users/show/28422	Bernadette	2009-07-29	New York, New York, United States	I am a Native New Yorker (yes, I was born and ...	within an hour	100%	20%	f	https://a0.muscache.com/im/users/28422/profile...	https://a0.muscache.com/im/users/28422/profile...	Châtelet - Les Halles - Beaubourg	3.0	3.0	['email', 'phone', 'reviews', 'jumio', 'offlin...	t	t	NaN	Louvre	NaN	48.860060	2.348630	Entire rental unit	Entire home/apt	2	NaN	1 bath	1.0	1.0	["Essentials", "Wifi", "Kitchen", "Hair dryer"...	$75.00	180	365	180.0	180.0	365.0	365.0	180.0	365.0	NaN	t	0	0	0	28	2021-09-11	0	0	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Available with a mobility lease only ("bail mo...	f	1	1	0	0	NaN
4	9952	https://www.airbnb.com/rooms/9952	20210909211521	2021-09-11	Paris petit coin douillet	Je suis une dame retraitée, qui propose un agr...	Vibrant neighborhood, full of bars, cafés, fre...	https://a0.muscache.com/pictures/ae822d16-74d2...	33534	https://www.airbnb.com/users/show/33534	Elisabeth	2009-08-24	Paris, Île-de-France, France	Parisienne retraitée, dynamique et accueillant...	within an hour	100%	100%	t	https://a0.muscache.com/im/pictures/user/4f775...	https://a0.muscache.com/im/pictures/user/4f775...	République	1.0	1.0	['email', 'phone', 'reviews', 'jumio', 'offlin...	t	t	Paris, Ile-de-France, France	Popincourt	NaN	48.863730	2.370930	Entire rental unit	Entire home/apt	2	NaN	1 bath	1.0	1.0	["TV", "Cooking basics", "Smoke alarm", "Lugga...	$80.00	4	31	4.0	4.0	31.0	31.0	4.0	31.0	NaN	t	1	2	2	198	2021-09-11	31	6	1	2016-08-04	2021-06-23	4.94	4.97	4.87	5.00	4.90	4.90	4.94	7511101582862	f	1	1	0	0	0.50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
50128	52162381	https://www.airbnb.com/rooms/52162381	20210909211521	2021-09-11	Cosy Apartment 4P/1BD Canal St-Martin	This very cute apartment will be perfect for a...	This apartment is located in a very lively dis...	https://a0.muscache.com/pictures/prohost-api/H...	325819242	https://www.airbnb.com/users/show/325819242	Checkmyguest	2020-01-09	Paris, Île-de-France, France	NaN	NaN	NaN	NaN	f	https://a0.muscache.com/im/pictures/user/3cfaf...	https://a0.muscache.com/im/pictures/user/3cfaf...	Pigalle - Saint-Georges	5.0	5.0	['email', 'phone']	t	t	Paris, Île-de-France, France	Entrepôt	NaN	48.874007	2.363030	Entire rental unit	Entire home/apt	4	NaN	1 bath	1.0	2.0	["Hot water kettle", "TV", "Cooking basics", "...	$142.00	1	1125	1.0	3.0	1125.0	1125.0	2.9	1125.0	NaN	t	26	56	86	361	2021-09-11	0	0	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	7511005130319	t	31	31	0	0	NaN
50129	52162674	https://www.airbnb.com/rooms/52162674	20210909211521	2021-09-11	Tour Eiffel / Passy // magnifique appartement 4P	Ce superbe appartement est situé dans le fameu...	NaN	https://a0.muscache.com/pictures/01ba90ae-6838...	353064334	https://www.airbnb.com/users/show/353064334	Home Suite	2020-07-02	FR	NaN	within an hour	75%	99%	f	https://a0.muscache.com/im/pictures/user/31e4e...	https://a0.muscache.com/im/pictures/user/31e4e...	NaN	4.0	4.0	['email', 'phone', 'offline_government_id', 's...	t	t	NaN	Passy	NaN	48.853539	2.281483	Entire condominium (condo)	Entire home/apt	4	NaN	1 bath	1.0	2.0	["Private entrance", "Essentials", "TV", "Wifi...	$138.00	1	1125	1.0	3.0	2.0	1125.0	1.4	705.4	NaN	t	19	49	79	354	2021-09-11	0	0	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	7511605709975	t	2	2	0	0	NaN
50130	52163316	https://www.airbnb.com/rooms/52163316	20210909211521	2021-09-11	❉Comfortable studio for 2p - Paris 3❉	Come and discover my charming studio. This apa...	PLACE DES VOSGES <br />It is the jewel of the ...	https://a0.muscache.com/pictures/miso/Hosting-...	50502817	https://www.airbnb.com/users/show/50502817	WeHost	2015-12-04	Paris, Île-de-France, France	Conciergerie Airbnb	NaN	NaN	NaN	f	https://a0.muscache.com/im/pictures/user/a6ba3...	https://a0.muscache.com/im/pictures/user/a6ba3...	Commerce - Dupleix	175.0	175.0	['email', 'phone', 'reviews', 'jumio', 'govern...	t	f	Paris, Île-de-France, France	Temple	NaN	48.861777	2.364823	Entire rental unit	Entire home/apt	2	NaN	1 bath	1.0	1.0	["Hot water kettle", "Refrigerator", "Shower g...	$61.00	3	1125	3.0	3.0	1125.0	1125.0	3.0	1125.0	NaN	t	21	51	81	261	2021-09-11	0	0	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	7510305662032	t	61	54	4	0	NaN
50131	52165011	https://www.airbnb.com/rooms/52165011	20210909211521	2021-09-11	Cosy flat between Bastille and République in P...	Located at the heart of Paris, between some of...	Staying in the 11th arrondissement will enable...	https://a0.muscache.com/pictures/prohost-api/H...	125797498	https://www.airbnb.com/users/show/125797498	Welkeys	2017-04-14	Paris, Île-de-France, France	Bienvenue Chez Vous !\r\n\r\nWelkeys est une s...	within an hour	92%	98%	f	https://a0.muscache.com/im/pictures/user/76dfd...	https://a0.muscache.com/im/pictures/user/76dfd...	2nd Arrondissement	138.0	138.0	['email', 'phone', 'google', 'reviews', 'offli...	t	t	Paris, Île-de-France, France	Popincourt	NaN	48.857923	2.375364	Entire rental unit	Entire home/apt	2	NaN	1 bath	1.0	1.0	["Refrigerator", "Essentials", "Bed linens", "...	$100.00	2	1125	2.0	2.0	2.0	1125.0	2.0	33.7	NaN	t	25	55	85	287	2021-09-11	0	0	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	7511100347794	t	33	33	0	0	NaN
50132	52166001	https://www.airbnb.com/rooms/52166001	20210909211521	2021-09-11	Bail Mobilité Montmartre	I'm renting a charming rooftop flat in montmar...	NaN	https://a0.muscache.com/pictures/50262813/7d17...	18045582	https://www.airbnb.com/users/show/18045582	Betty	2014-07-13	Paris, Île-de-France, France	bonjour ,\r\n je suis psychologue et formatric...	within an hour	100%	100%	t	https://a0.muscache.com/im/users/18045582/prof...	https://a0.muscache.com/im/users/18045582/prof...	Pigalle - Saint-Georges	1.0	1.0	['email', 'phone', 'reviews', 'offline_governm...	t	t	NaN	Opéra	NaN	48.880568	2.335568	Entire rental unit	Entire home/apt	3	NaN	1 bath	1.0	3.0	["Cooking basics", "Luggage dropoff allowed", ...	$69.00	30	1125	30.0	30.0	1125.0	1125.0	30.0	1125.0	NaN	t	0	17	47	126	2021-09-11	0	0	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Available with a mobility lease only ("bail mo...	f	2	2	0	0	NaN

50133 rows × 74 columns

Check boolean and numerical categories

# Plotting the distribution of numerical and boolean categories
data.hist(figsize=(25,25));

Drop unrelated columns

This project is not going to implement sentiment analysis or text processing to make a predictive model, maybe in the future. Hence, for now columns containing reviews, summary, rules, and so on will be dropped. Columns that have just single unique value will be eliminated as well. Other columns that do not seem to be related to price will also be removed for simplicity reason but we might use them later if necessary.

# Drop unused columns 
data = data.drop(["listing_url", "scrape_id", "last_scraped", "name", 
           "description", "neighborhood_overview", "picture_url", "host_id", 
           "host_url", "host_name", "host_location", "host_about", "host_thumbnail_url",
           "host_picture_url", "host_thumbnail_url", "host_picture_url", "host_verifications",
           "calendar_last_scraped", "number_of_reviews_l30d", "number_of_reviews_ltm",
           "first_review", "last_review", "license", "reviews_per_month", "host_neighbourhood",
           "host_listings_count", "host_has_profile_pic",
           "host_acceptance_rate", "host_total_listings_count", 
           "host_identity_verified", "neighbourhood", 
           "minimum_minimum_nights", "maximum_minimum_nights", "minimum_maximum_nights", 
           "maximum_maximum_nights", "minimum_nights_avg_ntm", "maximum_nights_avg_ntm", 
           "amenities", "has_availability", "availability_30", "availability_60", 
           "availability_90", "availability_365", "calculated_host_listings_count",
           "calculated_host_listings_count_entire_homes", "calculated_host_listings_count_private_rooms",
           "calculated_host_listings_count_shared_rooms"
           ], axis=1)

Missing values

# Calculate missing values
missing_data_pecentage = data.isna().sum()/len(data.index)*100

# Plot missing values
missing_data_pecentage.plot(kind="bar", color="darkorange", figsize=(10,7))
plt.xlabel("Variables")
plt.ylabel("Missing values (%)")
plt.show()

We are going to drop columns with significant missing values. Row(s) in the "id" column that contains NaN will also be dropped because we can't subtitute data in "id" column with alternative data such as median, mean, mode, or else.

# Drop columns with significant missing data
data = data.drop(["review_scores_rating", "review_scores_accuracy", "review_scores_cleanliness", 
                  "review_scores_checkin", "review_scores_communication", "calendar_updated",
                  "review_scores_location", "review_scores_value", "neighbourhood_group_cleansed", 
                  "host_response_time", "host_response_rate"],
                  axis=1)

# Find row in which "id" that contains NaN value(s)
data[data["id"].isnull()]

	id	host_since	host_is_superhost	neighbourhood_cleansed	latitude	longitude	property_type	room_type	accommodates	bathrooms	bathrooms_text	bedrooms	beds	price	minimum_nights	maximum_nights	number_of_reviews	instant_bookable

It seems there is no missing values in "id" column.

Check duplicates

data[data.duplicated()]

	id	host_since	host_is_superhost	neighbourhood_cleansed	latitude	longitude	property_type	room_type	accommodates	bathrooms	bathrooms_text	bedrooms	beds	price	minimum_nights	maximum_nights	number_of_reviews	instant_bookable

Descriptive statistics

# Print basic statistics of the data
data.describe()

	id	latitude	longitude	accommodates	bathrooms	bedrooms	beds	minimum_nights	maximum_nights	number_of_reviews
count	5.013300e+04	50133.000000	50133.000000	50133.000000	0.0	40700.000000	49542.000000	50133.000000	5.013300e+04	50133.000000
mean	2.594199e+07	48.863973	2.344874	3.053019	NaN	1.374103	1.689092	112.930226	9.999625e+02	20.834600
std	1.566613e+07	0.018182	0.033154	1.637042	NaN	1.017934	1.383799	170.370611	4.466482e+04	44.955323
min	5.396000e+03	48.813080	2.223870	0.000000	NaN	1.000000	0.000000	1.000000	1.000000e+00	0.000000
25%	1.214614e+07	48.850810	2.324160	2.000000	NaN	1.000000	1.000000	2.000000	2.400000e+02	1.000000
50%	2.544425e+07	48.865280	2.347970	2.000000	NaN	1.000000	1.000000	4.000000	1.125000e+03	5.000000
75%	4.004825e+07	48.878450	2.369280	4.000000	NaN	2.000000	2.000000	365.000000	1.125000e+03	21.000000
max	5.216600e+07	48.905680	2.473190	16.000000	NaN	50.000000	90.000000	9999.000000	1.000000e+07	1596.000000

# Print the statistics including columns with object data type
data.describe(include=['object'])

	host_since	host_is_superhost	neighbourhood_cleansed	property_type	room_type	bathrooms_text	price	instant_bookable
count	50099	50099	50133	50133	50133	50030	50133	50133
unique	3965	2	20	66	4	31	856	2
top	2019-09-02	f	Buttes-Montmartre	Entire rental unit	Entire home/apt	1 bath	$80.00	f
freq	504	43197	5355	38191	41329	37301	2133	34160

# Print data types of each column
data.dtypes

id                          int64
host_since                 object
host_is_superhost          object
neighbourhood_cleansed     object
latitude                  float64
longitude                 float64
property_type              object
room_type                  object
accommodates                int64
bathrooms                 float64
bathrooms_text             object
bedrooms                  float64
beds                      float64
price                      object
minimum_nights              int64
maximum_nights              int64
number_of_reviews           int64
instant_bookable           object
dtype: object

###Inspect columns

longitude and latitude

We will create a new dataframe consisting longitude and latitude then drop these two variables from the main dataframe later.

# Make a dataframe from longitude and latitude
long_lat = data[['longitude', 'latitude']]

# Drop longitude and latitude from main dataframe
data = data.drop(['longitude', 'latitude'], axis=1)

host_since

We will create a new feature called "active_days" by substracting the last_scraped date to host_since.

# Convert to datetime
data.host_since = pd.to_datetime(data.host_since) 

# Calculate the number of days
data['active_days'] = (datetime(2019, 4, 9) - data.host_since).astype('timedelta64[D]')

# Print mean and median
print("Mean days as host:", round(data['active_days'].mean(),0))
print("Median days as host:", data['active_days'].median())

# Replace null values with the median
data.active_days.fillna(data.active_days.median(), inplace=True)

Mean days as host: 1203.0
Median days as host: 1356.0

room_type and property_type

# Print categories in room_type
data.room_type.value_counts()

Entire home/apt    41329
Private room        7250
Hotel room          1209
Shared room          345
Name: room_type, dtype: int64

# Print categories in property_type
data.property_type.value_counts()

Entire rental unit             38191
Private room in rental unit     5053
Room in boutique hotel          1714
Entire condominium (condo)      1519
Entire loft                      875
                               ...  
Private room in chalet             1
Dome house                         1
Entire bed and breakfast           1
Entire cottage                     1
Barn                               1
Name: property_type, Length: 66, dtype: int64

Since room_type and property_type have similar categories, we will use only room type, drop the property_type, and categorize them to three labels which are "entire unit", "private room", and "other".

# Replacing other categories with 'other'
data.loc[~data.property_type.isin(['Entire home/apt', 'Private room']), 'room_type'] = 'Other'

# Drop property_type
data = data.drop(["property_type"], axis=1)

bathrooms, bedrooms and beds

# Print categories in bathrooms
data.bathrooms.unique()

array([nan])

Drop "bathrooms" since it only contains NaN values.

# Drop "bathrooms"
data = data.drop(["bathrooms"], axis=1)

Let's change categorical data of bathrooms_text to numerical.

# Print categories in bathrooms_text
data.bathrooms_text.unique()

array(['1 bath', '1 private bath', '1.5 baths', '2 baths',
       '1 shared bath', '1.5 shared baths', nan, '2.5 baths',
       '2 shared baths', '4 baths', 'Half-bath', '3 baths', '3.5 baths',
       '4.5 baths', 'Shared half-bath', '7 shared baths', '0 baths',
       '0 shared baths', '5 baths', 'Private half-bath',
       '2.5 shared baths', '6.5 shared baths', '6 baths', '8 baths',
       '3 shared baths', '7 baths', '50 baths', '5.5 baths', '6.5 baths',
       '10 baths', '29 baths', '23 baths'], dtype=object)

# Replace bathrooms_text data to numerical
data.bathrooms_text.replace({
    '1 bath': 1,
    '1 private bath': 1,
    '1.5 baths': 1,
    '2 baths': 2,
    '1 shared bath': 1,
    '1.5 shared baths': 1,
    '2.5 baths': 2,
    '2 shared baths': 2,
    '4 baths': 4,
    'Half-bath': 1,
    '3 baths': 3,
    '3.5 baths': 3,
    '4.5 baths': 4,
    '5.5 baths': 5,
    '6.5 baths': 6,
    'Shared half-bath': 1,
    '7 shared baths': 7,
    '0 baths': 0,
    '0 shared baths': 0,
    '5 baths': 5,
    'Private half-bath': 1,
    '2.5 shared baths': 2,
    '6.5 shared baths': 6,
    '6 baths': 6,
    '8 baths': 8,
    '3 shared baths': 3,
    '7 baths': 7, 
    '10 baths': 10,
    '23 baths': 23,
    '29 baths': 29,
    '50 baths': 50,

    }, inplace=True)

# Replace null values with the median
data.bathrooms_text.fillna(data.bathrooms_text.median(), inplace=True)

# Change data type to integer
data.bathrooms_text = data.bathrooms_text.astype('int32')

# Change column name
data = data.rename(columns = {'bathrooms_text':'bathrooms'})

Change NaN values in beds to median.

data.beds.unique()

array([ 1.,  2.,  3.,  0.,  4.,  5.,  6.,  8., nan,  9.,  7., 11., 12.,
       16., 18., 10., 79., 77., 90., 83., 85., 13., 14., 40.])

# Replace null values with the median
data.beds.fillna(data.beds.median(), inplace=True)

# Change data type to integer
data.beds = data.beds.astype('int32')

price

We will drop the currency sign from price strings and change them to integer.

# Format price
data.price = data.price.str[1:-3]
data.price = data.price.str.replace(",", "")
data.price = data.price.astype('int32')
data.price

0        110
1        100
2        130
3         75
4         80
        ... 
50128    142
50129    138
50130     61
50131    100
50132     69
Name: price, Length: 50133, dtype: int32

host_is_superhost

# Print value counts
print(data.host_is_superhost.unique())
print(data.host_is_superhost.value_counts())

['f' 't' nan]
f    43197
t     6902
Name: host_is_superhost, dtype: int64

# Replace binary categorical data to 0 and 1
data.host_is_superhost.replace({'f': 0, 't': 1}, inplace=True)

# Replace null values with the median
data.host_is_superhost.fillna(data.host_is_superhost.median(), inplace=True)

# Change data type to integer
data.host_is_superhost = data.host_is_superhost.astype('int32')

instant_bookable

# Print value counts
print(data.instant_bookable.unique())
print(data.instant_bookable.value_counts())

['f' 't']
f    34160
t    15973
Name: instant_bookable, dtype: int64

# Replace binary categorical data to 0 and 1
data.instant_bookable.replace({'f': 0, 't': 1}, inplace=True)

##Exploratory Data Analysis

Trend of Service Adoption

Let's see the trend of hosts joining the service. We can see that the number kept increasing since 2009 until the peak in 2015 and it has been relatively slowing down until today with significant increase in 2019.

# Create dataframes for time series analysis
ts_host_since = pd.DataFrame(data.set_index('host_since').resample('MS').size())

# Rename columns
ts_host_since = ts_host_since.rename(columns={0: 'hosts'})
ts_host_since.index.rename('month', inplace=True)

def decompose_time_series(df, title=''):
    """
    Plots the original time series and its decomposition into trend, seasonal and residual.
    """
    # Decomposing the time series
    decomposition = seasonal_decompose(df)
    
    # Getting the trend, seasonality and noise
    trend = decomposition.trend
    seasonal = decomposition.seasonal
    residual = decomposition.resid
    
    # Plotting the original time series and the decomposition
    plt.figure(figsize=(12,10))
    plt.suptitle(title, fontsize=12, y=1)
    plt.subplots_adjust(top=0.80)
    plt.subplot(411)
    plt.plot(df, label='Original')
    plt.legend(loc='upper left')
    plt.subplot(412)
    plt.plot(trend, label='Trend')
    plt.legend(loc='upper left')
    plt.subplot(413)
    plt.plot(seasonal,label='Seasonality')
    plt.legend(loc='upper left')
    plt.subplot(414)
    plt.plot(residual, label='Residuals')
    plt.legend(loc='upper left')
    plt.tight_layout()

# Call function of trend time series
decompose_time_series(ts_host_since, title='Number of hosts joining the service each month')

# Drop host_since as it is no longer needed
data.drop(['host_since'], axis=1, inplace=True)

Price Distribution

# Print max and min of price
print(f"Nightly prices range from ${min(data.price)} to ${max(data.price)}.")

Nightly prices range from $0 to $11600.

# Plot the distribution of prices
plt.figure(figsize=(20,4))
data.price.hist(bins=100, range=(0,1000))
plt.margins(x=0)
plt.axvline(200, color='orange', linestyle='--')
plt.title("Nightly prices in Paris up to $1000", fontsize=16)
plt.xlabel("Price ($)")
plt.ylabel("Number of listings")
plt.show()

# Distribution of prices from £200 upwards
plt.figure(figsize=(20,4))
data.price.hist(bins=100, range=(200, max(data.price)))
plt.margins(x=0)
plt.axvline(500, color='orange', linestyle='--')
plt.axvline(1000, color='red', linestyle='--')
plt.title("Nightly prices in Paris of more than $200", fontsize=16)
plt.xlabel("Price ($)")
plt.ylabel("Number of listings")
plt.show()

# Replacing values under $10 with $10
data.loc[data.price <= 10, 'price'] = 10

# Replacing values over $1000 with $1000
data.loc[data.price >= 1000, 'price'] = 1000

Capacity

# Plot number of people that can be accomodated based on median price
plt.figure(figsize=(10,5))
data.groupby('accommodates').price.median().plot(kind='bar')
plt.title('Median price of places accommodating different number of guests', fontsize=14)
plt.xlabel('Number of guests accommodated', fontsize=13)
plt.ylabel('Median price ($)', fontsize=13)
plt.xticks(rotation=0)
plt.xlim(left=0.5)
plt.show()

Neighborhood

We are going to see the distribution of price and the number of properties based on location.

# Renaming the neighbourhood column
data.rename(columns={'neighbourhood_cleansed': 'area'}, inplace=True)

# Importing the London borough boundary GeoJSON file as a dataframe in geopandas
map_df = gpd.read_file('https://raw.githubusercontent.com/waldysetio/price-estimation/main/data/neighbourhoods.geojson')
map_df.head()

	neighbourhood	neighbourhood_group	geometry
0	Batignolles-Monceau	None	MULTIPOLYGON (((2.29517 48.87396, 2.29504 48.8...
1	Palais-Bourbon	None	MULTIPOLYGON (((2.32090 48.86306, 2.32094 48.8...
2	Buttes-Chaumont	None	MULTIPOLYGON (((2.38943 48.90122, 2.39014 48.9...
3	Opéra	None	MULTIPOLYGON (((2.33978 48.88203, 2.33982 48.8...
4	Entrepôt	None	MULTIPOLYGON (((2.36469 48.88437, 2.36485 48.8...

# Dropping the empty column
map_df.drop('neighbourhood_group', axis=1, inplace=True)

# Creating a dataframe of listing counts and median price by borough
area_df = pd.DataFrame(data.groupby('area').size())
area_df.rename(columns={0: 'number_of_listings'}, inplace=True)
area_df['median_price'] = data.groupby('area').price.median().values

# Joining the dataframes
area_map_df = map_df.set_index('neighbourhood').join(area_df)

# Plotting the number of listings in each area
fig1, ax1 = plt.subplots(1, figsize=(15, 6))
area_map_df.plot(column='number_of_listings', cmap='Blues', ax=ax1)
ax1.axis('off')
ax1.set_title('Number of listings in each Paris area', fontsize=14)
sm = plt.cm.ScalarMappable(cmap='Blues', norm=plt.Normalize(vmin=0, vmax=9000))
sm._A = [] # Creates an empty array for the data range
cbar = fig1.colorbar(sm)
plt.show()

# Plotting the median price of listings in each area
fig2, ax2 = plt.subplots(1, figsize=(17, 6))
area_map_df.plot(column='median_price', cmap='Blues', ax=ax2)
ax2.axis('off')
ax2.set_title('Median price of listings in each Paris area', fontsize=14)
sm = plt.cm.ScalarMappable(cmap='Blues', norm=plt.Normalize(vmin=min(area_map_df.median_price), vmax=max(area_map_df.median_price)))
sm._A = [] # Create an empty array for the data range
cbar = fig2.colorbar(sm)
plt.show()

Boolean Features

def binary_count_and_price_plot(col, figsize=(8,3)):
    """
    Plots a simple bar chart of the counts of true and false categories in the column specified,
    next to a bar chart of the median price for each category.
    A figure size can optionally be specified.
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize)
    fig.suptitle(col, fontsize=16, y=1)
    plt.subplots_adjust(top=0.80) # So that the suptitle does not overlap with the ax plot titles
    
    data.groupby(col).size().plot(kind='bar', ax=ax1, color=['firebrick', 'seagreen'])
    ax1.set_xticklabels(labels=['false', 'true'], rotation=0)
    ax1.set_title('Category count')
    ax1.set_xlabel('')
    
    data.groupby(col).price.median().plot(kind='bar', ax=ax2, color=['firebrick', 'seagreen'])
    ax2.set_xticklabels(labels=['false', 'true'], rotation=0)
    ax2.set_title('Median price ($)')
    ax2.set_xlabel('')
    
    plt.show()

Superhosts

# Correlation between price and whether host is super host
binary_count_and_price_plot('host_is_superhost')
print(data.host_is_superhost.value_counts(normalize=True))

0    0.862326
1    0.137674
Name: host_is_superhost, dtype: float64

Instant booking

# Correlation between price and instant bookable
binary_count_and_price_plot('instant_bookable')
print(data.instant_bookable.value_counts(normalize=True))

0    0.681388
1    0.318612
Name: instant_bookable, dtype: float64

##Feature Engineering

One-hot Encoding

# Make dummy variables
transformed_data = pd.get_dummies(data)

Adressing Multicolinearity

def multi_collinearity_heatmap(df, figsize=(11,9)):
    
    """
    Creates a heatmap of correlations between features in the dataframe. A figure size can optionally be set.
    """
    
    # Set the style of the visualization
    sns.set(style="white")

    # Create a covariance matrix
    corr = df.corr()

    # Generate a mask the size of our covariance matrix
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=figsize)

    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(220, 10, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, cmap=cmap, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5}, vmax=corr[corr != 1.0].max().max());

# Plot correlation of transformed_data
multi_collinearity_heatmap(transformed_data, figsize=(20,20))

There seems to be strong relationship between room_type_other with room_type_Private and bedrooms with beds. We will remove one of the two for these two pairs.

# Drop collinear features
to_drop = ['beds',
           'bedrooms',
           'room_type_Other', 
           'room_type_Private room']
to_drop.extend(list(transformed_data.columns[transformed_data.columns.str.endswith('nan')]))

transformed_data.drop(to_drop, axis=1, inplace=True)

# See the multicollinearity without area feature
multi_collinearity_heatmap(transformed_data.drop(list(transformed_data.columns[transformed_data.columns.str.startswith('area')]), axis=1), figsize=(25,22))

Address Skewed Data

numerical_columns = ['accommodates', 'bathrooms', 'active_days', 'maximum_nights', 'minimum_nights', 'number_of_reviews', 'price']

transformed_data[numerical_columns].hist(figsize=(10,11));

# Log transforming columns
numerical_columns = [i for i in numerical_columns if i not in ['host_days_active']] # Remove items not to be transformed

for col in numerical_columns:
    transformed_data[col] = transformed_data[col].astype('float64').replace(0.0, 0.01) # Replace 0s with 0.01
    transformed_data[col] = np.log(transformed_data[col])

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py:726: RuntimeWarning: invalid value encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)

transformed_data[numerical_columns].hist(figsize=(10,11));

# Separating X and y
X = transformed_data.drop('price', axis=1)
y = transformed_data.price

# Scaling
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=list(X.columns))

##Modeling and Evaluation

# Splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

XGB

# Modeling
xgb_reg_start = time.time()

xgb_reg = xgb.XGBRegressor()
xgb_reg.fit(X_train, y_train)
training_preds_xgb_reg = xgb_reg.predict(X_train)
val_preds_xgb_reg = xgb_reg.predict(X_test)

xgb_reg_end = time.time()

print(f"Time taken to run: {round((xgb_reg_end - xgb_reg_start)/60,1)} minutes")
print("\nTraining MSE:", round(mean_squared_error(y_train, training_preds_xgb_reg),4))
print("Validation MSE:", round(mean_squared_error(y_test, val_preds_xgb_reg),4))
print("\nTraining r2:", round(r2_score(y_train, training_preds_xgb_reg),4))
print("Validation r2:", round(r2_score(y_test, val_preds_xgb_reg),4))

[14:52:27] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Time taken to run: 0.1 minutes

Training MSE: 0.1971
Validation MSE: 0.2054

Training r2: 0.5501
Validation r2: 0.5187

# Calculate weights
ft_weights_xgb_reg = pd.DataFrame(xgb_reg.feature_importances_, columns=['weight'], index=X_train.columns)
ft_weights_xgb_reg.sort_values('weight', inplace=True)
ft_weights_xgb_reg

	weight
room_type_Entire home/apt	0.000000
area_Vaugirard	0.000000
area_Entrepôt	0.000000
area_Batignolles-Monceau	0.000000
area_Observatoire	0.008662
maximum_nights	0.010759
area_Panthéon	0.010760
id	0.011895
active_days	0.012367
area_Opéra	0.013760
area_Bourse	0.015044
area_Popincourt	0.018647
area_Passy	0.019185
number_of_reviews	0.020651
area_Reuilly	0.020975
area_Temple	0.023728
area_Gobelins	0.025696
area_Palais-Bourbon	0.025961
area_Hôtel-de-Ville	0.029520
host_is_superhost	0.029991
area_Louvre	0.030567
area_Luxembourg	0.032430
minimum_nights	0.032503
area_Élysée	0.043453
area_Buttes-Chaumont	0.048962
area_Ménilmontant	0.053697
area_Buttes-Montmartre	0.062794
instant_bookable	0.077156
bathrooms	0.137204
accommodates	0.183633

# Plotting feature importances
plt.figure(figsize=(8,20))
plt.barh(ft_weights_xgb_reg.index, ft_weights_xgb_reg.weight, align='center') 
plt.title("Feature importances in the XGBoost model", fontsize=14)
plt.xlabel("Feature importance")
plt.margins(y=0.01)
plt.show()

Note: This project is still ongoing. Feature engineering and model development will be improved to increase metrics of the model's quality.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
images		images
README.md		README.md
price-estimation.ipynb		price-estimation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Property Rental Price Estimation in Paris

Outline

Data Preparation

About

Releases

Packages

Languages

waldysetio/price-estimation

Folders and files

Latest commit

History

Repository files navigation

Property Rental Price Estimation in Paris

Outline

Data Preparation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages