Skip to content

Regression analysis and Exploratory Data Analysis of New York City - AirBNB Open Dataset.

Notifications You must be signed in to change notification settings

imnikhilanand/NYC-AirBNB-Open-Data-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NYC-AirBNB-Open-Data-Analysis

About Dataset

Context:
Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more unique, personalized way of experiencing the world. This dataset describes the listing activity and metrics in NYC, NY for 2019.

Content:
This data file includes all needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions.

Acknowledgements:
This public dataset is part of Airbnb, and the original source can be found on this website.

Inspiration:
What can we learn about different hosts and areas? What can we learn from predictions? (ex: locations, prices, reviews, etc) Which hosts are the busiest and why? Is there any noticeable difference of traffic among different areas and what could be the reason for it?

Exploratory Data Analysis

Observations:

  • Unique number of neighborhoods : 5
  • Unique number of room types : 3
  • There are 48895 different airbnbs listed in the dataset.
  • There are 37457 different airbnb owners in the dataset.

Airbnb's per user

num_of_houses_per_user


num_of_houses_per_user_graph


Observations:

  • ~86% of the owners have only 1 airbnb.
  • ~9% of the owners have 2 airbnb's.
  • ~2.5% of the owners have 3 airbnb's.

Distribution of Airbnb's based on the neighborhood

num_of_houses_per_user


num_of_houses_per_user_graph


Observations:

  • ~44% of the Airbnb's are located in Manhattan.
  • ~41% of the Airbnb's are located in Brooklyn.
  • ~15% of the Airbnb's are located in Bronx, Queens and Staten Island.

Distribution of Airbnb's based on the type of room

num_of_houses_per_user


num_of_houses_per_user_graph


Observations:

  • ~52% of the Airbnb's are are Entire apartment.
  • ~45% of the Airbnb's are Private room.
  • ~2% of the Airbnb's are shared rooms.

Distribution of Airbnb's based on the price

KDE_plot_price.png

Observations:

  • The mean price of Airbnb's are around $100.
  • The price of most of the AirBNB's are concentrated around the $100.
  • A versy small percentage of AirBNB's have higher price than $300.

Distribution of Airbnb's based on the price in different neighborhoods

KDE_plot_price.png

Observations:

  • The mean price of Airbnb's in Manhattan is the highest.

  • The mean price of Airbnb's are slightly lower in Brooklyn than Manhattan and the most of them are concentrated around the mean.

Region and Airbnb's

KDE_plot_price.png

Regionwise Room type

KDE_plot_price.png

Availability of rooms

KDE_plot_price.png

Scatterplot of features

KDE_plot_price.png

Observation:

  • There was no significant correlation between any feature in the dataset.

Regression Analysis

To build a regression model for predicting the price of AirBNB, we have filtered only those BNB's which are below $200 dollar each night. The filtering was performed as most of the BNB's are lower than $200 (It can be seen in the PDF above).

Before builing the model, irrelavant features were removed such as -

  • Primary keys (irrelavant) - Id, Host_id

  • Categorical variables (irrelevant) - neighbourhood_group, room_type (We have label encoded these groups)

  • Removing one category type from each categories - Staten Island, Shared Room (As these are additional group in the predictors)


To build the model, stepwise regression modelning was performed.

1st model

Predictors:

  • Entire home/apt
  • Manhattan
  • latitude
  • longitude
  • number_of_reviews
  • calculated_host_listings_count

Results:

  • price

OLS Summary:

Dep. Variable: price R-squared: 0.477
Model: OLS Adj. R-squared: 0.476
Method: Least Squares F-statistic: 454.3
Date: Mon, 27 Jun 2022 Prob (F-statistic): 0.00
Time: 22:22:09 Log-Likelihood: -14665.
No. Observations: 3000 AIC: 2.934e+04
Df Residuals: 2993 BIC: 2.939e+04
Df Model: 6
Covariance Type: nonrobust
coefstd errtP>|t|[0.0250.975]
const-5833.44601499.096-3.8910.000-8772.808-2894.084
Entire home/apt53.50241.20044.5860.00051.15055.855
Manhattan20.42961.86910.9330.00016.76624.093
latitude-46.833814.847-3.1550.002-75.944-17.723
longitude-105.575315.209-6.9410.000-135.397-75.753
number_of_reviews0.04690.0153.1860.0010.0180.076
calculated_host_listings_count0.05020.0271.8480.065-0.0030.103
Omnibus:130.288Durbin-Watson:2.004
Prob(Omnibus):0.000Jarque-Bera (JB):149.226
Skew:0.505Prob(JB):3.95e-33
Kurtosis:3.416Cond. No.2.26e+05

KDE_plot_price.png

Let's visualize the residual plot:

KDE_plot_price.png

Observation:

  • Since we can see the residuals are negative in for the datapoints at the beginning which later become positive, We can see a increasing trend in residual errors.

To reduce the linearity in the residual errors, we can perform transformations in predictors and the results.

2nd model

Predictors:

  • Entire home/apt
  • Manhattan
  • latitude
  • longitude
  • number_of_reviews
  • calculated_host_listings_count
  • square of latitude
  • square of longitude
  • square of number of reviews
  • square of calculated host listings count

Results:

  • sqaured root of price

OLS Regression Results:

Dep. Variable: price R-squared: 0.512
Model: OLS Adj. R-squared: 0.510
Method: Least Squares F-statistic: 313.6
Date: Mon, 27 Jun 2022 Prob (F-statistic): 0.00
Time: 22:27:22 Log-Likelihood: -5610.3
No. Observations: 3000 AIC: 1.124e+04
Df Residuals: 2989 BIC: 1.131e+04
Df Model: 10
Covariance Type: nonrobust
coefstd errtP>|t|[0.0250.975]
const-9.322e+043.45e+04-2.7050.007-1.61e+05-2.57e+04
Entire home/apt2.70360.05945.9980.0002.5882.819
longitude1415.4808937.7371.5090.131-423.1943254.156
latitude7135.9938559.09312.7640.0006039.7488232.240
Manhattan0.91040.0969.4570.0000.7221.099
number_of_reviews0.00080.0020.4290.668-0.0030.004
calculated_host_listings_count-0.00810.004-2.2640.024-0.0150.001
longitude_29.61566.3441.5160.130-2.82422.055
latitude_2-87.60586.861-12.7680.000-101.059-74.152
number_of_reviews_21.099e-051.02e-051.0760.282-9.04e-063.1e-05
calculated_host_listings_count_23.112e-051.24e-052.5160.0126.87e-065.54e-05
Omnibus:73.297Durbin-Watson:2.017
Prob(Omnibus):0.000Jarque-Bera (JB):85.001
Skew:0.335Prob(JB):3.49e-19
Kurtosis:3.481Cond. No.9.12e+09

KDE_plot_price.png

Let's visualize the residual plot:

KDE_plot_price.png

Let's check the condition of equal variance using Levene's test:

statistic=18.386483526081904
pvalue=1.860245656713538e-05

Observation:

  • There is still some upward trend remaining in the data but it had been reduced significantly.
  • The constant vaiance condition is not met in this case, so we have to reduce it too.

To reduce the linearity in the residual errors and bringing constant variance in the residual errors, we can perform transformations in predictors and the results. In addition to transformations we have used Weighted least squared Regression this time, to fit the model better.

3rd model

Predictors:

  • Entire home/apt
  • Manhattan
  • latitude
  • longitude
  • number_of_reviews
  • calculated_host_listings_count
  • square of latitude
  • square of longitude
  • square of number of reviews
  • square of calculated host listings count
  • cube of longitude
  • cube of calculated host listings count
  • combined effect of longitude and calculated host listings count
  • combined effect of numnber of reviews and calculated host listings count
  • combined effect of Manhattan, entire apartment or not and number of reviews
  • combined effect of Manhattan, enitre apartment or not and availability of apartment
  • combined effect of Manhattan, longitude and latitude
  • combined effect of entire apartment, longitude, latitude and Manhattan
  • combined effect of squared longitude, squared latitude and number of reviews

Results:

  • root under four of price

WLS Regression Resutls:

Dep. Variable: price R-squared: 0.645
Model: WLS Adj. R-squared: 0.643
Method: Least Squares F-statistic: 284.7
Date: Mon, 27 Jun 2022 Prob (F-statistic): 0.00
Time: 22:47:10 Log-Likelihood: -7739.2
No. Observations: 3000 AIC: 1.552e+04
Df Residuals: 2980 BIC: 1.564e+04
Df Model: 19
Covariance Type: nonrobust
coefstd errtP>|t|[0.0250.975]
const-9126.38511430.266-6.3810.000-1.19e+04-6321.976
Entire home/apt0.49370.01145.1990.0000.4720.515
number_of_reviews-1.05370.184-5.7260.000-1.415-0.693
calculated_host_listings_count-1.01051.327-0.7610.447-3.6131.592
longitude_2-2.17260.764-2.8440.004-3.671-0.675
latitude_223.65911.27118.6210.00021.16826.150
number_of_reviews_21.236e-052.72e-064.5510.0007.04e-061.77e-05
calculated_host_listings_count_20.00032.3e-0510.9680.0000.0000.000
longitude_3-0.01960.007-2.8460.004-0.033-0.006
latitude_3-0.38730.021-18.6290.000-0.428-0.347
calculated_host_listings_count_3-6.423e-076.95e-08-9.2440.000-7.79e-07-5.06e-07
longitude_chlc0.04890.0153.3190.0010.0200.078
latitude_number_of_reviews0.04540.0085.7170.0000.0300.061
latitude_chls0.11330.0176.7120.0000.0800.146
number_of_reviews_chls3.562e-081.26e-082.8350.0051.1e-086.03e-08
manhattan_entire_apt_num_reviews-0.00220.001-3.2140.001-0.003-0.001
manhattan_entire_apt_available0.00068.02e-057.4050.0000.0000.001
manhattan_long_lat-8.5e-055.13e-06-16.5590.000-9.51e-05-7.49e-05
entire_home_long_lat_manhattan5.484e-056.46e-068.4850.0004.22e-056.75e-05
long_2_lat_2_number_of_reviews-8.766e-081.71e-08-5.1280.000-1.21e-07-5.41e-08
Omnibus:1170.517Durbin-Watson:1.884
Prob(Omnibus):0.000Jarque-Bera (JB):248821.793
Skew:-0.667Prob(JB):0.00
Kurtosis:47.596Cond. No.1.21e+14

KDE_plot_price.png

Let's visualize the residual plot:

KDE_plot_price.png

Let's check the condition of equal variance using Levene's test:

statistic=2.885560933873544
pvalue=0.08948063279615048

Let's check the conditions of normality of this model:

KDE_plot_price.png

Observation:

  • The trend in the residuals have been removed significatly. There are some data points which have higher residual errors.
  • The constant vaiance condition is met in this case. It can be seen from the Levene's test.
  • The normality condition of residuals is not met as there are some residuals which are outlier. We can still move ahead with this model, as linear regressions are robust to normality condition.

Future Score:

  • We can remove the outlier or high leverage points to further improve the model.
  • We can explore Tree based or Neural network based techniques to improve the model further.

Releases

No releases published

Packages

No packages published