Airbnb is an online marketplace that connects people who want to rent out their homes with people looking for accommodations in that locale. NYC is the most populous city in the United States, and one of the most popular tourism and business places globally.
Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Nowadays, Airbnb became one of a kind service that is used by the whole world.
Data analysts become a crucial factor for the company that provided millions of listings through Airbnb. These listings generate a lot of data that can be analyzed and used for security, business decisions, understanding of customers’ and providers’ behavior on the platform, implementing innovative additional services, guiding marketing initiatives, and much more.
Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Nowadays, Airbnb became one of a kind service that is used by the whole world. Data analysts become a crucial factor for the company that provided millions of listings through Airbnb. These listings generate a lot of data that can be analyzed and used for security, business decisions, understanding of customers’ and providers’ behavior on the platform, implementing innovative additional services, guiding marketing initiatives, and much more.
The very basic information about the dataset using df.info()
By basic inspection, a particular property name will have one particular host name hosted by that same individual but a particular host name can have multiple properties in an area. So, host_name is a categorical variable here. Also neighbourhood_group (comprising of Manhattan, Brooklyn, Queens, Bronx, Staten Island), neighbourhood and room_type (private,shared,Entire home/apt) fall into this category.
~id,latitude,longitude,price,minimum_nights,number_of_reviews,last_review, reviews_per_month, calculated_host_listings_count, availability_365 are numerical variables.
I was curious to check the distribution of price over the entire dataset looking at the five-number summary of the data, later found out something like this.
I have used seaborn distplot to plot this distribution curve.
The distribution has a positively skewed tail at the very extreme as we can see. Also getting the skewness as 19.118939 and kurtosis to be 585.672879, depicting the skewness value>1 and kurtosis is much high indicating presence of good amount of outliers, we will look later into this when we handle outliers!!
We’ll be finding the relationship between these two numerical variables using seaborn scatter plot as below:
the correlation matrix to understand how are the features interrelated with each other. I have plotted using seaborn heatmap to understand the strength between the variables used.
the correlation matrix to understand how are the features interrelated with each other. I have plotted using seaborn heatmap to understand the strength between the variables used.
Lets now check for distribution of price across: Manhattan, Brooklyn, Queens, Bronx & Staten Island.Instead of checking distributions for each categories one by one we can simply do a violin plot for getting the overall statistics for each groups. But we’ll get to know the median of price/neighbourhood group. As usual Manhattan being the most costliest place to live in, have price more than 140 USD followed by Brooklyn with around 80 USD on an average for the listings.
Queens, Staten Island are on the same page with price on listings.
The bar plot above clearly depicts the neighbourhoods with listings having highest average price/day in each neighbourhood groups of NYC.
Among the top neighbourhoods in each neighbourhood groups, top 2 of them namely: Fort Wadsworth & Sea Gate, origins from Staten Island & Brooklyn respectively.
Seaborn stripplot function always treats one of the variables as categorical and draws data at ordinal positions (0, 1, … n) on the relevant axis, even when the data has a numeric or date type. So what do we conclude by this another kind of scatter plot?
So, Private rooms received the most no of reviews/month where Manhattan had the highest reviews received for Private rooms with more than 50 reviews/month, followed by Manhattan in the chase.
Manhattan & Queens got the most no of reviews for Entire home/apt room type.
There were less reviews received from shared rooms as compared to other room types and it was from Staten Island followed by Bronx.
Now, let’s check for the distribution of types of rooms across all neighbourhood groups of NYC!
By the two scatterplots of latitude vs longitude we can infer there’s is very less shared room throughout NYC as compared to private and Entire home/apt.
95% of the listings on Airbnb are either Private room or Entire/home apt. Very few guests had opted for shared rooms on Airbnb.
Also, guests mostly prefer this room types when they are looking for a rent on Airbnb as we found out previously in our analysis.
The scatterplot showing the price variables across these co-ordinates in a more authentic way using the original NYC boroughs map by saving the original map image in my local directory and then reading the image using cv2 imread function.
We can infer that there are high range of prices across Manhattan followed by Brooklyn and Queens being the most costliest place to stay in NYC.
Listings availability in a year throughout NYC??
I’ve plotted the scatterplot depicting the availability of listings available throughout NYC in a year. I have used hues with different sizes based on the availability ranges.
Bronx & Staten Island has listings which are mostly available throughout the year, might be the case as they are not much costlier as compared to other boroughs as in Manhattan, Brooklyn & Queens.
I’ve reached almost the end of the analysis. There might be few analysis which can be done more. But there’s always an ending to a story!