Skip to content

Text analysis of user reviews for AirBnB listings in Bronx and Staten Island. Covers "Document Term Matrix", "Topicmodels" and "Sentiment Analysis"

Notifications You must be signed in to change notification settings

sameer-sinha/Airbnb-Reviews-Text-Analysis

Repository files navigation

Airbnb-Reviews-Text-Analysis

Text analysis of user reviews for AirBnB listings in Bronx and Staten Island. Covers "Document Term Matrix", "Topicmodels" and "Sentiment Analysis"

Running the app directly from RStudio

shiny::runGitHub("Airbnb-Reviews-Text-Analysis", "sameer-sinha")

Screenshot

screenshot

Data Details

Data used for this project can be found here, under New York City, New York, United States heading. We used the compressed files as they have more data.

Data Cleaning

As one of the objectives of our project was to Creating Predicting Models for Price Estimation, we carried out necessary steps for data cleaning. We used the tool JMP Pro 13 for the same.

  • From the initial analysis of the price field in listing.csv, we observed some significant outliers. For the purpose of building a robust predictive model we only look at values between $1 and $500.
  • Multiple important features had NULL values. We handled the missing values as
    • Removed unwanted columns
    • For continuous features, replaced null values with the mean
    • For nominal features, replaced null values with the mode or created an additional category for null values
  • A new field is created to represent the price of the listing using the formula: Price = Price + Cleaning Price
  • After checking the distribution of price and other continuous features, applied log transformation to some of the features, so that their disribution would be closer to normal distribution
  • This cleaned data was saved in file listing 2.csv from which we carried out predictive modeling and text analysis

Data Preparation

Apart from the basic data cleaning, some processing was required to prepare data for text analytics. The code for same can be found above. As the algorithms for text processing takes time to run, especially for data of this size, I saved the results in .RData files and further used it in ui.R and server.R files of Shiny.

Text Analytics

Text Mining

  • Used tm package of R for text mining.
  • Created a corpus of user reviews and carried out necessary cleaning of the corpus
  • Used SnowballC to implement Porter's word stemming algorithm
  • Created Document Term Matrix from the cleaned corpus.
  • Term count from this DTM is used to find high frequency term either based on
    • A give range of top percentage
    • A minimum number of frequency
  • Used wordcloud package to create word cloud

Topic Models

  • Used slam package to remove sparse terms from dtm and hence reduce the dimension of DTM
  • Used topicmodels package of R to cluster group of words into 10 topics and to get the posterior probability of the topics for each document and and the terms for each topic

Sentiment Analysis

  • Used qdap for sentiment analysis
  • Created polarity score for each review using the polarity function of qdap package
  • Combined the polarity variable with reviews dataframe to plot results and check the correlation between polarity and user ratings
  • Helpful and Not Helpful variables are created based on a threshold (depending upon minimum and maximum values) and a spineplot is created for visualization

Visualization of Text Mining and Topic Models

  • Used ggplot2 package to show bar graph
  • Used shiny to create interactive visualization

Future Agendas

  • Extend the application by using tf-idf, term frequency–inverse document frequency, in place of frequencies of term as entries, as it will give a measure of relative importance of a word to a document.
  • Use a precompiled list of words with positive and negative meanings
  • Create visualization for sentiment analysis

About

Text analysis of user reviews for AirBnB listings in Bronx and Staten Island. Covers "Document Term Matrix", "Topicmodels" and "Sentiment Analysis"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages