This assignment is about the text and sentiment analysis module of the visual analytics project.
This assignment is part of the bigger Shiny-based Visual Analytics Application (Shiny-VAA) project. Each team member is required to select a sub-module from the proposed Shiny-VAA and to complete the tasks below:
Airbnb is an online marketplace platform for accomodation rental. Founded in 2008 by Brian Chesky and Joe Gebbia who put an air mattress in their living room and offered bed & breakfast (thus “Airbnb”), the company has grown to be one of the most popular short-term accomodation rental platforms in multiple countries around the world.
With millions of listings in 220 countries and over 100,000 cities, Airbnb has a rich store of data from transactions between hosts and guests. Such data includes structured data like price, number of facilities (e.g. bedrooms, bathrooms), minimum and maximum number of nights’ stay; and unstructured text data like description of the accommodation, and reviews by guests.
This assignment focuses on analysis of unstructured text data from Airbnb’s online marketplace.
Multiple attempts on text and semtiment analysis has been done on Airbnb data sets using analysis tools such as SAS Enterprise Miner, Python, and others. A large number of practitioners prefer to perform text analysis in Python using tools like VADER (Valence Aware Dictionary and sEntiment Reasoner) for sentiment analysis, while R is typically preferred for data analysis.
That said, R does have a rich source of tools specifically developed for text and sentiment analysis.
Text analytics on Airbnb data typically consists of analysing user reviews to understand the correlation between words and phrases used, versus the rating given to the accommodation. Such analysis can reveal the kind of accomodation, facilities, and service standards that guests expect and what would be considered ‘good’ or ‘bad’ accommodation or service, thus leading to the guest’s high or low rating of the accommodation.
This assignment attempts to conduct text and sentiment analysis on the host’s description of the accommodation rather than the guest’s review of their experience. The aim is to analyse a host “sells” their listing might correlate to a high or low rating by a guest. This approach places greater focus on the host (rather than the guest) to emphasize the accommodation’s strengths and unique selling points.
This assignment (and the project as a whole) also attempts to develop an interactive application, available freely to anyone, by which the input variables such as region of country, review score, number of sentiment topics can be manipulated and customised based on the application user’s requirements. This will allow the user to focus in on their unique areas of concern. A review of current Airbnb analysis platforms revealed only one such interactive website, however, it is a paid service.
Some examples of other text and sentiment analysis done on Airbnb data are as listed: 1. Sentiment Analysis of Airbnb Boston Listing Reviews 2. AirBNB Data Analysis 3. Text analysis and Sentiment analysis of AirBnb Users’ reviews using SAS Enterprise Miner 4. Airbnb Price Prediction: Data Analysis with Python | Making Models (I) 5. Airbnb Rental Listings Dataset Mining 6. AIRDNA
The text analysis module should display the wordcloud and topic model on separate panes. On the left-hand pane, the user will be able to customise controls such as the country’s region of interest, the rating score to be analysed, the number of clusters for topic modelling, and the number of words for each topic. Other customisations that could be included are size of wordcloud and colour scheme.
packages = c('tidyverse', 'stringr', 'dplyr', 'lubridate', 'anytime', 'shiny',
'shinydashboard', 'plotly', 'corrplot', 'heatmaply', 'tidytext', 'tm',
'SnowballC', 'wordcloud', 'topicmodels')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
airbnb_raw <- read_csv("data/listings.csv")
Many of the columns are not relevant to the analysis. Only columns relevant to the analysis are selected.
airbnb <- select(airbnb_raw, c(1, 6, 7, 8, 10, 13, 14, 15, 16, 17, 18, 19, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 35, 36, 37, 38, 39, 40,
41, 42, 43, 44, 45, 46, 47, 49, 50, 51, 52, 53, 55, 56,
57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 68, 69, 70, 71,
72, 73, 74, 75, 76, 77, 80))
airbnb$host_response_rate <-
((as.numeric(str_remove_all(airbnb$host_response_rate, "%")))/100)
airbnb$host_acceptance_rate <-
((as.numeric(str_remove_all(airbnb$host_acceptance_rate, "%")))/100)
airbnb$price <- as.numeric(gsub("\\$","", airbnb$price))
airbnb$host_response_time <- na_if(airbnb$host_response_time, "N/A")
write.csv(airbnb, "data/Airbnb.csv")
airbnb <- read_csv("data/Airbnb_victoria.csv")
# Shiny app will allow the analyst to select the region of concern and also the range of review scores for analysis.
The first step in preparing unstructured textual data for analysis is to tokenise the data. Tokenisation is the process of breaking up words and phrases into units or ‘tokens’. The output is a tibble of all words and phrases.
airbnb_desc <- airbnb$description
airbnb_text <- iconv(airbnb_desc, to = "UTF-8")
airbnb_tibble <- tibble(airbnb_text)
airbnb_tibble %>%
unnest_tokens(word, airbnb_text)
# A tibble: 5,408,725 x 1
word
<chr>
1 this
2 two
3 bedroom
4 two
5 bathroom
6 beautifully
7 appointed
8 home
9 is
10 set
# ... with 5,408,715 more rows
The next step is to create a corpus. A corpus is a body of all texts included in the analysis.
# Create a corpus
airbnb_corpus <- Corpus(VectorSource(airbnb_tibble))
The next step is to clean the corpus. This includes removing white spaces, converting all texts to lower case (upper case and lower case words are considered different even when spelt the same), removing numbers and punctuation, and stopwords (common English words that do not add any value to the analysis such as ‘a’, ‘the’, ‘it’, etc).
The final cleaning step is to reduce all words to their root (e.g. ‘swimming’ to ‘swim’) in order to not skew the analysis.
# Remove white spaces between text
airbnb_corpus_clean <- tm_map(airbnb_corpus, stripWhitespace)
# Transform all characters to lower case
airbnb_corpus_clean <- tm_map(airbnb_corpus_clean, content_transformer(tolower))
# Remove numbers
airbnb_corpus_clean <- tm_map(airbnb_corpus_clean, removeNumbers)
# Remove punctuation
airbnb_corpus_clean <- tm_map(airbnb_corpus_clean, removePunctuation)
# Remove common words that do not add value to sentiment analysis
airbnb_corpus_clean <- tm_map(airbnb_corpus_clean, removeWords,
c((stopwords("english")), "%bbr", "%br"))
# Stem words to their root form
airbnb_corpus_clean <- tm_map(airbnb_corpus_clean, stemDocument)
The next step is to create a document-term matrix. A document-term matrix is a matrix that consisting of the frequency of all the words that occur in the corpus.
airbnb_corpus_dtm <- DocumentTermMatrix(airbnb_corpus_clean)
The visualisations can now be prepared.
The first visualisation is a word cloud. A word cloud is a visual representation of the frequency of words in the corpus. The more common the word, the larger it will appear.
Wordcloud of words most commonly appearing in the listing’s description# Create the data frame from which the wordcloud will be built
sums <- as.data.frame(colSums(as.matrix(airbnb_corpus_dtm)))
sums <- rownames_to_column(sums)
colnames(sums) <- c("terms", "count")
sums <- arrange(sums, desc(count))
head <- sums
# Wordcloud will show the most common terms used in the 'description' column
wordcloud(words = head$terms, freq = head$count, min.freq = 100,
max.words=1000, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Topic modeling using Latent Dirichlet Allocation The next visualisation is a topic modeling technique. Topic modeling is an unsupervised machine learning technique that detects word and phrase patterns in documents and clusters them into groups known as topics.
Latent Dirichlet Allocation (LDA) is one common topic modeling technique. The basic assumption of LDA is that similar topics make use of similar words (i.e. distributional hypothesis). The purpose of LDA is to map the corpus to topics covering a significant number of words in the documents in the corpus.
LDA assigns topics to arrangements of words for example, the best word for a topic related to accommodation. This is based on the assumption that documents are written with a certain arrangement of words and that those arrangements will determine the topics. LDA assumes that all words in the document can be assigned a probability of belonging to a topic. As such, the goal of LDA is to determine the mixture of topics that a document contains.
airbnb_lda <- LDA(airbnb_corpus_dtm, k = 3)
# Shiny app will allow the analyst to determine number of topics (k) to appear.
airbnb_topics <- tidy(airbnb_lda, matrix = "beta")
# Display the 10 most common terms within each topic
airbnb_top_terms <- airbnb_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
# Shiny app will allow the analyst to determine the top n number of words to list.
# Visualise the output
airbnb_top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
Thanks for visiting my blog!
This post is a data visualisation assignment for the MITB programme of the Singapore Management University. Distill is a publication format for scientific and technical writing, native to the web. Learn more about using Distill at https://rstudio.github.io/distill.