Elon_Musk_Tweets.Rmd

---
title: "Text Mining - Elon Musk Tweets"
author: "Siddharth Sudhakar"
date: "06/06/2020"
output:
  html_document:
    css: style1.css
    highlight: monochrome
    theme: cosmo
    toc: yes
    toc_depth: 4
    toc_float: no
  pdf_document:
    toc: yes
    toc_depth: '4'

---

## <b>Introduction</b> 

Just finished Text Mining with R course on Datacamp,really excited to apply my new skills into practice!!!. 
The dataset which I will be using can be found [here](https://data.world/adamhelsinger/elon-musk-tweets-until-4-6-17#).It contains Elon Musk's tweets from 2010-06-04 to 2017-04-05.It includes Tweet ID, Date & Time Tweet was created, Tweets & Mentions.<br>
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
```
I will be using the following packages
```{r}
library(tidyverse)
library(wordcloud)
library(RColorBrewer)
library(qdap)
library(tm)
library(gridExtra)
library(dendextend)
library(ggthemes)
library(RWeka)
library(tidytext)
library(textdata)
library(lubridate)
library(anytime)

theme_func <- function() {
  theme_minimal() +
  theme(
    text = element_text(family = "serif", color = "gray25"),
    plot.subtitle = element_text(size = 12,hjust = 0.5,color = "gray45"),
    plot.caption = element_text(color = "gray30"),
    plot.background = element_rect(fill = "gray95"),
    plot.margin = unit(c(5, 10, 5, 10), units = "mm"),
    plot.title = element_text(hjust = 0.5),
    strip.text = element_text(color = "white")
  )
  }
  
data <- read_csv("elonmusk_tweets.csv")
str(data)


```



Lets have a look at some of the latest tweets from him.
```{r}
data <- data[!data$text == '',] 
data$text[1:5]
```

## <b>Text Preprocessing</b>
Which includes converting all letters to lower case, removing URL, removing anything other than English letter and space, removig stopwords,extra white space and perform stemming. 

```{r}
clean_corpus <- function(corpus) {
  # Transform to lower case
  corpus <- tm_map(corpus, content_transformer(tolower))
  # remove URLs 
  removeURL <- function(x) gsub("http[^[:space:]]*", "", x) 
  corpus <- tm_map(corpus, content_transformer(removeURL)) 
  # remove anything other than English letters or space 
  removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x) 
  corpus <- tm_map(corpus, content_transformer(removeNumPunct))
  # Strip whitespace
  corpus <- tm_map(corpus, stripWhitespace)
    # Add more stopwords
  corpus <- tm_map(corpus,removeWords, words = c(stopwords("en"),"brt","amp","s"))
  #stemming
  corpus <- tm_map(corpus, stemDocument)
  return(corpus)
}
```


we have our text data as a vector, we convert this vector to a corpus.And then call the clean corpus function to the text data.Now reexamine the contents of the first document. <br>
Have a look at the difference before and after Text Preprocessing.

```{r echo=FALSE}
tm_vector <- VectorSource(data$text)

# Apply VCorpus
tm_corpus <- VCorpus(tm_vector)

# Examine the first document's contents
content(tm_corpus[[1]])

# Clean the text
clean_tweets <- clean_corpus(tm_corpus)

# Reexamine the contents of the first doc
content(clean_tweets[[1]])

```
Wow! Thats some serious processing done.



```{r echo=FALSE}
# Create tf_dtm
tweets_tdm <- TermDocumentMatrix(clean_tweets,control = list(wordLengths = c(1, Inf)))

tweets_tdm
# Create tf_dtm_m
tweets_m <- as.matrix(tweets_tdm)


```

## <b>Top Frequent Words in his Tweets</b> 

Let us have a quick look at the top 15 most frequently used words by Elon Musk in his tweets.
```{r fig.align = "center"}
term_frequency <- rowSums(tweets_m)

# Sort term_frequency in decreasing order
term_frequency <- sort(term_frequency,decreasing = TRUE)


# Plot a barchart of the 10 most common words
barplot(term_frequency[1:15], col = "tan", las = 2) 
```


## <b>Clustering Words</b>

Here first we perform cluster analysis on the dissimilarities of the distance matrix. And then, visualize the word frequency distances using a dendrogram .
```{r fig.align = "center"}
# remove sparse terms
tdm2 <- removeSparseTerms(tweets_tdm, sparse=0.96)
m2 <- as.matrix(tdm2)
# cluster terms
distMatrix <- dist(scale(m2))
hc <- hclust(distMatrix)

hcd <- as.dendrogram(hc)

# Change the branch color to red for "marvin" and "gaye"
hcd_colored <- branches_attr_by_labels(hcd, c("teslamotor", "car","launch","rocket"), "red")

# Plot hcd_colored
plot(hcd_colored, main = "Cluster Dendrogram") 

```

We can see that car and teslamotor are clustered into one group, whereas launch and rocket are clustered into another group.

## <b>Word Association</b>
Another way to think about word relationships is with the findAssocs() function in the tm package. For any given word, findAssocs() calculates its correlation with every other word in a TDM. 

Let us find the words which are associated with stock,spacex and falcon.
```{r fig.align = "center"}
findAssocs(tweets_tdm, "stock", 0.4)
findAssocs(tweets_tdm, "spacex", 0.2)
findAssocs(tweets_tdm, "falcon", 0.2)
```


## <b>Working with Bigrams</b>

Here our focus is on tokens containing two words as it can help to extract useful phrases which leads to some additional insights.

```{r fig.align = "center"}
# Make tokenizer function 
tokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 2, max = 2))
}

# Create unigram_dtm
unigram_dtm <- DocumentTermMatrix(clean_tweets)

# Create bigram_dtm
bigram_dtm <- DocumentTermMatrix(
  clean_tweets, 
  control = list(tokenize = tokenizer)
)

# Create bigram_dtm_m
bigram_dtm_m <- as.matrix(bigram_dtm)

# Create freq
freq <- colSums(bigram_dtm_m)

# Create bi_words
bi_words <- names(freq)


# Plot a wordcloud
par(bg="black") 
wordcloud(bi_words, freq,max.words=500,random.order=FALSE,c(4,0.4), col=terrain.colors(length(bi_words) , alpha=0.9) , rot.per=0.3)
```

Wow!! There are so many interesting Bi-grams - tesla model,model x,spacex dragon,space station,climate change,falcon rocket,launch pad,solar power and so on.


# <b>Sentiment Analysis</b>

This sentiment analysis algorithm is based on <b>"Bing"</b> lexicon.The Bing lexicon labels words as positive or negative.The lexicon object is obtained using [tidytext's](https://cran.r-project.org/web/packages/tidytext/) get_sentiments() function.
In the below steps we assign a polarity to each word and classify them as positive or negative based on the polarity.
```{r fig.align = "center"}
#Transform sentences into words
data_tibble <- data %>%
  unnest_tokens(output = "words", input = text, token = "words")
#Remove stop words from tibble
tweets_tibble_clean <- data_tibble %>%
  anti_join(stop_words, by=c("words"="word"))

word_freq <- tweets_tibble_clean %>% 
  # Inner join to bing lexicon by words = word
  inner_join(get_sentiments("bing"), by = c("words" = "word")) %>% 
  
  # Count by words and sentiment, weighted by count
  count(words, sentiment) %>%
  # Spread sentiment, using n as values
  spread(sentiment, n, fill = 0) %>%

  # Mutate to add a polarity column
  mutate(polarity = positive - negative)%>%
  filter(abs(polarity) >= 9) %>%
  mutate(
    pos_or_neg = ifelse(polarity > 0, "positive", "negative")
  )

```



## <b>Words that Contribute the most to Positive/Negative Sentiment Scores</b>

```{r fig.align = "center"}
# Plot polarity vs. (words reordered by polarity), filled by pos_or_neg
ggplot(word_freq, aes(x = reorder(words, polarity),y =  polarity, fill = pos_or_neg)) +
  geom_col() + 
    labs(
    x = "words",
    title = "Sentiment Word Frequency"
      )+
  theme_func()+

  # Rotate text and vertically justify
  theme(axis.text.x = element_text(angle = 55)) 
  
  

```

Majority of the words spoken by him are positive with love,awesome and cool having the greatest score.Similarly hard,risk and wrong are given highest score among the negative words.

## <b>Sentiment by Year</b>
Now we transition to the [AFINN](http://www2.imm.dtu.dk/pubdb/pubs/6010-full.html) lexicon.The AFINN lexicon has numeric values from 5 to -5, not just positive or negative. Unlike the Bing lexicon's sentiment, the AFINN lexicon's sentiment score column is called value.In the below steps I calculate the average sentiment score for each year.

```{r fig.align = "center"}
year_tibble <- data%>%
mutate(date = anytime::anydate(data$created_at))%>%
mutate(year = format(as.Date(date, format="%d/%m/%Y"),"%Y"))%>%
unnest_tokens(output = "words", input = text, token = "words")%>%
  anti_join(stop_words, by=c("words"="word"))%>%

  group_by(words)%>%
  mutate(count=n())

word_sentiments <- year_tibble %>%
  inner_join(get_sentiments("afinn"), by = c("words"="word")) %>%
  group_by(year) %>%
  summarize(score = sum(value *count) / sum(count))


word_sentiments %>%
  mutate(year = reorder(year, score)) %>%
  ggplot(aes(year, score, fill = score > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ggtitle("Sentiment Score of Elon Musk Tweets") + 
  labs(subtitle = "2010-2017",
       y = "Average sentiment score") +
  theme_func()

```

Initially I was puzzled looking at this,why only the year 2010 has an average negative sentiment score that too by a large margin.


Then I found out that there was only one tweet in the year 2010 and it was....
```{r}
twenty_ten <- data%>%
mutate(date = anytime::anydate(data$created_at))%>%
mutate(year = format(as.Date(date, format="%d/%m/%Y"),"%Y"))%>%
filter(year==2010)%>%
select(text,year)  

twenty_ten
```

## <b>Sentiment Analysis by hour of the day</b>
Here again we use the AFINN lexicon to calculate the average sentiment score by hour and visualise it using lollipop plot.
```{r fig.align = "center"}
hour_tibble <- data%>%
mutate(hour=format(data$created_at, "%H"))%>%
unnest_tokens(output = "words", input = text, token = "words")%>%
  anti_join(stop_words, by=c("words"="word"))%>%

  group_by(words)%>%
  mutate(count=n())

letters_sentiments1 <- hour_tibble %>%
  inner_join(get_sentiments("afinn"), by = c("words"="word")) %>%
  group_by(hour) %>%
  summarize(score = sum(value *count) / sum(count))

ggplot(letters_sentiments1,aes(hour,score))+
  geom_segment(aes(x=hour,xend=hour,y=0,yend=score))+
  geom_point(size=5,color="red",fill=alpha("orange",0.3),alpha=0.7,shape=21,stroke=2)+
  theme_func()+
  labs(title = "Sentiment Score in Each Hour")

```


On an average we can say the sentiment ranges from 0.5 to 1.5,with a huge spike in the sentiment score in the 11th hour followed by very low sentiment score in the 12th,13th and 14th hour of the day.

## <b>NRC Lexicon</b>
The [NRC Word Emotion Association Lexicon](http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm) tags words according to Plutchik's 8 emotions plus positive and negative.But since we have performed analysis on Positive and Negative sentiments, we will concentrate more on the rest of the 8 sentiments


```{r fig.align = "center"}
#Transform sentences into words
data_tibble <- data %>%
  unnest_tokens(output = "words", input = text, token = "words")
#Remove stop words from tibble
tweets_tibble_clean <- data_tibble %>%
  anti_join(stop_words, by=c("words"="word"))%>%
  group_by(words)%>%
  summarise(count=n())


word_freq_nrc <- tweets_tibble_clean %>% 
  # Inner join to bing lexicon by term = word
  inner_join(get_sentiments("nrc"), by = c("words" = "word")) %>% 
      filter(!grepl("positive|negative", sentiment)) %>%
  group_by(sentiment) %>% 
  summarize(total_count = sum(count))

ggplot(word_freq_nrc, aes(reorder(x = sentiment,total_count), y = total_count,fill = sentiment)) +
  # Add a column geom
  geom_col()+
  theme_func()+
    coord_flip() + theme(legend.position="none") +
    labs(x = "Sentiments",title = "Top Sentiments using NRC Lexicon") 
  

```

It seems Elon Musk's tweets have been more of anticipation and trust, and a good amount of fear and joy.

```{r fig.align = "center"}
data %>%
  unnest_tokens(output = "words", input = text, token = "words")%>%
  anti_join(stop_words, by=c("words"="word"))%>%
  count(words) %>%
  inner_join(get_sentiments("nrc"), by = c("words"="word")) %>%
  group_by(sentiment) %>%
      filter(!grepl("positive|negative", sentiment)) %>%
  top_n(5, n) %>%
  ungroup() %>%
  mutate(word = reorder(words, n)) %>%
   ggplot(aes(word, n,fill=sentiment)) +
   geom_col() +
   coord_flip() +
   facet_wrap(~ sentiment, scales = "free") +
  theme_func()+
  labs(y = "count",title = "Top Words under Each Sentiment")
```

The above plot clearly tells us what causes the following emotions, for example he might be anticipating launch of a new rocket or production of a tesla car. 

Yeah I think with that my first text mining project has come to an end.Looking forward to play with more interesting datasets.Source code for this post can be found [here](https://github.com/siddharth271101/Data-Analysis-with-R).