Vax_Sentiment_Analysis.Rmd

---
title: "Vaccine Sentiment Analysis"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


```{r, message=FALSE}
source("C:/Users/16502/Documents/R demo/630 NLP/Scripts/NLP_envi.R")
NLP_envi("C:/Users/16502/Documents/R demo/630 NLP/Scripts/")
```

```{r}
library(RColorBrewer)
library(wordcloud2)
library(text2vec)
library(Matrix)
library(stm)
library(pals)
library(tidytext)
library(widyr)
library(Matrix)
library(irlba)
```


## **Overview**
<p> 
The CDC has officially confirmed the first US case of the Omicron variant earlier this month and the urgency to reach the remaining 80 million unvaccinated Americans is back at the forefront. However, the sentiments of these individuals range from distrust to blind faith towards more holistic approaches against COVID-19 and thus, it might be relevant for government agencies to analyze the most candid sentiments of the public towards vaccines in order to vaccinate those who remain hesitant.<p/>

The fear of side effects and the desire to “wait and see if vaccines are safe” were cited as the top two concerns of a group of unvaccinated individuals that the Census Household Pulse surveyed in July of 2021. Given that enough time has passed since the vaccine roll outs occurred, it might be interesting to know if the perception of vaccines became more positive over time.<p/>


## **The Dataset** 
<p/>
To answer this inquiry, a dataset of tweets about Pfizer was sourced from Kaggle.This dataset contains 10,919 tweets and includes 9 features such as user_name, user_location, user_followers, date, text, hashtags, retweets, favorites and more. These tweets  were collected between December 2020 and November 2021. Only the “date” and “text” features were used in the NLP analysis portion of this project.
```{r}
#Upload Dataset
vaccine_tweets <- as.data.frame(read_csv("C:/Users/16502/Documents/R demo/630 NLP/tweets_cln1.csv" ))

head(vaccine_tweets,10)

```


## **Exploratory Data Analysis** 
<p/>
There are 7149 unique users in this dataset and 10,912 unique tweets with these tweets generated from all over the world notably from Malaysia, Australia, Germany and as far as Bangladesh. In terms of popular hashtags, the terms “PfizerBionTech” and “CovidVaccine“ seemed to be the most prominent although the hashtags or combination of hashtags were indexed as a list within each entry and therefore the result of a simple wordcloud() of the hashtag column is more representative of the most frequent combination of hashtags.Also, it might be worth noting that most of these tweets were not retweeted.

**Feature Name: user_name ** <p/>
```{r}
# The number of NA values in the user_name column
sum(is.na(vaccine_tweets$user_name))

# The number of unique values in the user_name column
length(unique(vaccine_tweets$user_name))
```

**Feature Name: user_name ** <p/>
```{r}
# The number of NA values in the user_name column
sum(is.na(vaccine_tweets$user_name))

# The number of unique values in the user_name column
length(unique(vaccine_tweets$user_name))
```
 

**Feature Name: user_location ** <p/>
```{r}
# The number of NA values in the user_location column
sum(is.na(vaccine_tweets$user_location))
# The number of unique values in the user_location column
length(unique(vaccine_tweets$user_location))


```
 
 
```{r}
loc <- as.data.frame(table(vaccine_tweets$user_location))
wordcloud2(data = loc, size=1.6, color='random-dark')
```

 **Feature Name: user_followers ** <p/>
```{r}
# The number of NA values in the user_location column
sum(is.na(vaccine_tweets$user_followers))

# The number of unique values in the user_location column
summary(vaccine_tweets$user_followers)
```
 
**Feature Name: date ** <p/>
```{r}
# The number of NA values in the user_location column
sum(is.na(vaccine_tweets$date))

# The number of unique values in the user_location column
length(unique(vaccine_tweets$date))

# The number of unique values in the user_location column
summary(vaccine_tweets$date)
```

**Feature Name: hashtags ** <p/>
```{r}
# The number of NA values in the user_location column
sum(is.na(vaccine_tweets$hashtags))

# The number of unique values in the user_hashtags column
length(unique(vaccine_tweets$hashtags))
```


**Feature Name: text ** <p/>
```{r}
# The number of NA values in the user_location column
sum(is.na(vaccine_tweets$text))

# The number of unique values in the user_location column
length(unique(vaccine_tweets$text))

```


**Retweets Distribution** 
```{r}
counts <- table(vaccine_tweets$retweets)
barplot(counts, main="Was the Tweet Retweeted?",
        col="#69b3a2",
   xlab="Tweet Status")
```


## **Pre-Processing Workflow**

**Exploring the Duplicated Tweets**
```{r,results='hide'}
# Determine entries that were repeated 
n_occur <- data.frame(table(vaccine_tweets$text))
n_occur[n_occur$Freq > 1,]
```

If we looked closer, the duplicated tweets were typically created by the same user either on different dates or when the user had different follower counts. Keeping these tweets might be sensible if we were doing a network analysis but for this project, these duplicated tweets were dropped. 
```{r,results='hide'}
vaccine_tweets[vaccine_tweets$text %in% n_occur$Var1[n_occur$Freq > 1],]
```


```{r}
# New dimension matches the number of unique tweets 
vaccine_tweets <- vaccine_tweets[!duplicated(vaccine_tweets$text),]
dim(vaccine_tweets)
```

```{r}
## Initialize list of words we want to remove 
to_remove<- c("vaccine", "covid", "covid19", "vax","coronavirus", "dose" )
```


```{r}
## Remove emojies, links and words included in our to_remove list 

tweet <- pre_process_corpusAG(vaccine_tweets, "text",
                             remove_html = TRUE,
                             remove_strings = to_remove,
                             remove_hex = TRUE, 
                             replace_emojis = TRUE,
                             replace_numbers = TRUE,
                             root_gen = 'lemmatize')

vaccine_tweets$text_preprocessed <- tweet
```


**Create a Document Term Matrix Object**
```{r}
# tokenize data with itoken function
it <- itoken(vaccine_tweets$text_preprocessed, tokenizer = word_tokenizer)
vocab_full <- create_vocabulary(it, ngram = c(1, 3))

# We impose a lower and upper bounds to truncate the resulting vocabulary
# term had to be mentioned at least twice 
lbound <- 2
# term appearing more than 45 times might be too common and therefore not significant
ubound <- 45

vocab <- vocab_full[vocab_full$doc_count > lbound & vocab_full$doc_count < ubound,]

vectorizer <- vocab_vectorizer(vocab)
dtm <- create_dtm(it, vectorizer)

sparse_corpus <- Matrix(dtm, sparse = T)

```

## **Methods and Results**
### **Frequency Analysis**
**Generate a "Baseline" Frequency Table**  
```{r}
freq_table <- data.frame(term = colnames(dtm), 
                         n = colSums(dtm),
                         freq = colSums(dtm)/sum(dtm)
                       
                         )

freq_table <- freq_table[order(freq_table$n, decreasing = T),]
```


```{r}
head(freq_table)
```
<p/>
As illustrated above, the top 3 words on this table seemed to allude to the lockdown in Europe while the terms “announce_ban” and “allergic_reaction” could allude to the ban on social gatherings and the concern over adverse effects from vaccines which were both relevant concerns during the early days of the pandemic. Given the need for more granularity in the main frequency table, several proxy terms that could be indicative of positive or negative sentiments were selected. These terms in addition to the vaccine brands that were mentioned in the corpora were used to subset the main frequency table. <p/>

**Select Proxy Terms that might be Indicative of Negative or Positive Sentiment**  
	
```{r}
pos <- c( "thank_science",
"igottheshot",
"getvaccinatednow_getvaccinatednow",
"fullyvaccinated_pfizerbiontech",
"go_pfizerbiontech",
"safe_effective",
"vaccinate_thank",
"grateful_receive",
"thisisourshot",
"get_vaccinate_pfizerbiontech",
"highly_effective",	
"pfizerbiontech_get",
"grateful_pfizerbiontech"
)
```
											
																	
```{r}
# Neg Tags 
neg <- c ("side_effect_pfizerbiontech",
"pregnant",	
"sideeffects",
"die_norway",
"allergy",
"ban_purchase",
"challenge" ,
"bet",
"ban_purchase_pfizerbiontech",
"pressure",
"america",
"bad",
"adverse",
"ill",
"worry",
"free",
"body",	"soreness",
"evidence",
"trump",
"effect_pfizerbiontech"	,
"privilege",
"regulator",
"use_fda_view",
"conservative",
"anaphylaxis",
"god",
"fear",
"mass",
"study_show",
"experimental",
"realdonaldtrump",
"guess",
"pharma",
"wake",
"fail",
"wrong",
"wont" ,
"ban_purchase"
  
)
```
															
																	
**Generate a Frequency Table that is More Specific to our Inquiry** <p/> 
For our purposes, the vaccine brands were considered 'neutral terms' since these terms could be mentioned in an informative context and therefore does not project a clear type of sentiment. 

```{r}
# Segment the frequency table to just include th proxy terms and the vaccine brands

vacs <- c("pfizerbiontech","moderna","sputnik","astrazeneca","johnson" )

vax_tags <- c (vacs, neg, pos)

pattern <-  paste(vax_tags, collapse = "|")

overlaps <-grep(pattern, freq_table$term, ignore.case = TRUE, value = TRUE)

```

 
```{r}
# Segment the main frequency table with mentions of vaccine names 
vacs_frqtb <- freq_table[freq_table$term %in% overlaps, ]
vacs_frqtb  <- vacs_frqtb [order(vacs_frqtb$n, decreasing = T),][1:30,]
head(vacs_frqtb)
```


**Examine the Frequency Table Just Composed of the Proxy Terms**<p/>
From this approach, we determined the proxy terms that had the most "signal" relative to the other terms. 
```{r}
#Frequency of Negative Terms 
pattern_neg <-  paste(neg, collapse = "|")

overlaps_neg  <-grep(pattern_neg, freq_table$term, ignore.case = TRUE, value = TRUE)

neg_frqtb <- freq_table[freq_table$term %in% overlaps_neg, ]
neg_frqtb  <- neg_frqtb [order(neg_frqtb$n, decreasing = T),][1:30,]
head(neg_frqtb)

```


```{r}
#Frequency of Positive Terms
pattern_pos <-  paste(pos, collapse = "|")

overlaps_pos  <-grep(pattern_pos, freq_table$term, ignore.case = TRUE, value = TRUE)

pos_frqtb <- freq_table[freq_table$term %in% overlaps_pos, ]
pos_frqtb  <- pos_frqtb [order(pos_frqtb$n, decreasing = T),][1:30,]
head(pos_frqtb)

```

<p/> 
**Graph the New Frequency Table**<p/>
```{r,  warning=FALSE}
ggplot(vacs_frqtb,aes(x = reorder(term, vacs_frqtb$freq), y = freq, fill = term)) + 
  geom_bar(stat = "identity", show.legend = F) + coord_flip() + xlab("Vaccine Mentions")
```
<p/> 
With this new table, more terms that could be related to vaccine sentiment could be observed. Furthermore, in graphing this new frequency table, some “marbling” of negative and positive proxy terms with terms that were more neutral could also be observed.This marbling indicated that these terms co-occurred enough based on frequency alone and these co-occurrences were contextualized further in this project.<p/>

### **Time Series Analysis** 
<p/>

**Time Series Analysis of Vaccine Brands** <p/>
```{r}

vax_names <- c("sputnikv","johnson", "thank_pfizerbiontech", "amp_moderna", "use_pfizerbiontech"  )
vax_freq <- dtm[, which(colnames(dtm) %in% sort(vax_names))]
colnames(vax_freq) <- vax_names
vax_freq <- data.frame(date= vaccine_tweets$date ,
                       as.data.frame(as.matrix(vax_freq))
                       )

vax_freq$id <- 1:nrow(vax_freq)

df_vax_name <- melt(vax_freq, id.vars = c("id", "date"), variable.name = "term", value.name = "count")
ggplot(df_vax_name, aes(x = date, y = count, color = term)) + geom_line(linetype = 2) + geom_point()
```

<p/> 
The time series analysis of the vaccine brands revealed that these terms appeared enough in the corpora to have some probable connection to the proxy terms. If the resulting time series graph as pictured above looked sparse, it would have been difficult to connect our terms to the historical data around our working time frame.<p/>


**Time Series Analysis of Proxy Terms with Vaccine Brands**<p/>

```{r}

b_terms <- c("bad", "adverse", "side_effect_pfizerbiontech", "thank_science", "igottheshot", "fullyvaccinated_pfizerbiontech","sputnikv","johnson", "amp_moderna", "use_pfizerbiontech")

b_freq <- dtm[, which(colnames(dtm) %in% sort(b_terms))]
colnames(b_freq) <- b_terms
b_freq <- data.frame(date= vaccine_tweets$date ,
                       as.data.frame(as.matrix(b_freq))
                       )

b_freq$id <- 1:nrow(b_freq)

df_b_term <- melt(b_freq, id.vars = c("id", "date"), variable.name = "term", value.name = "count")
ggplot(df_b_term, aes(x = date, y = count, color = term)) + geom_line(linetype = 2) + geom_point()
```
<p/> 
In layering the negative and positive proxy terms with the vaccine brands, denser lines could be observed above which was reminiscent of the “marbling” effect that was evident in the second frequency table.<p/>


**Time Series Analysis of 3 Negative Proxy Terms**
<p/>
Dedicated time series graphs were also created separately for the negative and positive terms to ensure that the signals of these terms were not being conflated from one another. 

```{r}

neg_terms <- c("bad", "adverse", "side_effect_pfizerbiontech")
neg_freq <- dtm[, which(colnames(dtm) %in% sort(neg_terms))]
colnames(neg_freq) <- neg_terms
neg_freq <- data.frame(date= vaccine_tweets$date ,
                       as.data.frame(as.matrix(neg_freq))
                       )


neg_freq$id <- 1:nrow(neg_freq)

df_neg_term <- melt(neg_freq, id.vars = c("id", "date"), variable.name = "term", value.name = "count")
ggplot(df_neg_term, aes(x = date, y = count, color = term)) + geom_line(linetype = 2) + geom_point()
```

**Time Series Analysis of 3 Positive Proxy Terms**<p/>

```{r}
 
pos_terms <- c("thank_science", "igottheshot","fullyvaccinated_pfizerbiontech")
pos_freq <- dtm[, which(colnames(dtm) %in% sort(pos_terms))]
colnames(pos_freq) <- pos_terms
pos_freq <- data.frame(date= vaccine_tweets$date ,
                       as.data.frame(as.matrix(pos_freq))
                       )


pos_freq$id <- 1:nrow(pos_freq)

df_pos_term <- melt(pos_freq, id.vars = c("id", "date"), variable.name = "term", value.name = "count")

ggplot(df_pos_term, aes(x = date, y = count, color = term)) + geom_line(linetype = 2) + geom_point()



```

<p/> 
The figures above illustrated how the negative terms seemed to be denser from January until April compared to the positive terms but the positive terms did maintain their relevance beyond July whereas the negative terms dwindled significantly.<p/>


**Time Series Analysis of Both Negative and  Positive Proxy Terms**<p/>

```{r}
 
both_terms <- c("thank_science", "igottheshot","fullyvaccinated_pfizerbiontech",
               "bad", "adverse", "side_effect_pfizerbiontech")
both_freq <- dtm[, which(colnames(dtm) %in% sort(both_terms))]
colnames(both_freq) <- both_terms
both_freq <- data.frame(date= vaccine_tweets$date ,
                       as.data.frame(as.matrix(both_freq))
                       )


both_freq$id <- 1:nrow(both_freq)

df_both_term <- melt(both_freq, id.vars = c("id", "date"), variable.name = "term", value.name = "count")

ggplot(df_both_term, aes(x = date, y = count, color = term)) + geom_line(linetype = 2) + geom_point()

```

<p/> 
In isolating just the interaction between the negative and positive terms as shown above, we could observe a somewhat equal contention between the positive sentiment (as represented by the pink and blue hues) and the negative sentiment (green hue) between January and July. Nevertheless both sentiments seem to dwindle after July.

### **TF/IDF Analysis **
<p/>
The TF/IDF approach was used to capture the ‘semantic contribution’ of the terms in the dtm where the results for the terms “bad” and “adverse”  as well as “igottheshot” and “thank_science” were specifically analyzed and cross-referenced with historical data. 
```{r}
# compute IDF
number_of_docs <- nrow(sparse_corpus)
term_in_docs <- colSums(sparse_corpus > 0)
idf <- log(number_of_docs / term_in_docs)
```


```{r}
# compute TF/IDF
tf_idf <- t(t(sparse_corpus) * idf)
names(tf_idf) <- colnames(matrix(sparse_corpus))
rownames(tf_idf) <- vaccine_tweets$date
```

**Semantic Significance of Vaccine Brands** <p/> 
In determining when each vaccine brand was the most semantically significant, we are able to solicit the appropriate "historical data" to cross-reference our working timetable with. 
```{r}
selected <- c("johnson" , "amp_moderna", "sputnikv","thank_pfizerbiontech","use_pfizerbiontech")
tf_idf_select <- tf_idf[, which(colnames(tf_idf) %in% selected)]
colnames(tf_idf_select) <- selected

tf_idf_select <- data.frame(date= vaccine_tweets$date, 
                          as.data.frame(as.matrix(tf_idf_select))
                          
                          )
tf_idf_select$id <- 1:nrow(tf_idf_select)

```


```{r, fig.width= 5, fig.height= 15,warning=FALSE}

# create visual
tf_idf_select <- melt(tf_idf_select, id.vars = c("id", "date"), variable.name = "vax", value.name = "tfidf")


# Order by Date 

df_select <- tf_idf_select[order(tf_idf_select$tfidf, decreasing = T),][1:130,]

df_select <- df_select[order(tf_idf_select$date),]


ggplot(data = df_select, aes(x = reorder(date, df_select$tfidf), y = tfidf, fill = vax)) + geom_bar(stat = "identity") + coord_flip() + xlab("date")

```

### **Some Relevant Historical Data** 
<p/>
**December 11:** FDA Agrees to EUA for COVID-19 Vaccine From Pfizer, BioNTech <p/>
**January 22:**  New Data Shows Moderna Vaccine Can Induce Rare Anaphylactic Reactions <p/>
**February 26: ** Vaccine Acceptance Among Americans Increases (KFF.ORG survey) <p/>
**March 14:** AstraZeneca Refutes Reports Linking Its Vaccine to Blood Clots <p/>
**April 7:** European Medicines Agency Finds Rare Clotting Links to AZ Vaccine <p/>
**April 13:** CDC, FDA, Recommend Pausing J&J Vaccine <p/>
**June 1 :** Employers Can Require COVID-19 Vaccine <p/>
**August 20:** Booster Plan Backlash <p/>


**Semantic Significance of Proxy Terms** <p/> 
For this analysis, we focused on 2 sets of 3 negative and positive proxy terms. 

```{r,  fig.width= 5, fig.height= 15,  warning=FALSE}
sentiment <- c("bad", "adverse", "side_effect_pfizerbiontech", "thank_science", "igottheshot", "fullyvaccinated_pfizerbiontech")
tf_idf_senti <- tf_idf[, which(colnames(tf_idf) %in% sentiment)]
colnames(tf_idf_senti) <- sentiment

tf_idf_senti <- data.frame(date= vaccine_tweets$date, 
                          as.data.frame(as.matrix(tf_idf_senti))
                          
                          )
tf_idf_senti$id <- 1:nrow(tf_idf_senti)


# create visual
tf_idf_senti <- melt(tf_idf_senti, id.vars = c("id", "date"), variable.name = "vax", value.name = "tfidf")


# Order by Date 

df_senti <- tf_idf_senti[order(tf_idf_senti$tfidf, decreasing = T),][1:130,]

df_senti <- df_senti[order(tf_idf_senti$date),]
 


ggplot(data = df_senti, aes(x = reorder(date, df_senti$tfidf), y = tfidf, fill = vax)) + geom_bar(stat = "identity") + coord_flip() + xlab("date")


```



### **Semantic Contribution of Certain Terms** 
<p/>
**“bad” (peach) :** Late May - Mid-September <p/>
**“igottheshot” (blue):** Mid December - Early Feb <p/>
**“thank_science” (teal):** Mid-December, Late  Feb and certain days in April, May, June and July <p/>
**“adverse” (green):** Several days in Dec  2020, Jan, Feb, and a day in March and April <p/>

First, the term “bad” had the most semantic contribution between late May until Mid-September which coincided with news of adverse events from the AstraZeneca and J&J vaccine in addition to mounting opposition to vaccine mandates and Booster plans all over the US around August. On the other hand, the term “igottheshot” was semantically significant between mid-December and early February which coincided with the timeframe for when the FDA approved the Pfizer vaccine and when many people eagerly got vaccinated. Nevertheless, it should be noted that the terms “adverse” and “side_effect_biontech” were also significant during certain days when “igottheshot” was semantically significant and this can be attributed to the news of adverse events from the Moderna vaccine that was reported around the same time that the Pfizer vaccine was being rolled out in many parts of the country. Through these approaches, we are able to roughly contextualize and confirm the probable sentiment of these selected terms as far as our analysis was concerned.<p/>


**Time Series Analysis of TF-IDF Score of Selected Terms**<p/>
However, in determining whether the positive sentiments were more pronounced compared to the negative sentiments (or vice versa), it might be helpful to see these TF-IDF scores graphed linearly in time. The results are shown below. <p/>

**"adverse"**
```{r,warning=FALSE}
tf_idf_b <- tf_idf[, which(colnames(tf_idf) == "adverse")]
tf_idf_b <- data.frame(date= vaccine_tweets$date, adverse = tf_idf_b)
tf_idf_b$id <- 1:nrow(tf_idf_b)

ggplot(tf_idf_b, aes(x = date, y = tf_idf_b$adverse )) +
  geom_line(color = 'blue') + geom_point(color = 'blue')
```


**"igottheshot"**
```{r,warning=FALSE}
tf_idf_b <- tf_idf[, which(colnames(tf_idf) == "igottheshot")]
tf_idf_b <- data.frame(date= vaccine_tweets$date, igottheshot = tf_idf_b)
tf_idf_b$id <- 1:nrow(tf_idf_b)

ggplot(tf_idf_b, aes(x = date, y = tf_idf_b$igottheshot )) +
  geom_line(color = 'red') + geom_point(color = 'red')
```

<p/> 
In the first set of proxy terms which are shown above, both sentiments grew sparse after July while the positive sentiment via the term “igottheshot" was more pronounced between January and February. This is in contrast to the negative sentiment as represented by the word “adverse” which was pronounced for a longer duration specifically between January and April.<p/>

**"side_effect_pfizerbiontech"**
```{r,warning=FALSE}
tf_idf_b <- tf_idf[, which(colnames(tf_idf) == "side_effect_pfizerbiontech")]
tf_idf_b <- data.frame(date= vaccine_tweets$date, side_effect_pfizerbiontech = tf_idf_b)
tf_idf_b$id <- 1:nrow(tf_idf_b)

ggplot(tf_idf_b, aes(x = date, y = tf_idf_b$side_effect_pfizerbiontech)) +
  geom_line(color = 'purple') + geom_point(color = 'purple')
```

**"thank_pfizerbiontech"**

```{r,warning=FALSE}

tf_idf_b <- tf_idf[, which(colnames(tf_idf)  == "thank_pfizerbiontech")]
tf_idf_b <- data.frame(date= vaccine_tweets$date, thank_pfizerbiontech = tf_idf_b)
tf_idf_b$id <- 1:nrow(tf_idf_b)

ggplot(tf_idf_b, aes(x = date, y = tf_idf_b$thank_pfizerbiontech )) +
  geom_line(color = 'green') + geom_point(color = 'green')
```
<p/> 
However, the positive sentiment as represented by the term “thank_pfizerbiontech” seemed to be more dominant between January and July compared to “side_effect_pfizerbiontech” and also differed significantly from the results of “igottheshot” which became irrelevant after March.<p/>

Although, when the graphs of the two negative and positive terms were superimposed to one another, we can again observe a roughly equal contention between the positive and negative terms. Please see page 9 of the documentation to see the superimposed images. 

### **Topic Modeling** 
<p/>
To contextualize any overarching themes in the whole corpora, topic modeling was used.<p/>
**Mod 1**
<p/>
For the first LDA model, we used 3 topics (K=3) in addition to using the default value for alpha which is  50/K, and the default value for eta which is 0.01. Given that the corpora was composed of tweets from different countries, it might be difficult even for the LDA approach to converge the dataset into coherent topics. As such, topic modeling was used to mainly generate some baseline themes but the use of ksearch() to further tune these models was not pursued. 
```{r,results='hide'}
Mod1 <- stm(sparse_corpus, init.type = 'LDA', seed = 12345, K = 3)
```


```{r}
# Derive the topic content 
Mod1_topic_content <- as.data.frame(t(exp(Mod1$beta$logbeta[[1]])))

# Create topic names 
Mod1_topic_names <- apply(Mod1_topic_content, 2, function(x) {paste(Mod1$vocab[order(x,decreasing = T)[1:5]], collapse = " ")})

# Topic prevalence 
Mod1_topic_prev <- as.data.frame(Mod1$theta)
df1 <- Mod1_topic_prev
colnames(df1) <- Mod1_topic_names
df1$date <- as.character(vaccine_tweets$date)
df1 <- melt(df1, id.vars = 'date', value.name = 'proportion', variable.name = 'topic')
```
```{r}
apply(Mod1_topic_content, 2, function(x) {Mod1$vocab[order(x, decreasing = T)[1:20]]})
```

```{r}
head(Mod1_topic_names,12)
```

Other than 'V2', we did not get any topics that could be indicative of vaccine sentiment. Given that we are mostly focusing on the American users, 'V3'was disregarded.

```{r}
# Baseline mean score for Mod1 
mean(apply(Mod1_topic_prev, 1, max))
```

The low mean score of this model indicated there might be weak connections between each document and there might be a lot of heterogeneity among the topics in the corpora. 

```{r}
# Topic prevalence 
Mod1_topic_prev <- as.data.frame(Mod1$theta)
df1 <- Mod1_topic_prev
colnames(df1) <- Mod1_topic_names
df1$date <- substr(vaccine_tweets$date,1,7)
df1$date <- as.character(df1$date)

df1 <- melt(df1, id.vars = 'date', value.name = 'proportion', variable.name = 'topic')
```

While the mean score was not encouraging, it might still be interesting to graph these topic distributions by month.
```{r,fig.width=10,fig.height=11}

ggplot(df1, aes(x = topic, y = proportion, fill = topic)) + geom_bar(stat = 'identity') +
  theme(axis.text.x = element_blank(), axis.text.y = element_blank(), legend.position = "bottom", legend.text=element_text(size=8),legend.title = element_text( size=8),legend.key.size = unit(2,"mm")) +  
  coord_flip() + facet_wrap(~ date, ncol = 10) +guides(fill = guide_legend(title.position = "top", ncol = 1))
```


**Model 2** 
<p/>
In this model, K was set to 8 in order to estimate the upper range for when we can observe a decrease in mean score. 

```{r,results='hide'}
Mod2 <- stm(sparse_corpus, init.type = 'LDA', seed = 12345, K = 8)
```


```{r}
# derive the topic content 
Mod2_topic_content <- as.data.frame(t(exp(Mod2$beta$logbeta[[1]])))

# Create topic names 
Mod2_topic_names <- apply(Mod2_topic_content, 2, function(x) {paste(Mod2$vocab[order(x,decreasing = T)[1:5]], collapse = " ")})

# Topic prevalence 
Mod2_topic_prev <- as.data.frame(Mod2$theta)
df2 <- Mod2_topic_prev
colnames(df2) <- Mod2_topic_names
df2$date <- as.character(vaccine_tweets$date)
df2 <- melt(df2, id.vars = 'date', value.name = 'proportion', variable.name = 'topic')
```
```{r}
apply(Mod2_topic_content, 2, function(x) {Mod2$vocab[order(x, decreasing = T)[1:20]]})
```


```{r}
head(Mod2_topic_names,12)
```

```{r}
# Baseline mean score  
mean(apply(Mod2_topic_prev, 1, max))
```
The mean score decreased significantly with the model with K=8 and the resulting topic names again, were not indicative of clear vaccine sentiment. 


**Model 3**
<p/>
Just to fully explore the LDA approach, we then again create a model with K=5. 
```{r,results='hide'}
Mod3 <- stm(sparse_corpus, init.type = 'LDA', seed = 12345, K = 5)
```


```{r}
# derive the topic content 
Mod3_topic_content <- as.data.frame(t(exp(Mod3$beta$logbeta[[1]])))

# Create topic names 
Mod3_topic_names <- apply(Mod3_topic_content, 2, function(x) {paste(Mod3$vocab[order(x,decreasing = T)[1:5]], collapse = " ")})

# Topic prevalence 
Mod3_topic_prev <- as.data.frame(Mod3$theta)
df3 <- Mod3_topic_prev
colnames(df3) <- Mod3_topic_names
df3$date <- as.character(vaccine_tweets$date)
df3 <- melt(df3, id.vars = 'date', value.name = 'proportion', variable.name = 'topic')
```
```{r}
apply(Mod3_topic_content, 2, function(x) {Mod3$vocab[order(x, decreasing = T)[1:20]]})
```




```{r}
head(Mod3_topic_names,12)
```

```{r}
# Baseline mean score 
mean(apply(Mod3_topic_prev, 1, max))
```

Ultimately, increasing the number of topics decreased the mean score and therefore Mod1 with K=3 was the best model among the 3 models. But as mentioned earlier, even Mod1 was not an accurate representation of the topics in the corpora based on the mean score and the resulting topic names. 

### **Word Embedding**
<p/>
To further vet the proxy terms that were tracked in the previous analysis,  word embeddings approach was used. Specifically, if a term was closely associated with either a vaccine brand and/or some of the other negative or positive proxy terms in our list, then the term was considered a good tracker for negative or positive sentiment. 

```{r}
# create all possible ngrams given context_window= 9
skipgrams <- unnest_tokens(vaccine_tweets, ngram,text_preprocessed, token = "ngrams", n = 9)

#Generate ngramID
skipgrams$ngramID <- 1:nrow(skipgrams)

#Generate skipgramID and concatenate all the info
skipgrams$skipgramID <- paste(skipgrams$X_unit_id, skipgrams$ngramID, sep = '_')

#Divide each ngram by word 
skipgrams <- unnest_tokens(skipgrams, word, ngram)
```


**Create Probability Table of the Skip Gram Model**
</p> 
Now that we have the indices for each separate context window and a row for each term that appears in each window, we want to calculate the co-occurrence of terms in context windows.
```{r, warning=FALSE}
# use pairwise_count() to get frequency
skipgram_probs <- pairwise_count(skipgrams, word, skipgramID, diag = T, sort = T)

#calculate probability   
skipgram_probs$p <- skipgram_probs$n/sum(skipgram_probs$n)

```

**Divide by Word** <p/> 
Normalize these probabilities by the overall frequency of each term in the corpus.
```{r}
unigram_probs <- unnest_tokens(vaccine_tweets, word, text_preprocessed)
unigram_probs <- count(unigram_probs, word, sort = T)
unigram_probs$p <- unigram_probs$n/sum(unigram_probs$n)

```


**Creating Normalized Model**
```{r}
# get rid of everything that appear less than 20 
lbound <- 20
normed_probs <- skipgram_probs[skipgram_probs$n > lbound,]

# rename  the cols names 
colnames(normed_probs) <- c('word1', 'word2', 'n', 'p_all')

# merge the list of pairs and their probabilities with the unigram probabilities we just calculated.
normed_probs <- merge(normed_probs, unigram_probs[, c('word', 'p')], by.x = 'word2', by.y = 'word', all.x = T)
normed_probs <- merge(normed_probs, unigram_probs[, c('word', 'p')], by.x = 'word1', by.y = 'word', all.x = T)

#calculate the normalized probability by dividing the pair’s probability by each of the individual terms’ probabilities
normed_probs$p_combined <- normed_probs$p_all/normed_probs$p.x/normed_probs$p.y

normed_probs <- normed_probs[order(normed_probs$p_combined, decreasing = T),]
```


**PMI Metric Integrated** 
<p/>
In determining the terms with the highest likelihood of association, the PMI metric was used.This metric is a useful conversion that sets the normalized probability equal to 0 if the co-occurrence of 2 terms was due to chance.  
```{r}
#The following code block takes awhile to run. 
# Create PMI score col
#normed_probs$pmi <- log(normed_probs$p_combined)
# Create PMI_Matrix
#pmi_matrix <- cast_sparse(normed_probs, word1, word2, pmi)
#pmi_matrix@x[is.na(pmi_matrix@x)] <- 0
# Compress pmi_matrix 
#pmi_svd <- irlba(pmi_matrix, 256, maxit = 1e6, fastpath=FALSE)
#word_vectors <- pmi_svd$u
#rownames(word_vectors) <- rownames(pmi_matrix)
```

The following code blocks were compiled in the University's Virtual Machine. Due to the long processing time for these methods, the results were captured as an image as opposed to a csv file. 

```{r, out.width = "300px"}

knitr::include_graphics("C:/Users/16502/Downloads/bad.png")
#bad <- word_vectors %*% word_vectors["bad",] %>% as.data.frame() %>%
#rename(bad = V1) %>% arrange(-bad)
```



```{r,out.width = "300px"}

knitr::include_graphics("C:/Users/16502/Downloads/igottheshot.jpg")
#igottheshot <- word_vectors %*% word_vectors["igottheshot",] %>% as.data.frame() %>%
#rename(igottheshot = V1) %>% arrange(-igottheshot)
```



```{r,out.width = "300px"}

knitr::include_graphics("C:/Users/16502/Downloads/sputnik.jpg")
#sputnik <- word_vectors %*% word_vectors["sputnik",] %>% as.data.frame() %>%
#rename(sputnik = V1) %>% arrange(-sputnik)
```


<p/> The results of the word embedding approach indicated that the word “bad” was a good tracker for negative sentiment given that “pfizer” was the top 2 and top 3 most associated word for “bad”. The results for “igottheshot” and “sputnik” were not as forthcoming but fortunately, the positive terms that were selected had clear positive connotations. Due to the long processing time for this approach, only 3 terms were vetted.



### **VADER LEXICON**
The VADER lexicon was used to generate sentiment scores that were then compared with the several time series graphs that we previously generated. In doing so, the range of the compound scores between the processed and unprocessed tweets were compared and the results of the unprocessed corpora were used for the next analysis. 

```{r}
library(vader)
```


**Using the Unprocessed Text**
```{r}
m <- vader_df(vaccine_tweets$text)
```


```{r}
# Create df with the values we want 
raw_vader_senti <- data.frame(
  text = m$text, 
  compound = m$compound, 
  neg = m$neg, 
  pos = m$pos, 
  neu = m$neu,
  date = vaccine_tweets$date
)
  
raw_vader_senti$id <- 1:nrow(raw_vader_senti)

```


**Using the Processed Text**
```{r}
b <- vader_df(vaccine_tweets$text_preprocessed)

```


```{r}
# Create df with the values we want 
pros_vader_senti <- data.frame(
  text = b$text, 
  compound = b$compound, 
  neg = b$neg, 
  pos = b$pos, 
  neu = b$neu,
  date = vaccine_tweets$date
)
  
pros_vader_senti$id <- 1:nrow(pros_vader_senti)

```


**Examine the Difference Between the Results of the Unprocessed and Raw Text**
```{r}
range(pros_vader_senti$compound)
```


```{r}
range(raw_vader_senti$compound)
```
The range were not too different between the two approaches and we further confirm this visually by looking at the graphed compound scores of the unprocessed and raw text. Ultimately, the results of the raw text was used. 
```{r}
 plot(pros_vader_senti$date, pros_vader_senti$compound)
```


```{r}
 plot(raw_vader_senti$date, raw_vader_senti$compound)
```

<p/> 
To better observe the contention between the positive and negative sentiment scores, the compound scores were graphed using a color gradient. 

```{r,warning=FALSE}

sp <- ggplot(raw_vader_senti, aes(date,compound, color = Compound_Score))+
  geom_point(aes(color = compound))
#Sequential color scheme. 
# Specify the colors for low and high ends of gradient
sp + scale_color_gradient(low = "blue", high = "red")

# Diverging color scheme
# Specify also the colour for mid point
mid <- mean(raw_vader_senti$compound)
sp + scale_color_gradient2(midpoint = mid, low = "blue", mid = "white",
                            high = "red", space = "Lab" )



```

<p/> 
As shown above, the compound scores for each tweet ranged between -1 and +1 where a score close to -1 was indicative of negative sentiment and vice versa. A denser coloration in the upper quadrants could be seen which indicated that there were more positive sentiments in the corpora based on the VADER lexicon. However, the difference was very subtle across all of the months. Furthermore,the dwindling of sentiments after July could be observed through this approach as well.<p/>


## **Analysis**
<p/>
Overall, we were able to observe two recurring themes throughout all of our approaches.  
First, the results of the Time Series Analysis of both the Term Frequency and Semantic Contribution as well as the distribution of the VADER generated sentiment scores all illustrated the  dwindling of both sentiments over time. A possible explanation for this is the fading of both the novelty and risk of getting vaccinated as more people became vaccinated which can affect the need of users to tweet either type of sentiment. Furthermore, the dataset was sourced by an outside contributor to Kaggle and the decreased volume of tweets during the later months of 2021 could be due to how the dataset was generated from the Twitter API.  
<p/>
Second, the roughly balanced contention between the positive and negative sentiments was repeatedly observed in previous methods that was mentioned earlier. This could be attributed to the dynamic nature of the  issues around vaccines. For instance, the interplay between lower infection and/or hospitalization rates with news about adverse effects from certain vaccines can create equal forces from both ends given how pervasive these topics are to society at this moment.

<p/>
Given the results across all of the techniques that were used in our methodology, it might be more appropriate to conclude that vaccine sentiment did not become more positive but actually remained roughly constant. As illustrated in the several previous graphs that we generated, the volume of positive sentiment across time had been more or less countered by the negative sentiment. In order to conclude that vaccine sentiment became more positive, we would need to see a more dramatic difference between the positive and negative sentiment.  


## **Limitations**
<p/>
It should be noted that the methodology of this project has certain limitations that prevent any robust conclusion to be made. For instance,the validation process of the proxy terms heavily relied on historical data that were specific to the United States. However, as mentioned in the EDA part of the project, a significant number of tweets were generated outside of the United States. While this does not prevent us from analyzing vaccine sentiment since COVID-19 is a global issue, it does weaken our validation process and potentially over-generalize our insights which might make these insights less valuable for US based organizations that want to understand their community's sentiment towards the vaccine. In this iteration of the project, data volume was prioritized instead of subsetting the dataset based on location which can unnecessarily decrease the volume of our dataset. 

Additionally,our analysis was derived from the observations of select proxy terms.There are more terms that can be studied through the battery of approaches that we used in this project that can still prove or disprove our initial results. However, our methods were good starting points for the purposes of this project and suggested refinements are mentioned in the conclusion. 

<p/>

## **Conclusion**
<p/>
Through the several approaches that we employed in determining whether vaccine sentiment became more positive ,it was evident that both types of sentiments dwindled over time and that the positive sentiments were marginally more pronounced than the negative sentiments between December 2020 and November 2021.  Given that the positive sentiment was only marginally dominant than the negative sentiment, for the terms that we specifically analyzed, it was more appropriate to conclude that the positive sentiment remained roughly constant and did not become more positive over time which is in contrast to our initial assumption. The “unsupervised” nature of our inquiry would simply require a more dramatic contrast between these sentiments in order to assert that any shifts in sentiments occurred. 
<p/>
Furthermore, only some terms were analyzed using our methodology and a more systematic approach can further enhance the robustness of our conclusion. For instance, a dataset with a “clean” location feature can be used next time to add more specificity to the historical data and the type of sentiments to expect. Then, instead of  using the word embeddings approach as a supplementary vetting tool for proxy terms,  we can use it as the primary tool for creating our list of proxy terms. This approach can reduce the number of terms to track so that we can pragmatically analyze all of the terms through the various time series analyses that were used in this project. Hence, in seeing the overall behavior of these well vetted words in terms of its  frequency, semantic contribution and VADER generated sentiment scores, we are able to systematically chart any shifts in sentiment  for all the terms and cross-reference it precisely with historical data. 
<p/>
Ultimately, vaccination efforts include some element of translational work. Given that some palpable insights can be derived from certain NLP approaches, government agencies can leverage these tools in properly approaching the unvaccinated members of the community and mitigate the further spread of the virus in the United States.