-
Notifications
You must be signed in to change notification settings - Fork 0
/
Vax_Sentiment_Analysis.Rmd
1032 lines (709 loc) · 40.6 KB
/
Vax_Sentiment_Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Vaccine Sentiment Analysis"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r, message=FALSE}
source("C:/Users/16502/Documents/R demo/630 NLP/Scripts/NLP_envi.R")
NLP_envi("C:/Users/16502/Documents/R demo/630 NLP/Scripts/")
```
```{r}
library(RColorBrewer)
library(wordcloud2)
library(text2vec)
library(Matrix)
library(stm)
library(pals)
library(tidytext)
library(widyr)
library(Matrix)
library(irlba)
```
## **Overview**
<p>
The CDC has officially confirmed the first US case of the Omicron variant earlier this month and the urgency to reach the remaining 80 million unvaccinated Americans is back at the forefront. However, the sentiments of these individuals range from distrust to blind faith towards more holistic approaches against COVID-19 and thus, it might be relevant for government agencies to analyze the most candid sentiments of the public towards vaccines in order to vaccinate those who remain hesitant.<p/>
The fear of side effects and the desire to “wait and see if vaccines are safe” were cited as the top two concerns of a group of unvaccinated individuals that the Census Household Pulse surveyed in July of 2021. Given that enough time has passed since the vaccine roll outs occurred, it might be interesting to know if the perception of vaccines became more positive over time.<p/>
## **The Dataset**
<p/>
To answer this inquiry, a dataset of tweets about Pfizer was sourced from Kaggle.This dataset contains 10,919 tweets and includes 9 features such as user_name, user_location, user_followers, date, text, hashtags, retweets, favorites and more. These tweets were collected between December 2020 and November 2021. Only the “date” and “text” features were used in the NLP analysis portion of this project.
```{r}
#Upload Dataset
vaccine_tweets <- as.data.frame(read_csv("C:/Users/16502/Documents/R demo/630 NLP/tweets_cln1.csv" ))
head(vaccine_tweets,10)
```
## **Exploratory Data Analysis**
<p/>
There are 7149 unique users in this dataset and 10,912 unique tweets with these tweets generated from all over the world notably from Malaysia, Australia, Germany and as far as Bangladesh. In terms of popular hashtags, the terms “PfizerBionTech” and “CovidVaccine“ seemed to be the most prominent although the hashtags or combination of hashtags were indexed as a list within each entry and therefore the result of a simple wordcloud() of the hashtag column is more representative of the most frequent combination of hashtags.Also, it might be worth noting that most of these tweets were not retweeted.
**Feature Name: user_name ** <p/>
```{r}
# The number of NA values in the user_name column
sum(is.na(vaccine_tweets$user_name))
# The number of unique values in the user_name column
length(unique(vaccine_tweets$user_name))
```
**Feature Name: user_name ** <p/>
```{r}
# The number of NA values in the user_name column
sum(is.na(vaccine_tweets$user_name))
# The number of unique values in the user_name column
length(unique(vaccine_tweets$user_name))
```
**Feature Name: user_location ** <p/>
```{r}
# The number of NA values in the user_location column
sum(is.na(vaccine_tweets$user_location))
# The number of unique values in the user_location column
length(unique(vaccine_tweets$user_location))
```
```{r}
loc <- as.data.frame(table(vaccine_tweets$user_location))
wordcloud2(data = loc, size=1.6, color='random-dark')
```
**Feature Name: user_followers ** <p/>
```{r}
# The number of NA values in the user_location column
sum(is.na(vaccine_tweets$user_followers))
# The number of unique values in the user_location column
summary(vaccine_tweets$user_followers)
```
**Feature Name: date ** <p/>
```{r}
# The number of NA values in the user_location column
sum(is.na(vaccine_tweets$date))
# The number of unique values in the user_location column
length(unique(vaccine_tweets$date))
# The number of unique values in the user_location column
summary(vaccine_tweets$date)
```
**Feature Name: hashtags ** <p/>
```{r}
# The number of NA values in the user_location column
sum(is.na(vaccine_tweets$hashtags))
# The number of unique values in the user_hashtags column
length(unique(vaccine_tweets$hashtags))
```
**Feature Name: text ** <p/>
```{r}
# The number of NA values in the user_location column
sum(is.na(vaccine_tweets$text))
# The number of unique values in the user_location column
length(unique(vaccine_tweets$text))
```
**Retweets Distribution**
```{r}
counts <- table(vaccine_tweets$retweets)
barplot(counts, main="Was the Tweet Retweeted?",
col="#69b3a2",
xlab="Tweet Status")
```
## **Pre-Processing Workflow**
**Exploring the Duplicated Tweets**
```{r,results='hide'}
# Determine entries that were repeated
n_occur <- data.frame(table(vaccine_tweets$text))
n_occur[n_occur$Freq > 1,]
```
If we looked closer, the duplicated tweets were typically created by the same user either on different dates or when the user had different follower counts. Keeping these tweets might be sensible if we were doing a network analysis but for this project, these duplicated tweets were dropped.
```{r,results='hide'}
vaccine_tweets[vaccine_tweets$text %in% n_occur$Var1[n_occur$Freq > 1],]
```
```{r}
# New dimension matches the number of unique tweets
vaccine_tweets <- vaccine_tweets[!duplicated(vaccine_tweets$text),]
dim(vaccine_tweets)
```
```{r}
## Initialize list of words we want to remove
to_remove<- c("vaccine", "covid", "covid19", "vax","coronavirus", "dose" )
```
```{r}
## Remove emojies, links and words included in our to_remove list
tweet <- pre_process_corpusAG(vaccine_tweets, "text",
remove_html = TRUE,
remove_strings = to_remove,
remove_hex = TRUE,
replace_emojis = TRUE,
replace_numbers = TRUE,
root_gen = 'lemmatize')
vaccine_tweets$text_preprocessed <- tweet
```
**Create a Document Term Matrix Object**
```{r}
# tokenize data with itoken function
it <- itoken(vaccine_tweets$text_preprocessed, tokenizer = word_tokenizer)
vocab_full <- create_vocabulary(it, ngram = c(1, 3))
# We impose a lower and upper bounds to truncate the resulting vocabulary
# term had to be mentioned at least twice
lbound <- 2
# term appearing more than 45 times might be too common and therefore not significant
ubound <- 45
vocab <- vocab_full[vocab_full$doc_count > lbound & vocab_full$doc_count < ubound,]
vectorizer <- vocab_vectorizer(vocab)
dtm <- create_dtm(it, vectorizer)
sparse_corpus <- Matrix(dtm, sparse = T)
```
## **Methods and Results**
### **Frequency Analysis**
**Generate a "Baseline" Frequency Table**
```{r}
freq_table <- data.frame(term = colnames(dtm),
n = colSums(dtm),
freq = colSums(dtm)/sum(dtm)
)
freq_table <- freq_table[order(freq_table$n, decreasing = T),]
```
```{r}
head(freq_table)
```
<p/>
As illustrated above, the top 3 words on this table seemed to allude to the lockdown in Europe while the terms “announce_ban” and “allergic_reaction” could allude to the ban on social gatherings and the concern over adverse effects from vaccines which were both relevant concerns during the early days of the pandemic. Given the need for more granularity in the main frequency table, several proxy terms that could be indicative of positive or negative sentiments were selected. These terms in addition to the vaccine brands that were mentioned in the corpora were used to subset the main frequency table. <p/>
**Select Proxy Terms that might be Indicative of Negative or Positive Sentiment**
```{r}
pos <- c( "thank_science",
"igottheshot",
"getvaccinatednow_getvaccinatednow",
"fullyvaccinated_pfizerbiontech",
"go_pfizerbiontech",
"safe_effective",
"vaccinate_thank",
"grateful_receive",
"thisisourshot",
"get_vaccinate_pfizerbiontech",
"highly_effective",
"pfizerbiontech_get",
"grateful_pfizerbiontech"
)
```
```{r}
# Neg Tags
neg <- c ("side_effect_pfizerbiontech",
"pregnant",
"sideeffects",
"die_norway",
"allergy",
"ban_purchase",
"challenge" ,
"bet",
"ban_purchase_pfizerbiontech",
"pressure",
"america",
"bad",
"adverse",
"ill",
"worry",
"free",
"body", "soreness",
"evidence",
"trump",
"effect_pfizerbiontech" ,
"privilege",
"regulator",
"use_fda_view",
"conservative",
"anaphylaxis",
"god",
"fear",
"mass",
"study_show",
"experimental",
"realdonaldtrump",
"guess",
"pharma",
"wake",
"fail",
"wrong",
"wont" ,
"ban_purchase"
)
```
**Generate a Frequency Table that is More Specific to our Inquiry** <p/>
For our purposes, the vaccine brands were considered 'neutral terms' since these terms could be mentioned in an informative context and therefore does not project a clear type of sentiment.
```{r}
# Segment the frequency table to just include th proxy terms and the vaccine brands
vacs <- c("pfizerbiontech","moderna","sputnik","astrazeneca","johnson" )
vax_tags <- c (vacs, neg, pos)
pattern <- paste(vax_tags, collapse = "|")
overlaps <-grep(pattern, freq_table$term, ignore.case = TRUE, value = TRUE)
```
```{r}
# Segment the main frequency table with mentions of vaccine names
vacs_frqtb <- freq_table[freq_table$term %in% overlaps, ]
vacs_frqtb <- vacs_frqtb [order(vacs_frqtb$n, decreasing = T),][1:30,]
head(vacs_frqtb)
```
**Examine the Frequency Table Just Composed of the Proxy Terms**<p/>
From this approach, we determined the proxy terms that had the most "signal" relative to the other terms.
```{r}
#Frequency of Negative Terms
pattern_neg <- paste(neg, collapse = "|")
overlaps_neg <-grep(pattern_neg, freq_table$term, ignore.case = TRUE, value = TRUE)
neg_frqtb <- freq_table[freq_table$term %in% overlaps_neg, ]
neg_frqtb <- neg_frqtb [order(neg_frqtb$n, decreasing = T),][1:30,]
head(neg_frqtb)
```
```{r}
#Frequency of Positive Terms
pattern_pos <- paste(pos, collapse = "|")
overlaps_pos <-grep(pattern_pos, freq_table$term, ignore.case = TRUE, value = TRUE)
pos_frqtb <- freq_table[freq_table$term %in% overlaps_pos, ]
pos_frqtb <- pos_frqtb [order(pos_frqtb$n, decreasing = T),][1:30,]
head(pos_frqtb)
```
<p/>
**Graph the New Frequency Table**<p/>
```{r, warning=FALSE}
ggplot(vacs_frqtb,aes(x = reorder(term, vacs_frqtb$freq), y = freq, fill = term)) +
geom_bar(stat = "identity", show.legend = F) + coord_flip() + xlab("Vaccine Mentions")
```
<p/>
With this new table, more terms that could be related to vaccine sentiment could be observed. Furthermore, in graphing this new frequency table, some “marbling” of negative and positive proxy terms with terms that were more neutral could also be observed.This marbling indicated that these terms co-occurred enough based on frequency alone and these co-occurrences were contextualized further in this project.<p/>
### **Time Series Analysis**
<p/>
**Time Series Analysis of Vaccine Brands** <p/>
```{r}
vax_names <- c("sputnikv","johnson", "thank_pfizerbiontech", "amp_moderna", "use_pfizerbiontech" )
vax_freq <- dtm[, which(colnames(dtm) %in% sort(vax_names))]
colnames(vax_freq) <- vax_names
vax_freq <- data.frame(date= vaccine_tweets$date ,
as.data.frame(as.matrix(vax_freq))
)
vax_freq$id <- 1:nrow(vax_freq)
df_vax_name <- melt(vax_freq, id.vars = c("id", "date"), variable.name = "term", value.name = "count")
ggplot(df_vax_name, aes(x = date, y = count, color = term)) + geom_line(linetype = 2) + geom_point()
```
<p/>
The time series analysis of the vaccine brands revealed that these terms appeared enough in the corpora to have some probable connection to the proxy terms. If the resulting time series graph as pictured above looked sparse, it would have been difficult to connect our terms to the historical data around our working time frame.<p/>
**Time Series Analysis of Proxy Terms with Vaccine Brands**<p/>
```{r}
b_terms <- c("bad", "adverse", "side_effect_pfizerbiontech", "thank_science", "igottheshot", "fullyvaccinated_pfizerbiontech","sputnikv","johnson", "amp_moderna", "use_pfizerbiontech")
b_freq <- dtm[, which(colnames(dtm) %in% sort(b_terms))]
colnames(b_freq) <- b_terms
b_freq <- data.frame(date= vaccine_tweets$date ,
as.data.frame(as.matrix(b_freq))
)
b_freq$id <- 1:nrow(b_freq)
df_b_term <- melt(b_freq, id.vars = c("id", "date"), variable.name = "term", value.name = "count")
ggplot(df_b_term, aes(x = date, y = count, color = term)) + geom_line(linetype = 2) + geom_point()
```
<p/>
In layering the negative and positive proxy terms with the vaccine brands, denser lines could be observed above which was reminiscent of the “marbling” effect that was evident in the second frequency table.<p/>
**Time Series Analysis of 3 Negative Proxy Terms**
<p/>
Dedicated time series graphs were also created separately for the negative and positive terms to ensure that the signals of these terms were not being conflated from one another.
```{r}
neg_terms <- c("bad", "adverse", "side_effect_pfizerbiontech")
neg_freq <- dtm[, which(colnames(dtm) %in% sort(neg_terms))]
colnames(neg_freq) <- neg_terms
neg_freq <- data.frame(date= vaccine_tweets$date ,
as.data.frame(as.matrix(neg_freq))
)
neg_freq$id <- 1:nrow(neg_freq)
df_neg_term <- melt(neg_freq, id.vars = c("id", "date"), variable.name = "term", value.name = "count")
ggplot(df_neg_term, aes(x = date, y = count, color = term)) + geom_line(linetype = 2) + geom_point()
```
**Time Series Analysis of 3 Positive Proxy Terms**<p/>
```{r}
pos_terms <- c("thank_science", "igottheshot","fullyvaccinated_pfizerbiontech")
pos_freq <- dtm[, which(colnames(dtm) %in% sort(pos_terms))]
colnames(pos_freq) <- pos_terms
pos_freq <- data.frame(date= vaccine_tweets$date ,
as.data.frame(as.matrix(pos_freq))
)
pos_freq$id <- 1:nrow(pos_freq)
df_pos_term <- melt(pos_freq, id.vars = c("id", "date"), variable.name = "term", value.name = "count")
ggplot(df_pos_term, aes(x = date, y = count, color = term)) + geom_line(linetype = 2) + geom_point()
```
<p/>
The figures above illustrated how the negative terms seemed to be denser from January until April compared to the positive terms but the positive terms did maintain their relevance beyond July whereas the negative terms dwindled significantly.<p/>
**Time Series Analysis of Both Negative and Positive Proxy Terms**<p/>
```{r}
both_terms <- c("thank_science", "igottheshot","fullyvaccinated_pfizerbiontech",
"bad", "adverse", "side_effect_pfizerbiontech")
both_freq <- dtm[, which(colnames(dtm) %in% sort(both_terms))]
colnames(both_freq) <- both_terms
both_freq <- data.frame(date= vaccine_tweets$date ,
as.data.frame(as.matrix(both_freq))
)
both_freq$id <- 1:nrow(both_freq)
df_both_term <- melt(both_freq, id.vars = c("id", "date"), variable.name = "term", value.name = "count")
ggplot(df_both_term, aes(x = date, y = count, color = term)) + geom_line(linetype = 2) + geom_point()
```
<p/>
In isolating just the interaction between the negative and positive terms as shown above, we could observe a somewhat equal contention between the positive sentiment (as represented by the pink and blue hues) and the negative sentiment (green hue) between January and July. Nevertheless both sentiments seem to dwindle after July.
### **TF/IDF Analysis **
<p/>
The TF/IDF approach was used to capture the ‘semantic contribution’ of the terms in the dtm where the results for the terms “bad” and “adverse” as well as “igottheshot” and “thank_science” were specifically analyzed and cross-referenced with historical data.
```{r}
# compute IDF
number_of_docs <- nrow(sparse_corpus)
term_in_docs <- colSums(sparse_corpus > 0)
idf <- log(number_of_docs / term_in_docs)
```
```{r}
# compute TF/IDF
tf_idf <- t(t(sparse_corpus) * idf)
names(tf_idf) <- colnames(matrix(sparse_corpus))
rownames(tf_idf) <- vaccine_tweets$date
```
**Semantic Significance of Vaccine Brands** <p/>
In determining when each vaccine brand was the most semantically significant, we are able to solicit the appropriate "historical data" to cross-reference our working timetable with.
```{r}
selected <- c("johnson" , "amp_moderna", "sputnikv","thank_pfizerbiontech","use_pfizerbiontech")
tf_idf_select <- tf_idf[, which(colnames(tf_idf) %in% selected)]
colnames(tf_idf_select) <- selected
tf_idf_select <- data.frame(date= vaccine_tweets$date,
as.data.frame(as.matrix(tf_idf_select))
)
tf_idf_select$id <- 1:nrow(tf_idf_select)
```
```{r, fig.width= 5, fig.height= 15,warning=FALSE}
# create visual
tf_idf_select <- melt(tf_idf_select, id.vars = c("id", "date"), variable.name = "vax", value.name = "tfidf")
# Order by Date
df_select <- tf_idf_select[order(tf_idf_select$tfidf, decreasing = T),][1:130,]
df_select <- df_select[order(tf_idf_select$date),]
ggplot(data = df_select, aes(x = reorder(date, df_select$tfidf), y = tfidf, fill = vax)) + geom_bar(stat = "identity") + coord_flip() + xlab("date")
```
### **Some Relevant Historical Data**
<p/>
**December 11:** FDA Agrees to EUA for COVID-19 Vaccine From Pfizer, BioNTech <p/>
**January 22:** New Data Shows Moderna Vaccine Can Induce Rare Anaphylactic Reactions <p/>
**February 26: ** Vaccine Acceptance Among Americans Increases (KFF.ORG survey) <p/>
**March 14:** AstraZeneca Refutes Reports Linking Its Vaccine to Blood Clots <p/>
**April 7:** European Medicines Agency Finds Rare Clotting Links to AZ Vaccine <p/>
**April 13:** CDC, FDA, Recommend Pausing J&J Vaccine <p/>
**June 1 :** Employers Can Require COVID-19 Vaccine <p/>
**August 20:** Booster Plan Backlash <p/>
**Semantic Significance of Proxy Terms** <p/>
For this analysis, we focused on 2 sets of 3 negative and positive proxy terms.
```{r, fig.width= 5, fig.height= 15, warning=FALSE}
sentiment <- c("bad", "adverse", "side_effect_pfizerbiontech", "thank_science", "igottheshot", "fullyvaccinated_pfizerbiontech")
tf_idf_senti <- tf_idf[, which(colnames(tf_idf) %in% sentiment)]
colnames(tf_idf_senti) <- sentiment
tf_idf_senti <- data.frame(date= vaccine_tweets$date,
as.data.frame(as.matrix(tf_idf_senti))
)
tf_idf_senti$id <- 1:nrow(tf_idf_senti)
# create visual
tf_idf_senti <- melt(tf_idf_senti, id.vars = c("id", "date"), variable.name = "vax", value.name = "tfidf")
# Order by Date
df_senti <- tf_idf_senti[order(tf_idf_senti$tfidf, decreasing = T),][1:130,]
df_senti <- df_senti[order(tf_idf_senti$date),]
ggplot(data = df_senti, aes(x = reorder(date, df_senti$tfidf), y = tfidf, fill = vax)) + geom_bar(stat = "identity") + coord_flip() + xlab("date")
```
### **Semantic Contribution of Certain Terms**
<p/>
**“bad” (peach) :** Late May - Mid-September <p/>
**“igottheshot” (blue):** Mid December - Early Feb <p/>
**“thank_science” (teal):** Mid-December, Late Feb and certain days in April, May, June and July <p/>
**“adverse” (green):** Several days in Dec 2020, Jan, Feb, and a day in March and April <p/>
First, the term “bad” had the most semantic contribution between late May until Mid-September which coincided with news of adverse events from the AstraZeneca and J&J vaccine in addition to mounting opposition to vaccine mandates and Booster plans all over the US around August. On the other hand, the term “igottheshot” was semantically significant between mid-December and early February which coincided with the timeframe for when the FDA approved the Pfizer vaccine and when many people eagerly got vaccinated. Nevertheless, it should be noted that the terms “adverse” and “side_effect_biontech” were also significant during certain days when “igottheshot” was semantically significant and this can be attributed to the news of adverse events from the Moderna vaccine that was reported around the same time that the Pfizer vaccine was being rolled out in many parts of the country. Through these approaches, we are able to roughly contextualize and confirm the probable sentiment of these selected terms as far as our analysis was concerned.<p/>
**Time Series Analysis of TF-IDF Score of Selected Terms**<p/>
However, in determining whether the positive sentiments were more pronounced compared to the negative sentiments (or vice versa), it might be helpful to see these TF-IDF scores graphed linearly in time. The results are shown below. <p/>
**"adverse"**
```{r,warning=FALSE}
tf_idf_b <- tf_idf[, which(colnames(tf_idf) == "adverse")]
tf_idf_b <- data.frame(date= vaccine_tweets$date, adverse = tf_idf_b)
tf_idf_b$id <- 1:nrow(tf_idf_b)
ggplot(tf_idf_b, aes(x = date, y = tf_idf_b$adverse )) +
geom_line(color = 'blue') + geom_point(color = 'blue')
```
**"igottheshot"**
```{r,warning=FALSE}
tf_idf_b <- tf_idf[, which(colnames(tf_idf) == "igottheshot")]
tf_idf_b <- data.frame(date= vaccine_tweets$date, igottheshot = tf_idf_b)
tf_idf_b$id <- 1:nrow(tf_idf_b)
ggplot(tf_idf_b, aes(x = date, y = tf_idf_b$igottheshot )) +
geom_line(color = 'red') + geom_point(color = 'red')
```
<p/>
In the first set of proxy terms which are shown above, both sentiments grew sparse after July while the positive sentiment via the term “igottheshot" was more pronounced between January and February. This is in contrast to the negative sentiment as represented by the word “adverse” which was pronounced for a longer duration specifically between January and April.<p/>
**"side_effect_pfizerbiontech"**
```{r,warning=FALSE}
tf_idf_b <- tf_idf[, which(colnames(tf_idf) == "side_effect_pfizerbiontech")]
tf_idf_b <- data.frame(date= vaccine_tweets$date, side_effect_pfizerbiontech = tf_idf_b)
tf_idf_b$id <- 1:nrow(tf_idf_b)
ggplot(tf_idf_b, aes(x = date, y = tf_idf_b$side_effect_pfizerbiontech)) +
geom_line(color = 'purple') + geom_point(color = 'purple')
```
**"thank_pfizerbiontech"**
```{r,warning=FALSE}
tf_idf_b <- tf_idf[, which(colnames(tf_idf) == "thank_pfizerbiontech")]
tf_idf_b <- data.frame(date= vaccine_tweets$date, thank_pfizerbiontech = tf_idf_b)
tf_idf_b$id <- 1:nrow(tf_idf_b)
ggplot(tf_idf_b, aes(x = date, y = tf_idf_b$thank_pfizerbiontech )) +
geom_line(color = 'green') + geom_point(color = 'green')
```
<p/>
However, the positive sentiment as represented by the term “thank_pfizerbiontech” seemed to be more dominant between January and July compared to “side_effect_pfizerbiontech” and also differed significantly from the results of “igottheshot” which became irrelevant after March.<p/>
Although, when the graphs of the two negative and positive terms were superimposed to one another, we can again observe a roughly equal contention between the positive and negative terms. Please see page 9 of the documentation to see the superimposed images.
### **Topic Modeling**
<p/>
To contextualize any overarching themes in the whole corpora, topic modeling was used.<p/>
**Mod 1**
<p/>
For the first LDA model, we used 3 topics (K=3) in addition to using the default value for alpha which is 50/K, and the default value for eta which is 0.01. Given that the corpora was composed of tweets from different countries, it might be difficult even for the LDA approach to converge the dataset into coherent topics. As such, topic modeling was used to mainly generate some baseline themes but the use of ksearch() to further tune these models was not pursued.
```{r,results='hide'}
Mod1 <- stm(sparse_corpus, init.type = 'LDA', seed = 12345, K = 3)
```
```{r}
# Derive the topic content
Mod1_topic_content <- as.data.frame(t(exp(Mod1$beta$logbeta[[1]])))
# Create topic names
Mod1_topic_names <- apply(Mod1_topic_content, 2, function(x) {paste(Mod1$vocab[order(x,decreasing = T)[1:5]], collapse = " ")})
# Topic prevalence
Mod1_topic_prev <- as.data.frame(Mod1$theta)
df1 <- Mod1_topic_prev
colnames(df1) <- Mod1_topic_names
df1$date <- as.character(vaccine_tweets$date)
df1 <- melt(df1, id.vars = 'date', value.name = 'proportion', variable.name = 'topic')
```
```{r}
apply(Mod1_topic_content, 2, function(x) {Mod1$vocab[order(x, decreasing = T)[1:20]]})
```
```{r}
head(Mod1_topic_names,12)
```
Other than 'V2', we did not get any topics that could be indicative of vaccine sentiment. Given that we are mostly focusing on the American users, 'V3'was disregarded.
```{r}
# Baseline mean score for Mod1
mean(apply(Mod1_topic_prev, 1, max))
```
The low mean score of this model indicated there might be weak connections between each document and there might be a lot of heterogeneity among the topics in the corpora.
```{r}
# Topic prevalence
Mod1_topic_prev <- as.data.frame(Mod1$theta)
df1 <- Mod1_topic_prev
colnames(df1) <- Mod1_topic_names
df1$date <- substr(vaccine_tweets$date,1,7)
df1$date <- as.character(df1$date)
df1 <- melt(df1, id.vars = 'date', value.name = 'proportion', variable.name = 'topic')
```
While the mean score was not encouraging, it might still be interesting to graph these topic distributions by month.
```{r,fig.width=10,fig.height=11}
ggplot(df1, aes(x = topic, y = proportion, fill = topic)) + geom_bar(stat = 'identity') +
theme(axis.text.x = element_blank(), axis.text.y = element_blank(), legend.position = "bottom", legend.text=element_text(size=8),legend.title = element_text( size=8),legend.key.size = unit(2,"mm")) +
coord_flip() + facet_wrap(~ date, ncol = 10) +guides(fill = guide_legend(title.position = "top", ncol = 1))
```
**Model 2**
<p/>
In this model, K was set to 8 in order to estimate the upper range for when we can observe a decrease in mean score.
```{r,results='hide'}
Mod2 <- stm(sparse_corpus, init.type = 'LDA', seed = 12345, K = 8)
```
```{r}
# derive the topic content
Mod2_topic_content <- as.data.frame(t(exp(Mod2$beta$logbeta[[1]])))
# Create topic names
Mod2_topic_names <- apply(Mod2_topic_content, 2, function(x) {paste(Mod2$vocab[order(x,decreasing = T)[1:5]], collapse = " ")})
# Topic prevalence
Mod2_topic_prev <- as.data.frame(Mod2$theta)
df2 <- Mod2_topic_prev
colnames(df2) <- Mod2_topic_names
df2$date <- as.character(vaccine_tweets$date)
df2 <- melt(df2, id.vars = 'date', value.name = 'proportion', variable.name = 'topic')
```
```{r}
apply(Mod2_topic_content, 2, function(x) {Mod2$vocab[order(x, decreasing = T)[1:20]]})
```
```{r}
head(Mod2_topic_names,12)
```
```{r}
# Baseline mean score
mean(apply(Mod2_topic_prev, 1, max))
```
The mean score decreased significantly with the model with K=8 and the resulting topic names again, were not indicative of clear vaccine sentiment.
**Model 3**
<p/>
Just to fully explore the LDA approach, we then again create a model with K=5.
```{r,results='hide'}
Mod3 <- stm(sparse_corpus, init.type = 'LDA', seed = 12345, K = 5)
```
```{r}
# derive the topic content
Mod3_topic_content <- as.data.frame(t(exp(Mod3$beta$logbeta[[1]])))
# Create topic names
Mod3_topic_names <- apply(Mod3_topic_content, 2, function(x) {paste(Mod3$vocab[order(x,decreasing = T)[1:5]], collapse = " ")})
# Topic prevalence
Mod3_topic_prev <- as.data.frame(Mod3$theta)
df3 <- Mod3_topic_prev
colnames(df3) <- Mod3_topic_names
df3$date <- as.character(vaccine_tweets$date)
df3 <- melt(df3, id.vars = 'date', value.name = 'proportion', variable.name = 'topic')
```
```{r}
apply(Mod3_topic_content, 2, function(x) {Mod3$vocab[order(x, decreasing = T)[1:20]]})
```
```{r}
head(Mod3_topic_names,12)
```
```{r}
# Baseline mean score
mean(apply(Mod3_topic_prev, 1, max))
```
Ultimately, increasing the number of topics decreased the mean score and therefore Mod1 with K=3 was the best model among the 3 models. But as mentioned earlier, even Mod1 was not an accurate representation of the topics in the corpora based on the mean score and the resulting topic names.
### **Word Embedding**
<p/>
To further vet the proxy terms that were tracked in the previous analysis, word embeddings approach was used. Specifically, if a term was closely associated with either a vaccine brand and/or some of the other negative or positive proxy terms in our list, then the term was considered a good tracker for negative or positive sentiment.
```{r}
# create all possible ngrams given context_window= 9
skipgrams <- unnest_tokens(vaccine_tweets, ngram,text_preprocessed, token = "ngrams", n = 9)
#Generate ngramID
skipgrams$ngramID <- 1:nrow(skipgrams)
#Generate skipgramID and concatenate all the info
skipgrams$skipgramID <- paste(skipgrams$X_unit_id, skipgrams$ngramID, sep = '_')
#Divide each ngram by word
skipgrams <- unnest_tokens(skipgrams, word, ngram)
```
**Create Probability Table of the Skip Gram Model**
</p>
Now that we have the indices for each separate context window and a row for each term that appears in each window, we want to calculate the co-occurrence of terms in context windows.
```{r, warning=FALSE}
# use pairwise_count() to get frequency
skipgram_probs <- pairwise_count(skipgrams, word, skipgramID, diag = T, sort = T)
#calculate probability
skipgram_probs$p <- skipgram_probs$n/sum(skipgram_probs$n)
```
**Divide by Word** <p/>
Normalize these probabilities by the overall frequency of each term in the corpus.
```{r}
unigram_probs <- unnest_tokens(vaccine_tweets, word, text_preprocessed)
unigram_probs <- count(unigram_probs, word, sort = T)
unigram_probs$p <- unigram_probs$n/sum(unigram_probs$n)
```
**Creating Normalized Model**
```{r}
# get rid of everything that appear less than 20
lbound <- 20
normed_probs <- skipgram_probs[skipgram_probs$n > lbound,]
# rename the cols names
colnames(normed_probs) <- c('word1', 'word2', 'n', 'p_all')
# merge the list of pairs and their probabilities with the unigram probabilities we just calculated.
normed_probs <- merge(normed_probs, unigram_probs[, c('word', 'p')], by.x = 'word2', by.y = 'word', all.x = T)
normed_probs <- merge(normed_probs, unigram_probs[, c('word', 'p')], by.x = 'word1', by.y = 'word', all.x = T)
#calculate the normalized probability by dividing the pair’s probability by each of the individual terms’ probabilities
normed_probs$p_combined <- normed_probs$p_all/normed_probs$p.x/normed_probs$p.y
normed_probs <- normed_probs[order(normed_probs$p_combined, decreasing = T),]
```
**PMI Metric Integrated**
<p/>
In determining the terms with the highest likelihood of association, the PMI metric was used.This metric is a useful conversion that sets the normalized probability equal to 0 if the co-occurrence of 2 terms was due to chance.
```{r}
#The following code block takes awhile to run.
# Create PMI score col
#normed_probs$pmi <- log(normed_probs$p_combined)
# Create PMI_Matrix
#pmi_matrix <- cast_sparse(normed_probs, word1, word2, pmi)
#pmi_matrix@x[is.na(pmi_matrix@x)] <- 0
# Compress pmi_matrix
#pmi_svd <- irlba(pmi_matrix, 256, maxit = 1e6, fastpath=FALSE)
#word_vectors <- pmi_svd$u
#rownames(word_vectors) <- rownames(pmi_matrix)
```
The following code blocks were compiled in the University's Virtual Machine. Due to the long processing time for these methods, the results were captured as an image as opposed to a csv file.
```{r, out.width = "300px"}
knitr::include_graphics("C:/Users/16502/Downloads/bad.png")
#bad <- word_vectors %*% word_vectors["bad",] %>% as.data.frame() %>%
#rename(bad = V1) %>% arrange(-bad)
```
```{r,out.width = "300px"}
knitr::include_graphics("C:/Users/16502/Downloads/igottheshot.jpg")
#igottheshot <- word_vectors %*% word_vectors["igottheshot",] %>% as.data.frame() %>%
#rename(igottheshot = V1) %>% arrange(-igottheshot)
```
```{r,out.width = "300px"}
knitr::include_graphics("C:/Users/16502/Downloads/sputnik.jpg")
#sputnik <- word_vectors %*% word_vectors["sputnik",] %>% as.data.frame() %>%
#rename(sputnik = V1) %>% arrange(-sputnik)
```
<p/> The results of the word embedding approach indicated that the word “bad” was a good tracker for negative sentiment given that “pfizer” was the top 2 and top 3 most associated word for “bad”. The results for “igottheshot” and “sputnik” were not as forthcoming but fortunately, the positive terms that were selected had clear positive connotations. Due to the long processing time for this approach, only 3 terms were vetted.
### **VADER LEXICON**
The VADER lexicon was used to generate sentiment scores that were then compared with the several time series graphs that we previously generated. In doing so, the range of the compound scores between the processed and unprocessed tweets were compared and the results of the unprocessed corpora were used for the next analysis.
```{r}
library(vader)
```
**Using the Unprocessed Text**
```{r}
m <- vader_df(vaccine_tweets$text)
```
```{r}
# Create df with the values we want
raw_vader_senti <- data.frame(
text = m$text,
compound = m$compound,
neg = m$neg,
pos = m$pos,
neu = m$neu,
date = vaccine_tweets$date
)
raw_vader_senti$id <- 1:nrow(raw_vader_senti)
```
**Using the Processed Text**
```{r}
b <- vader_df(vaccine_tweets$text_preprocessed)
```
```{r}
# Create df with the values we want
pros_vader_senti <- data.frame(
text = b$text,
compound = b$compound,
neg = b$neg,
pos = b$pos,
neu = b$neu,
date = vaccine_tweets$date
)
pros_vader_senti$id <- 1:nrow(pros_vader_senti)
```
**Examine the Difference Between the Results of the Unprocessed and Raw Text**
```{r}
range(pros_vader_senti$compound)
```
```{r}
range(raw_vader_senti$compound)
```
The range were not too different between the two approaches and we further confirm this visually by looking at the graphed compound scores of the unprocessed and raw text. Ultimately, the results of the raw text was used.
```{r}
plot(pros_vader_senti$date, pros_vader_senti$compound)
```
```{r}
plot(raw_vader_senti$date, raw_vader_senti$compound)
```
<p/>
To better observe the contention between the positive and negative sentiment scores, the compound scores were graphed using a color gradient.
```{r,warning=FALSE}
sp <- ggplot(raw_vader_senti, aes(date,compound, color = Compound_Score))+
geom_point(aes(color = compound))
#Sequential color scheme.
# Specify the colors for low and high ends of gradient
sp + scale_color_gradient(low = "blue", high = "red")
# Diverging color scheme
# Specify also the colour for mid point
mid <- mean(raw_vader_senti$compound)
sp + scale_color_gradient2(midpoint = mid, low = "blue", mid = "white",
high = "red", space = "Lab" )
```
<p/>
As shown above, the compound scores for each tweet ranged between -1 and +1 where a score close to -1 was indicative of negative sentiment and vice versa. A denser coloration in the upper quadrants could be seen which indicated that there were more positive sentiments in the corpora based on the VADER lexicon. However, the difference was very subtle across all of the months. Furthermore,the dwindling of sentiments after July could be observed through this approach as well.<p/>
## **Analysis**
<p/>
Overall, we were able to observe two recurring themes throughout all of our approaches.
First, the results of the Time Series Analysis of both the Term Frequency and Semantic Contribution as well as the distribution of the VADER generated sentiment scores all illustrated the dwindling of both sentiments over time. A possible explanation for this is the fading of both the novelty and risk of getting vaccinated as more people became vaccinated which can affect the need of users to tweet either type of sentiment. Furthermore, the dataset was sourced by an outside contributor to Kaggle and the decreased volume of tweets during the later months of 2021 could be due to how the dataset was generated from the Twitter API.
<p/>
Second, the roughly balanced contention between the positive and negative sentiments was repeatedly observed in previous methods that was mentioned earlier. This could be attributed to the dynamic nature of the issues around vaccines. For instance, the interplay between lower infection and/or hospitalization rates with news about adverse effects from certain vaccines can create equal forces from both ends given how pervasive these topics are to society at this moment.
<p/>
Given the results across all of the techniques that were used in our methodology, it might be more appropriate to conclude that vaccine sentiment did not become more positive but actually remained roughly constant. As illustrated in the several previous graphs that we generated, the volume of positive sentiment across time had been more or less countered by the negative sentiment. In order to conclude that vaccine sentiment became more positive, we would need to see a more dramatic difference between the positive and negative sentiment.
## **Limitations**
<p/>
It should be noted that the methodology of this project has certain limitations that prevent any robust conclusion to be made. For instance,the validation process of the proxy terms heavily relied on historical data that were specific to the United States. However, as mentioned in the EDA part of the project, a significant number of tweets were generated outside of the United States. While this does not prevent us from analyzing vaccine sentiment since COVID-19 is a global issue, it does weaken our validation process and potentially over-generalize our insights which might make these insights less valuable for US based organizations that want to understand their community's sentiment towards the vaccine. In this iteration of the project, data volume was prioritized instead of subsetting the dataset based on location which can unnecessarily decrease the volume of our dataset.
Additionally,our analysis was derived from the observations of select proxy terms.There are more terms that can be studied through the battery of approaches that we used in this project that can still prove or disprove our initial results. However, our methods were good starting points for the purposes of this project and suggested refinements are mentioned in the conclusion.
<p/>
## **Conclusion**
<p/>
Through the several approaches that we employed in determining whether vaccine sentiment became more positive ,it was evident that both types of sentiments dwindled over time and that the positive sentiments were marginally more pronounced than the negative sentiments between December 2020 and November 2021. Given that the positive sentiment was only marginally dominant than the negative sentiment, for the terms that we specifically analyzed, it was more appropriate to conclude that the positive sentiment remained roughly constant and did not become more positive over time which is in contrast to our initial assumption. The “unsupervised” nature of our inquiry would simply require a more dramatic contrast between these sentiments in order to assert that any shifts in sentiments occurred.
<p/>
Furthermore, only some terms were analyzed using our methodology and a more systematic approach can further enhance the robustness of our conclusion. For instance, a dataset with a “clean” location feature can be used next time to add more specificity to the historical data and the type of sentiments to expect. Then, instead of using the word embeddings approach as a supplementary vetting tool for proxy terms, we can use it as the primary tool for creating our list of proxy terms. This approach can reduce the number of terms to track so that we can pragmatically analyze all of the terms through the various time series analyses that were used in this project. Hence, in seeing the overall behavior of these well vetted words in terms of its frequency, semantic contribution and VADER generated sentiment scores, we are able to systematically chart any shifts in sentiment for all the terms and cross-reference it precisely with historical data.
<p/>
Ultimately, vaccination efforts include some element of translational work. Given that some palpable insights can be derived from certain NLP approaches, government agencies can leverage these tools in properly approaching the unvaccinated members of the community and mitigate the further spread of the virus in the United States.