-
Notifications
You must be signed in to change notification settings - Fork 1
/
Mark_Twain_Novels.Rmd
372 lines (261 loc) · 12.2 KB
/
Mark_Twain_Novels.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
---
title: "Text Mining Mark Twain's Novels"
author: "Siddharth Sudhakar"
date: "14/06/2020"
output:
html_document:
css: style1.css
highlight: monochrome
theme: cosmo
toc: yes
toc_depth: 4
toc_float: no
pdf_document:
toc: yes
toc_depth: '4'
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## <b>Introduction</b>
Samuel Langhorne Clemens, known worldwide by his pen name Mark Twain, was an American author renowned worldwide as one of the most influential writers in the English language. He has authored more than 25 novels,in this project I will be analysing four of his very famous works :-
* Adventures of Huckleberry Finn
* The Adventures of Tom Sawyer
* Roughing It
* The Gilded Age
<br>
I will download those novels from [Project Gutenberg](http://www.gutenberg.org/) (A library of over 60,000 free eBooks) using the `gutenbergr` package.
```{r warning=FALSE,message = FALSE}
library(tidyverse)
library(wordcloud)
library(gutenbergr)
library(tidytext)
library(tidyr)
library(ggraph)
library(igraph)
library(reshape2)
```
```{r message = FALSE}
twain_books <- gutenberg_download(c( 74, 76,3177,3178), meta_fields = "title")
```
## <b>Tokenizing by n-gram</b>
We got the data, the next step is tokenization.Where we split the data into individual words using the `unnest_tokens()`.Then remove stop words using `antijoin()`.
```{r message = FALSE}
tidy_books <- twain_books %>%
unnest_tokens(word, text)%>%
anti_join(stop_words)
```
Let us have a quick look at the most frequently used words in Mark Twain's novels.
```{r message = FALSE,fig.align = "center"}
tidy_books %>%
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) %>%
top_n(10)%>%
ggplot(aes(word, n,fill=-n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
ggtitle("The Most Common Words in Mark Twains' Novels")
```
Earlier we simply counted the occurence of each word and listed the most frequent words.But this doesn't provide much information on its own as we are unaware of the context in which these words were used.As we know a word’s context can matter nearly as much as its presence.Thus we will now work with tokens containing two words(bigrams) or three words(trigrams) as it can help to extract useful phrases which leads to some additional insights.
### <b>Bigrams</b>
```{r}
books_bigrams <- twain_books %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
books_bigrams
```
In the below steps, we first split the bigrams into two. Remove if any of the word is a stopword and then unite them back again.
```{r}
bigrams_separated <- books_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigrams_filtered
```
```{r}
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
bigrams_united
```
After filtering, essentially the next step is to find the most common bigrams in each of the novels.
```{r fig.align = "center"}
bigram_tf_idf <- bigrams_united %>%
count(title, bigram) %>%
bind_tf_idf(bigram, title, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf %>%
arrange(desc(tf_idf)) %>%
group_by(title) %>%
top_n(10, tf_idf) %>%
ungroup() %>%
mutate(bigram = reorder(bigram, tf_idf)) %>%
ggplot(aes(bigram, tf_idf, fill = title)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ title, ncol = 2, scales = "free") +
coord_flip() +
labs(y = "tf-idf of bigram to novel",
x = "")
```
We can see that the most common bigrams in Mark Twain's novels is names such as injun joe,mary jane,aunt sally,aunt polly,miss watson etc.
Finally, we will visualize relationships between these bigrams by plotting a network of bigrams with ggraph.
```{r warning=FALSE}
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_graph <- bigram_counts %>%
filter(n > 20) %>%
graph_from_data_frame()
```
```{r fig.align = "center"}
set.seed(2020)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
ggtitle("Network of Bigrams in Mark Twains' Novels")
```
In the word network we can see that quantitative words like "yards","miles","dollars","feet" all have a common node "hundred".Also we can see salutation such as "aunt","miss","col" are followed by their names.Other than those there are a lot of names of characters coming up.
### <b>Trigrams</b>
```{r}
twain_books %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!word3 %in% stop_words$word) %>%
count(word1, word2, word3, sort = TRUE)
```
The popular trigrams are "miss mary jane", "salt lake city", "genuine mexican plug", "ten/hundred/twenty thousand dollars".
## <b>Sentiment Analysis</b>
### <b>Comparison Cloud</b>
At first we will look at the words that contribute the most to positive and negative sentiment.We will be using the [Bing](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html) lexicon for this purpose.
```{r warning=FALSE,message = FALSE,fig.align = "center"}
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("#F8766D", "#00BFC4"),
max.words = 100)
```
### <b>Overall Sentiment of Novels</b>
Here we first find a sentiment score for each word using the Bing lexicon and `inner_join()`.And then divide each novel into 80 sections and define a column `index` to keep track of where we are in the narrative.This index counts up sections of 80 lines of text.We then use `spread()` so that we have positive and negative sentiment in separate columns, and lastly calculate a net sentiment (positive - negative).Now we can plot the net sentiment score against the `index` on the x-axis that keeps track of narrative time in sections of text.
```{r message=FALSE}
section_words <- twain_books %>%
mutate(section = row_number()) %>%
filter(section > 0) %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word)
mark_twain_sentiment <- section_words %>%
inner_join(get_sentiments("bing")) %>%
count(title, index = section %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
```
```{r fig.align = "center",fig.width=10}
ggplot(mark_twain_sentiment, aes(index, sentiment, fill = title)) +
geom_col(show.legend = FALSE) +
facet_wrap(~title, ncol = 2, scales = "free_x")
```
Next we will plot a stacked barchart to find the proportion of positive and negative sentiments in each of the four novels.
```{r message = FALSE,fig.width=10,fig.align = "center"}
# Review tail of all_books
books_sent_count <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(title, sentiment, sort = TRUE) %>%
ungroup()
book_pos <- books_sent_count %>%
# Group by book
group_by(title) %>%
# Mutate to add % positive column
mutate(percent_positive = 100 * n / sum(n) )
# Plot percent_positive vs. book, filled by sentiment
ggplot(book_pos, aes(title,percent_positive, fill = sentiment)) +
# Add a col layer
geom_col()+
ggtitle("Proportion of Positive/Negative Sentiment Scores")
```
All the novels have a greater proportion of negative sentiment.But earlier when we divided each novel into sections we found that most of the sections were negative.Thus I suspect many words are falsely labelled as positive.
To verify this we will use the bigrams which we collected earlier and use them to find the most common words which are preceded by negation.
```{r fig.height=8,fig.align = "center"}
negation_words <- c("not", "no","can't","don't")
negated_words <- bigrams_separated %>%
filter(word1 %in% negation_words) %>%
inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>%
count(word1, word2, value, sort = TRUE)
negated_words %>%
mutate(contribution = n * value,
word2 = reorder(paste(word2, word1, sep = "__"), contribution)) %>%
group_by(word1) %>%
top_n(10, abs(contribution)) %>%
ggplot(aes(word2, contribution, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ word1, scales = "free",ncol=2) +
scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
xlab("Words preceded by negation term") +
ylab("Sentiment value * # of occurrences") +
coord_flip()+
ggtitle("Positive/Negative Words that Follow Negations")
```
Here we can see pairings like "can't help","don't like","no good","not help" etc,which otherwise will be termed as positive.
### <b>NRC Lexicon</b>
We will use the [NRC Word Emotion Association Lexicon](http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm) to add 8 more sentiments to our analysis.<br>
Before we find the major sentiments in the novels,let us first check how the words contribute to the sentiments in the NRC lexicon.
```{r message = FALSE,fig.align = "center"}
# Review tail of all_books
tidy_books %>%
count(word) %>%
inner_join(get_sentiments("nrc")) %>%
group_by(sentiment)%>%
filter(!grepl("positive|negative", sentiment)) %>%
top_n(5, n) %>%
ungroup() %>%
mutate(words = reorder(word, n)) %>%
ggplot(aes(x = word, fill = sentiment)) +
facet_grid(~sentiment) +
geom_bar() + #Create a bar for each word per sentiment
theme(panel.grid.major.x = element_blank(),
axis.text.x = element_blank()) + #Place the words on the y-axis
xlab(NULL) + ylab(NULL) +
ggtitle("Words From Mark Twains' Novels") +
coord_flip()+
theme(legend.position="none")
```
```{r message = FALSE,fig.align = "center"}
# Review tail of all_books
books_sent_count2 <- tidy_books %>%
inner_join(get_sentiments("nrc")) %>%
filter(!grepl("positive|negative", sentiment)) %>%
count(sentiment, sort = TRUE) %>%
ungroup()
books_sent_count2%>%
ggplot( aes(x=reorder(sentiment,n), y=n,fill=sentiment)) +
geom_col()+
coord_flip() +
theme(legend.position="none")+
ggtitle("Top Sentiments in Mark Twains' Novels using NRC Lexicon")
```
Next will look at how the sentiments were distributed for each of the four novels.Since "The Glided Age" and "Roughing It" are relatively biggers novels in terms of number of words, I have divided every words contribution from a particular novel by the total contribution of all the words in that novel.So that all the novels are in similar scale.
```{r fig.width=10,fig.align = "center"}
scores <- tidy_books%>%
# filter(title != "Roughing It")%>%
# filter(title != "Adventures of Huckleberry Finn")%>%
inner_join(get_sentiments("nrc"), by = c("word" = "word")) %>%
filter(!grepl("positive|negative", sentiment)) %>%
count(title, sentiment)
scores%>%
group_by(title)%>%
mutate(total = (n/sum(n))*100) %>%
ggplot( aes(x=sentiment, y=total,fill=sentiment)) +
geom_bar(stat="identity", width=0.6) +
coord_flip() +
facet_wrap(~title, ncol=4)+
theme(legend.position="none")+
ggtitle("Comparison of Sentiments from Each of the Novels")
```
Interestingly all the novels show a similar trend for a particular sentiment.The only major difference is in the novel "The Glided Age" where the degree of joy is greater than the degree of sadness and fear.Which is not the trend followed by other novels.<br>
Earlier also we saw that "The Glided Age" was relatively more positive compared to the other novels. <br>
With that we come to the end of this project.Wow! that was so much fun, learned a lot of new things along the way.The source code can be found here.
## <b>References</b>
[Text Mining with R](https://www.tidytextmining.com/)<br><br>