-
Notifications
You must be signed in to change notification settings - Fork 18
/
topic-models.Rmd
236 lines (175 loc) · 13.1 KB
/
topic-models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
# (PART\*) Part III: Going Further {-}
# Topic Models
Anyone who has taken a literature class has been asked to discuss the various themes of a given text. How are such themes discovered? Upon close reading of the text, recurrent references to similar concepts might be bundled in our minds as some general theme hidden throughout the text. Discovery of such themes is part of the joy of reading in general.
One can think of these hidden, or latent, themes much as we would latent variables in factor analysis. Different texts have both contrasting and common themes, but the author doesn't usually come out and announce a theme explicitly. Now consider a situation in which you have a million books. You can’t read that many, but may still want to discover the themes in them. This is where <span class="emph">topic modeling</span> can be useful.
We don't actually analyze raw text, but what we can do is get word counts for every text[^words], and construct a <span class="emph">document term matrix</span> (DTM). In this matrix, rows represent the documents and columns represent terms that are found in all documents. The values represent the counts of how many times a term is found in a text. With many texts, one will easily have thousands of columns, but most texts do not use most of the words, resulting in a very <span class="emph">sparse</span> matrix that is mostly zeros. Still, now that we have a numeric matrix, we can perform analysis on it. The goal is just like the goal PCA and factor analysis- we want to reduce these thousands of columns of terms to a far fewer number of topics. Furthermore, like factor analysis, we will want to interpret the topics, in this case based on the terms associated with them.
## Latent Dirichlet Allocation
The most common approach to topic modeling is <span class="emph">latent dirichlet allocation</span> (LDA), which one can think of as discrete PCA. In my opinion, this should be as much a part of your toolbox as PCA and Factor Analysis, as 'compositional' data, where we have counts of occurrences (out of some total), are quite common. In the past PCA was applied to such data, but it is essentially a less performant approach with less intuitive results.
I have gone into far more detailed demonstration elsewhere, and have no desire to duplicate it. I have [workshop notes](http://m-clark.github.io/text-analysis-with-R/) devoted exclusively to it, as well as hands-on demos [here](http://m-clark.github.io/docs/topic_models/topic-model-demo.html) and [here](http://micl.shinyapps.io/texEx/texEx.Rmd). However we can discuss a couple things, starting with the Dirichlet distribution. A draw from the Dirichlet distribution can be seen as a probability distribution for a k category event. It has one parameter, we'll call $\alpha$, which is often referred to as the concentration parameter. If the k $\alpha$ values are equal, the resulting k probabilities will be equal on average, and with larger $\alpha$, there will be less variance around that probability. When they are unequal, the larger values will result in larger probabilities assigned. Consider the following for `k=5` topics.
```{r dirichlet_draw, echo=1:5}
library(gtools)
probs1 = rdirichlet(n=1000, alpha=rep(1,5))
probs2 = rdirichlet(n=1000, alpha=rep(100,5)) # less variance
probs3 = rdirichlet(n=1000, alpha=(1:5)^2)
map(list(probs1, probs2, probs3), colMeans) %>% map(round, 2)
# one more example of where purrr can only replicate the apply family for anything meaningful, can't return a matrix, can't bind the rows... sigh
```
In topic modeling, the probabilities can represent the probability of various topics, or the probability of terms within topics. However, the thing to note is that LDA can be applied to any appropriate data, it doesn't have to be a document-term matrix resulting from text. Any count-based data matrix might potentially be appropriate.
## Analysis
When it comes to topic modeling, most of the time is spent on processing the text itself. Importing/scraping it, dealing with capitalization, punctuation, removing stopwords, dealing with encoding issues, removing other miscellaneous common words. It is a highly iterative process such that once you initially get to the document-term matrix, you're just going to find the stuff that was missed before and repeat the process with new 'cleaning parameters' in place. So getting to the analysis stage is the hard part. The following image is from the [tidytext book](http://tidytextmining.com/), and gives some sense of the process, as well as some R packages that might be of use to you.
<a href="http://tidytextmining.com/images/tidyflow-ch-6.png"><img src="img/tidyflow.png" style="display:block; margin: 0 auto;" width=75%></a>
In what follows we'll start at the point of having the DTM in place and ready for analysis. For our needs we'll use the <span class="pack">topicModels</span> package for the analysis, and mostly others for post-processing. As mentioned, one of the primary results of such an analysis are the probabilities of terms within topics, which like factor loadings, can aid in interpreting the topics. The other result is the probability that a topic will be present in a given document.
The texts we'll analyze are Dante's Divine Comedy. Each canto within the three texts of Inferno, Purgatory, and Paradise will be treated as a document. I have already created the DTM where stopwords have been removed, but plenty more cleaning could have been applied. To get a sense of what's going into the analysis, I show some of the more common terms, but note also, as in most text analysis, most terms are not present in most documents, leading to a notably sparse DTM.
```{r dtm_setup, echo=1}
load('data/topic/divine_comedy_dtm.RData')
library(gutenbergr)
titles = c('The vision of hell. by Dante Alighieri',
'The Divine Comedy by Dante, Illustrated, Purgatory, Complete',
'The Divine Comedy by Dante, Illustrated, Paradise, Complete')
ids = c('The vision of hell. by Dante Alighieri',
'The Divine Comedy by Dante, Illustrated, Purgatory, Complete',
'The Divine Comedy by Dante, Illustrated, Paradise, Complete')
purgatory = 8795
paradise = 8799
hell = 8789
dc_books = gutenberg_works(gutenberg_id %in% c(purgatory, paradise, hell)) %>%
gutenberg_download(meta_fields = "title") %>%
mutate(title = factor(title, labels=c('Paradise', 'Purgatory', 'Inferno'))) # checked
# dc_whole = gutenberg_works(gutenberg_id==8800) %>%
# gutenberg_download(meta_fields = c('title'))
library(tidytext)
library(stringr)
by_chapter = dc_books %>%
group_by(title) %>%
filter(!str_detect(text, pattern='[0-99]')) %>% # get rid of 'list of cantos
slice(-(1:25)) %>% # get rid of title page
mutate(chapter = cumsum(str_detect(text, regex("^canto ", ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0)
by_chapter_word = by_chapter %>%
unite(title_chapter, title, chapter) %>%
unnest_tokens(word, text)
by_chapter_bigram = by_chapter %>%
unite(title_chapter, title, chapter) %>%
unnest_tokens(word, text, token='ngrams', n=2)
old_stops0 = read_lines('data/topic/old_english_stop_words.txt')
old_stops = data_frame(word=str_conv(old_stops0, 'UTF8'),
lexicon = 'cltk')
me_stops0 = read_lines('data/topic/middle_english_stop_words.txt')
me_stops = data_frame(word=str_conv(me_stops0, 'UTF8'),
lexicon = 'cltk')
em_stops0 = read_lines('data/topic/early_modern_english_stop_words.txt')
em_stops = data_frame(word=str_conv(em_stops0, 'UTF8'),
lexicon = 'emc')
word_counts = by_chapter_word %>%
# bind_rows(by_chapter_bigram) %>%
anti_join(old_stops) %>%
anti_join(em_stops) %>%
anti_join(me_stops) %>%
anti_join(stop_words) %>%
filter(word!='canto') %>%
count(title_chapter, word, sort = TRUE) %>%
ungroup()
# note that DTM %>% broom::tidy() will produce the same tidy word count df
DT::datatable(head(word_counts,100),
options=list(dom='tp',
autoWidth = TRUE,
# columnDefs = list(list(width = '50px', targets = 0)),
align='center',
columnDefs=list(list(width='100px', targets=0),
list(width='50px', targets=1:2),
list(className = 'dt-center', targets = 0:2))),
rownames=F,
width=500)
divine_comedy_chapters_dtm = word_counts %>%
cast_dtm(title_chapter, word, n)
# with stemming
# divine_comedy_chapters_dtm = word_counts %>%
# cast_dfm(title_chapter, word, n) %>%
# quanteda::dfm_wordstem()
divine_comedy_chapters_dtm
save('divine_comedy_chapters_dtm', file='data/topic/divine_comedy_dtm.RData')
```
As with factor or cluster analysis, one must choose the number of topics to retain. There are various methods/statistics that can help with this, but simple interpretability could be used as well. We'll go with three to see if the three books are uniquely expressive.
```{r lda_divine_comedy}
library(topicmodels)
chapters_lda = LDA(divine_comedy_chapters_dtm, k = 3, control = list(seed = 1234))
chapters_lda
```
First, we can simply look at probable terms. The following shows the top 10 most probable terms for each topic.
```{r top_terms, echo=FALSE}
chapters_lda_td = tidy(chapters_lda)
# chapters_lda_td
top_terms = chapters_lda_td %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(-beta)
# top_terms
# all of the following is because of ggplot's refusal to order data as it is presented
t1 = top_terms %>%
mutate(topic = factor(topic)) %>%
filter(topic==1) %>%
mutate(term = factor(term, levels=unique(term))) %>%
ggplot(aes(term, beta)) +
geom_point(aes(color=topic), size=3, show.legend=F) +
ylab('Probability') +
ylim(c(.0025,.0065)) +
scale_color_manual(values=scales::alpha(palettes$latvian_red$latvian_red), .85) +
theme(axis.text.x = element_text(size = 8, angle = 90, hjust = 1, vjust=.25)) +
theme_trueMinimal()
t2 = top_terms %>%
mutate(topic = factor(topic)) %>%
filter(topic==2) %>%
mutate(term = factor(term, levels=unique(term))) %>%
ggplot(aes(term, beta)) +
geom_point(aes(color=topic), size=3, show.legend=F) +
ylab('Probability') +
ylim(c(.0025,.0065)) +
scale_color_manual(values=scales::alpha(palettes$latvian_red$triadic[2]), .85) +
theme(axis.text.x = element_text(size = 8, angle = 90, hjust = 1, vjust=.25)) +
theme_trueMinimal()
t3 = top_terms %>%
mutate(topic = factor(topic)) %>%
filter(topic==3) %>%
mutate(term = factor(term, levels=unique(term))) %>%
ggplot(aes(term, beta)) +
geom_point(aes(color=topic), size=3, show.legend=F) +
ylab('Probability') +
ylim(c(.0025,.0065)) +
scale_color_manual(values=scales::alpha(palettes$latvian_red$triadic[3]), .85) +
theme(axis.text.x = element_text(size = 8, angle = 90, hjust = 1, vjust=.25)) +
theme_trueMinimal()
cowplot::plot_grid(t1, t2, t3, nrow=1, align='v', axis='l')
# ggplotly()
```
For the following I've collapsed across cantos to get an average topic probability for each book. For example, with the Inferno, the topic where **god**, **heaven**, and **virtue** are less probable, while the one with **beneath** and **cried** are more so. With Paradise you have the opposite situation, and are more likely to find the topics regarding **virtue**, **heaven**, **sun**, and **god**. Purgatory, perhaps not surprisingly is a fairly balanced mix of topics[^facomparison].
```{r doc_topics, echo=FALSE, fig.height=6}
chapters_lda_gamma = tidy(chapters_lda, matrix = "gamma")
# chapters_lda_gamma
chapters_lda_gamma = chapters_lda_gamma %>%
separate(document, c("title", "chapter"), sep = "_", convert = TRUE) %>%
mutate(title = factor(title),
topic = factor(topic)) %>%
rename(canto = chapter)
# chapters_lda_gamma
chapters_lda_gamma %>%
group_by(title, topic) %>%
summarise(`Average Probability`=mean(gamma)) %>%
ggplot(aes(x=topic, y=`Average Probability`, fill = topic)) +
geom_point(aes(color=topic), size=5) +
# scale_color_brewer(type='qual', palette=1) +
# scale_color_manual(values=scales::alpha(c(palettes$orange$orange, palettes$orange$triadic), .75)) +
scale_color_manual(values=scales::alpha(palettes$latvian_red$triadic, .85)) +
# scale_color_manual(values=scales::alpha(NineteenEightyR::sonny(3), .85)) +
ylim(c(0,.6)) +
facet_wrap(~ title, nrow = 1) +
theme_minimal() +
theme(axis.title.x=element_text(size=10),
panel.grid.major.x=element_blank())
# ggplotly()
```
## Summary of Topic Models
This is just a snippet of the potential with topic modeling, but hopefully you get the idea. As in other latent variable approaches, one can see LDA as a dimension reduction technique, where thousands of terms are boiled down to a few topics. However, topic models can also possess a rich interpretive quality. Just remember that what you call 'documents', 'terms' and 'topics' is not limited to text, and is in fact highly flexible.
[^words]: Words don't have to be *just* words, but also might be collections of terms.
[^facomparison]: This is just like in factor analysis where some variables load mostly on one factor while some have notable cross-loadings.