forked from dgrtwo/tidy-text-mining
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04-tf-idf.Rmd
254 lines (190 loc) · 12.2 KB
/
04-tf-idf.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
# Analyzing word and document frequency: tf-idf {#tfidf}
```{r echo = FALSE}
library(knitr)
opts_chunk$set(message = FALSE, warning = FALSE, cache = TRUE)
options(width = 100, dplyr.width = 100)
library(ggplot2)
theme_set(theme_light())
```
A central question in text mining and natural language processing is how to quantify what a document is about. Can we do this by looking at the words that make up the document? One measure of how important a word may be is its *term frequency* (tf), how frequently a word occurs in a document; we have examined how to measure word frequency using tidy data principles in [Chapter 2](#tidytext). There are words in a document, however, that occur many times but may not be important; in English, these are probably words like "the", "is", "of", and so forth. We might take the approach of adding words like these to a list of stop words and removing them before analysis, but it is possible that some of these words might be more important in some documents than others. A list of stop words is not a very sophisticated approach to adjusting term frequency for commonly used words.
Another approach is to look at a term's *inverse document frequency* (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. This can be combined with term frequency to calculate a term's *tf-idf* (the two quantities multiplied together), the frequency of a term adjusted for how rarely it is used. It is intended to measure how important a word is to a document in a collection (or corpus) of documents. It is a rule-of-thumb or heuristic quantity; while it has proved useful in text mining, search engines, etc., its theoretical foundations are considered less than firm by information theory experts. The inverse document frequency for any given term is defined as
$$idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}$$
We can use tidy data principles, as described in [Chapter 2](#tidytext), to approach tf-idf analysis and use consistent, effective tools to quantify how important various terms are in a document that is part of a collection.
## Term frequency in Jane Austen's novels
Let's start by looking at the published novels of Jane Austen and examine first term frequency, then tf-idf. We can start just by using dplyr verbs such as `group_by` and `join`. What are the most commonly used words in Jane Austen's novels? (Let's also calculate the total words in each novel here, for later use.)
```{r book_words}
library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- austen_books() %>%
unnest_tokens(word, text) %>%
count(book, word, sort = TRUE) %>%
ungroup()
total_words <- book_words %>%
group_by(book) %>%
summarize(total = sum(n))
book_words <- left_join(book_words, total_words)
book_words
```
There is one row in this data frame for each word-book combination; `n` is the number of times that word is used in that book and `total` is the total words in that book. The usual suspects are here, "the", "and", "to", and so forth. Let's look at the distribution of `n/total` for each novel, the number of times a word appears in a novel divided by the total number of terms (words) in that novel. This is exactly what term frequency is.
```{r plot_tf, dependson = "book_words", fig.height=9, fig.width=9}
library(ggplot2)
ggplot(book_words, aes(n/total, fill = book)) +
geom_histogram(alpha = 0.8, show.legend = FALSE) +
xlim(NA, 0.0009) +
labs(title = "Term Frequency Distribution in Jane Austen's Novels") +
facet_wrap(~book, ncol = 2, scales = "free_y")
```
There are very long tails to the right for these novels (those extremely common words!) that we have not shown in these plots. These plots exhibit similar distributions for all the novels, with many words that occur rarely and fewer words that occur frequently.
## Zipf's law
TODO
```{r freq_by_rank, dependson = book_words}
freq_by_rank <- book_words %>%
group_by(book) %>%
mutate(rank = row_number(),
`term frequency` = n/total)
freq_by_rank
```
```{r zipf, dependson = "freq_by_rank", fig.width=9, fig.height=6}
freq_by_rank %>%
ggplot(aes(rank, `term frequency`, color = book)) +
geom_line(size = 1.2, alpha = 0.8) +
scale_x_log10() +
scale_y_log10()
```
Notice that this plot is in log-log coordinates.
```{r lower_rank, dependson = "freq_by_rank"}
lower_rank <- freq_by_rank %>%
filter(rank < 500)
lm(log10(`term frequency`) ~ log10(rank), data = lower_rank)
```
```{r zipf_fit, dependson = "freq_by_rank", fig.width=9, fig.height=6}
freq_by_rank %>%
ggplot(aes(rank, `term frequency`, color = book)) +
geom_abline(intercept = -0.77, slope = -1.05, color = "gray50", linetype = 2) +
geom_line(size = 1.2, alpha = 0.8) +
scale_x_log10() +
scale_y_log10()
```
## The `bind_tf_idf` function
The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents, in this case, the group of Jane Austen's novels as a whole. Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not *too* common. Let's do that now.
```{r tf_idf, dependson = "book_words"}
book_words <- book_words %>%
bind_tf_idf(word, book, n)
book_words
```
Notice that idf and thus tf-idf are zero for these extremely common words. These are all words that appear in all six of Jane Austen's novels, so the idf term (which will then be the natural log of 1) is zero. The inverse document frequency (and thus tf-idf) is very low (near zero) for words that occur in many of the documents in a collection; this is how this approach decreases the weight for common words. The inverse document frequency will be a higher number for words that occur in fewer of the documents in the collection. Let's look at terms with high tf-idf in Jane Austen's works.
```{r desc_idf, dependson = "tf_idf"}
book_words %>%
select(-total) %>%
arrange(desc(tf_idf))
```
Here we see all proper nouns, names that are in fact important in these novels. None of them occur in all of novels, and they are important, characteristic words for each text. Some of the values for idf are the same for different terms because there are 6 documents in this corpus and we are seeing the numerical value for $\ln(6/1)$, $\ln(6/2)$, etc. Let's look at a visualization for these high tf-idf words.
```{r plot_austen, dependson = "desc_idf", fig.height=6, fig.width=9}
plot_austen <- book_words %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word))))
ggplot(plot_austen[1:20,], aes(word, tf_idf, fill = book)) +
geom_bar(alpha = 0.8, stat = "identity") +
labs(title = "Highest tf-idf words in Jane Austen's Novels",
x = NULL, y = "tf-idf") +
coord_flip()
```
Let's look at the novels individually.
```{r plot_separate, dependson = "plot_austen", fig.height=10, fig.width=9}
plot_austen <- plot_austen %>%
group_by(book) %>%
top_n(15) %>%
ungroup
ggplot(plot_austen, aes(word, tf_idf, fill = book)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
labs(title = "Highest tf-idf words in Jane Austen's Novels",
x = NULL, y = "tf-idf") +
facet_wrap(~book, ncol = 2, scales = "free") +
coord_flip()
```
Still all proper nouns! These words are, as measured by tf-idf, the most important to each novel and most readers would likely agree.
## A corpus of physics texts
Let's work with another corpus of documents, to see what terms are important in a different set of works. In fact, let's leave the world of fiction and narrative entirely. Let's download some classic physics texts from Project Gutenberg and see what terms are important in these works, as measured by tf-idf. Let's download [*Discourse on Floating Bodies* by Galileo Galilei](http://www.gutenberg.org/ebooks/37729), [*Treatise on Light* by Christiaan Huygens](http://www.gutenberg.org/ebooks/14725), [*Experiments with Alternate Currents of High Potential and High Frequency* by Nikola Tesla](http://www.gutenberg.org/ebooks/13476), and [*Relativity: The Special and General Theory* by Albert Einstein](http://www.gutenberg.org/ebooks/5001).
This is a pretty diverse bunch. They may all be physics classics, but they were written across a 300-year timespan, and some of them were first written in other languages and then translated to English. Perfectly homogeneous these are not, but that doesn't stop this from being an interesting exercise!
```{r eval = FALSE}
library(gutenbergr)
physics <- gutenberg_download(c(37729, 14725, 13476, 5001),
meta_fields = "author")
```
```{r physics, echo = FALSE}
load("data/physics.rda")
```
```{r physics_words, dependson = "physics"}
physics_words <- physics %>%
unnest_tokens(word, text) %>%
count(author, word, sort = TRUE) %>%
ungroup()
physics_words
```
Here we see just the raw counts, and of course these documents are all different lengths. Let's go ahead and calculate tf-idf.
```{r plot_physics, dependson = "physics_words", fig.width=9, fig.height=6}
physics_words <- physics_words %>%
bind_tf_idf(word, author, n)
plot_physics <- physics_words %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
mutate(author = factor(author, levels = c("Galilei, Galileo",
"Huygens, Christiaan",
"Tesla, Nikola",
"Einstein, Albert")))
ggplot(plot_physics[1:20,], aes(word, tf_idf, fill = author)) +
geom_bar(alpha = 0.8, stat = "identity") +
labs(title = "Highest tf-idf words in Classic Physics Texts",
x = NULL, y = "tf-idf") +
coord_flip()
```
Nice! Let's look at each text individually.
```{r physics_separate, dependson = "plot_physics", fig.height=7, fig.width=8}
plot_physics <- plot_physics %>%
group_by(author) %>%
top_n(15, tf_idf) %>%
mutate(word = reorder(word, tf_idf))
ggplot(plot_physics, aes(word, tf_idf, fill = author)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
labs(title = "Highest tf-idf words in Classic Physics Texts",
x = NULL, y = "tf-idf") +
facet_wrap(~author, ncol = 2, scales = "free") +
coord_flip()
```
Very interesting indeed. One thing we see here is "gif" in the Einstein text?!
```{r dependson = "physics"}
grep("eq\\.", physics$text, value = TRUE)[1:10]
```
Some cleaning up of the text might be in order. "K1" is the name of a coordinate system for Einstein:
```{r dependson = "physics"}
grep("K1", physics$text, value = TRUE)[1]
```
Maybe it makes sense to keep this one. Also notice that in this line we have "co-ordinate", which explains why there are separate "co" and "ordinate" items in the high tf-idf words for the Einstein text.
"AB", "RC", and so forth are names of rays, circles, angles, and so forth for Huygens.
```{r dependson = "physics"}
grep("AK", physics$text, value = TRUE)[1]
```
Let's remove some of these less meaningful words to make a better, more meaningful plot. Notice that we make a custom list of stop words and use `anti_join` to remove them.
```{r mystopwords, dependson = "plot_physics", fig.height=7, fig.width=8}
mystopwords <- data_frame(word = c("eq", "co", "rc", "ac", "ak", "bn",
"fig", "file", "cg", "cb", "cm"))
physics_words <- anti_join(physics_words, mystopwords, by = "word")
plot_physics <- physics_words %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(author) %>%
top_n(15, tf_idf) %>%
ungroup %>%
mutate(author = factor(author, levels = c("Galilei, Galileo",
"Huygens, Christiaan",
"Tesla, Nikola",
"Einstein, Albert")))
ggplot(plot_physics, aes(word, tf_idf, fill = author)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
labs(title = "Highest tf-idf words in Classic Physics Texts",
x = NULL, y = "tf-idf") +
facet_wrap(~author, ncol = 2, scales = "free") +
coord_flip()
```
We don't hear enough about ramparts or things being ethereal in physics today.