forked from kbenoit/ITAUR-Short
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2_descriptive.Rmd
359 lines (294 loc) · 13.6 KB
/
2_descriptive.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
---
title: "Preparation and Descriptive Analysis of Texts"
author: "Kenneth Benoit and Paul Nulty"
date: "2nd December 2015"
output: html_document
---
Here we will step through the basic elements of preparing a text for analysis. These are tokenization, conversion to lower case, stemming, removing or selecting features, and defining equivalency classes for features, including the use of dictionaries.
### 1. Tokenization
Tokenization in quanteda is very *conservative*: by default, it only removes separator characters.
```{r}
require(quanteda, quietly = TRUE, warn.conflicts = FALSE)
txt <- c(text1="This is $10 in 999 different ways,\n up and down; left and right!",
text2="@kenbenoit working: on #quanteda 2day\t4ever, http://textasdata.com?page=123.")
tokenize(txt)
tokenize(txt, verbose=TRUE)
tokenize(txt, removeNumbers=TRUE, removePunct=TRUE)
tokenize(txt, removeNumbers=FALSE, removePunct=TRUE)
tokenize(txt, removeNumbers=TRUE, removePunct=FALSE)
tokenize(txt, removeNumbers=FALSE, removePunct=FALSE)
tokenize(txt, removeNumbers=FALSE, removePunct=FALSE, removeSeparators=FALSE)
```
There are several options to the `what` argument:
```{r}
# sentence level
tokenize(c("Kurt Vongeut said; only assholes use semi-colons.",
"Today is Thursday in Canberra: It is yesterday in London.",
"Today is Thursday in Canberra: \nIt is yesterday in London.",
"To be? Or\not to be?"),
what = "sentence")
tokenize(inaugTexts[2], what = "sentence")
# character level
tokenize("My big fat text package.", what="character")
tokenize("My big fat text package.", what="character", removeSeparators=FALSE)
```
Two other options, for really fast and simple tokenization are `"fastestword"` and `"fasterword"`, if performance is a key issue. These are less intelligent than the boundary detection used in the default `"word"` method, which is based on stringi\ICU boundary detection.
### 2. Conversion to lower case
This is a tricky one in our workflow, since it is a form of equivalency declaration, rather than a tokenization step. It turns out that it is more efficient to perform at the pre-tokenization stage.
As a result, the method `toLower()` is defined for many classes of quanteda objects.
```{r}
methods(toLower)
```
We include options designed to preserve acronyms.
```{r}
test1 <- c(text1 = "England and France are members of NATO and UNESCO",
text2 = "NASA sent a rocket into space.")
toLower(test1)
toLower(test1, keepAcronyms = TRUE)
test2 <- tokenize(test1, removePunct=TRUE)
toLower(test2)
toLower(test2, keepAcronyms = TRUE)
```
toLower is based on stringi, and is therefore nicely Unicode compliant.
```{r}
# Russian
cat(iconv(encodedTexts[8], "windows-1251", "UTF-8"))
cat(toLower(iconv(encodedTexts[8], "windows-1251", "UTF-8")))
head(toLower(stopwords("russian")), 20)
# Arabic
cat(iconv(encodedTexts[6], "ISO-8859-6", "UTF-8"))
cat(toLower(iconv(encodedTexts[6], "ISO-8859-6", "UTF-8")))
head(toLower(stopwords("arabic")), 20)
```
**Note**: dfm, the Swiss army knife, converts to lower case by default, but this can be turned off using the `toLower = FALSE` argument.
### 3. Removing and selecting features
This can be done when creating a dfm:
```{r}
# with English stopwords and stemming
dfmsInaug2 <- dfm(subset(inaugCorpus, Year>1980),
ignoredFeatures = stopwords("english"), stem = TRUE)
```
Or can be done **after** creating a dfm:
```{r}
myDfm <- dfm(c("My Christmas was ruined by your opposition tax plan.",
"Does the United_States or Sweden have more progressive taxation?"),
toLower = FALSE, verbose = FALSE)
selectFeatures(myDfm, c("s$", ".y"), "keep", valuetype = "regex")
selectFeatures(myDfm, c("s$", ".y"), "remove", valuetype = "regex")
selectFeatures(myDfm, stopwords("english"), "keep", valuetype = "fixed")
selectFeatures(myDfm, stopwords("english"), "remove", valuetype = "fixed")
```
More examples:
```{r}
# removing stopwords
testText <- "The quick brown fox named Seamus jumps over the lazy dog also named Seamus, with
the newspaper from a boy named Seamus, in his mouth."
testCorpus <- corpus(testText)
# note: "also" is not in the default stopwords("english")
features(dfm(testCorpus, ignoredFeatures = stopwords("english")))
# for ngrams
features(dfm(testCorpus, ngrams = 2, ignoredFeatures = stopwords("english")))
features(dfm(testCorpus, ngrams = 1:2, ignoredFeatures = stopwords("english")))
## removing stopwords before constructing ngrams
tokensAll <- tokenize(toLower(testText), removePunct = TRUE)
tokensNoStopwords <- removeFeatures(tokensAll, stopwords("english"))
tokensNgramsNoStopwords <- ngrams(tokensNoStopwords, 2)
features(dfm(tokensNgramsNoStopwords, ngrams = 1:2))
# keep only certain words
dfm(testCorpus, keptFeatures = "*s", verbose = FALSE) # keep only words ending in "s"
dfm(testCorpus, keptFeatures = "s$", valuetype = "regex", verbose = FALSE)
# testing Twitter functions
testTweets <- c("My homie @justinbieber #justinbieber shopping in #LA yesterday #beliebers",
"2all the ha8ers including my bro #justinbieber #emabiggestfansjustinbieber",
"Justin Bieber #justinbieber #belieber #fetusjustin #EMABiggestFansJustinBieber")
dfm(testTweets, keptFeatures = "#*", removeTwitter = FALSE) # keep only hashtags
dfm(testTweets, keptFeatures = "^#.*$", valuetype = "regex", removeTwitter = FALSE)
```
One very nice feature, recently added, is the ability to create a new dfm with the same feature set as the old. This is very useful, for instance, if we train a model on one dfm, and need to predict on counts from another, but need the feature set to be equivalent.
```{r}
# selecting on a dfm
textVec1 <- c("This is text one.", "This, the second text.", "Here: the third text.")
textVec2 <- c("Here are new words.", "New words in this text.")
features(dfm1 <- dfm(textVec1))
features(dfm2a <- dfm(textVec2))
(dfm2b <- selectFeatures(dfm2a, dfm1))
identical(features(dfm1), features(dfm2b))
```
### 4. Applying equivalency classes: dictionaries, thesaruses
Dictionary creation is done through the `dictionary()` function, which classes a named list of characters as a dictionary.
```{r}
# import the Laver-Garry dictionary from http://bit.ly/1FH2nvf
lgdict <- dictionary(file="http://www.kenbenoit.net/courses/essex2014qta/LaverGarry.cat",
format="wordstat")
dfm(inaugTexts, dictionary=lgdict)
# import a LIWC formatted dictionary
liwcdict <- dictionary(file = "http://www.kenbenoit.net/files/LIWC2001_English.dic",
format = "LIWC")
dfm(inaugTexts, dictionary=liwcdict)
```
We apply dictionaries to a dfm using the `applyDictionary()` function. Through the `valuetype`, argument, we can match patterns of one of three types: `"glob"`, `"regex"`, or `"fixed"`.
```{r}
myDict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
opposition = c("Opposition", "reject", "notincorpus"),
taxglob = "tax*",
taxregex = "tax.+$",
country = c("United_States", "Sweden")))
myDfm <- dfm(c("My Christmas was ruined by your opposition tax plan.",
"Does the United_States or Sweden have more progressive taxation?"),
ignoredFeatures = stopwords("english"), verbose = FALSE)
myDfm
# glob format
applyDictionary(myDfm, myDict, valuetype = "glob")
applyDictionary(myDfm, myDict, valuetype = "glob", case_insensitive = FALSE)
# regex v. glob format: note that "united_states" is a regex match for "tax*"
applyDictionary(myDfm, myDict, valuetype = "glob")
applyDictionary(myDfm, myDict, valuetype = "regex", case_insensitive = TRUE)
# fixed format: no pattern matching
applyDictionary(myDfm, myDict, valuetype = "fixed")
applyDictionary(myDfm, myDict, valuetype = "fixed", case_insensitive = FALSE)
```
It is also possible to pass through a dictionary at the time of `dfm()` creation.
```{r}
# dfm with dictionaries
mycorpus <- subset(inaugCorpus, Year>1900)
mydict <- dictionary(list(christmas=c("Christmas", "Santa", "holiday"),
opposition=c("Opposition", "reject", "notincorpus"),
taxing="taxing",
taxation="taxation",
taxregex="tax*",
country="united states"))
dictDfm <- dfm(mycorpus, dictionary=mydict)
head(dictDfm)
```
Finally, there is a related "thesaurus" feature, which collapses words in a dictionary but is not exclusive.
```{r}
mytexts <- c("British English tokenises differently, with more colour.",
"American English tokenizes color as one word.")
mydict <- dictionary(list(color = "colo*r", tokenize = "tokeni?e*"))
dfm(mytexts, thesaurus = mydict)
```
### 5. Stemming
Stemming relies on the `SnowballC` package's implementation of the Porter stemmer, and is available for the following languages:
```{r}
SnowballC::getStemLanguages()
```
It's not perfect:
```{r}
wordstem(c("win", "winning", "wins", "won", "winner"))
```
but it's fast.
Stemmed objects must be tokenized, but can be of many different quanteda classes:
```{r}
methods(wordstem)
wordstem(tokenize("This is a winning package, of many packages."))
head(wordstem(dfm(inaugTexts[1:2], verbose = FALSE)))
# same as
head(dfm(inaugTexts[1:2], stem = TRUE, verbose = FALSE))
```
### 6. `dfm()` and its many options
Operates on `character` (vectors), `corpus`, or `tokenizedText` objects,
```{r, eval=FALSE}
## S3 method for class 'character'
dfm(x, verbose = TRUE, toLower = TRUE,
removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
removeTwitter = FALSE, stem = FALSE, ignoredFeatures = NULL,
keptFeatures = NULL, matrixType = c("sparse", "dense"),
language = "english", thesaurus = NULL, dictionary = NULL,
valuetype = c("glob", "regex", "fixed"), dictionary_regex = FALSE, ...)
```
quateda has a number of descriptive statistics available for reporting on texts. The **simplest of these** is through the `summary()` method:
```{r}
require(quanteda)
txt <- c(sent1 = "This is an example of the summary method for character objects.",
sent2 = "The cat in the hat swung the bat.")
summary(txt)
```
This also works for corpus objects:
```{r}
summary(corpus(ukimmigTexts, notes = "Created as a demo."))
```
To access the **syllables** of a text, we use `syllables()`:
```{r}
syllables(c("Superman.", "supercalifragilisticexpialidocious", "The cat in the hat."))
```
We can even compute the **Scabble value** of English words, using `scrabble()`:
```{r}
scrabble(c("cat", "quixotry", "zoo"))
```
We can analyze the **lexical diversity** of texts, using `lexdiv()` on a dfm:
```{r}
myDfm <- dfm(subset(inaugCorpus, Year > 1980), verbose = FALSE)
lexdiv(myDfm, "R")
dotchart(sort(lexdiv(myDfm, "R")))
```
We can analyze the **readability** of texts, using `readability()` on a vector of texts or a corpus:
```{r}
readab <- readability(subset(inaugCorpus, Year > 1980), measure = "Flesch.Kincaid")
dotchart(sort(readab))
```
We can **identify documents and terms that are similar to one another**, using `similarity()`:
```{r}
## Presidential Inaugural Address Corpus
presDfm <- dfm(inaugCorpus, ignoredFeatures = stopwords("english"))
# compute some document similarities
similarity(presDfm, "1985-Reagan", n=5, margin="documents")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), n=5, margin="documents", method = "cosine")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), n=5, margin="documents", method = "Hellinger")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), n=5, margin="documents", method = "eJaccard")
# compute some term similarities
similarity(presDfm, c("fair", "health", "terror"), method="cosine")
```
And this can be used for **clustering documents**:
```{r, fig.height=6, fig.width=10}
data(SOTUCorpus, package="quantedaData")
presDfm <- dfm(subset(SOTUCorpus, as.numeric(year)>1981), verbose=FALSE, stem=TRUE,
ignoredFeatures=stopwords("english", verbose=FALSE))
presDfm <- trim(presDfm, minCount=5, minDoc=3)
# hierarchical clustering - get distances on normalized dfm
presDistMat <- dist(as.matrix(weight(presDfm, "relFreq")))
# hierarchical clustering the distance object
presCluster <- hclust(presDistMat)
# label with document names
presCluster$labels <- docnames(presDfm)
# plot as a dendrogram
plot(presCluster)
```
Or we could look at **term clustering** insteadd:
```{r, fig.height=8, fig.width=12}
# word dendrogram with tf-idf weighting
wordDfm <- sort(weight(presDfm, "tfidf"))
wordDfm <- t(wordDfm)[1:100,] # because transposed
wordDistMat <- dist(wordDfm)
wordCluster <- hclust(wordDistMat)
plot(wordCluster, xlab="", main="tf-idf Frequency weighting")
```
Finally, there are number of helper functions to extract information from quanteda objects:
```{r}
myCorpus <- subset(inaugCorpus, Year > 1980)
# return the number of documents
ndoc(myCorpus)
ndoc(dfm(myCorpus, verbose = FALSE))
# how many tokens (total words)
ntoken(myCorpus)
ntoken("How many words in this sentence?")
# arguments to tokenize can be passed
ntoken("How many words in this sentence?", removePunct = TRUE)
# how many types (unique words)
ntype(myCorpus)
ntype("Yada yada yada. (TADA.)")
ntype("Yada yada yada. (TADA.)", removePunct = TRUE)
ntype(toLower("Yada yada yada. (TADA.)"), removePunct = TRUE)
# can count documents and features
ndoc(inaugCorpus)
myDfm1 <- dfm(inaugCorpus, verbose = FALSE)
ndoc(myDfm1)
nfeature(myDfm1)
myDfm2 <- dfm(inaugCorpus, ignoredFeatures = stopwords("english"), stem = TRUE, verbose = FALSE)
nfeature(myDfm2)
# can extract feature labels and document names
head(features(myDfm1), 20)
head(docnames(myDfm1))
# and topfeatures
topfeatures(myDfm1)
topfeatures(myDfm2) # without stopwords
```