2_descriptive/2_descriptive.Rmd

---
title: "Preparation and Descriptive Analysis of Texts"
author: "Kenneth Benoit and Paul Nulty"
date: "2nd December 2015"
output: html_document
---


Here we will step through the basic elements of preparing a text for analysis.  These are tokenization, conversion to lower case, stemming, removing or selecting features, and defining equivalency classes for features, including the use of dictionaries.


### 1. Tokenization

Tokenization in quanteda is very *conservative*: by default, it only removes separator characters.

```{r}
require(quanteda, quietly = TRUE, warn.conflicts = FALSE)
txt <- c(text1="This is $10 in 999 different ways,\n up and down; left and right!",
         text2="@kenbenoit working: on #quanteda 2day\t4ever, http://textasdata.com?page=123.")
tokenize(txt)
tokenize(txt, verbose=TRUE)
tokenize(txt, removeNumbers=TRUE, removePunct=TRUE)
tokenize(txt, removeNumbers=FALSE, removePunct=TRUE)
tokenize(txt, removeNumbers=TRUE, removePunct=FALSE)
tokenize(txt, removeNumbers=FALSE, removePunct=FALSE)
tokenize(txt, removeNumbers=FALSE, removePunct=FALSE, removeSeparators=FALSE)
```

There are several options to the `what` argument: 
```{r}
# sentence level
tokenize(c("Kurt Vongeut said; only assholes use semi-colons.",
           "Today is Thursday in Canberra:  It is yesterday in London.",
           "Today is Thursday in Canberra:  \nIt is yesterday in London.",
           "To be?  Or\not to be?"),
          what = "sentence")
tokenize(inaugTexts[2], what = "sentence")

# character level
tokenize("My big fat text package.", what="character")
tokenize("My big fat text package.", what="character", removeSeparators=FALSE)
```

Two other options, for really fast and simple tokenization are `"fastestword"` and `"fasterword"`, if performance is a key issue.  These are less intelligent than the boundary detection used in the default `"word"` method, which is based on stringi\ICU boundary detection.

### 2. Conversion to lower case

This is a tricky one in our workflow, since it is a form of equivalency declaration, rather than a tokenization step.  It turns out that it is more efficient to perform at the pre-tokenization stage. 

As a result, the method `toLower()` is defined for many classes of quanteda objects.
```{r}
methods(toLower)
```

We include options designed to preserve acronyms.
```{r}
test1 <- c(text1 = "England and France are members of NATO and UNESCO",
           text2 = "NASA sent a rocket into space.")
toLower(test1)
toLower(test1, keepAcronyms = TRUE)

test2 <- tokenize(test1, removePunct=TRUE)
toLower(test2)
toLower(test2, keepAcronyms = TRUE)
```

toLower is based on stringi, and is therefore nicely Unicode compliant.
```{r}
# Russian
cat(iconv(encodedTexts[8], "windows-1251", "UTF-8"))
cat(toLower(iconv(encodedTexts[8], "windows-1251", "UTF-8")))
head(toLower(stopwords("russian")), 20)

# Arabic
cat(iconv(encodedTexts[6], "ISO-8859-6", "UTF-8"))
cat(toLower(iconv(encodedTexts[6], "ISO-8859-6", "UTF-8")))
head(toLower(stopwords("arabic")), 20)
```

**Note**: dfm, the Swiss army knife, converts to lower case by default, but this can be turned off using the `toLower = FALSE` argument.

### 3. Removing and selecting features

This can be done when creating a dfm:
```{r}
# with English stopwords and stemming
dfmsInaug2 <- dfm(subset(inaugCorpus, Year>1980),
                  ignoredFeatures = stopwords("english"), stem = TRUE)
```


Or can be done **after** creating a dfm:
```{r}
myDfm <- dfm(c("My Christmas was ruined by your opposition tax plan.",
               "Does the United_States or Sweden have more progressive taxation?"),
             toLower = FALSE, verbose = FALSE)
selectFeatures(myDfm, c("s$", ".y"), "keep", valuetype = "regex")
selectFeatures(myDfm, c("s$", ".y"), "remove", valuetype = "regex")
selectFeatures(myDfm, stopwords("english"), "keep", valuetype = "fixed")
selectFeatures(myDfm, stopwords("english"), "remove", valuetype = "fixed")
```

More examples:
```{r}

# removing stopwords
testText <- "The quick brown fox named Seamus jumps over the lazy dog also named Seamus, with
             the newspaper from a boy named Seamus, in his mouth."
testCorpus <- corpus(testText)
# note: "also" is not in the default stopwords("english")
features(dfm(testCorpus, ignoredFeatures = stopwords("english")))
# for ngrams
features(dfm(testCorpus, ngrams = 2, ignoredFeatures = stopwords("english")))
features(dfm(testCorpus, ngrams = 1:2, ignoredFeatures = stopwords("english")))

## removing stopwords before constructing ngrams
tokensAll <- tokenize(toLower(testText), removePunct = TRUE)
tokensNoStopwords <- removeFeatures(tokensAll, stopwords("english"))
tokensNgramsNoStopwords <- ngrams(tokensNoStopwords, 2)
features(dfm(tokensNgramsNoStopwords, ngrams = 1:2))

# keep only certain words
dfm(testCorpus, keptFeatures = "*s", verbose = FALSE)  # keep only words ending in "s"
dfm(testCorpus, keptFeatures = "s$", valuetype = "regex", verbose = FALSE)

# testing Twitter functions
testTweets <- c("My homie @justinbieber #justinbieber shopping in #LA yesterday #beliebers",
                "2all the ha8ers including my bro #justinbieber #emabiggestfansjustinbieber",
                "Justin Bieber #justinbieber #belieber #fetusjustin #EMABiggestFansJustinBieber")
dfm(testTweets, keptFeatures = "#*", removeTwitter = FALSE)  # keep only hashtags
dfm(testTweets, keptFeatures = "^#.*$", valuetype = "regex", removeTwitter = FALSE)
```


One very nice feature, recently added, is the ability to create a new dfm with the same feature set as the old.  This is very useful, for instance, if we train a model on one dfm, and need to predict on counts from another, but need the feature set to be equivalent.
```{r}
# selecting on a dfm
textVec1 <- c("This is text one.", "This, the second text.", "Here: the third text.")
textVec2 <- c("Here are new words.", "New words in this text.")
features(dfm1 <- dfm(textVec1))
features(dfm2a <- dfm(textVec2))
(dfm2b <- selectFeatures(dfm2a, dfm1))
identical(features(dfm1), features(dfm2b))
```

### 4. Applying equivalency classes: dictionaries, thesaruses

Dictionary creation is done through the `dictionary()` function, which classes a named list of characters as a dictionary.
```{r}
# import the Laver-Garry dictionary from http://bit.ly/1FH2nvf
lgdict <- dictionary(file="http://www.kenbenoit.net/courses/essex2014qta/LaverGarry.cat",
                     format="wordstat")
dfm(inaugTexts, dictionary=lgdict)

# import a LIWC formatted dictionary
liwcdict <- dictionary(file = "http://www.kenbenoit.net/files/LIWC2001_English.dic",
                       format = "LIWC")
dfm(inaugTexts, dictionary=liwcdict)
```

We apply dictionaries to a dfm using the `applyDictionary()` function.  Through the `valuetype`, argument, we can match patterns of one of three types: `"glob"`, `"regex"`, or `"fixed"`.
```{r}
myDict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
                          opposition = c("Opposition", "reject", "notincorpus"),
                          taxglob = "tax*",
                          taxregex = "tax.+$",
                          country = c("United_States", "Sweden")))
myDfm <- dfm(c("My Christmas was ruined by your opposition tax plan.",
               "Does the United_States or Sweden have more progressive taxation?"),
             ignoredFeatures = stopwords("english"), verbose = FALSE)
myDfm

# glob format
applyDictionary(myDfm, myDict, valuetype = "glob")
applyDictionary(myDfm, myDict, valuetype = "glob", case_insensitive = FALSE)

# regex v. glob format: note that "united_states" is a regex match for "tax*"
applyDictionary(myDfm, myDict, valuetype = "glob")
applyDictionary(myDfm, myDict, valuetype = "regex", case_insensitive = TRUE)

# fixed format: no pattern matching
applyDictionary(myDfm, myDict, valuetype = "fixed")
applyDictionary(myDfm, myDict, valuetype = "fixed", case_insensitive = FALSE)
```

It is also possible to pass through a dictionary at the time of `dfm()` creation.
```{r}
# dfm with dictionaries
mycorpus <- subset(inaugCorpus, Year>1900)
mydict <- dictionary(list(christmas=c("Christmas", "Santa", "holiday"),
                          opposition=c("Opposition", "reject", "notincorpus"),
                          taxing="taxing",
                          taxation="taxation",
                          taxregex="tax*",
                          country="united states"))
dictDfm <- dfm(mycorpus, dictionary=mydict)
head(dictDfm)
```

Finally, there is a related "thesaurus" feature, which collapses words in a dictionary but is not exclusive.
```{r}
mytexts <- c("British English tokenises differently, with more colour.",
             "American English tokenizes color as one word.")
mydict <- dictionary(list(color = "colo*r", tokenize = "tokeni?e*"))
dfm(mytexts, thesaurus = mydict)
```

### 5. Stemming

Stemming relies on the `SnowballC` package's implementation of the Porter stemmer, and is available for the following languages:
```{r}
SnowballC::getStemLanguages()
```

It's not perfect:
```{r}
wordstem(c("win", "winning", "wins", "won", "winner"))
```
but it's fast.

Stemmed objects must be tokenized, but can be of many different quanteda classes:
```{r}
methods(wordstem)
wordstem(tokenize("This is a winning package, of many packages."))
head(wordstem(dfm(inaugTexts[1:2], verbose = FALSE)))
# same as 
head(dfm(inaugTexts[1:2], stem = TRUE, verbose = FALSE))
```

### 6. `dfm()` and its many options

Operates on `character` (vectors), `corpus`, or `tokenizedText` objects,

```{r, eval=FALSE}
## S3 method for class 'character'
dfm(x, verbose = TRUE, toLower = TRUE,
  removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
  removeTwitter = FALSE, stem = FALSE, ignoredFeatures = NULL,
  keptFeatures = NULL, matrixType = c("sparse", "dense"),
  language = "english", thesaurus = NULL, dictionary = NULL,
  valuetype = c("glob", "regex", "fixed"), dictionary_regex = FALSE, ...)
```


quateda has a number of descriptive statistics available for reporting on texts.  The **simplest of these** is through the `summary()` method:
```{r}
require(quanteda)
txt <- c(sent1 = "This is an example of the summary method for character objects.",
         sent2 = "The cat in the hat swung the bat.")
summary(txt)
```

This also works for corpus objects:
```{r}
summary(corpus(ukimmigTexts, notes = "Created as a demo."))
```

To access the **syllables** of a text, we use `syllables()`:
```{r}
syllables(c("Superman.", "supercalifragilisticexpialidocious", "The cat in the hat."))
```

We can even compute the **Scabble value** of English words, using `scrabble()`:
```{r}
scrabble(c("cat", "quixotry", "zoo"))
```

We can analyze the **lexical diversity** of texts, using `lexdiv()` on a dfm:
```{r}
myDfm <- dfm(subset(inaugCorpus, Year > 1980), verbose = FALSE)
lexdiv(myDfm, "R")
dotchart(sort(lexdiv(myDfm, "R")))
```

We can analyze the **readability** of texts, using `readability()` on a vector of texts or a corpus:
```{r}
readab <- readability(subset(inaugCorpus, Year > 1980), measure = "Flesch.Kincaid")
dotchart(sort(readab))
```

We can **identify documents and terms that are similar to one another**, using `similarity()`:
```{r}
## Presidential Inaugural Address Corpus
presDfm <- dfm(inaugCorpus, ignoredFeatures = stopwords("english"))
# compute some document similarities
similarity(presDfm, "1985-Reagan", n=5, margin="documents")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), n=5, margin="documents", method = "cosine")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), n=5, margin="documents", method = "Hellinger")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), n=5, margin="documents", method = "eJaccard")

# compute some term similarities
similarity(presDfm, c("fair", "health", "terror"), method="cosine")
```

And this can be used for **clustering documents**:
```{r, fig.height=6, fig.width=10}
data(SOTUCorpus, package="quantedaData")
presDfm <- dfm(subset(SOTUCorpus, as.numeric(year)>1981), verbose=FALSE, stem=TRUE,
               ignoredFeatures=stopwords("english", verbose=FALSE))
presDfm <- trim(presDfm, minCount=5, minDoc=3)
# hierarchical clustering - get distances on normalized dfm
presDistMat <- dist(as.matrix(weight(presDfm, "relFreq")))
# hierarchical clustering the distance object
presCluster <- hclust(presDistMat)
# label with document names
presCluster$labels <- docnames(presDfm)
# plot as a dendrogram
plot(presCluster)
```

Or we could look at **term clustering** insteadd:
```{r, fig.height=8, fig.width=12}
# word dendrogram with tf-idf weighting
wordDfm <- sort(weight(presDfm, "tfidf"))
wordDfm <- t(wordDfm)[1:100,]  # because transposed
wordDistMat <- dist(wordDfm)
wordCluster <- hclust(wordDistMat)
plot(wordCluster, xlab="", main="tf-idf Frequency weighting")
```

Finally, there are number of helper functions to extract information from quanteda objects:
```{r}
myCorpus <- subset(inaugCorpus, Year > 1980)

# return the number of documents
ndoc(myCorpus)           
ndoc(dfm(myCorpus, verbose = FALSE))

# how many tokens (total words)
ntoken(myCorpus)
ntoken("How many words in this sentence?")
# arguments to tokenize can be passed 
ntoken("How many words in this sentence?", removePunct = TRUE)

# how many types (unique words)
ntype(myCorpus)
ntype("Yada yada yada.  (TADA.)")
ntype("Yada yada yada.  (TADA.)", removePunct = TRUE)
ntype(toLower("Yada yada yada.  (TADA.)"), removePunct = TRUE)

# can count documents and features
ndoc(inaugCorpus)
myDfm1 <- dfm(inaugCorpus, verbose = FALSE)
ndoc(myDfm1)
nfeature(myDfm1)
myDfm2 <- dfm(inaugCorpus, ignoredFeatures = stopwords("english"), stem = TRUE, verbose = FALSE)
nfeature(myDfm2)

# can extract feature labels and document names
head(features(myDfm1), 20)
head(docnames(myDfm1))

# and topfeatures
topfeatures(myDfm1)
topfeatures(myDfm2) # without stopwords
```