1_getting_started/1_getting_started.Rmd

---
title: "Part 1: Getting Started and Basic Text Analysis"
author: "Kenneth Benoit and Paul Nulty"
date: "2nd December 2015"
output: html_document
---

#### Preliminaries: Installation

First, you need to have **quanteda** installed.  You can do this from inside RStudio, from the Tools...Install Packages menu, or simply using
```{r, eval = FALSE}
install.packages("quanteda", dependencies = TRUE)
```

(Optional) You can install some additional corpus data from **quantedaData** using

```{r, eval=FALSE}
## the devtools package is required to install quanteda from Github
devtools::install_github("kbenoit/quantedaData")
```

Note that on **Windows platforms**, it is also recommended that you install the [RTools suite](https://cran.r-project.org/bin/windows/Rtools/), and for **OS X**, that you install [XCode](https://itunes.apple.com/gb/app/xcode/id497799835?mt=12) from the App Store.


#### Test your setup

Run the rest of this file to test your setup.  You must have quanteda installed in order for this next step to succeed.
```{r}
require(quanteda)
```

Now summarize some texts in the Irish 2010 budget speech corpus:
```{r}
summary(ie2010Corpus)
```

Create a document-feature matrix from this corpus, removing stop words:
```{r}
ieDfm <- dfm(ie2010Corpus, ignoredFeatures = c(stopwords("english"), "will"), stem = TRUE)
```

Look at the top occuring features:
```{r}
topfeatures(ieDfm)
```

Make a word cloud:
```{r, fig.width=8, fig.height=8}
plot(ieDfm, min.freq=25, random.order=FALSE)
```

If you got this far, congratulations!


### Three ways to create a `corpus` object

**quanteda can construct a `corpus` object** from several input sources:

1.  a character vector object  
    ```{r}
    require(quanteda)
    myTinyCorpus <- corpus(inaugTexts[1:2], notes = "Just G.W.")
    summary(myTinyCorpus)
    ```
    
2.  a `VCorpus` object from the **tm** package, and
    ```{r}
    require(tm)
    data(crude, package = "tm")
    myTmCorpus <- corpus(crude)
    summary(myTmCorpus, 5)
    detach()
    ```

3.  a `corpusSource` object, created by `textfile()`.

    In most cases you will need to load input files from outside of R, so you will use this third method.  The remainder of this tutorial focuses on `textfile()`, which is designed to be a simple, powerful, and all-purpose method to load texts.

### Using `textfile()` to import texts

In the simplest case, we would like to load a set of texts in plain text files from a single directory. To do this, we use the `textfile` command, and use the 'glob' operator '*' to indicate that we want to load multiple files:

```{r message=FALSE}
myCorpus <- corpus(textfile(file='inaugural/*.txt'))
myCorpus <- corpus(textfile(file='sotu/*.txt'))
```

Often, we have metadata encoded in the names of the files. For example, the inaugural addresses contain the year and the president's name in the name of the file. With the `docvarsfrom` argument, we can instruct the `textfile` command to consider these elements as document variables.

```{r}
mytf <- textfile("inaugural/*.txt", docvarsfrom="filenames", dvsep="-", docvarnames=c("Year", "President"))
inaugCorpus <- corpus(mytf)
summary(inaugCorpus, 5)
```

If the texts and document variables are stored separately, we can easily add document variables to the corpus, as long as the data frame containing them is of the same length as the texts:

```{r}
SOTUdocvars <- read.csv("SOTU_metadata.csv", stringsAsFactors = FALSE)
SOTUdocvars$Date <- as.Date(SOTUdocvars$Date, "%B %d, %Y")
SOTUdocvars$delivery <- as.factor(SOTUdocvars$delivery)
SOTUdocvars$type <- as.factor(SOTUdocvars$type)
SOTUdocvars$party <- as.factor(SOTUdocvars$party)
SOTUdocvars$nwords <- NULL

sotuCorpus <- corpus(textfile(file='sotu/*.txt'), encodingFrom = "UTF-8-BOM")
docvars(sotuCorpus) <- SOTUdocvars
```

Another common case is that our texts are stored alongside the document variables in a structured file, such as a json, csv or excel file. The textfile command can read in the texts and document variables simultaneously from these files when the name of the field containing the texts is specified.
```{r}
tf1 <- textfile(file='inaugTexts.csv', textField = 'inaugSpeech')
myCorpus <- corpus(tf1)


tf2 <- textfile("text_example.csv", textField = "Title")
myCorpus <- corpus(tf2)
head(docvars(tf2))
```

Once the we have loaded a corpus with some document level variables, we can subset the corpus using these variables, create document-feature matrices by aggregating on the variables, or extract the texts concatenated by variable.

```{r}
recentCorpus <- subset(inaugCorpus, Year > 1980)
oldCorpus <- subset(inaugCorpus, Year < 1880)

require(dplyr)
demCorpus <- subset(sotuCorpus, party == 'Democratic')
demFeatures <- dfm(demCorpus, ignoredFeatures=stopwords('english')) %>%
    trim(minDoc=3, minCount=5) %>% weight(type='tfidf') %>% topfeatures

repCorpus <- subset(sotuCorpus, party == 'Republican') 
repFeatures <- dfm(repCorpus, ignoredFeatures=stopwords('english')) %>%
    trim(minDoc=3, minCount=5) %>% weight(type='tfidf') %>% topfeatures
```

The `quanteda` corpus objects can be combined using the `+` operator:
```{r}
inaugCorpus <- demCorpus + repCorpus
allFeatures <- dfm(inaugCorpus, ignoredFeatures=stopwords('english'))%>%
    trim(minDoc=3, minCount=5) %>% weight(type='tfidf') %>% topfeatures

```

It should also be possible to load a zip file containing texts directly from a url. However, whether this operation succeeds or not can depend on access permission settings on your particular system (i.e. fails on Windows):

```{r eval=FALSE}
immigfiles <- textfile("https://github.com/kbenoit/ME114/raw/master/day8/UKimmigTexts.zip")
mycorpus <- corpus(immigfiles)
summary(mycorpus)
```