forked from kbenoit/ITAUR-Short
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path1_getting_started.Rmd
157 lines (112 loc) · 5.59 KB
/
1_getting_started.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
title: "Part 1: Getting Started and Basic Text Analysis"
author: "Kenneth Benoit and Paul Nulty"
date: "2nd December 2015"
output: html_document
---
#### Preliminaries: Installation
First, you need to have **quanteda** installed. You can do this from inside RStudio, from the Tools...Install Packages menu, or simply using
```{r, eval = FALSE}
install.packages("quanteda", dependencies = TRUE)
```
(Optional) You can install some additional corpus data from **quantedaData** using
```{r, eval=FALSE}
## the devtools package is required to install quanteda from Github
devtools::install_github("kbenoit/quantedaData")
```
Note that on **Windows platforms**, it is also recommended that you install the [RTools suite](https://cran.r-project.org/bin/windows/Rtools/), and for **OS X**, that you install [XCode](https://itunes.apple.com/gb/app/xcode/id497799835?mt=12) from the App Store.
#### Test your setup
Run the rest of this file to test your setup. You must have quanteda installed in order for this next step to succeed.
```{r}
require(quanteda)
```
Now summarize some texts in the Irish 2010 budget speech corpus:
```{r}
summary(ie2010Corpus)
```
Create a document-feature matrix from this corpus, removing stop words:
```{r}
ieDfm <- dfm(ie2010Corpus, ignoredFeatures = c(stopwords("english"), "will"), stem = TRUE)
```
Look at the top occuring features:
```{r}
topfeatures(ieDfm)
```
Make a word cloud:
```{r, fig.width=8, fig.height=8}
plot(ieDfm, min.freq=25, random.order=FALSE)
```
If you got this far, congratulations!
### Three ways to create a `corpus` object
**quanteda can construct a `corpus` object** from several input sources:
1. a character vector object
```{r}
require(quanteda)
myTinyCorpus <- corpus(inaugTexts[1:2], notes = "Just G.W.")
summary(myTinyCorpus)
```
2. a `VCorpus` object from the **tm** package, and
```{r}
require(tm)
data(crude, package = "tm")
myTmCorpus <- corpus(crude)
summary(myTmCorpus, 5)
detach()
```
3. a `corpusSource` object, created by `textfile()`.
In most cases you will need to load input files from outside of R, so you will use this third method. The remainder of this tutorial focuses on `textfile()`, which is designed to be a simple, powerful, and all-purpose method to load texts.
### Using `textfile()` to import texts
In the simplest case, we would like to load a set of texts in plain text files from a single directory. To do this, we use the `textfile` command, and use the 'glob' operator '*' to indicate that we want to load multiple files:
```{r message=FALSE}
myCorpus <- corpus(textfile(file='inaugural/*.txt'))
myCorpus <- corpus(textfile(file='sotu/*.txt'))
```
Often, we have metadata encoded in the names of the files. For example, the inaugural addresses contain the year and the president's name in the name of the file. With the `docvarsfrom` argument, we can instruct the `textfile` command to consider these elements as document variables.
```{r}
mytf <- textfile("inaugural/*.txt", docvarsfrom="filenames", dvsep="-", docvarnames=c("Year", "President"))
inaugCorpus <- corpus(mytf)
summary(inaugCorpus, 5)
```
If the texts and document variables are stored separately, we can easily add document variables to the corpus, as long as the data frame containing them is of the same length as the texts:
```{r}
SOTUdocvars <- read.csv("SOTU_metadata.csv", stringsAsFactors = FALSE)
SOTUdocvars$Date <- as.Date(SOTUdocvars$Date, "%B %d, %Y")
SOTUdocvars$delivery <- as.factor(SOTUdocvars$delivery)
SOTUdocvars$type <- as.factor(SOTUdocvars$type)
SOTUdocvars$party <- as.factor(SOTUdocvars$party)
SOTUdocvars$nwords <- NULL
sotuCorpus <- corpus(textfile(file='sotu/*.txt'), encodingFrom = "UTF-8-BOM")
docvars(sotuCorpus) <- SOTUdocvars
```
Another common case is that our texts are stored alongside the document variables in a structured file, such as a json, csv or excel file. The textfile command can read in the texts and document variables simultaneously from these files when the name of the field containing the texts is specified.
```{r}
tf1 <- textfile(file='inaugTexts.csv', textField = 'inaugSpeech')
myCorpus <- corpus(tf1)
tf2 <- textfile("text_example.csv", textField = "Title")
myCorpus <- corpus(tf2)
head(docvars(tf2))
```
Once the we have loaded a corpus with some document level variables, we can subset the corpus using these variables, create document-feature matrices by aggregating on the variables, or extract the texts concatenated by variable.
```{r}
recentCorpus <- subset(inaugCorpus, Year > 1980)
oldCorpus <- subset(inaugCorpus, Year < 1880)
require(dplyr)
demCorpus <- subset(sotuCorpus, party == 'Democratic')
demFeatures <- dfm(demCorpus, ignoredFeatures=stopwords('english')) %>%
trim(minDoc=3, minCount=5) %>% weight(type='tfidf') %>% topfeatures
repCorpus <- subset(sotuCorpus, party == 'Republican')
repFeatures <- dfm(repCorpus, ignoredFeatures=stopwords('english')) %>%
trim(minDoc=3, minCount=5) %>% weight(type='tfidf') %>% topfeatures
```
The `quanteda` corpus objects can be combined using the `+` operator:
```{r}
inaugCorpus <- demCorpus + repCorpus
allFeatures <- dfm(inaugCorpus, ignoredFeatures=stopwords('english'))%>%
trim(minDoc=3, minCount=5) %>% weight(type='tfidf') %>% topfeatures
```
It should also be possible to load a zip file containing texts directly from a url. However, whether this operation succeeds or not can depend on access permission settings on your particular system (i.e. fails on Windows):
```{r eval=FALSE}
immigfiles <- textfile("https://github.com/kbenoit/ME114/raw/master/day8/UKimmigTexts.zip")
mycorpus <- corpus(immigfiles)
summary(mycorpus)
```