forked from cjbarrie/CTA-ED
-
Notifications
You must be signed in to change notification settings - Fork 1
/
14-scaling.Rmd
192 lines (120 loc) · 5.75 KB
/
14-scaling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
# Exercise 4: Scaling techniques
## Introduction
The hands-on exercise for this week focuses on: 1) scaling texts ; 2) implementing scaling techniques using `quanteda`.
In this tutorial, you will learn how to:
* Scale texts using the "wordfish" algorithm
* Scale texts gathered from online sources
* Replicate analyses by @kaneko_estimating_2021
Before proceeding, we'll load the packages we will need for this tutorial.
```{r, echo=F}
library(kableExtra)
```
```{r, message=F}
library(dplyr)
library(quanteda) # includes functions to implement Lexicoder
library(quanteda.textmodels) # for estimating similarity and complexity measures
library(quanteda.textplots) #for visualizing text modelling results
```
In this exercise we'll be using the dataset we used for the sentiment analysis exercise. The data were collected from the Twitter accounts of the top eight newspapers in the UK by circulation. The tweets include any tweets by the news outlet from their main account.
## Importing data
We can download the dataset with:
```{r}
tweets <- readRDS("data/sentanalysis/newstweets.rds")
```
If you're working on this document from your own computer ("locally") you can download the tweets data in the following way:
```{r, eval = F}
tweets <- readRDS(gzcon(url("https://github.com/cjbarrie/CTA-ED/blob/main/data/sentanalysis/newstweets.rds?raw=true")))
```
We first take a sample from these data to speed up the runtime of some of the analyses.
```{r}
tweets <- tweets %>%
sample_n(20000)
```
## Construct `dfm` object
Then, as in the previous exercise, we create a corpus object, specify the document-level variables by which we want to group, and generate our document feature matrix.
```{r, eval = F}
#make corpus object, specifying tweet as text field
tweets_corpus <- corpus(tweets, text_field = "text")
#add in username document-level information
docvars(tweets_corpus, "newspaper") <- tweets$user_username
dfm_tweets <- dfm(tokens(tweets_corpus),
remove_punct = TRUE,
remove = stopwords("english"))
```
```{r, echo = F}
dfm_tweets <- readRDS("data/wordscaling/dfm_tweets.rds")
```
We can then have a look at the number of documents (tweets) we have per newspaper Twitter account.
```{r}
## number of tweets per newspaper
table(docvars(dfm_tweets, "newspaper"))
```
And this is what our document feature matrix looks like, where each word has a count for each of our eight newspapers.
```{r}
dfm_tweets
```
## Estimate wordfish model
Once we have our data in this format, we are able to group and trim the document feature matrix before estimating the wordfish model.
```{r}
# compress the document-feature matrix at the newspaper level
dfm_newstweets <- dfm_group(dfm_tweets, groups = newspaper)
# remove words not used by two or more newspapers
dfm_newstweets <- dfm_trim(dfm_newstweets,
min_docfreq = 2, docfreq_type = "count")
## size of the document-feature matrix
dim(dfm_newstweets)
#### estimate the Wordfish model ####
set.seed(123L)
dfm_newstweets_results <- textmodel_wordfish(dfm_newstweets #,
#sparse = TRUE
)
```
And this is what results.
```{r}
summary(dfm_newstweets_results)
```
We can then plot our estimates of the $\theta$s---i.e., the estimates of the latent newspaper position---as so.
```{r}
textplot_scale1d(dfm_newstweets_results)
```
Interestingly, we seem not to have captured ideology but some other tonal dimension. We see that the tabloid newspapers are scored similarly, and grouped toward the right hand side of this latent dimension; whereas the broadsheet newspapers have an estimated theta further to the left.
Plotting the "features," i.e., the word-level betas shows how words are positioned along this dimension, and which words help discriminate between news outlets.
```{r}
textplot_scale1d(dfm_newstweets_results, margin = "features")
```
And we can also look at these features.
```{r}
features <- dfm_newstweets_results[["features"]]
betas <- dfm_newstweets_results[["beta"]]
feat_betas <- as.data.frame(cbind(features, betas))
feat_betas$betas <- as.numeric(feat_betas$betas)
feat_betas %>%
arrange(desc(betas)) %>%
top_n(20, betas) %>%
kbl() %>%
kable_styling(bootstrap_options = "striped")
```
These words do seem to belong to more tabloid-style reportage, and include emojis relating to film, sports reporting on "cristiano" as well as more colloquial terms like "saucy."
## Replicating Kaneko et al.
This section adapts code from the replication data provided for @kaneko_estimating_2021 [here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/EL3KYD). We can access data from the first study by @kaneko_estimating_2021 in the following way.
```{r}
kaneko_dfm <- readRDS("data/wordscaling/study1_kaneko.rds")
```
If you're working locally, you can download the `dfm` data with:
```{r, eval = F}
kaneko_dfm <- readRDS(gzcon(url("https://github.com/cjbarrie/CTA-ED/blob/main/data/wordscaling/study1_kaneko.rds?raw=true")))
```
This data is in the form a document-feature-matrix. We can first manipulate it in the same way as @kaneko_estimating_2021 by grouping at the level of newspaper and removing infrequent words.
```{r}
table(docvars(kaneko_dfm, "Newspaper"))
## prepare the newspaper-level document-feature matrix
# compress the document-feature matrix at the newspaper level
kaneko_dfm_study1 <- dfm_group(kaneko_dfm, groups = Newspaper)
# remove words not used by two or more newspapers
kaneko_dfm_study1 <- dfm_trim(kaneko_dfm_study1, min_docfreq = 2, docfreq_type = "count")
## size of the document-feature matrix
dim(kaneko_dfm_study1)
```
## Exercises
1. Estimate a wordfish model for the @kaneko_estimating_2021 data
2. Visualize the results