forked from dgrtwo/tidy-text-mining
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path08-tweet-archives.Rmd
378 lines (288 loc) · 21 KB
/
08-tweet-archives.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
# Case study: comparing Twitter archives {#twitter}
```{r echo = FALSE}
library(knitr)
opts_chunk$set(message = FALSE, warning = FALSE, cache = TRUE, cache.lazy = FALSE)
options(width = 100, dplyr.width = 100)
library(ggplot2)
theme_set(theme_light())
```
One type of text that has risen to attention in recent years is text shared online via Twitter. In fact, several of the sentiment lexicons used in this book (and commonly used in general) were designed for use with and validated on tweets. Both of the authors of this book are on Twitter and are fairly regular users of it so in this case study, let's compare the entire Twitter archives of [Julia](https://twitter.com/juliasilge) and [David](https://twitter.com/drob).
## Getting the data and distribution of tweets
An individual can download their own Twitter archive by following [directions available on Twitter's website](https://support.twitter.com/articles/20170160). We each downloaded ours and will now open them up. Let's use the lubridate package to convert the string timestamps to date-time objects and initially take a look at our tweeting patterns overall.
```{r setup, fig.width=8, fig.height=6}
library(lubridate)
library(ggplot2)
library(dplyr)
library(readr)
tweets_julia <- read_csv("data/tweets_julia.csv")
tweets_dave <- read_csv("data/tweets_dave.csv")
tweets <- bind_rows(tweets_julia %>%
mutate(person = "Julia"),
tweets_dave %>%
mutate(person = "David")) %>%
mutate(timestamp = ymd_hms(timestamp))
ggplot(tweets, aes(x = timestamp, fill = person)) +
geom_histogram(alpha = 0.5, position = "identity", bins = 20)
```
David and Julia tweet at about the same rate currently and joined Twitter about a year apart from each other, but there were about 5 years where David was not active on Twitter and Julia was. In total, Julia has about 4 times as many tweets as David.
## Word frequencies
Let's use `unnest_tokens` to make a tidy dataframe of all the words in our tweets, and remove the common English stop words. There are certain conventions in how people use text on Twitter, so we will do a bit more work with our text here than, for example, we did with the narrative text from Project Gutenberg.
First, we will remove tweets from this dataset that are retweets so that we only have tweets that we wrote ourselves. Next, the `mutate` line removes links and cleans out some characters that we don't want like ampersands and such. In the call to `unnest_tokens`, we unnest using a regex pattern, instead of just looking for single unigrams (words). This regex pattern is very useful for dealing with Twitter text; it retains hashtags and mentions of usernames with the `@` symbol. Because we have kept these types of symbols in the text, we can't use a simple `anti_join` to remove stop words. Instead, we can take the approach shown in the `filter` line that uses `str_detect` from the stringr library.
```{r tidytweets, dependson = "setup"}
library(tidytext)
library(stringr)
reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
tidy_tweets <- tweets %>%
filter(!str_detect(text, "^RT")) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&|<|>|RT|https", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
```
Now we can calculate word frequencies for each person. First, we group by person and count how many times each person used each word. Then we use `left_join` to add a column of the total number of words used by each person. (This is higher for Julia than David since she has more tweets than David.) Finally, we calculate a frequency for each person and word.
```{r frequency, dependson = "tidytweets"}
frequency <- tidy_tweets %>%
group_by(person) %>%
count(word, sort = TRUE) %>%
left_join(tidy_tweets %>%
group_by(person) %>%
summarise(total = n())) %>%
mutate(freq = n/total)
frequency
```
This is a nice, tidy data frame but we would actually like to plot those frequencies on the x- and y-axes of a plot, so we will need to use `spread` from tidyr make a differently shaped dataframe.
```{r spread, dependson = "frequency"}
library(tidyr)
frequency <- frequency %>%
select(person, word, freq) %>%
spread(person, freq) %>%
arrange(Julia, David)
frequency
```
Now this is ready for us to plot. Let's use `geom_jitter` so that we don't see the discreteness at the low end of frequency as much.
```{r dependson = "spread", fig.height=7, fig.width=7}
library(scales)
ggplot(frequency, aes(Julia, David)) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
geom_abline(color = "red")
```
Words near the red line in this plot are used with about equal frequencies by David and Julia, while words far away from the line are used much more by one person compared to the other. Words, hashtags, and usernames that appear in this plot are ones that we have both used at least once in tweets.
This may not even need to be pointed out, but David and Julia have used their Twitter accounts rather differently over the course of the past several years. David has used his Twitter account almost exclusively for professional purposes since he became more active, while Julia used it for entirely personal purposes until late 2015 and still uses it more personally than David. We see these differences immediately in this plot exploring word frequencies, and they will continue to be obvious in the rest of this chapter.
## Comparing word usage
We just made a plot comparing raw word frequencies over our whole Twitter histories; now let's find which words are more or less likely to come from each person's account using the log odds ratio. First, let's restrict the analysis moving forward to tweets from David and Julia sent during 2016. David was consistently active on Twitter for all of 2016 and this was about when Julia transitioned into data science as a career.
```{r}
tidy_tweets <- tidy_tweets %>%
filter(timestamp >= as.Date("2016-01-01"),
timestamp < as.Date("2017-01-01"))
```
Next, let's use `str_detect` to remove Twitter usernames from the `word` column, because otherwise, the results here are dominated only by people who Julia or David know and the other does not. After removing these, we count how many times each person uses each word and keep only the words used more than 10 times. After a `spread` operation, we can calculate the log odds ratio for each word, using
$$\text{log odds ratio} = \ln{\left(\frac{\left[\frac{n+1}{\text{total}+1}\right]_\text{David}}{\left[\frac{n+1}{\text{total}+1}\right]_\text{Julia}}\right)}$$
where $n$ is the number of times the word in question is used by each person and the total indicates the total words for each person.
```{r word_ratios, dependson = "tidytweets"}
word_ratios <- tidy_tweets %>%
filter(!str_detect(word, "^@")) %>%
count(word, person) %>%
filter(sum(n) >= 10) %>%
spread(person, n, fill = 0) %>%
ungroup() %>%
mutate_each(funs((. + 1) / sum(. + 1)), -word) %>%
mutate(logratio = log(David / Julia)) %>%
arrange(desc(logratio))
```
What are some words that have been about equally likely to come from David or Julia's account during 2016?
```{r, dependson = "word_ratios"}
word_ratios %>%
arrange(abs(logratio))
```
We are about equally likely to tweet about maps, email, APIs, and functions.
Which words are most likely to be from Julia's account or from David's account? Let's just take the top 15 most distinctive words for each account and plot them.
```{r plotratios, dependson = "word_ratios", fig.width=7, fig.height=6}
word_ratios %>%
group_by(logratio < 0) %>%
top_n(15, abs(logratio)) %>%
ungroup() %>%
mutate(word = reorder(word, logratio)) %>%
ggplot(aes(word, logratio, fill = logratio < 0)) +
geom_bar(alpha = 0.8, stat = "identity") +
coord_flip() +
ylab("log odds ratio (David/Julia)") +
scale_fill_discrete(name = "", labels = c("David", "Julia"))
```
So David has tweeted about specific conferences he has gone to, genes, Stack Overflow, and matrices while Julia tweeted about Utah, physics, Census data, Christmas, and her family.
## Changes in word use
The section above looked at overall word use, but now let's ask a different question. Which words' frequencies have changed the fastest in our Twitter feeds? Or to state this another way, which words have we tweeted about at a higher or lower rate as time has passed? To do this, we will define a new time variable in the dataframe that defines which unit of time each tweet was posted in. We can use `floor_date()` from lubridate to do this, with a unit of our choosing; using 1 month seems to work well for this year of tweets from both of us.
After we have the time bins defined, we count how many times each of us used each word in each time bin. After that, we add columns to the dataframe for the total number of words used in each time bin by each person and the total number of times each word was used by each person. We can then `filter` to only keep words used at least some minimum number of times (30, in this case).
```{r words_by_time, dependson = "tidytweets"}
words_by_time <- tidy_tweets %>%
filter(!str_detect(word, "^@")) %>%
mutate(time_floor = floor_date(timestamp, unit = "1 month")) %>%
count(time_floor, person, word) %>%
ungroup() %>%
group_by(person, time_floor) %>%
mutate(time_total = sum(n)) %>%
group_by(word) %>%
mutate(word_total = sum(n)) %>%
ungroup() %>%
rename(count = n) %>%
filter(word_total > 30)
words_by_time
```
Each row in this dataframe corresponds to one person using one word in a given time bin. The `count` column tells us how many times that person used that word in that time bin, the `time_total` column tells us how many words that person used during that time bin, and the `word_total` column tells us how many times that person used that word over the whole year. This is the data set we can use for modeling.
We can use `nest` from tidyr to make a data frame with a list column that contains little miniature data frames for each word. Let's do that now and take a look at the resulting structure.
```{r nest, dependson = "words_by_time"}
nested_data <- words_by_time %>%
nest(-word, -person)
nested_data
```
This data frame has one row for each person-word combination; the `data` column is a list column that contains data frames, one for each combination of person and word. Let's use `map` from the purrr library to apply our modeling procedure to each of those little data frames inside our big data frame. This is count data so let’s use `glm` with `family = "binomial"` for modeling. We can think about this modeling procedure answering a question like, "Was a given word mentioned in a given time bin? Yes or no? How does the count of word mentions depend on time?"
```{r nested_models, dependson = "nest"}
library(purrr)
nested_models <- nested_data %>%
mutate(models = map(data, ~ glm(cbind(count, time_total) ~ time_floor, .,
family = "binomial")))
nested_models
```
Now notice that we have a new column for the modeling results; it is another list column and contains `glm` objects. The next step is to use `map` and `tidy` from the broom package to pull out the slopes for each of these models and find the important ones. We are comparing many slopes here and some of them are not statistically significant, so let's apply an adjustment to the p-values for multiple comparisons.
```{r slopes, dependson = "nested_models"}
library(broom)
slopes <- nested_models %>%
unnest(map(models, tidy)) %>%
filter(term == "time_floor") %>%
mutate(adjusted.p.value = p.adjust(p.value))
```
Now let's find the most important slopes. Which words have changed in frequency at a moderately significant level in our tweets?
```{r top_slopes2, dependson = "slopes"}
top_slopes <- slopes %>%
filter(adjusted.p.value < 0.1)
top_slopes
```
To visualize our results, we can plot these words' use for both David and Julia over this year of tweets.
```{r dependson = c("words_by_time", "top_slopes2"), fig.width=8, fig.height=5}
words_by_time %>%
inner_join(top_slopes, by = c("word", "person")) %>%
filter(person == "David") %>%
ggplot(aes(time_floor, count/time_total, color = word)) +
geom_line(alpha = 0.8, size = 1.3) +
labs(x = NULL, y = "Word frequency",
title = "Trending words in David's tweets")
```
David tweeted a lot about the UseR conference while he was there and then quickly stopped. He has tweeted more about Stack Overflow toward the end of the year and less about ggplot2 as the year has progressed.
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Me: I'm so sick of data science wars. <a href="https://twitter.com/hashtag/rstats?src=hash">#rstats</a> vs Python, frequentist vs Bayesian...<br><br>Them: base vs ggplot2...<br><br>Me: WHY WHICH SIDE ARE YOU ON</p>— David Robinson (@drob) <a href="https://twitter.com/drob/status/712639593703542785">March 23, 2016</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
```{r dependson = c("words_by_time", "top_slopes2"), fig.width=8, fig.height=5}
words_by_time %>%
inner_join(top_slopes, by = c("word", "person")) %>%
filter(person == "Julia") %>%
ggplot(aes(time_floor, count/time_total, color = word)) +
geom_line(alpha = 0.8, size = 1.3) +
labs(x = NULL, y = "Word frequency",
title = "Trending words in Julia's tweets")
```
All the significant slopes for Julia are negative. This means she has not tweeted at a higher rate using any specific words, but instead using a variety of different words; her tweets earlier in the year contained the words shown in this plot at higher proportions. Words she uses when publicizing a new blog post like the #rstats hashtag and "post" have gone down in frequency, and she has tweeted less about reading.
## Favorites and retweets
Another important characteristic of tweets is how many times they are favorited or retweeted. Let's explore which words are more likely to be retweeted or favorited for Julia's and David's tweets. When a user downloads their own Twitter archive, favorites and retweets are not included, so we constructed another dataset of the author's tweets that includes this information. We accessed our own tweets via the Twitter API and downloaded about 3200 tweets for each person. In both cases, that is about the last 18 months worth of Twitter activity. This corresponds to a period of increasing activity and increasing numbers of followers for both of us.
```{r setup2}
tweets_julia <- read_csv("data/juliasilge_tweets.csv")
tweets_dave <- read_csv("data/drob_tweets.csv")
tweets <- bind_rows(tweets_julia %>%
mutate(person = "Julia"),
tweets_dave %>%
mutate(person = "David")) %>%
mutate(created_at = ymd_hms(created_at))
```
Now that we have this second, smaller set of only recent tweets, let's use `unnest_tokens` to transform these tweets to a tidy data set. (Let's again remove any retweets from this data set so we only look at tweets that David and Julia have posted directly.)
```{r tidy_tweets2, dependson = "setup2"}
tidy_tweets <- tweets %>%
filter(!str_detect(text, "^RT")) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&|<|>|RT|https", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
anti_join(stop_words)
tidy_tweets
```
To start with, let's look at retweets. Let's find the total number of retweets for each person.
```{r rt_totals, dependson = "tidy_tweets2"}
totals <- tidy_tweets %>%
group_by(person, id) %>%
summarise(rts = sum(retweets)) %>%
group_by(person) %>%
summarise(total_rts = sum(rts))
totals
```
Now let's find the median number of retweets for each word and person; we probably want to count each tweet/word combination only once, so we will use `group_by` and `summarise` twice, one right after the other. Next, we can join this to the data frame of retweet totals.
```{r word_by_rts, dependson = c("rt_totals", "tidy_tweets2")}
word_by_rts <- tidy_tweets %>%
group_by(id, word, person) %>%
summarise(rts = first(retweets)) %>%
group_by(person, word) %>%
summarise(retweets = median(rts)) %>%
left_join(totals) %>%
filter(retweets != 0) %>%
ungroup()
word_by_rts %>%
arrange(desc(retweets))
```
At the top of this sorted data frame, we see David's tweet about [his blog post on Donald Trump's own tweets](http://varianceexplained.org/r/trump-tweets/) that went viral. A search tells us that this is the only time David has ever used the word "angrier" in his tweets, so that word has an extremely high median retweet rate.
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">New post: Analysis of Trump tweets confirms he writes only the angrier Android half <a href="https://t.co/HRr4yj30hx">https://t.co/HRr4yj30hx</a> <a href="https://twitter.com/hashtag/rstats?src=hash">#rstats</a> <a href="https://t.co/cmwdNeYSE7">pic.twitter.com/cmwdNeYSE7</a></p>— David Robinson (@drob) <a href="https://twitter.com/drob/status/763048283531055104">August 9, 2016</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
Now we can plot the words that have contributed the most to each of our retweets.
```{r dependson = "word_by_rts", fig.width=10, fig.height=5}
word_by_rts %>%
mutate(ratio = retweets / total_rts) %>%
group_by(person) %>%
top_n(10, ratio) %>%
arrange(ratio) %>%
mutate(word = factor(word, unique(word))) %>%
ungroup() %>%
ggplot(aes(word, ratio, fill = person)) +
geom_bar(stat = "identity", alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ person, scales = "free", ncol = 2) +
coord_flip() +
scale_y_continuous(labels = percent_format()) +
labs(x = NULL,
y = "proportion of total RTs due to each word",
title = "Words with highest median retweets")
```
We see more words from David's tweet about his Trump blog post, and words from Julia making announcements about blog posts and new package releases. These are some pretty good tweets; we can see why people retweeted them.
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">NEW POST: Mapping ghost sightings in Kentucky using Leaflet 👻👻👻 <a href="https://twitter.com/hashtag/rstats?src=hash">#rstats</a> <a href="https://t.co/rRFTSsaKWQ">https://t.co/rRFTSsaKWQ</a> <a href="https://t.co/codPf3gy6O">pic.twitter.com/codPf3gy6O</a></p>— Julia Silge (@juliasilge) <a href="https://twitter.com/juliasilge/status/761667180148641793">August 5, 2016</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Me: Git makes it easy to revert your local changes<br><br>Them: Great! So what command do I use?<br><br>Me: I said it was easy not that I knew how</p>— David Robinson (@drob) <a href="https://twitter.com/drob/status/770706647585095680">August 30, 2016</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
We can follow a similar procedure to see which words led to more favorites. Are they different than the words that lead to more retweets?
```{r word_by_favs, dependson = "tidy_tweets2"}
totals <- tidy_tweets %>%
group_by(person, id) %>%
summarise(favs = sum(favorites)) %>%
group_by(person) %>%
summarise(total_favs = sum(favs))
word_by_favs <- tidy_tweets %>%
group_by(id, word, person) %>%
summarise(favs = first(favorites)) %>%
group_by(person, word) %>%
summarise(favorites = median(favs)) %>%
left_join(totals) %>%
filter(favorites != 0) %>%
ungroup()
```
```{r dependson = "word_by_favs", fig.width=10, fig.height=5}
word_by_favs %>%
mutate(ratio = favorites / total_favs) %>%
group_by(person) %>%
top_n(10, ratio) %>%
arrange(ratio) %>%
mutate(word = factor(word, unique(word))) %>%
ungroup() %>%
ggplot(aes(word, ratio, fill = person)) +
geom_bar(stat = "identity", alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ person, scales = "free", ncol = 2) +
coord_flip() +
scale_y_continuous(labels = percent_format()) +
labs(x = NULL,
y = "proportion of total favorites due to each word",
title = "Words with highest median favorites")
```
We see some minor differences, especially near the bottom of the top 10 list, but these are largely the same words as for favorites. In general, the same words that lead to retweets lead to favorites. There are some exceptions, though.
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">🎶 I am writing a Shiny app for my joooooooob 🎶🎶 I am living the dreeeeeeeeeam... 🎶🎶 <a href="https://twitter.com/hashtag/rstats?src=hash">#rstats</a></p>— Julia Silge (@juliasilge) <a href="https://twitter.com/juliasilge/status/732645241610600448">May 17, 2016</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>