-
Notifications
You must be signed in to change notification settings - Fork 91
/
12-wt-social-network-analysis.Rmd
291 lines (203 loc) · 18.3 KB
/
12-wt-social-network-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
# Walkthrough 6: Exploring relationships using social network analysis with social media data {#c12}
**Abstract**
This chapter explores the use of social network analysis (commonly referred to as SNA). While much of the data that data scientists in education analyze pertains to variables for individuals, some data concerns the relationships between individuals, such as friendship between youth, or advice-seeking on the part of teachers. While such data is common---and often of interest to data scientists---it can be difficult to analyze, in part due to the multiple sources of data (about both individuals’ relations and individuals and the complexity of the relations between individuals. This chapter uses data from the Twitter #tidyuesday network, an R learning community, to demonstrate how such data can be accessed and imported, processed into forms necessary for social network analysis using the {tidygraph} R package, and visualized using the {ggraph} R package.
## Topics emphasized
- Transforming data
- Visualizing data
## Functions introduced
- `rtweet::search_tweets()`
- `randomNames::randomNames()`
- `tidyr::unnest()`
- `tidygraph::as_tbl_graph()`
- `ggraph::ggraph()`
## Vocabulary
- Application Programming Interface (API)
- edgelist
- edge
- influence model
- regex
- selection model
- social network analysis
- sociogram
- vertex
## Chapter overview
This chapter builds on [Walkthrough 5 in Chapter 11](#c11), where we worked with #tidytuesday data. In the previous chapter we focused on using text analysis to understand the *content* of tweets. In this, we chapter focus on the *interactions* between #tidytuesday participants using social network analysis (sometimes simply referred to as network analysis) techniques.
While social network analysis is increasingly common, it remains challenging to carry out. For one, cleaning and tidying the data can be even more challenging than for most other data sources because net data for social network analysis (or network data) often includes variables about both individuals (such as information students or teachers) and their relationships (whether they have a relationship at all, for example, or how strong or of what type their relationship is). This chapter is designed to take you from not having carried out social network analysis through visualizing network data.
Like the previous chapter, we've also included an appendix ([Appendix C](#c20c)) to introduce some social network-related ideas for further exploration; these focus on modeling social network processes, particularly, the processes of who chooses (or selects) to interact with whom, and of influence, or how relationships can impact individuals' behaviors.
You will need a Twitter account to complete *all* of the code outlined in this chapter in order to access your own Twitter data (through the Twitter Application Interface [API]; see [Chapter 11](#c11) for an in-depth description of what this means and how to access it).
If you do not have a Twitter account, you can create one and keep it private, or even delete the account once you're done with this walkthrough. Additionally, you can simply access data that has already been accessed from the Twitter API via the {dataedu} package (as we describe below).
### Background
There are a few reasons to be interested in social media. For example, if you work in a school district, you may want to know who is interacting with the content you share. If you are a researcher, you may want to investigate what teachers, administrators, and others do through state-based hashtags (e.g., @rosenberg2016). Social media-based data also provides new contexts for learning to take place, like in professional learning networks [@trust2016].
In the past, if a teacher wanted advice about how to plan a unit or to design a lesson, they would turn to a trusted peer in their building or district [@spillane2012]. Today they are as likely to turn to someone in a social media network. Social media interactions like the ones tagged with the #tidytuesday hashtag are increasingly common in education. Using data science tools to learn from these interactions is valuable for improving the student experience.
### Packages
In this chapter, we access data using the {rtweet} package [@kearney2016]. Through {rtweet} and a Twitter account, it is easy to access data from Twitter. We will load the {tidyverse} and {rtweet} packages to get started.
We will also load other packages that we will be using in this analysis, including two packages related to social network analysis [@R-tidygraph; @R-ggraph] as well as one that will help us to use not-anonymized names in a savvy way [@R-randomNames]. As always, if you have not installed any of these packages before (which may particularly be the case for the {rtweet}, {randomNames}, {tidygraph}, and {ggraph} packages, which we have not yet used in the book), do so using the `install.packages()` function. More on installing packages is included in the ["*Packages*"](#c06p) section of the ["*Foundational Skills*"](#c06) chapter.
Let's load the packages with the following calls to the `library()` function:
```{r, message = F, warning = F}
library(tidyverse)
library(rtweet)
library(dataedu)
library(randomNames)
library(tidygraph)
library(ggraph)
```
### Data sources and import
Here is an example of searching the most recent 1,000 tweets, which include the hashtag #rstats. When you run this code, you will be prompted to authenticate your access via Twitter.
```{r, eval = FALSE}
rstats_tweets <-
search_tweets("#rstats")
```
You can easily change the search term to other hashtags terms. For example, to search for #tidytuesday tweets, we can replace #rstats with #tidytuesday:
```{r, eval = FALSE}
tt_tweets <-
search_tweets("#tidytuesday")
```
You can find a greater number of tweets by adding a greater value to the `n` argument of the `search_tweets()` function, as follows, to collect the most recent 500 tweets:
```{r, eval = FALSE}
tt_tweets <-
search_tweets("#tidytuesday", n = 500)
```
You may notice that the most recent tweets containing the #tidytuesday hashtag are returned. What if you wanted to go further back in time? We’ll discuss this topic in the next section and in Appendix B.
### Using an Application Programming Interface (or API)
It's worth taking a short detour to talk about how you can obtain a dataset spanning a longer period of time. A common way to import data from websites, including social media platforms, is to use something called an Application Programming Interface (API). In fact, if you ran the code above, you just accessed an API!
Think of an API as a special door a home builder made for a house that has a lot of cool stuff in it. The home builder doesn’t want everyone to be able to walk right in and use a bunch of stuff in the house. But they also don’t want to make it too hard because, after all, sharing is caring! Imagine the home builder made a door just for folks who know how to use doors. In order to get through this door, users need to know where to find it along the outside of the house. Once they’re there, they have to know the code to open. And, once they’re through the door, they have to know how to use the stuff inside. An API for social media platforms like Twitter and Facebook are the same way. You can download datasets of social media information, like tweets, using some code and authentication credentials organized by the website.
There are some advantages to using an API to import data at the start of your education dataset analysis. Every time you run the code in your analysis, you’ll be using the API to contact the social media platform and download a fresh dataset. Now your analysis is not just a one-off product, but, rather, is one that can be updated with the most recent data (in this case, Tweets), every time you run it. By using an API to import new data every time you run your code, you create an analysis that can be used again and again on future datasets. However, a key point---and limitation---is that Twitter allows access to their data via their API only for (approximately) the seven most recent days. There are a number of *other* ways to access older data, though we focus on one way here: having access to the URLs of (or the status IDs for) tweets.
As a result, we used this technique---described in-depth in [Appendix B](#c20b)---to collect older (historical) data from Twitter about the #tidytuesday hashtag, using a different function than the one described above (`rtweet::lookup_statuses()` instead of `rtweet::search_tweets()`). This was important for this chapter because having access to a greater number of tweets allows us to better understand the interactions between a larger number of the individuals participating in #tidytuesday. The data that we prepared from accessing historical data for #tidytuesday is available in the {dataedu} R package as the `tt_tweets` dataset, as we describe next.
#### Accessing the data from {dataedu}
Don't have Twitter or don't wish to access the data via Twitter? Then, you can load the data from the {dataedu} package (just as we did in the last chapter, [Chapter 11](#c11)), as follows:
```{r}
tt_tweets <- dataedu::tt_tweets
```
## View data
We can see that there are *many* rows for the data:
```{r}
nrow(tt_tweets)
```
## Methods: process data
Network data requires some processing before it can be used in subsequent analyses. The network dataset needs a way to identify each participant's role in the interaction. We need to answer questions like: Did someone reach out to another for help? Was someone contacted by another for help? We can process the data by creating an "edgelist". An edgelist is a dataset where each row is a unique interaction between two parties. Each row (which represents a single relationship) in the edgelist is referred to as an "edge". We note that one challenge facing data scientists beginning to use network analysis is the different terms that are used for similar (or the same!) aspects of analyses: Edges are sometimes referred to as "ties" or "relations", but these generally refer to the same thing, though they may be used in different contexts.
An edgelist looks like the following, where the `sender` (sometimes called the "nominator") column identifies who is initiating the interaction and the `receiver` (sometimes called the "nominee") column identifies who is receiving the interaction:
```{r, include = FALSE}
names_d1 <-
randomNames(6) %>%
enframe(name = NULL) %>%
mutate(sender = 1:6) %>%
set_names(c("sender2", "sender"))
names_d2 <-
randomNames(6) %>%
enframe(name = NULL) %>%
mutate(receiver = 1:6) %>%
set_names(c("receiver2", "receiver"))
example_edgelist <-
tibble(
sender = c(2, 1, 3, 1, 2, 6, 3, 5, 6, 4, 3, 4),
receiver = c(1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 6, 6)
)
example_edgelist <-
example_edgelist %>%
left_join(names_d1) %>%
left_join(names_d2) %>%
select(sender = receiver2, receiver = sender2)
```
```{r, echo = FALSE}
example_edgelist
```
In this edgelist, the `sender` column might identify someone who nominates another (the receiver) as someone they go to for help. The sender might also identify someone who interacts with the receiver in other ways, like "liking" or "mentioning" their tweets. In the following steps, we will work to create an edgelist from the data from #tidytuesday on Twitter.
### Extracting mentions
Let's extract the mentions. There is a lot going on in the code below; let's break it down line-by-line, starting with `mutate()`:
- `mutate(all_mentions = str_extract_all(text, regex))`: this line uses a regex, or regular expression, to identify all of the usernames in the tweet (*note*: the regex comes from [this Stack Overflow page](https://stackoverflow.com/questions/18164839/get-twitter-username-with-regex-in-r) (https[]()://stackoverflow.com/questions/18164839/get-twitter-username-with-regex-in-r))
- `unnest(all_mentions)` this line uses a {tidyr} function, `unnest()` to move every mention to its own line, while keeping all of the other information the same (see more about `unnest()` here: [https://tidyr.tidyverse.org/reference/unnest.html](https://tidyr.tidyverse.org/reference/unnest.html))).
Now let's use these functions to extract the mentions from the dataset. Here's how all the code looks in action:
```{r}
regex <- "@([A-Za-z]+[A-Za-z0-9_]+)(?![A-Za-z0-9_]*\\.)"
tt_tweets <-
tt_tweets %>%
# Use regular expression to identify all the usernames in a tweet
mutate(all_mentions = str_extract_all(text, regex)) %>%
unnest(all_mentions)
```
Let's put these into their own data frame, called `mentions`.
```{r}
mentions <-
tt_tweets %>%
mutate(all_mentions = str_trim(all_mentions)) %>%
select(sender = screen_name, all_mentions)
```
### Putting the edgelist together
Recall that an edgelist is a data structure that has columns for the "sender" and "receiver" of interactions. Someone "sends" the mention to someone who is mentioned, who can be considered to "receive" it. To make the edgelist, we'll need to clean it up a little by removing the "@" symbol. Let's look at our data as it is now.
```{r}
mentions
```
Let's remove that "@" symbol from the columns we created and save the results to a new tibble, `edgelist`.
```{r}
edgelist <-
mentions %>%
# remove "@" from all_mentions column
mutate(all_mentions = str_sub(all_mentions, start = 2)) %>%
# rename all_mentions to receiver
select(sender, receiver = all_mentions)
```
## Analysis and results
Now that we have our edgelist, let's plot the network. We'll use the {tidygraph} and {ggraph} packages to visualize the data. We note that network visualizations are often referred to as "sociograms" or a representation of the relationships between individuals in a network. We use this term and the term network visualization interchangeably in this chapter.
### Plotting the network
Large networks like this one can be hard to work with because of their size. We can get around that problem by only including some individuals. Let's explore how many interactions each individual in the network sent by using `count()`:
```{r}
interactions_sent <- edgelist %>%
# this counts how many times each sender appears in the data frame, effectively counting how many interactions each individual sent
count(sender) %>%
# arranges the data frame in descending order of the number of interactions sent
arrange(desc(n))
interactions_sent
```
618 senders of interactions is a lot! What if we focused on only those who sent more than one interaction?
```{r}
interactions_sent <-
interactions_sent %>%
filter(n > 1)
```
That leaves us with only 349, which will be much easier to work with.
We now need to filter the edgelist to only include these 349 individuals. The following code uses the `filter()` function combined with the `%in%` operator to do this:
```{r}
edgelist <- edgelist %>%
# the first of the two lines below filters to include only senders in the interactions_sent data frame
# the second line does the same, for receivers
filter(sender %in% interactions_sent$sender,
receiver %in% interactions_sent$sender)
```
We'll use the `as_tbl_graph()` function, which identifies the first column as the "sender" and the second as the "receiver". Let's look at the object it creates:
```{r}
g <-
as_tbl_graph(edgelist)
g
```
We can see that the network now has 267 individuals, all of whom sent more than one interaction. The individuals in a network are often referred to as "nodes" (and this terminology is used in the {ggraph} functions for plotting the individuals---the nodes---in a network). We note that nodes are sometimes referred to as "vertices" or "actors"; like the different names for edges, these generally mean the same thing.
Next, we'll use the `ggraph()` function:
```{r fig12-1, fig.cap="Network Graph"}
g %>%
# we chose the kk layout as it created a graph which was easy-to-interpret, but others are available; see ?ggraph
ggraph(layout = "kk") +
# this adds the points to the graph
geom_node_point() +
# this adds the links, or the edges; alpha = .2 makes it so that the lines are partially transparent
geom_edge_link(alpha = .2) +
# this last line of code adds a ggplot2 theme suitable for network graphs
theme_graph()
```
Finally, let's size the points based on a measure of centrality. A common way to do this is to measure how influential an individual may be based on the interactions observed.
```{r, echo = FALSE, message=F, warning=F}
extrafont::loadfonts() # needed to add this because of the css sheet, otherwise graph below wouldn't run
```
```{r fig12-2, fig.cap="Network Graph with Centrality"}
g %>%
# this calculates the centrality of each individual using the built-in centrality_authority() function
mutate(centrality = centrality_authority()) %>%
ggraph(layout = "kk") +
geom_node_point(aes(size = centrality, color = centrality)) +
# this line colors the points based upon their centrality
scale_color_continuous(guide = 'legend') +
geom_edge_link(alpha = .2) +
theme_graph()
```
There is much more you can do with {ggraph} (and {tidygraph}); check out the {ggraph} tutorial here: [https://ggraph.data-imaginist.com/](https://ggraph.data-imaginist.com/)
## Conclusion
In this chapter, we used social media data from the #tidytuesday hashtag to prepare and visualize social network data. Sociograms are a useful visualization tool to reveal who is interacting with whom---and, in some cases, to suggest why. In our applications of data science, we have found that the individuals (such as teachers or students) who are represented in a network often like to see what the network (and the relationships in it) *look like*. It can be compelling to think about why networks are the way they are, and how changes could be made to---for example---foster more connections between individuals who have few opportunities to interact. In this way, social network analysis can be useful to the data scientist in education because it provides a technique to communicate with other educational stakeholders in a compelling way.
Social network analysis is a broad (and growing) domain, and this chapter was intended to present some of its foundation. Fortunately for R users, many recent developments are implemented first in R (e.g., @R-amen). If you are interested in some of the additional steps that you can take to model and analyze network data, consider the appendix on two types of models (for selection and influence processes), [Appendix C](#c20c).