-
Notifications
You must be signed in to change notification settings - Fork 66
/
03-dplyr.Rmd
257 lines (177 loc) · 9.66 KB
/
03-dplyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
# Manipulating data with dplyr and pipes
# Goals
- Become familiar with the 5 dplyr verbs of data manipulation
- Be comfortable chaining R commands with the `%>%` (pipe) symbol
# dplyr
dplyr is built around 5 verbs. These verbs make up the majority of the data manipulation you tend to do. You might need to:
*Select* certain columns of data.
*Filter* your data to select specific rows.
*Arrange* the rows of your data into an order.
*Mutate* your data frame to contain new columns.
*Summarise* chunks of you data in some way.
Let's look at how those work.
# The data
We're going to work with a dataset of mammal life-history, geography, and ecology traits from the PanTHERIA database:
Jones, K.E., *et al*. PanTHERIA: a species-level database of life history, ecology, and geography of extant and recently extinct mammals. Ecology 90:2648. <http://esapubs.org/archive/ecol/E090/184/>
You can download the data from here:
```{r download-data, eval=FALSE}
pantheria <-
"http://esapubs.org/archive/ecol/E090/184/PanTHERIA_1-0_WR05_Aug2008.txt"
download.file(pantheria, destfile = "data/raw/mammals.txt")
```
I've cached the file in `data/raw/` as well.
We'll load the dplyr package:
```{r, message=FALSE}
library(dplyr)
```
Next we'll read it in and simplify it. This gets a bit ugly, but you can safely just run this code chunk and ignore the details. Later you might want to come back and work through the code if any looks new to you. This type of data cleaning is commonly required:
```{r, message=FALSE}
mammals <- readr::read_tsv("data/raw/mammals.txt")
names(mammals) <- sub("[0-9._-]+", "", names(mammals))
names(mammals) <- sub("MSW", "", names(mammals))
mammals <- select(mammals, Order, Binomial, AdultBodyMass_g,
AdultHeadBodyLen_mm, HomeRange_km2, LitterSize)
names(mammals) <- gsub("([A-Z])", "_\\L\\1", names(mammals), perl = TRUE)
names(mammals) <- gsub("^_", "", names(mammals), perl = TRUE)
mammals[mammals == -999] <- NA
names(mammals)[names(mammals) == "binomial"] <- "species"
mammals <- as_data_frame(mammals) # for prettier printing etc.
```
# Looking at the data
Data frames look a bit different in dplyr. Above, I called the `as_data_frame()` function on our data. This provides more useful printing of data frames in the console. Ever accidentally printed a massive data frame in the console before? Yeah... this avoids that. You don't need to change your data to a data frame tbl first — the dplyr functions will automatically convert your data when you call them. This is what the data look like on the console:
```{r}
mammals
```
dplyr also provides a function `glimpse()` that makes it easy to look at our data in a transposed view. It's similar to the `str()` (structure) function, but has a few advantages (see `?glimpse`).
```{r}
glimpse(mammals)
```
# Selecting columns
`select()` lets you subset by columns. This is similar to `subset()` in base R, but it also allows for some fancy use of helper functions such as `contains()`, `starts_with()` and, `ends_with()`. I think these examples are self explanatory, so I'll just include them here:
```{r}
select(mammals, adult_head_body_len_mm)
select(mammals, adult_head_body_len_mm, litter_size)
select(mammals, adult_head_body_len_mm:litter_size)
select(mammals, -adult_head_body_len_mm)
select(mammals, starts_with("adult"))
select(mammals, ends_with("g"))
select(mammals, 1:3)
```
# Filtering rows
`filter()` lets you subset by rows. You can use any valid logical statements:
```{r}
filter(mammals, adult_body_mass_g > 1e7)
filter(mammals, species == "Balaena mysticetus")
filter(mammals, order == "Carnivora", adult_body_mass_g < 200)
```
Challenge: filter `mammals` for all rows where `adult_body_mass_g` is `NA` and `adult_head_body_len_mm` is greater than 2000. Hint (use `is.na()` and `>`):
```{r}
filter(mammals, is.na(home_range_km2), adult_head_body_len_mm > 2000) # exercise
```
# Arranging rows
`arrange()` lets you order the rows by one or more columns in ascending or descending order. I'm selecting the first three columns only to make the output easier to read:
```{r}
arrange(mammals, adult_body_mass_g)
arrange(mammals, desc(adult_body_mass_g))
arrange(mammals, order, adult_body_mass_g)
```
# Mutating columns
`mutate()` lets you add new columns. Notice that the new columns you create can build on each other. I will wrap these in `glimpse()` to make the new columns easy to see:
```{r}
glimpse(mutate(mammals, adult_body_mass_kg = adult_body_mass_g / 1000))
glimpse(mutate(mammals,
g_per_mm = adult_body_mass_g / adult_head_body_len_mm))
glimpse(mutate(mammals,
g_per_mm = adult_body_mass_g / adult_head_body_len_mm,
kg_per_mm = g_per_mm / 1000))
```
# Summarising columns
Finally, `summarise()` lets you calculate summary statistics. On its own `summarise()` isn't that useful, but when combined with `group_by()` you can summarise by chunks of data. This is similar to what you might be familiar with through `ddply()` and `summarise()` from the plyr package:
```{r}
summarise(mammals, mean_mass = mean(adult_body_mass_g, na.rm = TRUE))
# summarise with group_by:
head(summarise(group_by(mammals, order),
mean_mass = mean(adult_body_mass_g, na.rm = TRUE)))
```
# Piping data
Pipes take the output from one function and feed it to the first argument of the next function. You may have encountered the Unix pipe `|` before.
The magrittr R package contains the pipe function `%>%`. Yes it might look bizarre at first but it makes more sense when you think about it. The R language allows symbols wrapped in `%` to be defined as functions, the `>` helps imply a chain, and you can hit these 2 characters one after the other very quickly on a keyboard by holding down the Shift key. Try it!
Try pronouncing `%>%` "then" whenever you see it. If you want to see the help page, you'll need to wrap it in back ticks like so:
```{r, eval=FALSE}
?magrittr::`%>%`
```
# A trivial pipe example
Pipes can work with nearly any functions. Let's start with a non-dplyr example:
```{r}
x <- rnorm(10)
x %>% max()
# is the same thing as:
max(x)
```
So, we took the value of `x` (what would have been printed on the console), captured it, and fed it to the first argument of `max()`. It's probably not clear why this is cool yet, but hang on.
# A silly dplyr example with pipes
Let's try a single-pipe dplyr example. We'll pipe the `mammals` data frame to the arrange function's first argument, and choose to arrange by the `adult_body_mass_g` column:
```{r}
mammals %>% arrange(adult_body_mass_g)
```
# A better example
Here's where it gets interesting. We can chain dplyr functions in succession. This lets us write data manipulation steps in the order we think of them and avoid creating temporary variables in the middle to capture the output. This works because the output from every dplyr function is a data frame and the first argument of every dplyr function is a data frame.
Say we wanted to find the species with the highest body-mass-to-length ratio:
```{r}
mammals %>%
mutate(mass_to_length = adult_body_mass_g / adult_head_body_len_mm) %>%
arrange(desc(mass_to_length)) %>%
select(species, mass_to_length)
```
So, we took `mammals`, fed it to `mutate()` to create a mass-length ratio column, arranged the resulting data frame in descending order by that ratio, and selected the columns we wanted to see. This is just the beginning. If you can imagine it, you can string it together. If you want to debug your code, just pull a pipe off the end and run the code down to that step. Or build your analysis up and add successive pipes.
The above is equivalent to:
```{r}
select(
arrange(
mutate(mammals,
mass_to_length = adult_body_mass_g / adult_head_body_len_mm),
desc(mass_to_length)),
species, mass_to_length)
```
But the problem here is that you have to read it inside out, it's easy to miss a bracket, and the arguments get separated from the function (e.g. see `mutate()` and `desc(mass_to_length))`). Plus, this is a rather trivial example. Chain together even more steps and it quickly gets out of hand.
Here's one more example. Let's ask what taxonomic orders have a median litter size less than 3. I'll start by grouping by `order` and calculating the median litter sizes:
```{r}
mammals %>% group_by(order) %>%
summarise(median_litter = median(litter_size, na.rm = TRUE))
```
## Challenge 1
Take what I started and add a line that keeps only the rows where `median_litter` is less than 3:
```{r}
mammals %>% group_by(order) %>%
summarise(median_litter = median(litter_size, na.rm = TRUE)) %>%
filter(median_litter < 3) # exercise
```
Bonus: also arrange the data frame by `median_litter` from largest to smallest and select only the columns `order` and `median_litter`:
```{r}
mammals %>% group_by(order) %>%
summarise(median_litter = median(litter_size, na.rm = TRUE)) %>%
filter(median_litter < 3) %>% # exercise
select(order, median_litter) %>% # exercise
arrange(-median_litter) # exercise
```
## Challenge 2
Again, your turn: make the following piece of code easier to read by converting it to use pipes (hint: work from the inside out starting with `select()`):
```{r}
summarise(
group_by(
select(mammals, order, adult_body_mass_g),
order),
mean_body_mass = mean(adult_body_mass_g, na.rm = TRUE))
```
Answer:
```{r}
mammals %>% select(order, adult_body_mass_g) %>% # exercise
group_by(order) %>% # exercise
summarise(mean_body_mass = mean(adult_body_mass_g, na.rm = TRUE)) # exercise
```
# Resources
Parts of this exercise were modified from: <http://seananderson.ca/2014/09/13/dplyr-intro.html>
<https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html>
<https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf>
<http://r4ds.had.co.nz/transform.html>
<http://r4ds.had.co.nz/pipes.html>