-
Notifications
You must be signed in to change notification settings - Fork 66
/
04-tidyr.Rmd
140 lines (91 loc) · 4.3 KB
/
04-tidyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
# Manipulating data with tidyr
# Goals
- Be able to identify wide- and long-format data and why we might need one or the other
- Be able to convert between the two formats with the tidyr package
# Introduction
Wide data has a column for each variable.
```{r}
library(tidyverse)
air <- airquality # built into R
names(air) <- tolower(names(air))
air <- as_data_frame(air) %>%
select(-ozone, -solar.r, -wind)
```
For example, this is long-format data:
```{r}
air
```
And this is wide-format data:
```{r}
spread(air, month, temp)
```
Long-format data has a column or columns identifying the rows of data and a column
for the values of those variables. In wide data, the values of those identifiers form columns themselves.
It turns out that you need wide-format data for some types of data analysis and long-format data for others. In reality, you need long-format data much more commonly than wide-format data. For example, ggplot2 requires long-format data,`dplyr::group_by()` requires long-format data, and most modelling functions (such as `lm()`, `lmer()`, `glm()`, and `gam()`) require long-format data (except for the predictors themselves). But people often find it easier to record their data in wide format.
# tidyr
tidyr is a successor to the reshape2 package. It doesn't do everything that the reshape2 package does (and if you need that, see my [blog post](http://seananderson.ca/2013/10/19/reshape.html)). But it covers the majority of data reshaping and it does it more elegantly than reshape2 (read: it works nicely with the data pipe, `%>%`).
tidyr is based around two key functions: `gather()` and `spread()`.
`gather` goes from wide-format data and *gathers* it into fewer columns.
`spread` takes long-format data and *spreads* it out wide.
We'll sometimes end up having to use these to get data formatted for fitting mixed effects models and for manipulating the output.
# tidyr::spread
`spread` takes in order a data frame, the name of the 'key' column (the column that gets 'swung' up to made the new identifying columns), and the name of the 'value' column (the column that fills the wide dataset with values).
The tidyr package functions take bare (unquoted) column names. This saves typing and makes the functions work well with pipes. E.g. `spread(data, x, y)` *not* `spread(data, "x", "y"))`.
Let's take our `air` data and make it wide:
```{r}
air_wide <- spread(air, month, temp)
air_wide
```
# tidyr::gather
`gather` takes wide data and makes it long.
The first argument is the data, the second argument represents whatever we want to call the ID columns in the long dataset, and the rest of the (unnamed) arguments use the syntax from the `dplyr::select` function to specify which colums to gather (i.e. all non ID columns.)
As an example: let's turn `air_wide` back into `air`.
```{r}
gather(air_wide, month, temp, -day)
```
## Challenge 1
Try and answer the following questions before running the code:
What will the following do?
```{r}
gather(air_wide, zebra, aligator, -day) # exercise
```
Is this the same as above?
```{r}
gather(air_wide, month, temp, 2:6)
```
Why doesn't the following do what we want?
```{r, eval=FALSE}
gather(air_wide, month, temp)
```
## Challenge 2
Start by running the following code to create a data set:
```{r fake-data}
# from ?tidyr::spread
stocks <- data.frame(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
```
Now make the dataset (stock prices for stocks X, Y, and Z) long with columns for time, stock ID, and price. Save the output into `stocks_long`.
Answer:
```{r}
stocks_long <- stocks %>%
gather(
key = stock, value = price, -time) # exercise
```
Make `stocks_long` wide again with `spread()`:
```{r}
stocks_long %>% spread(
key = stock, value = price) # exercise
```
Bonus: There's another possible wide format for this dataset. Can you figure it out and make it with `spread()`? Hint: there's a row for each stock.
```{r}
stocks_long %>% spread(
key = time, value = price) # exercise
```
# Further information
Parts of this exercise were modified from <http://seananderson.ca/2013/10/19/reshape.html>, which uses the older, more powerful, but less pipe friendly reshape2 package.
<http://r4ds.had.co.nz/tidy-data.html>
<https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf>