-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path07-tidyr.Rmd
333 lines (261 loc) · 11.5 KB
/
07-tidyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
---
title: "Create Tidy Data with tidyr"
teaching: 30
exercises: 15
questions:
- "How can I change the layout of a dataframe?"
objectives:
- "To understand the concepts of 'longer' and 'wider' dataframe formats and be able to convert between them with `tidyr`."
keypoints:
- "Use the `tidyr` package to change the layout of dataframes."
- "Use `pivot_longer()` to go from wide to longer layout."
- "Use `pivot_wider()` to go from long to wider layout."
source: Rmd
---
# Create Tidy Data with tidyr
```{r, include=FALSE}
source("bin/chunk-options.R")
knitr_fig_path("07-")
```
```{r include=FALSE, eval=FALSE}
# To create gapminder_wide.csv:
library("dplyr")
library("tidyr")
#transform the data to the wide format
gap_wide <- gapminder %>%
pivot_longer(cols = c(-continent,-country,-year),names_to = "obs_type",values_to = "obs_values") %>%
unite(ID_var,continent,country,sep = "_") %>%
unite(var_names,obs_type,year,sep = "_") %>%
pivot_wider(names_from = var_names,values_from = obs_values) %>%
separate(ID_var,into = c('continent','country'),sep = '_')
#write our the .csv so students can use it
write.csv(gap_wide,"data/gapminder_wide.csv",row.names = FALSE)
```
People who work with data often want to reshape their dataframes from 'wide' to 'longer'
layouts, or vice-versa. The 'long' layout or format is where:
- each column is a variable
- each row is an observation
In the purely 'long' (or 'longest') format, you usually have 1 column for the observed variable and the other columns are ID variables.
For the 'wide' format each row is often a site/subject/patient and you have
multiple observation variables containing the same type of data. These can be
either repeated observations over time, or observation of multiple variables (or
a mix of both). You may find data input may be simpler or some other
applications may prefer the 'wide' format. However, many of `R`'s functions have
been designed assuming you have 'longer' formatted data. This tutorial will help you
efficiently transform your data shape regardless of original format.
![](fig/08-tidyr-fig1.png){width=100%}
Long and wide dataframe layouts mainly affect readability. For humans, the wide format is often more intuitive since we can often see more of the data on the screen due
to its shape. However, the long format is more machine readable and is closer
to the formatting of databases. The ID variables in our dataframes are similar to
the fields in a database and observed variables are like the database values.
## Getting started
First install the packages if you haven't already done so (you probably
installed dplyr in the previous lesson):
```{r,eval=FALSE}
install.packages("tidyr")
install.packages("dplyr")
```
Load the packages
```{r,message=FALSE}
library("tidyr")
library("dplyr")
library("readr")
```
First, lets look at the structure of our original gapminder dataframe:
```{r}
gapminder <- read_csv("data/gapminder_data.csv")
gapminder
```
### Challenge 1
>
> Is gapminder a purely long, purely wide, or some intermediate format?
>
>
> <details>
>
> <summary>
> Solution to challenge 1
> </summary>
>
> <br />
> The original gapminder data.frame is in an intermediate format. It is not
> purely long since it had multiple observation variables
> (`pop`,`lifeExp`,`gdpPercap`).
> </details>
Sometimes, as with the gapminder dataset, we have multiple types of observed
data. It is somewhere in between the purely 'long' and 'wide' data formats. We
have 3 "ID variables" (`continent`, `country`, `year`) and 3 "Observation
variables" (`pop`,`lifeExp`,`gdpPercap`). This intermediate format can be
preferred despite not having ALL observations in 1 column given that all 3
observation variables have different units. There are few operations that would
need us to make this dataframe any longer (i.e. 4 ID variables and 1
Observation variable).
While using many of the functions in R, which are often vector based, you
usually do not want to do mathematical operations on values with different
units. For example, using the purely long format, a single mean for all of the
values of population, life expectancy, and GDP would not be meaningful since it
would return the mean of values with 3 incompatible units. The solution is that
we first manipulate the data either by grouping (see the lesson on `dplyr`), or
we change the structure of the dataframe. **Note:** Some plotting functions in
R actually work better in the wide format data.
## From wide to long format with pivot_longer()
Until now, we've been using the nicely formatted original gapminder dataset, but
'real' data (i.e. our own research data) will never be so well organized. Here
let's start with the wide formatted version of the gapminder dataset.
We'll load the data file and look at it.
```{r}
gap_wide <- read_csv("data/gapminder_wide.csv")
gap_wide
```
![](fig/08-tidyr-fig2.png){width=100%}
To change this very wide dataframe layout back to our nice, intermediate (or longer) layout, we will use one of the two available `pivot` functions from the `tidyr` package. To convert from wide to a longer format, we will use the `pivot_longer()` function. `pivot_longer()` makes datasets longer by increasing the number of rows and decreasing the number of columns, or 'lengthening' your observation variables into a single variable.
![](fig/08-tidyr-fig3.png){width=100%}
```{r}
gap_long <- gap_wide %>%
pivot_longer(
cols = pop_1952:gdpPercap_2007,
names_to = "obstype_year", values_to = "obs_values"
)
gap_long
```
Here we have used piping syntax which is similar to what we were doing in the
previous lesson with dplyr. In fact, these are compatible and you can use a mix
of tidyr and dplyr functions by piping them together.
We first provide to `pivot_longer()` a vector of column names that will be
pivoted into longer format. We could type out all the observation variables, but
as in the `select()` function, we can use the `:` operator to select our desired variables.
`pivot_longer()` also allows the alternative syntax of using the `-` symbol to
identify which variables are not to be pivoted (i.e. ID variables).
The next arguments to `pivot_longer()` are `names_to` for naming the column that
will contain the new ID variable (`obstype_year`) and `values_to` for naming the
new amalgamated observation variable (`obs_value`). We supply these new column
names as strings.
![](fig/08-tidyr-fig4.png){width=100%}
```{r}
gap_long <- gap_wide %>%
pivot_longer(
cols = c(-continent, -country),
names_to = "obstype_year", values_to = "obs_values"
)
gap_long
```
That may seem trivial with this particular dataframe, but sometimes you have 1
ID variable and 40 observation variables with irregular variable names. The
flexibility is a huge time saver!
Now `obstype_year` actually contains 2 pieces of information, the observation
type (`pop`,`lifeExp`, or `gdpPercap`) and the `year`. We can use the
`separate()` function to split the character strings into multiple variables
```{r}
gap_long_split <- gap_long %>%
separate(obstype_year, into = c('obs_type', 'year'), sep = "_") %>%
mutate(year = as.numeric(year))
```
### Challenge 2 (5 minutes)
>
> Using `gap_long_split`, calculate the mean life expectancy, population, and gdpPercap for each continent.
>**Hint:** use the `group_by()` and `summarize()` functions we learned in the `dplyr` lesson
>
> <details>
>
> <summary>
> Solution to challenge 2
> </summary>
>
> <br />
>```{r}
>gap_long_split %>%
> group_by(continent, obs_type) %>%
> summarize(means=mean(obs_values))
>```
> </details>
## From long to intermediate format with pivot_wider()
It is always good to check work. So, let's use the second `pivot` function, `pivot_wider()`, to 'widen' our observation variables back out. `pivot_wider()` is the opposite of `pivot_longer()`, making a dataset wider by increasing the number of columns and decreasing the number of rows. We can use `pivot_wider()` to pivot or reshape our `gap_long_split` to the original intermediate format or the widest format. Let's start with the intermediate format.
The `pivot_wider()` function takes `names_from` and `values_from` arguments.
To `names_from` we supply the column name whose contents will be pivoted into new
output columns in the widened dataframe. The corresponding values will be added
from the column named in the `values_from` argument.
```{r}
gap_normal <- gap_long_split %>%
pivot_wider(names_from = obs_type, values_from = obs_values)
dim(gap_normal)
dim(gapminder)
names(gap_normal)
names(gapminder)
```
Now we've got an intermediate dataframe `gap_normal` with the same dimensions as
the original `gapminder`, but the order of the variables is different. Let's fix
that before checking if they are `all.equal()`.
```{r}
gap_normal <- gap_normal %>%
select(country, year, pop, continent, lifeExp, gdpPercap)
all_equal(gap_normal, gapminder)
gap_normal
gapminder
```
We're almost there, the original was sorted by `country`, then
`year`.
```{r}
gap_normal <- gap_normal %>% arrange(country, year)
all_equal(gap_normal, gapminder)
```
That's great! We've gone from the longest format back to the intermediate and we
didn't introduce any errors in our code.
Now let's convert the long all the way back to the wide. In the wide format, we
will keep country and continent as ID variables and pivot the observations
across the 3 metrics (`pop`,`lifeExp`,`gdpPercap`) and time (`year`). First we
need to create appropriate labels for all our new variables (time*metric
combinations) and we also need to unify our ID variables to simplify the process
of defining `gap_wide`.
```{r}
gap_temp <- gap_long_split %>% unite(ID_var, continent, country, sep = "_")
gap_temp
gap_temp <- gap_long_split %>%
unite(ID_var, continent, country, sep = "_") %>%
unite(var_names, obs_type, year, sep = "_")
gap_temp
```
Using `unite()` we now have a single ID variable which is a combination of
`continent`,`country`,and we have defined variable names. We're now ready to
pipe in `pivot_wider()`
```{r}
gap_wide_new <- gap_long_split %>%
unite(ID_var, continent, country, sep = "_") %>%
unite(var_names, obs_type, year, sep = "_") %>%
pivot_wider(names_from = var_names, values_from = obs_values)
gap_wide_new
```
### Challenge 3 (10 minutes)
> Take this 1 step further and create a `gap_ludicrously_wide` format data by pivoting over countries, year and the 3 metrics?
>**Hint** this new dataframe should only have 5 rows.
>
> <details>
>
> <summary>
> Solution to challenge 3
> </summary>
>
> <br />
>```{r}
>gap_ludicrously_wide <- gap_long_split %>%
> unite(var_names, obs_type, year, country, sep = "_") %>%
> pivot_wider(names_from = var_names, values_from = obs_values)
>```
> </details>
Now we have a great 'wide' format dataframe, but the `ID_var` could be more
usable, let's separate it into 2 variables with `separate()`
```{r}
gap_wide_betterID <- separate(gap_wide_new, ID_var, c("continent", "country"), sep="_")
gap_wide_betterID <- gap_long_split %>%
unite(ID_var, continent, country, sep = "_") %>%
unite(var_names, obs_type, year, sep = "_") %>%
pivot_wider(names_from = var_names, values_from = obs_values) %>%
separate(ID_var, c("continent","country"), sep = "_")
gap_wide_betterID
all_equal(gap_wide, gap_wide_betterID)
```
There and back again!
## Other great resources
* [R for Data Science](http://r4ds.had.co.nz/index.html)
* [Data Wrangling Cheat sheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
* [Introduction to tidyr](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)
* [Data wrangling with R and RStudio](https://www.rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/)