-
Notifications
You must be signed in to change notification settings - Fork 24
/
group_by.qmd
177 lines (123 loc) · 6.56 KB
/
group_by.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
# Grouping and summarizing
```{r echo=FALSE}
source("libs/Common.R")
```
```{r echo = FALSE}
pkg_ver(c("dplyr"))
```
## Summarizing data by group
Let's first create a dataframe listing the average delay time in minutes, by day of the week and by quarter, for Logan airport's 2014 outbound flights.
```{r}
df <- data.frame(
Weekday = factor(rep(c("Mon", "Tues", "Wed", "Thurs", "Fri"), each = 4),
levels = c("Mon", "Tues", "Wed", "Thurs", "Fri")),
Quarter = paste0("Q", rep(1:4, each = 5)),
Delay = c(9.9, 5.4, 8.8, 6.9, 4.9, 9.7, 7.9, 5, 8.8, 11.1, 10.2, 9.3, 12.2,
10.2, 9.2, 9.7, 12.2, 8.1, 7.9, 5.6))
```
The goal will be to summarize the table by `Weekday` as shown in the following graphic.
<img src="img/Summarize_by_one_variable.png" style="width: 73%; height: auto;">
The data table has three variables: `Weekday`, `Quarter` and `Delay`. `Delay` is the value we will summarize which leaves us with one variable to *collapse*: `Quarter`. In doing so, we will compute the `Delay` statistics for all quarters associated with a unique `Weekday` value.
This workflow requires two operations: a grouping operation using the `group_by` function and a summary operation using the `summarise`/`summarize` function. Here, we'll compute two summary statistics: minimum delay time and maximum delay time.
```{r}
library(dplyr)
df %>%
group_by(Weekday) %>%
summarise(min_delay = min(Delay), max_delay = max(Delay))
```
Note that the weekday follows the chronological order as defined in the `Weekday` factor.
You'll also note that the output is a `tibble`. This data class is discussed at the end of this page.
### Grouping by multiple variables
You can group by more than one variable. For example, let's build another dataframe listing the average delay time in minutes, by quarter, by weekend/weekday and by inbound/outbound status for Logan airport's 2014 outbound flights.
```{r}
df2 <- data.frame(
Quarter = paste0("Q", rep(1:4, each = 4)),
Week = rep(c("Weekday", "Weekend"), each=2, times=4),
Direction = rep(c("Inbound", "Outbound"), times=8),
Delay = c(10.8, 9.7, 15.5, 10.4, 11.8, 8.9, 5.5,
3.3, 10.6, 8.8, 6.6, 5.2, 9.1, 7.3, 5.3, 4.4))
```
The goal will be to summarize the delay time by `Quarter` and by `Week` type as shown in the following graphic.
<img src="img/Summarize_by_two_variable.png" style="width: 73%; height: auto;">
This time, the data table has four variables. We are wanting to summarize by `Quater` and `Week` which leaves one variable, `Direction`, that needs to be collapsed.
```{r}
df2 %>%
group_by(Quarter, Week) %>%
summarise(min_delay = min(Delay), max_delay = max(Delay))
```
The following section demonstrates other grouping/summarizing operations on a larger dataset.
## A working example
The data file *FAO_grains_NA.csv* will be used in this exercise. This dataset consists of grain yield and harvest year by North American country. The dataset was downloaded from http://faostat3.fao.org/ in June of 2014.
Run the following line to load the FAO data file into your current R session.
```{r}
dat <- read.csv("http://mgimond.github.io/ES218/Data/FAO_grains_NA.csv", header=TRUE)
```
Make sure to load the `dplyr` package before proceeding with the following examples.
```{r, message=FALSE}
library(dplyr)
```
### Summarizing by crop type
The `group_by` function will split any operations applied to the dataframe into groups defined by one or more columns. For example, if we wanted to get the minimum and maximum years from the `Year` column for which crop data are available *by crop type*, we would type the following:
```{r}
dat %>%
group_by(Crop) %>%
summarise(yr_min = min(Year), yr_max=max(Year))
```
### Count the number of records in each group
In this example, we are identifying the number of records by `Crop` type. There are two ways this can be accomplished:
```{r eval = FALSE}
dat %>%
filter(Information == "Yield (Hg/Ha)",
Year >= 2005 & Year <=2010,
Country=="United States of America") %>%
group_by(Crop) %>%
count()
```
Or,
```{r}
dat %>%
filter(Information == "Yield (Hg/Ha)",
Year >= 2005 & Year <=2010,
Country=="United States of America") %>%
group_by(Crop) %>%
summarise(Count = n())
```
The former uses the `count()` function and the latter uses the `summarise()` and `n()` functions.
### Summarize by mean yield and year range
Here's another example where *two* variables are summarized in a single pipe.
```{r}
dat.grp <- dat %>%
filter(Information == "Yield (Hg/Ha)",
Year >= 2005 & Year <=2010,
Country=="United States of America") %>%
group_by(Crop) %>%
summarise( Yield = mean(Value), `Number of Years` = max(Year) - min(Year))
dat.grp
```
### Normalizing each value in a group by the group median
In this example, we are subtracting each value in a group by that group's median. This can be useful in identifying which year yields are higher than or lower than the median yield value within each crop group. We will concern ourselves with US yields only and sort the output by crop type. We'll save the output dataframe as `dat2`.
```{r}
dat2 <- dat %>%
filter(Information == "Yield (Hg/Ha)",
Country == "United States of America") %>%
select(Crop, Year, Value) %>%
group_by(Crop) %>%
mutate(NormYield = Value - median(Value)) %>%
arrange(Crop)
```
Let's plot the normalized yields by year for `Barley` and add a `0` line representing the (normalized) central value.
```{r fig.width=5, fig.height=3,small.mar=TRUE}
plot( NormYield ~ Year, dat2[dat2$Crop == "Barley",] )
abline(h = 0, col="red")
```
The relative distribution of points does not change, but the values do (they are re-scaled) allowing us to compare values based on some localized (group) context. This technique will prove very useful later on in the course when EDA topics are explored.
### `dplyr`'s output data structure
Some of `dplyr`'s functions such as `group_by`/`summarise` generate a **tibble** data table. For example, the `dat.grp` object created in the last chunk of code is associated with a `tb_df` (a tibble).
```{r}
class(dat.grp)
```
A *tibble* table will behave a little differently than a *data frame* table when printing to a screen or subsetting its elements. In most cases, a tibble rendering of the table will not pose a problem in a workflow, however, this format may prove problematic with some older functions. To remedy this, you can force the `dat.grp` object to a standalone `dataframe` as follows:
```{r}
dat.df <- as.data.frame(dat.grp)
class(dat.df)
```