group_by.qmd

# Grouping and summarizing

```{r echo=FALSE}
source("libs/Common.R")
```


```{r echo = FALSE}
pkg_ver(c("dplyr"))
```


## Summarizing data by group


Let's first create a dataframe listing the average delay time in minutes, by day of the week and by quarter, for Logan airport's 2014 outbound flights.

```{r}
df <- data.frame(
  Weekday = factor(rep(c("Mon", "Tues", "Wed", "Thurs", "Fri"), each = 4), 
                   levels = c("Mon", "Tues", "Wed", "Thurs", "Fri")),
  Quarter = paste0("Q", rep(1:4, each = 5)), 
  Delay = c(9.9, 5.4, 8.8, 6.9, 4.9, 9.7, 7.9, 5, 8.8, 11.1, 10.2, 9.3, 12.2,
            10.2, 9.2, 9.7, 12.2, 8.1, 7.9, 5.6))
```

The goal will be to summarize the table by `Weekday` as shown in the following graphic. 

<img src="img/Summarize_by_one_variable.png" style="width: 73%; height: auto;">

The data table has three variables: `Weekday`, `Quarter` and `Delay`. `Delay` is the value we will summarize which leaves us with one variable to *collapse*: `Quarter`. In doing so, we will compute the `Delay` statistics for all quarters associated with a unique `Weekday` value.

This workflow requires two operations: a grouping operation using the `group_by` function and a summary operation using the `summarise`/`summarize` function. Here, we'll compute two summary statistics: minimum delay time and maximum delay time.

```{r}
library(dplyr)

df %>% 
  group_by(Weekday) %>% 
  summarise(min_delay = min(Delay), max_delay = max(Delay))
```

Note that the weekday follows the chronological order as defined in the `Weekday` factor.

You'll also note that the output is a `tibble`. This data class is discussed at the end of this page.

### Grouping by multiple variables

You can group by more than one variable. For example, let's build another dataframe listing the average delay time in minutes, by quarter, by weekend/weekday and by inbound/outbound status for Logan airport's 2014 outbound flights.

```{r}
df2 <- data.frame(
  Quarter = paste0("Q", rep(1:4, each = 4)), 
  Week = rep(c("Weekday", "Weekend"), each=2, times=4),
  Direction = rep(c("Inbound", "Outbound"), times=8),
  Delay = c(10.8, 9.7, 15.5, 10.4, 11.8, 8.9, 5.5, 
            3.3, 10.6, 8.8, 6.6, 5.2, 9.1, 7.3, 5.3, 4.4))
```

The goal will be to summarize the delay time by `Quarter` and by `Week` type as shown in the following graphic. 

<img src="img/Summarize_by_two_variable.png" style="width: 73%; height: auto;">

This time, the data table has four variables. We are wanting to summarize by `Quater` and `Week` which leaves one variable, `Direction`, that needs to be collapsed. 

```{r}
df2 %>% 
  group_by(Quarter, Week) %>% 
  summarise(min_delay = min(Delay), max_delay = max(Delay))
```

The following section demonstrates other  grouping/summarizing operations on a larger dataset.

## A working example

The data file  *FAO_grains_NA.csv* will be used in this exercise. This dataset consists of grain yield and harvest year by North American country. The dataset was downloaded from http://faostat3.fao.org/ in June of 2014. 

Run the following line to load the FAO data file into your current R session.

```{r}
dat <- read.csv("http://mgimond.github.io/ES218/Data/FAO_grains_NA.csv", header=TRUE)
```

Make sure to load the `dplyr` package before proceeding with the following examples. 

```{r, message=FALSE}
library(dplyr)
```

### Summarizing by crop type

The `group_by` function will split any operations applied to the dataframe into groups defined by one or more columns. For example, if we wanted to get the minimum and maximum years from the `Year` column for which crop data are available *by crop type*, we would type the following:

```{r}
dat %>% 
  group_by(Crop) %>% 
  summarise(yr_min = min(Year), yr_max=max(Year))
```

### Count the number of records in each group

In this example, we are identifying the number of records by `Crop` type. There are two ways this can be accomplished:

```{r eval = FALSE}
dat %>%
  filter(Information == "Yield (Hg/Ha)", 
         Year >= 2005 & Year <=2010, 
         Country=="United States of America") %>%
  group_by(Crop) %>%
  count()
```

Or,

```{r}
dat %>%
  filter(Information == "Yield (Hg/Ha)", 
         Year >= 2005 & Year <=2010, 
         Country=="United States of America") %>%
  group_by(Crop) %>%
  summarise(Count = n())
```

The former uses the `count()` function and the latter uses the `summarise()` and `n()` functions.

### Summarize by mean yield and year range

Here's another example where *two* variables are summarized in a single pipe.

```{r}
dat.grp <- dat %>%
  filter(Information == "Yield (Hg/Ha)", 
         Year >= 2005 & Year <=2010, 
         Country=="United States of America") %>%
  group_by(Crop) %>%
  summarise( Yield = mean(Value), `Number of Years` = max(Year) - min(Year)) 

dat.grp
```

### Normalizing each value in a group by the group median

In this example, we are subtracting each value in a group by that group's median. This can be useful in identifying which year yields are higher than or lower than the median yield value within each crop group. We will concern ourselves with US yields only and sort the output by crop type. We'll save the output dataframe as `dat2`.

```{r}
dat2 <- dat %>% 
  filter(Information == "Yield (Hg/Ha)",
         Country == "United States of America") %>%
  select(Crop, Year, Value)                     %>%
  group_by(Crop)                                %>%
  mutate(NormYield = Value - median(Value))     %>%
  arrange(Crop)
```

Let's plot the normalized yields by year for `Barley` and add a `0` line representing the (normalized) central value.

```{r fig.width=5, fig.height=3,small.mar=TRUE}
plot( NormYield ~ Year, dat2[dat2$Crop == "Barley",] )
abline(h = 0, col="red")
```

The relative distribution of points does not change, but the values do (they are re-scaled) allowing us to compare values based on some localized (group) context. This technique will prove very useful later on in the course when EDA topics are explored.

### `dplyr`'s output data structure

Some of `dplyr`'s functions such as `group_by`/`summarise` generate a **tibble** data table. For example, the `dat.grp` object created in the last chunk of code is associated with a `tb_df` (a tibble).

```{r}
class(dat.grp)
```

A *tibble* table will behave a little differently than a *data frame* table when printing to a screen or subsetting its elements. In most cases, a tibble rendering of the table will not pose a problem in a workflow, however, this format may prove problematic with some older functions. To remedy this, you can force the `dat.grp` object to a standalone `dataframe` as follows:

```{r}
dat.df <- as.data.frame(dat.grp)
class(dat.df)
```