01-DataWrangling.qmd

---
title: "Data Wrangling with R"
subtitle: "An Introduction to the Tidyverse"
format: html
toc: true
editor: 
  markdown: 
    wrap: 72
---

## Learning Objectives

By the end of this session, students will be able to:

-   Explain some benefits of learning R
-   Understand the difference between R and RStudio
-   Navigate RStudio
-   Define key R concepts and terminology.
-   Identify sources of documentation about R packages and functions
-   Apply commonly used **`tidyverse`** functions to a real data set
-   Becoming familiar with a typical workflow for exploring and wrangling data.


## Why learn R?

-   **R is free, open-source, and cross-platform.** Anyone can inspect
    the source code to see how R works. Because of this transparency,
    there is less chance for mistakes, and if you (or someone else) find
    some, you can report and fix bugs. Because R is open source and is
    supported by a large community of developers and users, there is a
    very large selection of third-party add-on packages which are freely
    available to extend R's native capabilities.

-   **R code is great for reproducibility**. Reproducibility is when
    someone else (including your future self) can obtain the same
    results from the same dataset when using the same analysis. R
    integrates with other tools to generate manuscripts from your code.
    If you collect more data, or fix a mistake in your dataset, the
    figures and the statistical tests in your manuscript are updated
    automatically.

-   **R relies on a series of written commands, not on remembering a
    succession of pointing and clicking.** If you want to redo your
    analysis because you collected more data, you don't have to remember
    which button you clicked in which order to obtain your results; you
    just have to run your script again.

-   **R is interdisciplinary and extensible** With 10,000+ packages that
    can be installed to extend its capabilities, R provides a framework
    that allows you to combine statistical approaches from many
    scientific disciplines to best suit the analytical framework you
    need to analyze your data. For instance, R has packages for image
    analysis, GIS, time series, population genetics, and a lot more.

-   **R works on data of all shapes and sizes.** The skills you learn
    with R scale easily with the size of your dataset. Whether your
    dataset has hundreds or millions of lines, it won't make much
    difference to you. R is designed for data analysis. It comes with
    special data structures and data types that make handling of missing
    data and statistical factors convenient. R can connect to
    spreadsheets, databases, and many other data formats, on your
    computer or on the web.

-   **R produces high-quality graphics.** The plotting functionalities
    in R are endless, and allow you to adjust any aspect of your graph
    to convey most effectively the message from your data.

-   **R has a large and welcoming community.** Thousands of people use R
    daily. Many of them are willing to help you through mailing lists
    and websites such as [Stack Overflow](https://stackoverflow.com/),
    or on the [RStudio community](https://community.rstudio.com/).
    Questions which are backed up with [short, reproducible code
    snippets](https://www.tidyverse.org/help/) are more likely to
    attract knowledgeable responses.
    

## Starting out in R

[R](https://cran.rstudio.com/) is both a programming language and an
interactive environment for data exploration and statistics.

Working with R is primarily text-based. The basic mode of use for R is
that the user provides commands in the R language and then R computes
and displays the result.

### Downloading, Installing and Running R

**Download**\
R can be downloaded from [CRAN (The Comprehensive R Archive
Network)](https://cran.rstudio.com/index.html) for Windows, Linux, or
Mac.

**Install**\
Installation of R is like most software packages and you will be guided.
Should you have any issues or need help you can refer to [R Installation
and
Administration](https://cran.r-project.org/doc/manuals/r-release/R-admin.html)

**Running**\
R can be launched from your software or applications launcher or When
working at a command line on UNIX or Windows, the command `R` can be
used for starting the main R program in the form `R`

You will see a console similar to this appear:

```{r echo=F}
knitr::include_graphics("images/console.png")
```

While it is possible to work solely through the console or using a
command line interface, the ideal environment to work in R is RStudio.

### RStudio

[RStudio](https://www.rstudio.com/products/rstudio/download/) is a user
interface for working with R. It is called an Integrated Development
Environment (IDE): a piece of software that provides tools to make
programming easier. RStudio acts as a sort of wrapper around the R
language. You can use R without RStudio, but it’s much more limiting.
RStudio makes it easier to import datasets, create and write scripts,
and makes using R much more effective. RStudio is also free and open
source. To function correctly, RStudio needs R and therefore both need
to be installed on your computer. 

RStudio is divided into four "panes". The placement of these panes and
their content can be customized (see menu, Tools -\> Global Options -\>
Pane Layout).

The Default Layout is:

-   Top Left - **Source**: your scripts and documents
-   Bottom Left - **Console**: what R would look and be like without
    RStudio
-   Top Right - **Environment/History**: look here to see what you have
    done
-   Bottom Right - **Files** and more: see the contents of the
    project/working directory here, like your Script.R file

```{r echo=F}
knitr::include_graphics("images/rstudio.png")
```

### RStudio Projects

RStudio provides a useful feature called `Projects` which act like a container for your work. As you use R more, you will find it useful to make sure your files and environment for one real-world project are kept together and separate from other projects. 

Let's create a new project now. 

1. Go to `File > New Project`
2. In `Create project from` menu choose `Existing Directory`
3. Browse to `Desktop > Session01_DataWrangling`
4. Select the check box that says `Open in New Session`


::: {.callout-tip appearence="minimal" collapse="true"}

### Follow Along at Home

Posit (RStudio) Cloud is a browser-based version of RStudio. It will
allow you to use RStudio without needing to download anything to your
computer. Posit Cloud automatically organizes things into Projects. You can also easily share your R projects with others. 

Get Started:

1.  Create your free RStudio Cloud account at
    <https://posit.cloud/plans/free>.
2.  Go to the class project <https://posit.cloud/content/8458222>
3.  Note the text that marks this as a Temporary Copy. Select the
    `Save a Permanent Copy` button to begin working!
    
:::

## R Scripts

A script is a text file in which you write your code. R scripts are generally recognized by the `.R` file extension. Scripts make it easy to re-run that code when you need to. In addition to code, your script can also have comments, which start with a `#` symbol. These comments make your script more human readable, but are ignored by the computer. 

To get started in this lesson - open up the script in your RStudio Project called `01_DataWrangling.R` 

## Welcome to the Tidyverse

In this lesson, we will be using a group of packages which are part of what is known as the **`tidyverse`** - "an opinionated
collection of R packages designed for data science. All packages share
an underlying design philosophy, grammar, and data structures."[^1],
developed by Hadley Wickham.

[^1]: https://www.tidyverse.org/

::: {.callout-important}

## What is a package?

As mentioned above, R is *extensible* and packages are the way to *extend* the base functionality of R. Each package is a collection of **functions**, code, data, and documentation. Packages are specialized to accomplish a particular set of tasks. Users can easily install packages from package repositories, such as the central repository [CRAN](https://cran.r-project.org/) (Comprehensive R Archive Network) and [Bioconductor](https://www.bioconductor.org/), an important source of bioinformatics packages.

The sheer number of R packages can seem overwhelming to a beginner and a common question we hear is, "But how do I know what package to use?". One place is to start is to take a look at [CRAN Task Views](https://cran.r-project.org/web/views/), which organizes packages by topic. You can also try an internet search like "How do I do X in R" and this will typically lead you to solutions that mention packages you need to accomplish the task. 

One reason we are focusing on the **tidyverse** packages in this class is because they are so versatile and might be the only packages you need for much of what you want to do in R.

:::

The **tidyverse** packages we will be using include:

-   **`readr`** for importing data into R
-   **`dplyr`** for handling common data wrangling tasks for tabular
    data
-   **`tidyr`** which enables you to swiftly convert between different
    data formats (long vs. wide) for plotting and analysis
-   **`lubridate`** for working with dates
-   **`ggplot2`** for visualizing data (we'll explore this package in
    the next session).

For the full list of tidyverse packages and documentation visit
[tidyverse.org](https://www.tidyverse.org/) You can install these
packages individually, or you can install the entire tidyverse in one
go.


## Installing and loading packages

When you first install R on your computer, it comes with a set of built-in packages and functions collectively referred to as Base R. To add additional packages, you must first install that package, and then load it into your current session. If you are taking this workshop in person at the library, or using the [class Posit Cloud project](https://posit.cloud/content/8458222), the **`tidyverse`** has already been installed, so we just need to load it. You only need to install a package once on a system, but you will load it each time you start a new r session. If the package had not already been installed, we would install with a function called `install.packages()`.

```{r}
#| warning: false

#install tidyverse if you haven't yet
#install.packages("tidyverse")

#load tidyverse
library(tidyverse)
```

## Functions

`install.packages()` and `library()` are two examples of **functions**.
Functions are one of the most important components of R code. A function
is like a canned script. It usually takes some inputs, called
**arguments** inside the parentheses that follow the name of the function,
performs one or more tasks, and often returns some kind of output. The
`library()` function takes the name of the package to load as it's
argument.

How do you know what arguments a function takes? For that you need to
turn to the documentation of a particular package, or from within
RStudio you can look up a function with `?function-name`. Let's try it
with the `library()` function.

```{r}
#| eval: false
?library

```

This opens the help pane in the lower right corner of RStudio. The
documentation provides you with all the arguments and any default
values, along with explanations of the arguments. Here we see that the
the library function has the argument package with no defaults.

## What is Tidy Data?

The **`tidyverse`** is so named from the concept of "tidy data". Data is considered "tidy" if it follows three rules:

1.  Each column is a variable
2.  Each row is an observation
3.  Each cell is a single value[^2]

[^2]: read more about tidy data
    https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

Data "in the wild" often isn't tidy, but the tidyverse packages can help
you create and analyze tidy datasets.

```{r echo=F, fig.cap="tidy data structure^[image from R for Data Science https://r4ds.had.co.nz/tidy-data.html#fig:tidy-structure]"}
knitr::include_graphics("images/tidy-data.PNG")
```

## The Data for This Lesson

For this lesson we will be using data which comes from [Project Tycho](https://www.tycho.pitt.edu/) - an open data project from the University of Pittsburgh which provides standardized datasets on numerous diseases to aid global health research. 

Throughout this lesson, we will be using a dataset from Project Tycho featuring historical counts of measles cases in the U.S.. We want to clean and present this data in a way that makes it easy to see how measles cases fluctuated over time.

A useful feature of Project Tycho data is their use of a common set of variables. Read more about their [data format](https://www.tycho.pitt.edu/dataformat/ProjectTychoPreCompiledDataFormat.pdf).

## Importing data

Now,that the **`tidyverse`** is loaded, we can use it to import some data
into our RStudio session. We are using a function from the **`readr`**
package called `read_csv()`. This function takes as an argument the path
to where the file is located. Let's start by reading in `measles_us` file in the `/data` folder. 

```{r}
read_csv("data/measles_us.csv")
```

But doing this just gives us a preview of the data in the console. To
really use the data, we need to **assign** it to an **object**. An
object is like a container for a numerical value, string, data set,
image, and much more. Just about everything in R is an object. You might
liken them to variables in other programming languages or in math. We
create an object, by giving our data a name and use the **assignment
operator**, which looks like an arrow `<-`. You can manually type in the
lesser than sign `<` and hyphen `-`, or use the keyboard shortcut
`Alt + -`.

Let's call our new object `measles_us`. Object names should be short and
easy to understand. They can't have spaces, so you'll want to separate
multiple words with a underscore, or by using camel case. Object names
also need to start with a letter not a number, and it's best to avoid
using names of common functions.

```{r}
#| message: false
 
measles_us <- read_csv("data/measles_us.csv")

```

When you create an object, it shows up in your environment pane (the
upper right panel). If we check our environment pane, we should now see an object called `measles_us`. 

Let's do the same for the `states.csv` file.

```{r}
states <- read_csv(file = "data/states.csv")
```


## Exploring and Summarizing data

Data wrangling, also known as data cleaning or data munging, involves preparing raw data for analysis by transforming it into a more useful format. This process includes detecting and correcting errors, handling missing values, and reorganizing data for analysis. Using the tidyverse, we can streamline these tasks efficiently. After importing the data, you'll typically start by exploring it, identifying patterns, and making necessary adjustments to prepare it for visualization and further analysis. This foundational step ensures that your data is accurate, consistent, and ready for insightful exploration.

### Vectors and Data Frames and Tibbles oh my!

First, it's important to understand the type of object we just created. In R, tabular data like you find in a spreadsheet is stored in a **data frame**, one of the fundamental **data structures** in R. A data frame is a rectangular, two-dimensional data structure. That is, it has both columns and rows. Data frames can store multiple **data types**, such as numeric, character, and logical data, also known as classes. 

A **tibble** is a tidyverse version of the standard R data frame. For our purposes, the differences are minor enough that we can generally use the terms interchangeably, but to be precise, we will be working with tibbles in this lesson. All tibbles are data frames, but not all data frames are tibbles. 

Another important R data structure is a **vector**. A vector is a one-dimensional data structure. That is, it is simply a sequence of elements. A vector can have only one data type. Data frames are created from multiple vectors, that is, each column in a data frame is a vector of the same length.


### Base R functions for exploring data

`View()` opens the data as a file in your documents pane.This is a good way to see the data in a familiar spreadsheet-like format.

```{r}
#| eval: false
#| echo: true

View(measles_us)
```

Use `summary()` to look at each column and find the data type and interquartile range for numeric data.

```{r}
#| results: false

summary(measles_us)
```

::: {.callout-note collapse="true" title="Missing data"}

Sometimes we get data with a large number of missing values. It can be helpful to know where data is missing before attempting to do any further analysis. R uses NA to indicate missing values. We can use the function `is.na()` to test for the presence of NAs in our data.`is.na()` will return a vector of values TRUE or FALSE. TRUE if the value is NA, FALSE if it is not. When we examined our data with the View function, we might have noticed that the first several values in Admin2Name column are missing (NA). We might want to know how many missing values total are in that column. 

```{r}
#| results: false

is.na(measles_us$Admin2Name)
```

After running this code you should see `TRUE` printed out repeatedly in the console. R is running through that column and printing TRUE whenever it runs into a missing value. But this still does not help us get the total number of NAs. To do that we need to nest the above code in another function `sum()`.

```{r}

sum(is.na(measles_us$Admin2Name))
```

`sum()` treats each each TRUE as a 1 and each FALSE as a 0. In that column there are `r sum(is.na(measles_us$Admin2Name))` out of `r length(measles_us$Admin2Name)` 

But, if you have a lot of variables (columns), it would be a pain to do this for each one. So instead we can use a similar function `colSums`

```{r}
colSums(is.na(measles_us))
```

:::

### **`tidyverse`** functions for exploring data

The `glimpse()` function which is part of the **`tidyverse`** package
**`tibble`**, lets you see the column names and data types clearly.

```{r glimpse}
#| results: false

glimpse(measles_us)
```

`distinct()` returns the distinct rows in a tibble. It can be used on a column to return the distinct values in that column. The first
argument you supply is the tibble object. Subsequent arguments
include the variables you want to count.

```{r}
distinct(measles_us, ConditionName)
```


```{r}
distinct(measles_us, Admin1Name)
```


`count()` is similar to `distinct()` but also returns the number of observations
(i.e. rows) for each of those distinct values. The first
argument you supply is the tibble object. Subsequent arguments
include the variables you want to count.

```{r}
 count(measles_us, Admin1Name) 
```

::: {.callout-tip title="Make code flow with the pipe %>%"}
Before we go any further - I want to introduce you to an important
time-saving symbol in R called the pipe `%>%` (`CTRL + SHIFT + M`). The
pipe allows you to take the output of the left-hand expression and make
it the input of the right-hand expression. It allows you to chain
together multiple functions and avoid nesting. With the pipe, we can
rewrite the above code as follows:


```{r}
measles_us %>% 
  count(Admin1Name)
```

In many tidyverse functions, the first argument is the name of the data
frame you're applying the function to. So when you use the pipe, you'll
generally start a line of code with the name of a tibble. One
benefit you might notice right away, is that when we use the pipe,
RStudio will supply the column names which helps to reduce typing and
typos.

:::

### Try it Yourself!
::: {.callout-tip title="CHALLENGE" }

Now let's try exploring the `states` tibble in our environment

1. Use `glimpse()` to inspect the columns and data types in the dataset.
1. Use `distinct()` to find out the distinct values in the `region` column.
1. Using `count()`, find out how many states are in each region. 
1. Using `count()`, find out how many states are in each region AND division. HINT: You can add additional column names to `distinct()` and `count()` to look at combinations of columns.

:::: {.callout-tip title="Solution" icon=false collapse="true"}
1. `glimpse(states)`
1. `states %>% distinct(region)`
1. `states %>% count(region)`
1. `states %>% count(region, division)`

::::

:::


## Subsetting data with `select()` and `filter()`

Real data sets can be quite large. So, once you've explored your data, you may want to start trimming it down to just the variables and conditions you're interested in. In
this section, we'll look at two functions from the **`tidyverse`**
package called **`dplyr`**: `select()` which lets you choose columns
(variables) and `filter()` which lets you choose rows. (Note: `dplyr` is
known for using easy to understand verbs for its function names.)

### `select()`

`select()` lets you choose columns by name. The syntax of this function is similar to the the ones we've already learned `count()` and `distinct()`. We need to supply the function with the name of the tibble and the columns. This will create a new tibble with just those columns. 

As with all tidyverse functions, we can use `%>%` to make this easier.

```{r}
measles_us %>% 
select(Admin1Name, CountValue)
```

If you want to select several columns that are next to each other, you can use `:` to specify a range, rather than writing each name out separately.


```{r}
measles_us %>% 
select( ConditionName:Admin1ISO)
```

Now, let’s think through which columns we want for our analysis and save this to a new object called `measles_select`. It’s always a good idea to create new objects when you make major changes to your data.

For this exercise, we want to look at trends in number of measles cases over time. To do that, we’ll need to keep our `CountValue` variable, as and the date variables (`PeriodStartDate` and `PeriodEndDate`), as well as the `PartOfCumulativeCountSeries` variable, which will help us understand how to use the dates (more on this later). The first five columns each have only one value. So it might be redundant to keep those, although if we were combining them with other Project Tycho datasets they could be useful. It might be interesting to get a state-level view of the data, so let’s keep `Admin1Name`. But we saw that there are a number of missing values in our `Admin2Name` and `CityName` variables, so they might not be very useful for our analysis.

```{r}
measles_select <-
  measles_us %>%
    select(
     Admin1Name,
     PeriodStartDate,
     PeriodEndDate,
     PartOfCumulativeCountSeries,
     CountValue
)
```

Sometimes when receive a data set or start working with data, you may find that the column names are overly long or not very descriptive or useful, and it may be necessary to rename them. For this, we can use the `rename()` function. Like naming objects, you should use a simple, descriptive, relatively short name without spaces for your column names. Let's rename `Admin1Name` to `State` to make that more meaningful to us. `rename()` has the syntax `rename(newColumnName = OldColumnName)`. 

```{r}
measles_select <-
  measles_select %>% 
    rename(state = Admin1Name)
```

Note that in this case, we are overwriting our original object with the new name instead of creating a new one!

### `filter()`

While `select()` acts on columns, `filter()` acts on rows. `filter()` takes the name of the tibble and one or more logical
conditions as arguments. 

```{r}
measles_md <- measles_select %>% 
  filter(state == "MARYLAND")
```


Here we are saying keep all the rows where the
value in the `state` column is "MARYLAND". Note the use of the double equals sign == versus the singular = sign. The double equal sign is a **logical operator**. The logical operators are:

|  operator |  meaning                 |
|-----------|--------------------------|
| ==        | exactly equal            |
| !=        | not equal to             |
| \<        | less than                |
| \<=       | less than or equal to    |
| \>        | greater than             |
| \>=       | greater than or equal to |
| x\|y      | x or y                   |
| x&y       | x and y                  |
| !x        | not x                    |


Note that after running our code, our resulting tibble (our new object `measles_md`) has `r nrow(measles_md)` observations (rows) while our original tibble had `r nrow(measles_us)`.

::: {.callout-warning}
When matching strings you must be exact. R is case-sensitive. So `state == "Maryland"` or `state == "maryland"` would return 0 rows.

:::


You can add additional conditions to filter by, separated other logical operators like `&`, `>`, and `>`.

Below we want just the rows for Maryland, and only include periods where the count was more than 500 reported cases. Note that while you need quotation marks around character data, you do not need them around numeric data.

```{r filter-multiple}
measles_select %>%
  filter(state == "MARYLAND" & CountValue > 500)

```

Here, we joined together 2 conditions with the `&` logical operator. Then we **piped** that resulting tibble to `count()` which remember takes a tibble as its first argument.

What if we wanted to filter our tibble to include just the 50 states and no territories? We sure would not have to write out an expression for each state, or even all the territories. 

```{r}
# we can avoid verbose code like this with %in%

measles_select %>% 
filter(state == "MARYLAND" & state == "DELAWARE" & state == "Pennsylvania")
```

Luckily, We can filter based on a vector of values with the `%in%` operator (remember we can think of a vector as a column of data). So, we can write some code to filter our data based on list of state names in our `states` tibble.


```{r filter-vector}
measles_states_only <-
  measles_select %>% 
  filter(state %in% states$name)
```


Let's save this output to a new object `measles_states_only`. Notice how we now have fewer rows than we had in our `measles_select` object. 

We could alternatively have used negation with the names of the values we specifically wanted to exclude.

```{r}
measles_states_only <- measles_select %>% 
  filter(!state %in% c("PUERTO RICO", "GUAM", "AMERICAN SAMOA", "NORTHERN MARIANA ISLANDS", "VIRGIN ISLANDS, U.S.", "DISTRICT OF COLUMBIA"))

```

Great! Our dataset is really shaping up. Let's also take a closer look at our date columns. If you look at the first several rows, it looks like each row of our dataset represents about a discrete week of measles case counts. But (as you can read in the Tycho data documentation) there are actually two date series in this dataset - non-cumulative and cumulative. Which series a row belongs to is noted by the `PartofCumulativeCountSeries`, which as the value `0` if a row is non-cumulative, and `1` if the row is part of a cumulative count.

To keep things consistent. Let's filter our tibble so we only have the non-overlapping discrete weeks.

```{r filter-cumulative}
measles_non_cumulative <- 
  measles_states_only %>% 
  filter(PartOfCumulativeCountSeries==0)
```

Once again, we have fewer rows than we started with. 

### Try it Yourself
:::{.callout-tip title="CHALLENGE"}
1. Use `select()` to create a new tibble with just the `name` and `division` columns from the `states` tibble. Assign this to an object called `us_divisions`.
2. Use `filter()` to keep just the rows in the `South Atlantic` division of the `us_divisions` tibble. Assign this to an object called `sa_division`.
3. Use `filter()` to keep just the rows in the `measles_non_cumulative` tibble where the `state` matches one of the states in the `name` column of the `sa_division` tibble and where the `CountValue` is greater than 1000. Assign this to an object called `measles_sa`.

:::: {.callout-tip title="Solution" icon=false collapse="true"}
1. `us_divisions <- states %>% select(name, division)`
1. `sa_division <- us_divisions %>% filter(division == "South Atlantic")`
1. `measles_sa <- measles_non_cumulative %>% filter(state %in% sa_division$name & CountValue > 1000)`
::::

:::

Now let's do some more with our date variables.

## Changing and creating variables with `mutate()`

Let's review the columns in our `measles_states_only` tibble

```{r}
glimpse(measles_non_cumulative)
```

We can see from this that the dates are being interpreted as *character* data. We want R to recognize them as dates. We can create new variables and adjust existing variables with the `mutate()` function.

`mutate()` takes as an argument the name and definition of the new column you're creating. Note that if you use the same variable name as an existing variable name it overwrites that column. Otherwise, it will add a column to your tibble.

To change the variable to a date - we are using a date parsing function from another package called **`lubridate`**. `mdy()` takes a character string or number in month-day-year format (as we have here) and returns a formal date object in YYYY-MM-DD format. There are similar functions if the input date is in year-month-day `ydm()` or day-month-year `dmy()`

```{r}
measles_non_cumulative <- measles_non_cumulative %>% 
mutate(PeriodStartDate = mdy(PeriodStartDate),
       PeriodEndDate = mdy(PeriodEndDate))
```

Note that you can mutate multiple columns at a time, separating each new column definition with a comma.

Now that R recognizes the date columns as Dates, we can do things like extract parts of the date, such as the year. Let's create a separate Year column. Later we'll be able to group our tibble by year for analysis.


```{r}
measles_year <- 
  measles_non_cumulative %>% 
  mutate(Year=year(PeriodStartDate))
```


## Grouping and Summarizing

Many data analysis tasks can be approached using the *split-apply-combine* paradigm: split the data into groups, apply some analysis to each group, and then combine the results. **`dplyr`** makes this very easy through the use of the `group_by()` function.

`group_by()` is often used together with `summarize()`, which collapses each group into a single-row summary of that group. `group_by()` takes as arguments the column names that contain the **categorical** variables for which you want to calculate the summary statistics.

How can we calculate the total number of measles cases for each year?

First we need to group our data by year using our new `Year` column.

```{r groupby}
yearly_count <-
  measles_year %>%
  group_by(Year)

yearly_count
```

When you inspect your new tibble, everything should look the same. Grouping prepares your data for summarize, but it does not do anything visually to the data.

Now let's trying summarizing that data. `summarize()` condenses the value of the group values to a single value per group. Like `mutate()`, we provide the function with the name of the new column that will hold the summary information. In this case, we will use the `sum()` function on the `CountValue` column and put this in a new column called `TotalCount`. Summarize will drop the columns that aren't being used.

```{r count-by-year}
yearly_count <-
  measles_year %>%
  group_by(Year) %>%
  summarise(TotalCount = sum(CountValue))

yearly_count
```

A more useful view might be to look for yearly totals of case counts by state. We can group by two variables, Year, and then State.

```{r count-by-state}

yearly_count_state <-
  measles_year %>%
  group_by(Year, state) %>%
  summarise(TotalCount = sum(CountValue))

yearly_count_state
```

Notice how the use of pipes really comes in handy here. It saved us from having to create and keep track of a number of intermediate objects.

## Sorting datasets with `arrange()`

Which state in which year had the highest case count? To easily find out, we can use the function `arrange()`. One of the arguments must be the column you want to sort on.

```{r arrange}
yearly_count_state %>% arrange(TotalCount)
```

By default, arrange sorts in ascending order. To sort by descending order we use together with the `desc()` function.

```{r arrange-desc}
yearly_count_state %>% arrange(desc(TotalCount))
```


## Joining Datasets

Of course, looking at total counts in each state is not the most helpful metric without taking population into account. To rectify this, let's try joining some historical population data with our measles data.

First we need to import the population data^[population data retrieved from the FRED, the Federal Reserve Bank of St. Louis Economic Data, https://fred.stlouisfed.org/release/tables?rid=118&eid=259194]. 

```{r}

hist_pop <-
  read_csv("data/Historical_Population_by_State.csv")

```

## Long vs Wide formats

Remember that for data to be considered "tidy", it should be in what is called "long" format. Each column is a variable, each row is an observation, and each cell is a value. Our state population data is in "wide" format, because State Name is being treated as a variable, when it is really a value. Wide data is often preferable for human-readability, but is less ideal for machine-readability. To be able to join this data to our measles dataset, it needs to have 3 columns - Year, State Name, and Population. 

We will use the package **`tidyr`** and the function `pivot_longer` to convert our population data to a long format, thus making it easier to join with our measles data.

Each column in our population dataset represents a state. To make it tidy we are going to reduce those to one column called State with the state names as the values of the column. We will then need to create a new column for population containing the current cell values. To remember that the population data is provided in 1000s of persons, we will call this new column pop1000.

`pivot_longer()` takes four principal arguments:

1.  the data
2.  *cols* are the names of the columns we use to fill the new values variable (or to drop).
3.  the *names_to* column variable we wish to create from the *cols* provided.
4.  the *values_to* column variable we wish to create and fill with values associated with the *cols* provided.

```{r}
library(tidyr)
hist_pop_long <- hist_pop %>%
  pivot_longer(ALASKA:WYOMING,
               names_to = "state",
               values_to = "pop1000")
```

```{r eval=FALSE}
View(hist_pop_long)
```

Now our two datasets have similar structures, a column of state names, a column of years, and a column of values. Let's join these two datasets by the state and year columns. Note that if both sets have the same column names, you do not need to specify anything in the by argument. We use a left join here which preserves all the rows in our measles dataset and adds the matching rows from the population dataset.

```{r}
measles_joined<- yearly_count_state %>% 
  left_join(hist_pop_long, by=join_by(state, Year == DATE))

measles_joined
```

::: {.callout-tip title="CHALLENGE"}
1. Use `mutate()` to calculate the rate of measles per 100,000 persons (remember population is given in 1000s).


1. Try joining `measles_yearly_rates` to `states`. What variable do you need to join by?

:::: {.callout-tip title="Solution" icon=false collapse="true"}

```{r}
# 1. 
measles_yearly_rates <-
  measles_joined %>% 
  mutate(epi_rate = (TotalCount / pop1000)*100)

# 2. 
yearly_rates_joined <- measles_yearly_rates %>% 
  left_join(states, by = join_by(state == name))
```


::::

:::


Now our data is ready to be visualized!