-
Notifications
You must be signed in to change notification settings - Fork 0
/
01-DataWrangling.qmd
795 lines (538 loc) · 34.7 KB
/
01-DataWrangling.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
---
title: "Data Wrangling with R"
subtitle: "An Introduction to the Tidyverse"
format: html
toc: true
editor:
markdown:
wrap: 72
---
## Learning Objectives
By the end of this session, students will be able to:
- Explain some benefits of learning R
- Understand the difference between R and RStudio
- Navigate RStudio
- Define key R concepts and terminology.
- Identify sources of documentation about R packages and functions
- Apply commonly used **`tidyverse`** functions to a real data set
- Becoming familiar with a typical workflow for exploring and wrangling data.
## Why learn R?
- **R is free, open-source, and cross-platform.** Anyone can inspect
the source code to see how R works. Because of this transparency,
there is less chance for mistakes, and if you (or someone else) find
some, you can report and fix bugs. Because R is open source and is
supported by a large community of developers and users, there is a
very large selection of third-party add-on packages which are freely
available to extend R's native capabilities.
- **R code is great for reproducibility**. Reproducibility is when
someone else (including your future self) can obtain the same
results from the same dataset when using the same analysis. R
integrates with other tools to generate manuscripts from your code.
If you collect more data, or fix a mistake in your dataset, the
figures and the statistical tests in your manuscript are updated
automatically.
- **R relies on a series of written commands, not on remembering a
succession of pointing and clicking.** If you want to redo your
analysis because you collected more data, you don't have to remember
which button you clicked in which order to obtain your results; you
just have to run your script again.
- **R is interdisciplinary and extensible** With 10,000+ packages that
can be installed to extend its capabilities, R provides a framework
that allows you to combine statistical approaches from many
scientific disciplines to best suit the analytical framework you
need to analyze your data. For instance, R has packages for image
analysis, GIS, time series, population genetics, and a lot more.
- **R works on data of all shapes and sizes.** The skills you learn
with R scale easily with the size of your dataset. Whether your
dataset has hundreds or millions of lines, it won't make much
difference to you. R is designed for data analysis. It comes with
special data structures and data types that make handling of missing
data and statistical factors convenient. R can connect to
spreadsheets, databases, and many other data formats, on your
computer or on the web.
- **R produces high-quality graphics.** The plotting functionalities
in R are endless, and allow you to adjust any aspect of your graph
to convey most effectively the message from your data.
- **R has a large and welcoming community.** Thousands of people use R
daily. Many of them are willing to help you through mailing lists
and websites such as [Stack Overflow](https://stackoverflow.com/),
or on the [RStudio community](https://community.rstudio.com/).
Questions which are backed up with [short, reproducible code
snippets](https://www.tidyverse.org/help/) are more likely to
attract knowledgeable responses.
## Starting out in R
[R](https://cran.rstudio.com/) is both a programming language and an
interactive environment for data exploration and statistics.
Working with R is primarily text-based. The basic mode of use for R is
that the user provides commands in the R language and then R computes
and displays the result.
### Downloading, Installing and Running R
**Download**\
R can be downloaded from [CRAN (The Comprehensive R Archive
Network)](https://cran.rstudio.com/index.html) for Windows, Linux, or
Mac.
**Install**\
Installation of R is like most software packages and you will be guided.
Should you have any issues or need help you can refer to [R Installation
and
Administration](https://cran.r-project.org/doc/manuals/r-release/R-admin.html)
**Running**\
R can be launched from your software or applications launcher or When
working at a command line on UNIX or Windows, the command `R` can be
used for starting the main R program in the form `R`
You will see a console similar to this appear:
```{r echo=F}
knitr::include_graphics("images/console.png")
```
While it is possible to work solely through the console or using a
command line interface, the ideal environment to work in R is RStudio.
### RStudio
[RStudio](https://www.rstudio.com/products/rstudio/download/) is a user
interface for working with R. It is called an Integrated Development
Environment (IDE): a piece of software that provides tools to make
programming easier. RStudio acts as a sort of wrapper around the R
language. You can use R without RStudio, but it’s much more limiting.
RStudio makes it easier to import datasets, create and write scripts,
and makes using R much more effective. RStudio is also free and open
source. To function correctly, RStudio needs R and therefore both need
to be installed on your computer.
RStudio is divided into four "panes". The placement of these panes and
their content can be customized (see menu, Tools -\> Global Options -\>
Pane Layout).
The Default Layout is:
- Top Left - **Source**: your scripts and documents
- Bottom Left - **Console**: what R would look and be like without
RStudio
- Top Right - **Environment/History**: look here to see what you have
done
- Bottom Right - **Files** and more: see the contents of the
project/working directory here, like your Script.R file
```{r echo=F}
knitr::include_graphics("images/rstudio.png")
```
### RStudio Projects
RStudio provides a useful feature called `Projects` which act like a container for your work. As you use R more, you will find it useful to make sure your files and environment for one real-world project are kept together and separate from other projects.
Let's create a new project now.
1. Go to `File > New Project`
2. In `Create project from` menu choose `Existing Directory`
3. Browse to `Desktop > Session01_DataWrangling`
4. Select the check box that says `Open in New Session`
::: {.callout-tip appearence="minimal" collapse="true"}
### Follow Along at Home
Posit (RStudio) Cloud is a browser-based version of RStudio. It will
allow you to use RStudio without needing to download anything to your
computer. Posit Cloud automatically organizes things into Projects. You can also easily share your R projects with others.
Get Started:
1. Create your free RStudio Cloud account at
<https://posit.cloud/plans/free>.
2. Go to the class project <https://posit.cloud/content/8458222>
3. Note the text that marks this as a Temporary Copy. Select the
`Save a Permanent Copy` button to begin working!
:::
## R Scripts
A script is a text file in which you write your code. R scripts are generally recognized by the `.R` file extension. Scripts make it easy to re-run that code when you need to. In addition to code, your script can also have comments, which start with a `#` symbol. These comments make your script more human readable, but are ignored by the computer.
To get started in this lesson - open up the script in your RStudio Project called `01_DataWrangling.R`
## Welcome to the Tidyverse
In this lesson, we will be using a group of packages which are part of what is known as the **`tidyverse`** - "an opinionated
collection of R packages designed for data science. All packages share
an underlying design philosophy, grammar, and data structures."[^1],
developed by Hadley Wickham.
[^1]: https://www.tidyverse.org/
::: {.callout-important}
## What is a package?
As mentioned above, R is *extensible* and packages are the way to *extend* the base functionality of R. Each package is a collection of **functions**, code, data, and documentation. Packages are specialized to accomplish a particular set of tasks. Users can easily install packages from package repositories, such as the central repository [CRAN](https://cran.r-project.org/) (Comprehensive R Archive Network) and [Bioconductor](https://www.bioconductor.org/), an important source of bioinformatics packages.
The sheer number of R packages can seem overwhelming to a beginner and a common question we hear is, "But how do I know what package to use?". One place is to start is to take a look at [CRAN Task Views](https://cran.r-project.org/web/views/), which organizes packages by topic. You can also try an internet search like "How do I do X in R" and this will typically lead you to solutions that mention packages you need to accomplish the task.
One reason we are focusing on the **tidyverse** packages in this class is because they are so versatile and might be the only packages you need for much of what you want to do in R.
:::
The **tidyverse** packages we will be using include:
- **`readr`** for importing data into R
- **`dplyr`** for handling common data wrangling tasks for tabular
data
- **`tidyr`** which enables you to swiftly convert between different
data formats (long vs. wide) for plotting and analysis
- **`lubridate`** for working with dates
- **`ggplot2`** for visualizing data (we'll explore this package in
the next session).
For the full list of tidyverse packages and documentation visit
[tidyverse.org](https://www.tidyverse.org/) You can install these
packages individually, or you can install the entire tidyverse in one
go.
## Installing and loading packages
When you first install R on your computer, it comes with a set of built-in packages and functions collectively referred to as Base R. To add additional packages, you must first install that package, and then load it into your current session. If you are taking this workshop in person at the library, or using the [class Posit Cloud project](https://posit.cloud/content/8458222), the **`tidyverse`** has already been installed, so we just need to load it. You only need to install a package once on a system, but you will load it each time you start a new r session. If the package had not already been installed, we would install with a function called `install.packages()`.
```{r}
#| warning: false
#install tidyverse if you haven't yet
#install.packages("tidyverse")
#load tidyverse
library(tidyverse)
```
## Functions
`install.packages()` and `library()` are two examples of **functions**.
Functions are one of the most important components of R code. A function
is like a canned script. It usually takes some inputs, called
**arguments** inside the parentheses that follow the name of the function,
performs one or more tasks, and often returns some kind of output. The
`library()` function takes the name of the package to load as it's
argument.
How do you know what arguments a function takes? For that you need to
turn to the documentation of a particular package, or from within
RStudio you can look up a function with `?function-name`. Let's try it
with the `library()` function.
```{r}
#| eval: false
?library
```
This opens the help pane in the lower right corner of RStudio. The
documentation provides you with all the arguments and any default
values, along with explanations of the arguments. Here we see that the
the library function has the argument package with no defaults.
## What is Tidy Data?
The **`tidyverse`** is so named from the concept of "tidy data". Data is considered "tidy" if it follows three rules:
1. Each column is a variable
2. Each row is an observation
3. Each cell is a single value[^2]
[^2]: read more about tidy data
https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
Data "in the wild" often isn't tidy, but the tidyverse packages can help
you create and analyze tidy datasets.
```{r echo=F, fig.cap="tidy data structure^[image from R for Data Science https://r4ds.had.co.nz/tidy-data.html#fig:tidy-structure]"}
knitr::include_graphics("images/tidy-data.PNG")
```
## The Data for This Lesson
For this lesson we will be using data which comes from [Project Tycho](https://www.tycho.pitt.edu/) - an open data project from the University of Pittsburgh which provides standardized datasets on numerous diseases to aid global health research.
Throughout this lesson, we will be using a dataset from Project Tycho featuring historical counts of measles cases in the U.S.. We want to clean and present this data in a way that makes it easy to see how measles cases fluctuated over time.
A useful feature of Project Tycho data is their use of a common set of variables. Read more about their [data format](https://www.tycho.pitt.edu/dataformat/ProjectTychoPreCompiledDataFormat.pdf).
## Importing data
Now,that the **`tidyverse`** is loaded, we can use it to import some data
into our RStudio session. We are using a function from the **`readr`**
package called `read_csv()`. This function takes as an argument the path
to where the file is located. Let's start by reading in `measles_us` file in the `/data` folder.
```{r}
read_csv("data/measles_us.csv")
```
But doing this just gives us a preview of the data in the console. To
really use the data, we need to **assign** it to an **object**. An
object is like a container for a numerical value, string, data set,
image, and much more. Just about everything in R is an object. You might
liken them to variables in other programming languages or in math. We
create an object, by giving our data a name and use the **assignment
operator**, which looks like an arrow `<-`. You can manually type in the
lesser than sign `<` and hyphen `-`, or use the keyboard shortcut
`Alt + -`.
Let's call our new object `measles_us`. Object names should be short and
easy to understand. They can't have spaces, so you'll want to separate
multiple words with a underscore, or by using camel case. Object names
also need to start with a letter not a number, and it's best to avoid
using names of common functions.
```{r}
#| message: false
measles_us <- read_csv("data/measles_us.csv")
```
When you create an object, it shows up in your environment pane (the
upper right panel). If we check our environment pane, we should now see an object called `measles_us`.
Let's do the same for the `states.csv` file.
```{r}
states <- read_csv(file = "data/states.csv")
```
## Exploring and Summarizing data
Data wrangling, also known as data cleaning or data munging, involves preparing raw data for analysis by transforming it into a more useful format. This process includes detecting and correcting errors, handling missing values, and reorganizing data for analysis. Using the tidyverse, we can streamline these tasks efficiently. After importing the data, you'll typically start by exploring it, identifying patterns, and making necessary adjustments to prepare it for visualization and further analysis. This foundational step ensures that your data is accurate, consistent, and ready for insightful exploration.
### Vectors and Data Frames and Tibbles oh my!
First, it's important to understand the type of object we just created. In R, tabular data like you find in a spreadsheet is stored in a **data frame**, one of the fundamental **data structures** in R. A data frame is a rectangular, two-dimensional data structure. That is, it has both columns and rows. Data frames can store multiple **data types**, such as numeric, character, and logical data, also known as classes.
A **tibble** is a tidyverse version of the standard R data frame. For our purposes, the differences are minor enough that we can generally use the terms interchangeably, but to be precise, we will be working with tibbles in this lesson. All tibbles are data frames, but not all data frames are tibbles.
Another important R data structure is a **vector**. A vector is a one-dimensional data structure. That is, it is simply a sequence of elements. A vector can have only one data type. Data frames are created from multiple vectors, that is, each column in a data frame is a vector of the same length.
### Base R functions for exploring data
`View()` opens the data as a file in your documents pane.This is a good way to see the data in a familiar spreadsheet-like format.
```{r}
#| eval: false
#| echo: true
View(measles_us)
```
Use `summary()` to look at each column and find the data type and interquartile range for numeric data.
```{r}
#| results: false
summary(measles_us)
```
::: {.callout-note collapse="true" title="Missing data"}
Sometimes we get data with a large number of missing values. It can be helpful to know where data is missing before attempting to do any further analysis. R uses NA to indicate missing values. We can use the function `is.na()` to test for the presence of NAs in our data.`is.na()` will return a vector of values TRUE or FALSE. TRUE if the value is NA, FALSE if it is not. When we examined our data with the View function, we might have noticed that the first several values in Admin2Name column are missing (NA). We might want to know how many missing values total are in that column.
```{r}
#| results: false
is.na(measles_us$Admin2Name)
```
After running this code you should see `TRUE` printed out repeatedly in the console. R is running through that column and printing TRUE whenever it runs into a missing value. But this still does not help us get the total number of NAs. To do that we need to nest the above code in another function `sum()`.
```{r}
sum(is.na(measles_us$Admin2Name))
```
`sum()` treats each each TRUE as a 1 and each FALSE as a 0. In that column there are `r sum(is.na(measles_us$Admin2Name))` out of `r length(measles_us$Admin2Name)`
But, if you have a lot of variables (columns), it would be a pain to do this for each one. So instead we can use a similar function `colSums`
```{r}
colSums(is.na(measles_us))
```
:::
### **`tidyverse`** functions for exploring data
The `glimpse()` function which is part of the **`tidyverse`** package
**`tibble`**, lets you see the column names and data types clearly.
```{r glimpse}
#| results: false
glimpse(measles_us)
```
`distinct()` returns the distinct rows in a tibble. It can be used on a column to return the distinct values in that column. The first
argument you supply is the tibble object. Subsequent arguments
include the variables you want to count.
```{r}
distinct(measles_us, ConditionName)
```
```{r}
distinct(measles_us, Admin1Name)
```
`count()` is similar to `distinct()` but also returns the number of observations
(i.e. rows) for each of those distinct values. The first
argument you supply is the tibble object. Subsequent arguments
include the variables you want to count.
```{r}
count(measles_us, Admin1Name)
```
::: {.callout-tip title="Make code flow with the pipe %>%"}
Before we go any further - I want to introduce you to an important
time-saving symbol in R called the pipe `%>%` (`CTRL + SHIFT + M`). The
pipe allows you to take the output of the left-hand expression and make
it the input of the right-hand expression. It allows you to chain
together multiple functions and avoid nesting. With the pipe, we can
rewrite the above code as follows:
```{r}
measles_us %>%
count(Admin1Name)
```
In many tidyverse functions, the first argument is the name of the data
frame you're applying the function to. So when you use the pipe, you'll
generally start a line of code with the name of a tibble. One
benefit you might notice right away, is that when we use the pipe,
RStudio will supply the column names which helps to reduce typing and
typos.
:::
### Try it Yourself!
::: {.callout-tip title="CHALLENGE" }
Now let's try exploring the `states` tibble in our environment
1. Use `glimpse()` to inspect the columns and data types in the dataset.
1. Use `distinct()` to find out the distinct values in the `region` column.
1. Using `count()`, find out how many states are in each region.
1. Using `count()`, find out how many states are in each region AND division. HINT: You can add additional column names to `distinct()` and `count()` to look at combinations of columns.
:::: {.callout-tip title="Solution" icon=false collapse="true"}
1. `glimpse(states)`
1. `states %>% distinct(region)`
1. `states %>% count(region)`
1. `states %>% count(region, division)`
::::
:::
## Subsetting data with `select()` and `filter()`
Real data sets can be quite large. So, once you've explored your data, you may want to start trimming it down to just the variables and conditions you're interested in. In
this section, we'll look at two functions from the **`tidyverse`**
package called **`dplyr`**: `select()` which lets you choose columns
(variables) and `filter()` which lets you choose rows. (Note: `dplyr` is
known for using easy to understand verbs for its function names.)
### `select()`
`select()` lets you choose columns by name. The syntax of this function is similar to the the ones we've already learned `count()` and `distinct()`. We need to supply the function with the name of the tibble and the columns. This will create a new tibble with just those columns.
As with all tidyverse functions, we can use `%>%` to make this easier.
```{r}
measles_us %>%
select(Admin1Name, CountValue)
```
If you want to select several columns that are next to each other, you can use `:` to specify a range, rather than writing each name out separately.
```{r}
measles_us %>%
select( ConditionName:Admin1ISO)
```
Now, let’s think through which columns we want for our analysis and save this to a new object called `measles_select`. It’s always a good idea to create new objects when you make major changes to your data.
For this exercise, we want to look at trends in number of measles cases over time. To do that, we’ll need to keep our `CountValue` variable, as and the date variables (`PeriodStartDate` and `PeriodEndDate`), as well as the `PartOfCumulativeCountSeries` variable, which will help us understand how to use the dates (more on this later). The first five columns each have only one value. So it might be redundant to keep those, although if we were combining them with other Project Tycho datasets they could be useful. It might be interesting to get a state-level view of the data, so let’s keep `Admin1Name`. But we saw that there are a number of missing values in our `Admin2Name` and `CityName` variables, so they might not be very useful for our analysis.
```{r}
measles_select <-
measles_us %>%
select(
Admin1Name,
PeriodStartDate,
PeriodEndDate,
PartOfCumulativeCountSeries,
CountValue
)
```
Sometimes when receive a data set or start working with data, you may find that the column names are overly long or not very descriptive or useful, and it may be necessary to rename them. For this, we can use the `rename()` function. Like naming objects, you should use a simple, descriptive, relatively short name without spaces for your column names. Let's rename `Admin1Name` to `State` to make that more meaningful to us. `rename()` has the syntax `rename(newColumnName = OldColumnName)`.
```{r}
measles_select <-
measles_select %>%
rename(state = Admin1Name)
```
Note that in this case, we are overwriting our original object with the new name instead of creating a new one!
### `filter()`
While `select()` acts on columns, `filter()` acts on rows. `filter()` takes the name of the tibble and one or more logical
conditions as arguments.
```{r}
measles_md <- measles_select %>%
filter(state == "MARYLAND")
```
Here we are saying keep all the rows where the
value in the `state` column is "MARYLAND". Note the use of the double equals sign == versus the singular = sign. The double equal sign is a **logical operator**. The logical operators are:
| operator | meaning |
|-----------|--------------------------|
| == | exactly equal |
| != | not equal to |
| \< | less than |
| \<= | less than or equal to |
| \> | greater than |
| \>= | greater than or equal to |
| x\|y | x or y |
| x&y | x and y |
| !x | not x |
Note that after running our code, our resulting tibble (our new object `measles_md`) has `r nrow(measles_md)` observations (rows) while our original tibble had `r nrow(measles_us)`.
::: {.callout-warning}
When matching strings you must be exact. R is case-sensitive. So `state == "Maryland"` or `state == "maryland"` would return 0 rows.
:::
You can add additional conditions to filter by, separated other logical operators like `&`, `>`, and `>`.
Below we want just the rows for Maryland, and only include periods where the count was more than 500 reported cases. Note that while you need quotation marks around character data, you do not need them around numeric data.
```{r filter-multiple}
measles_select %>%
filter(state == "MARYLAND" & CountValue > 500)
```
Here, we joined together 2 conditions with the `&` logical operator. Then we **piped** that resulting tibble to `count()` which remember takes a tibble as its first argument.
What if we wanted to filter our tibble to include just the 50 states and no territories? We sure would not have to write out an expression for each state, or even all the territories.
```{r}
# we can avoid verbose code like this with %in%
measles_select %>%
filter(state == "MARYLAND" & state == "DELAWARE" & state == "Pennsylvania")
```
Luckily, We can filter based on a vector of values with the `%in%` operator (remember we can think of a vector as a column of data). So, we can write some code to filter our data based on list of state names in our `states` tibble.
```{r filter-vector}
measles_states_only <-
measles_select %>%
filter(state %in% states$name)
```
Let's save this output to a new object `measles_states_only`. Notice how we now have fewer rows than we had in our `measles_select` object.
We could alternatively have used negation with the names of the values we specifically wanted to exclude.
```{r}
measles_states_only <- measles_select %>%
filter(!state %in% c("PUERTO RICO", "GUAM", "AMERICAN SAMOA", "NORTHERN MARIANA ISLANDS", "VIRGIN ISLANDS, U.S.", "DISTRICT OF COLUMBIA"))
```
Great! Our dataset is really shaping up. Let's also take a closer look at our date columns. If you look at the first several rows, it looks like each row of our dataset represents about a discrete week of measles case counts. But (as you can read in the Tycho data documentation) there are actually two date series in this dataset - non-cumulative and cumulative. Which series a row belongs to is noted by the `PartofCumulativeCountSeries`, which as the value `0` if a row is non-cumulative, and `1` if the row is part of a cumulative count.
To keep things consistent. Let's filter our tibble so we only have the non-overlapping discrete weeks.
```{r filter-cumulative}
measles_non_cumulative <-
measles_states_only %>%
filter(PartOfCumulativeCountSeries==0)
```
Once again, we have fewer rows than we started with.
### Try it Yourself
:::{.callout-tip title="CHALLENGE"}
1. Use `select()` to create a new tibble with just the `name` and `division` columns from the `states` tibble. Assign this to an object called `us_divisions`.
2. Use `filter()` to keep just the rows in the `South Atlantic` division of the `us_divisions` tibble. Assign this to an object called `sa_division`.
3. Use `filter()` to keep just the rows in the `measles_non_cumulative` tibble where the `state` matches one of the states in the `name` column of the `sa_division` tibble and where the `CountValue` is greater than 1000. Assign this to an object called `measles_sa`.
:::: {.callout-tip title="Solution" icon=false collapse="true"}
1. `us_divisions <- states %>% select(name, division)`
1. `sa_division <- us_divisions %>% filter(division == "South Atlantic")`
1. `measles_sa <- measles_non_cumulative %>% filter(state %in% sa_division$name & CountValue > 1000)`
::::
:::
Now let's do some more with our date variables.
## Changing and creating variables with `mutate()`
Let's review the columns in our `measles_states_only` tibble
```{r}
glimpse(measles_non_cumulative)
```
We can see from this that the dates are being interpreted as *character* data. We want R to recognize them as dates. We can create new variables and adjust existing variables with the `mutate()` function.
`mutate()` takes as an argument the name and definition of the new column you're creating. Note that if you use the same variable name as an existing variable name it overwrites that column. Otherwise, it will add a column to your tibble.
To change the variable to a date - we are using a date parsing function from another package called **`lubridate`**. `mdy()` takes a character string or number in month-day-year format (as we have here) and returns a formal date object in YYYY-MM-DD format. There are similar functions if the input date is in year-month-day `ydm()` or day-month-year `dmy()`
```{r}
measles_non_cumulative <- measles_non_cumulative %>%
mutate(PeriodStartDate = mdy(PeriodStartDate),
PeriodEndDate = mdy(PeriodEndDate))
```
Note that you can mutate multiple columns at a time, separating each new column definition with a comma.
Now that R recognizes the date columns as Dates, we can do things like extract parts of the date, such as the year. Let's create a separate Year column. Later we'll be able to group our tibble by year for analysis.
```{r}
measles_year <-
measles_non_cumulative %>%
mutate(Year=year(PeriodStartDate))
```
## Grouping and Summarizing
Many data analysis tasks can be approached using the *split-apply-combine* paradigm: split the data into groups, apply some analysis to each group, and then combine the results. **`dplyr`** makes this very easy through the use of the `group_by()` function.
`group_by()` is often used together with `summarize()`, which collapses each group into a single-row summary of that group. `group_by()` takes as arguments the column names that contain the **categorical** variables for which you want to calculate the summary statistics.
How can we calculate the total number of measles cases for each year?
First we need to group our data by year using our new `Year` column.
```{r groupby}
yearly_count <-
measles_year %>%
group_by(Year)
yearly_count
```
When you inspect your new tibble, everything should look the same. Grouping prepares your data for summarize, but it does not do anything visually to the data.
Now let's trying summarizing that data. `summarize()` condenses the value of the group values to a single value per group. Like `mutate()`, we provide the function with the name of the new column that will hold the summary information. In this case, we will use the `sum()` function on the `CountValue` column and put this in a new column called `TotalCount`. Summarize will drop the columns that aren't being used.
```{r count-by-year}
yearly_count <-
measles_year %>%
group_by(Year) %>%
summarise(TotalCount = sum(CountValue))
yearly_count
```
A more useful view might be to look for yearly totals of case counts by state. We can group by two variables, Year, and then State.
```{r count-by-state}
yearly_count_state <-
measles_year %>%
group_by(Year, state) %>%
summarise(TotalCount = sum(CountValue))
yearly_count_state
```
Notice how the use of pipes really comes in handy here. It saved us from having to create and keep track of a number of intermediate objects.
## Sorting datasets with `arrange()`
Which state in which year had the highest case count? To easily find out, we can use the function `arrange()`. One of the arguments must be the column you want to sort on.
```{r arrange}
yearly_count_state %>% arrange(TotalCount)
```
By default, arrange sorts in ascending order. To sort by descending order we use together with the `desc()` function.
```{r arrange-desc}
yearly_count_state %>% arrange(desc(TotalCount))
```
## Joining Datasets
Of course, looking at total counts in each state is not the most helpful metric without taking population into account. To rectify this, let's try joining some historical population data with our measles data.
First we need to import the population data^[population data retrieved from the FRED, the Federal Reserve Bank of St. Louis Economic Data, https://fred.stlouisfed.org/release/tables?rid=118&eid=259194].
```{r}
hist_pop <-
read_csv("data/Historical_Population_by_State.csv")
```
## Long vs Wide formats
Remember that for data to be considered "tidy", it should be in what is called "long" format. Each column is a variable, each row is an observation, and each cell is a value. Our state population data is in "wide" format, because State Name is being treated as a variable, when it is really a value. Wide data is often preferable for human-readability, but is less ideal for machine-readability. To be able to join this data to our measles dataset, it needs to have 3 columns - Year, State Name, and Population.
We will use the package **`tidyr`** and the function `pivot_longer` to convert our population data to a long format, thus making it easier to join with our measles data.
Each column in our population dataset represents a state. To make it tidy we are going to reduce those to one column called State with the state names as the values of the column. We will then need to create a new column for population containing the current cell values. To remember that the population data is provided in 1000s of persons, we will call this new column pop1000.
`pivot_longer()` takes four principal arguments:
1. the data
2. *cols* are the names of the columns we use to fill the new values variable (or to drop).
3. the *names_to* column variable we wish to create from the *cols* provided.
4. the *values_to* column variable we wish to create and fill with values associated with the *cols* provided.
```{r}
library(tidyr)
hist_pop_long <- hist_pop %>%
pivot_longer(ALASKA:WYOMING,
names_to = "state",
values_to = "pop1000")
```
```{r eval=FALSE}
View(hist_pop_long)
```
Now our two datasets have similar structures, a column of state names, a column of years, and a column of values. Let's join these two datasets by the state and year columns. Note that if both sets have the same column names, you do not need to specify anything in the by argument. We use a left join here which preserves all the rows in our measles dataset and adds the matching rows from the population dataset.
```{r}
measles_joined<- yearly_count_state %>%
left_join(hist_pop_long, by=join_by(state, Year == DATE))
measles_joined
```
::: {.callout-tip title="CHALLENGE"}
1. Use `mutate()` to calculate the rate of measles per 100,000 persons (remember population is given in 1000s).
1. Try joining `measles_yearly_rates` to `states`. What variable do you need to join by?
:::: {.callout-tip title="Solution" icon=false collapse="true"}
```{r}
# 1.
measles_yearly_rates <-
measles_joined %>%
mutate(epi_rate = (TotalCount / pop1000)*100)
# 2.
yearly_rates_joined <- measles_yearly_rates %>%
left_join(states, by = join_by(state == name))
```
::::
:::
Now our data is ready to be visualized!