-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathindex.qmd
917 lines (692 loc) · 43.7 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
---
title: "Keeping it Tidy"
subtitle: "*Using the Tidyverse to Organize, Transform, and Visualize Data*<br>Workshop hosted by Meghan Hall, through R-Ladies Paris<br>September 8, 2022 | 1pm EST / 7pm CEST"
from: markdown+emoji
header-includes: |
<link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.1.1/css/all.min.css"
rel="stylesheet"/>
execute:
warning: false
---
<a href="https://www.meetup.com/rladies-paris/events/287856868/" class="btn btn-primary">**Register now**</a> <a href="https://meghan.rbind.io/slides/rladiesparis/rladiesparis.html#/title-slide" class="btn btn-primary">**Workshop slides**</a> <a href="https://github.com/meghall06/rladiesparis/blob/master/keeping-it-tidy.R" class="btn btn-primary">**Code file**</a>
## About the workshop
This 90-minute workshop is targeted toward beginners/advanced beginners and will serve as an introduction to the [tidyverse](https://www.tidyverse.org/), a collection of packages designed to aid in the cleaning and wrangling of data. In order to give you the tools to embark upon your own analysis on your own data, we will work through a sample data set and discuss methods for modifying, aggregating, reshaping, and visualizing data.
The event is free and will be held virtually over Zoom, though registration is required!
![](images/image.png)
### Technical details
For those attending: you are free to "code along" throughout the workshop, using the details in the slides and on this page, but you do not have to! All the materials are publicly available, so feel free to just watch and go through the code at your own pace at a later date. Unfortunately, due to time and other logistical constraints of this workshop, there won't be time for individual troubleshooting.
If you would like to be prepared to code along, please have R and RStudio installed on your computer and make sure that the tidyverse is installed with `install.packages("tidyverse")`.
### Website materials
The materials on this website are designed as a companion for the live workshop, not as a comprehensive guide to the entire tidyverse. All the code is here, along with some background on the concepts discussed. The workshop slides are linked above.
The code is interspersed throughout this page, but the full `keeping-it-tidy.R` file is available at the link above.
## R and the `tidyverse`
R is an open-source (that means free!) scripting language for working with data. It helps make your analysis wonderfully efficient and :sparkles: reproducible :sparkles:, and I would highly recommend it over something like Excel for anything beyond very basic data analysis, especially if you expect to repeat that analysis ever again (which you probably will).
Having a) your data separate from your analysis and b) that analysis codified in a script will help tremendously with reproducibility efforts. That makes it easier for anyone else (including yourself in three months once you have forgotten all the details you swore you'd remember) to rerun your code and make edits, deal with updated data, etc.
### Getting started
You need the R [language](https://cloud.r-project.org/), as well as a place to actually run that R code. I highly recommend [RStudio](https://www.rstudio.com/products/rstudio/download/#download). Both are free to download. There are some basic installation instructions [here](https://stat545.com/install.html).
You use R via *packages*, which hold *functions*, which are just *verbs*. `filter` is an example of a function. The functions we'll see today take the syntax `function(argument)`. Many functions can take multiple arguments, which are separated by a comma.
(You can also write your own functions, but that is beyond the scope of this workshop. See more [here](https://r4ds.had.co.nz/functions.html).)
### Why the tidyverse?
The tidyverse is an opinionated set of packages that work together and are designed for common data science tasks. (By *opinionated*, I mean that the packages have some thoughts on how your data should be structured. We'll discuss the concept of **tidy data** later on in this workshop.) The packages are well-maintained and beginner-friendly, plus they cover almost all of what any beginner and/or intermediate user needs in order to analyze data.
You install packages with the command `install.packages("tidyverse")` (need to do once per R installation on your computer) and then load them with the command `library(tidyverse)` (need to do every session; it is convention to put all of the necessary packages at the top of your coding script).
Since the tidyverse is actually a set of packages, the `library(tidyverse)` command loads nine of them---the eight listed below, in addition to `lubridate`, which is useful for working with dates (I clearly need to update my packages before I take screenshots!).
![](images/packages2.png){fig-alt="The packages loaded with the tidyverse: ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr, forcats"}
We'll be focusing on the three packages highlighted above, though there are several more packages within the tidyverse family (shown below). You would need to load any of those separately.
![](images/packages3.png){fig-alt="List of other optional tidyverse packages, including broom, rvest"}
#### Common `dplyr` verbs
`dplyr` is perhaps the most useful tidyverse package, just due to how often you'll use its common functions for data manipulation.
- `filter()` keeps or discards rows (aka observations)
- `select()` keeps or discards columns (aka variables)
- `arrange()` sorts data set by certain variable(s)
- `count()` tallies data set by certain variable(s)
- `mutate()` creates new variables
- `summarize()` aggregates data
Both `mutate()` and `summarize()` can be further modified by the addition of the `group_by()` function to specify grouping variables. We'll see examples of all of these during this workshop.
#### Common operators
`<-` is the assignment operator, used for assigning objects like data frames (how data sets are commonly stored in R). I like to think of `<-`, which can be added with the shortcut `option -` in RStudio, as "save as."
`|>` is the *pipe*, used to chain together multiple lines of code. It automatically sends the output of one line into the input of the next line and makes it much easier to read and write code. The keyboard shortcut is `cmd shift m`.
You might also be familiar with `%>%`, which is the original pipe that was native to the tidyverse. The new pipe `|>` works in base R and is therefore a bit more versatile, but for the purposes of this workshop feel free to use either! I am trying to get familiar with using `|>` instead of `%>%`, so that's what you'll see here.
## Workshop data
The data we're using for today's workshop comes from [#TidyTuesday](https://github.com/rfordatascience/tidytuesday), a weekly social data project based on the tidyverse ecosystem. The GitHub repo hosts many interesting data sets to practice with, and [this particular data set](https://github.com/rfordatascience/tidytuesday/blob/master/data/2022/2022-02-01/readme.md) comes from the American Kennel Club.
`read_csv()` from the `readr` package (loaded as part of the tidyverse!) imports data of multiple file formats, and the two data sets for this workshop can be read in directly from this website's GitHub repo.
```{r readr}
#| output: false
library(tidyverse)
breed_rank <- read_csv("https://raw.githubusercontent.com/meghall06/rladiesparis/master/breed_rank.csv")
breed_traits <- read_csv("https://raw.githubusercontent.com/meghall06/rladiesparis/master/breed_traits.csv")
```
The first data set, `breed_rank`, lists the popularity rankings over time for 195 different dog breeds (many of the snippets shown throughout are truncated for the purposes of display).
::: {style="text-align: center"}
`breed_rank`
:::
::: {style="font-size: 0.75em"}
```{r breed-rank-kable}
#| echo: false
knitr::kable(breed_rank |> head(4)) |> kableExtra::kable_styling(full_width = FALSE)
```
:::
The second data set, `breed_traits`, has information on 16 different traits, classified from 1 to 5, for those 195 dog breeds.
::: {style="text-align: center"}
`breed_traits`
:::
::: {style="font-size: 0.75em"}
```{r breed-traits-kable}
#| echo: false
knitr::kable(breed_traits |> head(4) |> select(1:5)) |> kableExtra::kable_styling(full_width = FALSE)
```
:::
For today's workshop, since we'll be learning about tidy data, we'll investigate some very **un**tidy dogs. To do so, we'll be specifically focusing on three of these trait variables: shedding level, coat grooming frequency, and drooling level.
## Basic data cleaning
Per the [tidyverse style guide](https://style.tidyverse.org/syntax.html), variable names should use snake case---lower case with underscores between words. This helps with consistency and readability, but it's also technically easier, as any variable names that start with numbers and/or have spaces need to be referred to within `` `back ticks` ``. It's easier to refer to a variable with `shedding_level` instead of `` `Shedding Level` ``, and thankfully we have a function to easily rename all of those variables instead of doing it by hand.
Unfortunately...that function does not live within the tidyverse! It's the only such function we'll be highlighting during this workshop, but it is so helpful that it has to be included.
In the code below, after we've loaded the [janitor](https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html) package, the first line uses the assignment operator `<-` to "save as" our `breed_traits` data set. We could give it another name and save it as something else, but for this purpose we're going to overwrite it. The second line applies the `clean_names()` function.
:::{.callout-note icon="false"}
## Note! :rotating_light:
A helpful tip on notation: once a package has been loaded with `library`, you can use the function by itself, like you see here with `clean_names()` and above with `read_csv()`. If you don't want to load the package, you can eliminate that line and instead refer to the function along with its package name, such as `janitor::clean_names()`. (The package still has to be installed, however.)
:::
```{r janitor}
library(janitor)
breed_traits <- breed_traits |>
clean_names()
```
The `clean_names()` function neatly converts all variable names to snake case, as shown below.
::: {style="font-size: 0.75em"}
```{r janitor-kable}
#| echo: false
knitr::kable(breed_traits |> head(2) |> select(1:3)) |> kableExtra::kable_styling(full_width = FALSE)
```
:::
### `count()`
If you want to *look* at your data, use `View(breed_traits)` and it will pop up in RStudio in a spreadsheet-like format. But there are ways to programatically look at your data, as well, and one of the most useful functions for doing so is `count()`. It allows you to count the unique values of a variable (or multiple).
Note that this piece of code does not have an assignment operator! We are applying the `count()` function to the `breed_traits` data set, but the results would appear in your console and would not be saved anywhere. This is useful whenever you don't need to save the output.
```{r count-1}
#| eval: false
breed_traits |>
count(shedding_level)
```
This output tallies up our data by the `shedding_level` variable. 109 breeds have a value of 3, 41 breeds have a value of 2, etc. Unfortunately, we have a value of zero, and we know that these variables should have a value of 1, 2, 3, 4, or 5. This is likely an error in the data that should be removed.
::: {style="font-size: 0.75em"}
```{r count-1-kable}
#| echo: false
knitr::kable(breed_traits |> count(shedding_level)) |>
kableExtra::row_spec(1, bold = T, color = "#6A395B", background = "#F8F2F3") |> kableExtra::kable_styling(full_width = FALSE)
```
:::
### `filter()`
To investigate this further, we can use the `filter()` function, which keeps or discards observations (rows). Here, the function argument says that we want to keep all records with a `shedding_level` value of zero (and we know from our `count()` output above that there should be only one such observation). I'm also adding the `select()` function (which keeps or discards columns) to keep only our columns of interest.
%hanks to the pipe `|>`, we start with our `breed_traits` data set, apply the `filter()` function with that input, apply the `select()` function to the given output, and then the results would be shown in the console.
```{r filter-1}
#| eval: false
breed_traits |>
filter(shedding_level == 0) |>
select(breed, shedding_level, coat_grooming_frequency,
drooling_level)
```
There is in fact only one observation matching this criteria---it looks like something is wrong with the record on Plott Hounds, so we should remove them from our data set (even though [they're quite cute](https://www.google.com/search?q=plott+hound&rlz=1C5GCEM_enUS1004US1004&source=lnms&tbm=isch&sa=X&ved=2ahUKEwi-vqSXsIH6AhVRrYkEHVlDD1EQ_AUoAXoECAIQAw&biw=1440&bih=712&dpr=2)).
::: {style="font-size: 0.75em"}
```{r filter-1-kable}
#| echo: false
knitr::kable(breed_traits |>
filter(shedding_level == 0) |>
select(breed, shedding_level, coat_grooming_frequency, drooling_level)) |> kableExtra::kable_styling(full_width = FALSE)
```
:::
We can use the `filter()` function once again to make this change. You'll notice in the code below that the assignment operator is back, which means we're making changes to our data set. We'll overwrite the `breed_traits` data frame once again and apply a `filter()` function to keep only the records in which `shedding_level` does *not* equal zero (i.e., all but the Plott Hounds observation). The `!=` operator means *not equal to*.
```{r filter-2}
breed_traits <- breed_traits |>
filter(shedding_level != 0)
```
If you were to once again run `breed_traits |> count(shedding_level)`, you'd see that zero is no longer a value of that variable.
### `mutate()`
I mentioned earlier that the focus of our analysis will be **un**tidy dogs and that we'll be focusing on three of the traits. Since those traits are classified from 1 to 5 for each breed, with a higher score denoting a higher level of untidiness, we can add up the scores for all three traits to create a new variable.
The code below uses the assignment operator `<-` again, but this time we're starting with the `breed_traits` data frame and creating a *new* data frame called `untidy_scores`. The `mutate()` function creates this new variable, `untidy_score`, that adds up our three traits of interest, and the `select()` function keeps only two columns from our original data set.
```{r mutate-1}
untidy_scores <- breed_traits |>
mutate(untidy_score = shedding_level +
coat_grooming_frequency + drooling_level) |>
select(breed, untidy_score)
```
Our entire data set now consists of the `untidy_score` for 194 breeds (we had 195 until we dropped the Plott Hounds).
::: {style="font-size: 0.75em"}
```{r mutate-1-kable}
#| echo: false
knitr::kable(untidy_scores |> head(5)) |> kableExtra::kable_styling(full_width = FALSE)
```
:::
### `arrange()`
The `arrange()` function can be useful to quickly sort your data set based on the value of any selected variable(s). `arrange()` defaults to ascending order, as you can see in the code and output on the left, but you can specify descending order by wrapping the variable name within `desc()`, as seen in the code on the right.
::: columns
::: {.column width="50%"}
```{r arrange-1}
#| eval: false
untidy_scores |>
arrange(untidy_score)
```
::: {style="font-size: 0.75em"}
::: {style="text-align: center"}
```{r arrange-1-kable}
#| echo: false
knitr::kable(untidy_scores |> arrange(untidy_score) |> head(6)) |> kableExtra::kable_styling(full_width = FALSE)
```
:::
:::
:::
::: {.column width="50%"}
```{r arrange-2}
#| eval: false
untidy_scores |>
arrange(desc(untidy_score))
```
::: {style="font-size: 0.75em; text-align: center"}
```{r arrange-2-kable}
#| echo: false
knitr::kable(untidy_scores |> arrange(desc(untidy_score)) |> head(6)) |> kableExtra::kable_styling(full_width = FALSE)
```
:::
:::
:::
## Bar chart
:::{.callout-note icon="false"}
## Note! :rotating_light:
For all of the plots in this workshop, I have included code for the most basic, out-of-the-box version as well as code to make a more custom plot (toggle to the custom tab to see it). The former can be useful whenever you're a) learning and/or b) visualizing data for personal data exploration purposes. But if you're looking to present your plots in any way, you'll likely want to make some customizations. Thankfully, `ggplot` makes that possible! I use a special theme for the custom plots, the code for which is available below.
:::
```{r theme}
#| code-fold: true
#| code-summary: "bonus: custom ggplot theme"
#| output: false
library(showtext) # helps with custom fonts
font_add_google("Prompt", "prompt") # load preferred Google fonts
showtext_auto()
theme_tidy_dog <- function () {
theme_linedraw(base_size=13, base_family="prompt") %+replace%
theme(
# justify axis titles
axis.title = element_text(hjust = 0),
# backgrounds to match website
panel.background = element_rect(fill='#F9E0D9', color = NA),
plot.background = element_rect(fill='#F9E0D9', color=NA),
legend.background = element_rect(fill="transparent", color=NA),
legend.key = element_rect(fill="transparent", color=NA),
# I hate axis ticks and lines :shrug:
axis.ticks = element_blank(),
panel.grid.major = element_line(color = "grey90", size = 0.3),
panel.grid.minor = element_blank(),
# make tweaks to the title and subtitle
plot.title = element_text(size = 15, hjust = 0, vjust = 0.5, face = "bold",
margin = margin(b = 0.2, unit = "cm")),
plot.subtitle = element_text(size = 10, hjust = 0, vjust = 0.5,
margin = margin(b = 0.2, unit = "cm")),
)
}
```
Our previous output showed the six untidiest dogs, along with their untidy score. Let's make a simple bar chart to visualize that data. Our `untidy_scores` data set we created previously has all 194 breeds, but for this purpose we want to create a new data set, `untidy_dogs`, with only the six highest scores.
The `slice_max` function is quite handy for filtering and keeping only the highest values of a certain variable. This function has multiple arguments: the first specifies the variable to filter on (`untidy_score`), the second specifies the number of observations we want to keep, and the third clarifies our stance on ties (here we'll just ignore them).
```{r untidy-dogs}
untidy_dogs <- untidy_scores |>
slice_max(untidy_score, n = 6, with_ties = FALSE)
```
We are left with only six observations in this data set, which we can use to make our bar chart.
::: {style="font-size: 0.75em"}
```{r untidy-dogs-kable}
#| echo: false
knitr::kable(untidy_dogs) |> kableExtra::kable_styling(full_width = FALSE)
```
:::
`ggplot2` is the tidyverse package used for making plots---**gg** stands for the grammar of graphics. The base function is `ggplot()`, and there are many ([many, many](https://ggplot2.tidyverse.org/reference/index.html)) associated functions to help construct and customize your plots.
:::{.callout-note icon="false"}
## Note! :rotating_light:
When you're working with `ggplot` and its related functions, you need to pipe lines together with `+` instead of `|>`. Sometimes you will forget (I still do, and I've been using `ggplot` for years :shrug:), but the error message you get is helpful.
:::
The code for the bar chart below starts with the name of our data frame: `untidy_dogs`. The `ggplot` function starts things off by mapping elements to the **aes**thetic attributes. Here we just have `x` and `y`, but you'll see more later. We only have two variables in this data set, so `untidy_score` will go on the x-axis and `breed` will go on the y.
We specify the **geom**etric points with the following line: `geom_bar()` signifies a bar chart. The `stat = "identity"` argument is necessary here to communicate that we want the bar lengths to be determined by the variable that we specified. (The standard bar chart just takes a raw count of your data.)
::: {.panel-tabset}
### Basic
```{r bar-basic}
#| fig-height: 4
#| fig-align: center
untidy_dogs |>
ggplot(aes(x = untidy_score, y = breed)) +
geom_bar(stat = "identity")
```
### Custom
```{r bar-custom}
#| code-fold: true
#| code-summary: "expand for full code"
#| fig-height: 4.5
#| fig-align: center
untidy_dogs |>
# can reorder the breeds based on the untidy score
ggplot(aes(x = untidy_score, y = reorder(breed, untidy_score),
label = untidy_score)) +
# fill colors the bars
geom_bar(stat = "identity", fill = "#6A395B") +
# add the data labels
geom_label(family = "prompt") +
# ensure the bars go all the way to the axis line
scale_x_continuous(expand = expansion(mult = c(0, 0.1))) +
labs(title = "The untidiest dogs",
subtitle = "Based on drooling, shedding, and grooming frequency",
x = "Untidiness score", y = NULL) +
theme_tidy_dog()
```
:::
## Tidy data
From the previous section, we know that Bernese Mountain Dogs are among the untidiest of all: they have the highest score of 11. So far we've mostly been working with the `breed_traits` data, but we have a whole other data set, `breed_ranks`. Can we look at how the popularity ranking of Bernese Mountain Dogs has shifted over time?
::: {style="font-size: 0.75em"}
```{r bernese-kable-1}
#| echo: false
knitr::kable(breed_rank |> filter(str_detect(Breed, "Bernese"))) |>
kableExtra::kable_styling(full_width = FALSE)
```
:::
I am imagining some kind of dot plot, with the year on the x-axis and the ranking on the y-axis. That seems easy enough based on the data that we have above, but if you were to start constructing that plot, you'd pretty quickly run into a road block. What would your first function look like? `ggplot(aes(x = ??, y = ??))`
We discussed earlier that the tidyverse is opinionated on the topic of **tidy data**, and this particular question is tricky because this data set does not meet the criteria for tidy data:
>There are three interrelated rules which make a dataset tidy:<br>
1. Each variable must have its own column.<br>
2. Each observation must have its own row.<br>
3. Each value must have its own cell.<br>
- [**R for Data Science**](https://r4ds.had.co.nz/tidy-data.html)
How does `breed_ranks` violate these rules? We have a year variable, but that variable does not actually exist as such, in its own column. Shown above is one observation, by dog breed. But that "one" observation is actually eight separate observations: the rank in 2013, the rank in 2014, etc. Each observation needs to have its own row.
### `pivot_longer()`
Thankfully, the tidyverse gives us a handy solution from the `tidyr` package! The current structure of `breed_ranks` is in a *wide* format, and we need it to be in a *long* format. `pivot_longer()` to the rescue.
The code below uses the assignment operator to create a new data frame called `ranks_pivoted` based on `breed_rank`. The `pivot_longer()` function here has three arguments: the columns we want to pivot (remember we need to use back ticks because these variable names have spaces in them), the name for the new column consisting of the previous column headers, and the name for the new column consisting of the previous column *values*.
```{r pivot-longer}
ranks_pivoted <- breed_rank |>
pivot_longer(`2013 Rank`:`2020 Rank`,
names_to = "year",
values_to = "rank")
```
The output is below. This data is now tidy, with each observation (e.g., the rank in 2013) in its own row and separate columns for each variable.
::: {style="font-size: 0.75em"}
```{r bernese-kable-3}
#| echo: false
knitr::kable(ranks_pivoted |> filter(str_detect(Breed, "Bernese"))) |>
kableExtra::kable_styling(full_width = FALSE)
```
:::
We can do a bit more cleaning up of this data set. First, the `rename()` function makes it easy to change the names of variables (the new name comes first, followed by the original)---we can change `Breed` to `breed` to match our other data set. And the `parse_number()` function from `readr` allows us to pull out the integer from our `year` column. This is an example of how you can use `mutate()` to rewrite existing variables in addition to creating new ones.
```{r pivot-longer-2}
ranks_pivoted <- ranks_pivoted |>
rename(breed = Breed) |>
mutate(year = parse_number(year))
```
::: {style="font-size: 0.75em"}
```{r pivot-longer-2-kable}
#| echo: false
knitr::kable(ranks_pivoted |> filter(str_detect(breed, "Bernese"))) |>
kableExtra::kable_styling(full_width = FALSE)
```
:::
:::{.callout-note icon="false"}
## Note! :rotating_light:
There are more advanced pivoting examples [here](https://r4ds.had.co.nz/tidy-data.html#pivoting) and [here](https://dcl-wrangle.stanford.edu/pivot-advanced.html).
:::
Now that our data is appropriately tidy, we can return to the question from the beginning of this section: how has the popularity ranking of Bernese Mountain Dogs has shifted over time? Since I'm only interested in Bernese Mountain Dogs, and our new `ranks_pivoted` data frame contains all the breeds, the code below starts with a `filter()` statement.
I could use `filter(breed == "Bernese Mountain Dogs")`, but perhaps you don't want to type all that out. The `stringr` [package](https://stringr.tidyverse.org/) (part of the core tidyverse) has many functions for dealing with string (text) data, and `str_detect()` is among the most useful. It returns `TRUE` or `FALSE` as to whether the variable you select (the first argument of the function) contains the string you provide (the second argument of the function). This function within a function will `filter` to only the observations for which our `str_detect` expression is `TRUE`.
Our `ggplot` function assigns `year` to the x-axis and `rank` to the `y-axis`. I've also added `label = rank` to indicate that I also want my `rank` variable to determine the label. Instead of adding a `geom_bar` label like last time, here we've added a `geom_point` layer for dots as well as a `geom_text` layer to add a label. The `vjust` argument within that function specifies the desired vertical justification for the labels to place them below the dots.
## Dot plot
::: {.panel-tabset}
### Basic
```{r dot-basic}
#| fig-height: 3.5
#| fig-align: center
ranks_pivoted |>
filter(str_detect(breed, "Bernese")) |>
ggplot(aes(x = year, y = rank, label = rank)) +
geom_point(size = 3) +
geom_text(vjust = 2)
```
### Custom
```{r dot-custom}
#| code-fold: true
#| code-summary: "expand for full code"
#| fig-height: 4.5
#| fig-align: center
ranks_pivoted |>
filter(str_detect(breed, "Bernese")) |>
ggplot(aes(x = year, y = rank, label = rank)) +
geom_point(size = 3) +
geom_text(vjust = 2) +
# flip the y-axis
scale_y_reverse(limits = c(50, 1)) +
# specify the breaks on the x-axis
scale_x_continuous(breaks = seq(2013, 2020, 1)) +
labs(x = NULL, y = "Popularity Rank",
title = "Popularity of Bernese Mountain Dogs") +
theme_tidy_dog()
```
:::
## Line graph
In the previous section we plotted the ranking over time for the Bernese Mountain Dogs. Can we do the same for all six of the untidiest dogs we identified previously in the `untidy_dogs` data set?
These are the two data sets we have:
::: columns
::: {.column width="50%"}
::: {style="text-align: center"}
`untidy_dogs`
:::
::: {style="font-size: 0.75em"}
```{r untidy-dogs-kable}
#| echo: false
```
:::
:::
::: {.column width="50%"}
::: {style="text-align: center"}
`ranks_pivoted`
:::
::: {style="font-size: 0.75em"}
```{r ranks-pivoted-kable}
#| echo: false
knitr::kable(ranks_pivoted |> head(6)) |>
kableExtra::kable_styling(full_width = FALSE)
```
:::
:::
:::
The data we want to plot (the rankings by year) is in the `ranks_pivoted` data set, but that holds all the dog breeds and we only want to include the six breeds we have in the `untidy_dogs` data set.
Thankfully, you can also use the `filter()` function to filter *across* data sets! The code below creates a new data set `untidy_popularity` based on the `ranks_pivoted` data set that keeps only the observations for which the `breed` is included among the list of `breed`s within the `untidy_dogs` data set. There are two more common operators used here: `%in%` allows for multiple options within a list, and `$` refers to a specific variable within another data set.
```{r line-graph}
untidy_popularity <- ranks_pivoted |>
filter(breed %in% untidy_dogs$breed)
```
:::{.callout-note icon="false"}
## Note! :rotating_light:
You could get the same results with `filter(breed %in% c("Bernese Mountain Dogs", "Leonbergers", "Newfoundlands", "Bloodhounds", "St. Bernards", "Old English Sheepdogs"))`, but that is a lot of typing (i.e., many opportunities for typos).
:::
To check that this filter worked as we wanted it to, we can apply a `count()` function to our new `untidy_popularity` data set.
```{r count-check}
#| eval: false
untidy_popularity |>
count(breed)
```
There are six breeds included (correct), with eight separate observations each (one for each year; correct).
::: {style="font-size: 0.75em"}
```{r count-check-kable}
#| echo: false
knitr::kable(untidy_popularity |> count(breed)) |>
kableExtra::kable_styling(full_width = FALSE)
```
:::
We can create a line graph off of this `untidy_popularity` data set with a `geom_line` layer. (I've also included a `geom_point` layer to add points to those lines.) The `year` variable is on the x-axis with `rank` on the y-axis, similar to what we've used before, but in order to get multiple lines for the multiple breeds, I've added the `breed` variable onto both the `group` and `color` aesthetics. This will group the data points by breed and also create a color legend.
::: {.panel-tabset}
### Basic
```{r line-basic}
#| fig-height: 4
#| fig-align: center
untidy_popularity |>
ggplot(aes(x = year, y = rank, group = breed, color = breed)) +
geom_line() +
geom_point(size = 3)
```
### Custom
```{r line-custom}
#| code-fold: true
#| code-summary: "expand for full code"
#| fig-height: 4.5
#| fig-align: center
untidy_popularity |>
# create a label that applies for the last point only
mutate(label = ifelse(year == 2020, breed, NA)) |>
ggplot(aes(x = year, y = rank, group = breed, color = breed,
label = label)) +
geom_line() +
geom_point(size = 3) +
# apply the label so I don't have to use a legend
geom_text(hjust = -0.1, family = "prompt") +
# change the color scale
scale_color_viridis_d(option = "A") +
# control the width of the x-axis (so I have room for labels)
# and specify breaks
scale_x_continuous(expand = expansion(mult = c(0.025, 0.5)),
breaks = seq(2013, 2020, 1)) +
# flip the y-axis so better rankings are "higher"
scale_y_reverse() +
labs(title = "Popularity over time of the untidiest dogs",
subtitle = "Based on drooling, shedding, and grooming frequency",
x = NULL,
y = "Popularity Rank") +
theme_tidy_dog() +
# remove the legend since I used labels instead
theme(legend.position = "none")
```
:::
## Relational data
We have data on the `untidy_scores` for each breed as well as their popularity over time---can we create a scatter plot of the *average* popularity ranking against the `untidy_scores` for all breeds?
We already have the `untidy_scores`, so let's focus on finding the average popularity ranking. Our `ranks_pivoted` data set holds that information in a tidy way, but there is one observation per year per breed and we need to find the overall average per breed.
The `group_by()`/`summarize()` combination of `dplyr` functions allows you to calculate aggregations by a specified group (or multiple!). Here, the code below groups by `breed` and creates a new variable, `avg_rank`, within a `summarize()` function by taking the mean of the existing `rank` variable.
```{r summ-1}
avg_ranks <- ranks_pivoted |>
group_by(breed) |>
summarize(avg_rank = mean(rank))
```
The first nine rows of our new `avg_ranks` data set are below. But we have an `NA` value! This is a quirk of the aggregation calculations available within `summarize()`. If there is any missing data among the values being summarized, `NA` will be returned. (From the raw data, you can find out that the American Hairless Terriers were only ranked in 2020.)
::: {style="font-size: 0.75em"}
```{r summ-1-kable}
#| echo: false
knitr::kable(avg_ranks |> head(9)) |> kableExtra::row_spec(9, bold = T, color = "#6A395B", background = "#F8F2F3") |>
kableExtra::kable_styling(full_width = FALSE)
```
:::
Depending on your data, this could be very useful information! But if you are aware of the missing data and still want to calculate the aggregations irrespective of any possible missing values, add `na.rm = TRUE` as an additional argument to your aggregations within `summarize()`.
```{r summ-2}
avg_ranks <- ranks_pivoted |>
group_by(breed) |>
summarize(avg_rank = mean(rank, na.rm = TRUE))
```
::: {style="font-size: 0.75em"}
```{r summ-1-kable}
#| echo: false
```
:::
We now have all of the data that we need. Our new `avg_ranks` data set holds the average popularity ranking per breed, and the `untidy_scores` data set we created earlier holds each breed's untidy score.
How do we put them together?
::: columns
::: {.column width="50%"}
::: {style="text-align: center"}
`avg_ranks`
:::
::: {style="font-size: 0.75em"}
```{r avg-ranks-kable}
#| echo: false
knitr::kable(avg_ranks |> head(5)) |>
kableExtra::kable_styling(full_width = FALSE)
```
:::
:::
::: {.column width="50%"}
::: {style="text-align: center"}
`untidy_scores`
:::
::: {style="font-size: 0.75em"}
```{r untidy_scores-kable}
#| echo: false
knitr::kable(untidy_scores |> head(5)) |>
kableExtra::kable_styling(full_width = FALSE)
```
:::
:::
:::
Relational data---tables of data that share one or more common elements---is extremely common, and it's worth becoming familiar with the tidyverse functions that allow you to join data tables.
To solve this particular question, we'll use `left_join()`, which is a type of *mutating* join: it matches up observations across tables and moves variables from one to another.
:::{.callout-note icon="false"}
## Note! :rotating_light:
We're focusing on a simple `left_join()` because it is the most common type of join. Other join types are available, and you can read more [here](https://r4ds.had.co.nz/relational-data.html).
:::
The syntax of this example is fairly simple: we're starting with the `avg_ranks` data set we just created and making a new one titled `tidy_and_rank`. The two arguments to the `left_join()` function are the name of the data set we want to get information from (`untidy_scores`) as well as the *key*, or the variable that connects the two data frames. In this case, it's `breed`.
```{r left-join-1}
tidy_and_rank <- avg_ranks |>
left_join(untidy_scores, by = "breed")
```
:::{.callout-note icon="false"}
## Note! :rotating_light:
You can have multiple keys: `by = c("var1","var2")` and even keys that have different variable names: `by = c("var_left" = "var_right")`. If you *don't* specify a key, R will default to any variable(s) that have the same name across tables. It is good practice to specify a key no matter what, even in a simple example like this, to avoid mistakes and make it clear what you're intending to do.
:::
A quick peek at our new `tidy_and_rank` data set looks okay...
::: {style="font-size: 0.75em"}
```{r tidy-and-rank-kable}
#| echo: false
knitr::kable(tidy_and_rank |> head(5)) |>
kableExtra::kable_styling(full_width = FALSE)
```
:::
...but we can use `count()` again just to make sure that this worked the way we wanted it to.
```{r count-check-2}
#| eval: false
tidy_and_rank |>
count(untidy_score)
```
And it looks like we have an `NA` value.
::: {style="font-size: 0.75em"}
```{r count-check-2-kable}
#| echo: false
knitr::kable(tidy_and_rank |>
count(untidy_score))|> kableExtra::row_spec(10, bold = T, color = "#6A395B", background = "#F8F2F3") |>
kableExtra::kable_styling(full_width = FALSE)
```
:::
Let's further investigate with a simple `filter()` to see where the issue is. `is.na()` is a handy function to use within `filter()` that will return any observations with an NA value for the specified variable.
```{r na-1}
#| eval: false
tidy_and_rank |>
filter(is.na(untidy_score))
```
It's the darn Plott Hounds again. We removed them from the original `breed_traits` data set, the basis of `untidy_scores`, but they were still present in the original `breed_ranks` data set, which is what this is based on.
::: {style="font-size: 0.75em"}
```{r na-1-kable}
#| echo: false
knitr::kable(tidy_and_rank |>
filter(is.na(untidy_score))) |>
kableExtra::kable_styling(full_width = FALSE)
```
:::
This is essentially the opposite of the operation we just did. We are filtering to keep any observations for which `untidy_score` is *not* (`!`) `NA`.
```{r na-2}
tidy_and_rank <- tidy_and_rank |>
filter(!is.na(untidy_score))
```
You can run the `count()` code again if you wish to double-check, but we are good to go.
## Jitter plot
You can create a simple scatter plot in `ggplot` with numeric variables on the x- and y-axes and a `geom_point` layer. But a case like this, where one of our variables is an interval like `untidy_score`, sometimes benefits from adding a bit of *jitter* to help showcase data points that might otherwise overlap in a scatter plot.
That's possible by using a `geom_jitter` layer, where the width of the strip can be controlled by the `width` argument.
::: {.panel-tabset}
### Basic
```{r jitter-basic}
#| fig-height: 4
#| fig-align: center
tidy_and_rank |>
ggplot(aes(x = untidy_score, y = avg_rank)) +
geom_jitter(width = 0.1)
```
### Custom
```{r jitter-custom}
#| code-fold: true
#| code-summary: "expand for full code"
#| fig-height: 4.5
#| fig-align: center
tidy_and_rank |>
ggplot(aes(x = untidy_score, y = avg_rank)) +
# specify the labels of the x-axis
scale_x_continuous(breaks = seq(3, 11, 1)) +
# control the size, transparency, and color of the points
geom_jitter(size = 3, width = 0.1, alpha = 0.7, color = "#6A395B") +
# flip the y-axis and control the labels
scale_y_reverse(breaks = c(200, 150, 100, 50, 1)) +
labs(title = "Dog breed popularity compared to their untidiness score",
subtitle = "Ranking based on 2013-2020 data; tidy score based on drooling, shedding, grooming frequency",
x = "Untidiness Score",
y = "Average Popularity Rank") +
theme_tidy_dog()
```
:::
## Dumbbell plot
The last question we'll answer from this data set is: which breeds have had the biggest jump in popularity from 2013 to 2020?
To answer that question, we're only interested in certain observations from `ranks_pivoted`, those with a `year` of 2013 or 2020.
::: {style="font-size: 0.75em"}
```{r ranks-pivoted-3-kable}
#| echo: false
knitr::kable(ranks_pivoted |> head(9)) |> kableExtra::row_spec(c(1,8,9), bold = T, color = "#6A395B", background = "#F8F2F3") |>
kableExtra::kable_styling(full_width = FALSE)
```
:::
All three of the following functions would deliver the same results on this data set:
1. `filter(year %in% c(2013, 2020))`
2. `filter(year == 2013 | year == 2020)`
3. `filter(year == min(year) | year == max(year))`
The last option :point_up: is the most *robust*, meaning that it would still deliver the intended result if, say, you got an updated version of this data set with more years of data.
### `pivot_wider()`
The code below creates a new data set `rank_change` from `ranks_pivoted` with five different functions applied. After the `filter()` we discussed above, we would be left with two observations per breed, one for 2013 and 2020. But the desired structure for the dumbbell plot would have *one* observation per breed so that the 2013 and 2020 ranks could be plotted separately. That means it's time to pivot again---but this time in the *opposite* direction with `pivot_wider()` instead of `pivot_longer()`.
This `pivot_wider()` function has two arguments to determine where the variable names come from and where the values are from (`year` and `rank`, respectively). `mutate()` will give us a new `change` variable that calculates the difference, and we'll `filter` to only those breeds that were ranked within the top 50 in 2020. And lastly, another application of `slice_max()` will result in the top six breeds with the highest values of `change`.
```{r pivot-wider}
rank_change <- ranks_pivoted |>
filter(year == min(year) | year == max(year)) |>
pivot_wider(names_from = "year",
values_from = "rank") |>
mutate(change = `2013` - `2020`) |>
filter(`2020` <= 50) |>
slice_max(change, n = 6)
```
That leaves us with six observations, one per breed, with separate variables for the rank in 2013, the rank in 2020, and the change.
::: {style="font-size: 0.75em"}
```{r rank-change-kable}
#| echo: false
knitr::kable(rank_change) |>
kableExtra::kable_styling(full_width = FALSE)
```
:::
We can plot this data with a dumbbell plot, which are useful for showing the change between two points.
:::{.callout-note icon="false"}
## Note! :rotating_light:
To be honest, it's easier to make dumbbell plots with `ggalt::geom_dumbbell()`. But we can use `ggplot` to learn a) how to combine multiple geoms and b) how inherited `aes` works.
:::
We can build these in `ggplot` by adding multiple `geom_` layers. The first `ggplot()` function will contain only the mapped aesthetics that apply to *all* the `geom_` layers. Here, that's only `breed` on the y-axis. Individual `geom_` layers can also have their own aesthetics, in addition to what's inherited from the initial `ggplot(aes())`, as you can see below. One `geom_point` maps the `2013` variable, another maps the `2020` variable, and a `geom_segment` layer creates the horizontal line that connects the two dots.
::: {.panel-tabset}
### Basic
```{r dumbbell-basic}
#| fig-height: 3
#| fig-align: center
rank_change |>
ggplot(aes(y = breed)) +
geom_segment(aes(yend = breed, x = `2013`, xend = `2020`)) +
geom_point(aes(x = `2013`), color = "#c991b8", size = 3) +
geom_point(aes(x = `2020`), color = "#6A395B", size = 3) +
geom_label(aes(x = `2020`, label = `2020`), vjust = -0.5) +
geom_label(aes(x = `2013`, label = `2013`), vjust = -0.5)
```
### Custom
```{r dumbbell-custom}
#| code-fold: true
#| code-summary: "expand for full code"
#| fig-height: 4.5
#| fig-align: center
rank_change |>
# create a new variable to find the value of the middle
mutate(middle = `2020` + (change / 2)) |>
# order the breeds by their 2020 rank
ggplot(aes(y = reorder(breed, -`2020`))) +
geom_segment(aes(yend = reorder(breed, -`2020`), x = `2013`, xend = `2020`),
color = "grey20") +
geom_point(aes(x = `2013`), color = "#c991b8", size = 3) +
geom_point(aes(x = `2020`), color = "#6A395B", size = 3) +
geom_label(aes(x = `2020`, label = `2020`), family = "prompt", vjust = -0.5) +
geom_label(aes(x = `2013`, label = `2013`), family = "prompt", vjust = -0.5) +
# label the value of the change at the middle of the line
geom_text(aes(x = middle, label = str_c("+", change)), family = "prompt",
vjust = -0.75, size = 3.5) +
# flip the x-axis
scale_x_reverse() +
labs(x = "Popularity Ranking",
y = NULL,
title = "Dog breeds with the biggest jump in popularity from 2013",
subtitle = "Among the top 50 in 2020") +
theme_tidy_dog() +
theme(panel.grid.major.y = element_blank(),
plot.title.position = "plot")
```
:::
## Resources
I hope this workshop has been a helpful introduction to the tidyverse. (If you found it useful and/or you think someone *else* would find it useful, I'd appreciate a Twitter share---I'm [@MeghanMHall](https://twitter.com/meghanmhall)!) But this intro just scratches the surface of the tidyverse, which is a very powerful set of packages that can help you solve lots of problems. These are some of my favorite resources to learn more:
- [**R for Data Science**](https://r4ds.had.co.nz/)
- [**R for Excel users**](https://rstudio-conf-2020.github.io/r-for-excel/)
- [**STAT 545**](https://stat545.com/)
- [**RStudio Education**](https://education.rstudio.com/)
- [**Learn `tidyverse`**](https://www.tidyverse.org/learn/)