I our previous classes we have been working with small cleaned up dataset to go through steps of creating some of the most common visualization types.
In your workflow you are going to need data visualization at two points, namely during exploratory data analysis where you learn to know your dataset and during report preparation when you try to communicate what have you found. And this is not two stop trip, it's more like a roundabout, an iterative process, where you pass these two point multiple times after you have done some "tweaking" of your data. By "tweaking" I mean here data transformation and/or modeling.
You need to transform your data during analysis, because in real life you rarely start with a dataset that is in the right form for visualization and modeling. So, often you will need to:
- summarise your data or to
- create new variables,
- rename variables, or
- reorder the observations.
We are going to use the dplyr library from tidyverse to learn how to carry out these tasks.
Again, we are follow closely R4DS book, chapter "Data transformation", available from http://r4ds.had.co.nz/transform.html. More examples from https://rstats-tartu.github.io/lectures/tidyverse.html#dplyr-ja-selle-viis-verbi
Estonian COVID19 tests data was downloaded from Estonian Health Board open data portal https://www.terviseamet.ee/et/koroonaviirus/avaandmed and contains positive and negative test results with test dates, including metadata about subject gender, age group, country, and county. Whole dataset was sampled down and includes 5% of the original data. Dataset was downloaded and prepared using get_data.R script
(covid_tests <- readr::read_csv("https://raw.githubusercontent.com/rstats-tartu/transform-data-with-dplyr/main/data/covid_tests.csv"))
#> Rows: 91087 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (5): Gender, AgeGroup, Country, County, ResultValue
#> dbl (2): wk, yr
#> date (1): StatisticsDate
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 91,087 × 8
#> Gender AgeGroup Country County ResultValue StatisticsDate wk yr
#> <chr> <chr> <chr> <chr> <chr> <date> <dbl> <dbl>
#> 1 M 50-54 Eesti Harju maakond N 2020-09-16 38 2020
#> 2 M 45-49 Eesti Põlva maakond N 2021-02-13 7 2021
#> 3 N 40-44 Eesti Jõgeva maako… N 2020-12-22 51 2020
#> 4 M 80-84 Eesti Rapla maakond N 2021-03-15 11 2021
#> 5 M <NA> Tundmatu <NA> N 2021-07-11 28 2021
#> 6 N 30-34 Eesti Harju maakond N 2021-01-05 1 2021
#> 7 N 10-14 Eesti Harju maakond N 2021-07-01 26 2021
#> 8 N 70-74 Eesti Harju maakond P 2021-07-12 28 2021
#> 9 M 60-64 Eesti Tartu maakond N 2021-06-09 23 2021
#> 10 N 35-39 Eesti Harju maakond N 2020-11-18 47 2020
#> # … with 91,077 more rows
Created on 2021-09-14 by the reprex package (v2.0.1)