transform_1.Rmd

---
title: "Data transformation using dplyr (aka five verbs)"
author: "Taavi Päll"
date: "2021-09-16"
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

I our previous classes we were working with small and cleaned up **mgp** dataset to go through steps of creating some of the most common visualisation types. 

In data analysis workflow you need to plot out data at two points:

1. During **exploratory data analysis** where you learn to know your dataset and 
2. **reporting** when you try to communicate what you have found. 

Importantly, data analysis is not a linear process, but an iterative process of data transformation, modeling and visualisation. 


Here we add another step to data analysis process: data transformation.

In most cases you need to transform your data during analysis, because in real life you rarely start with a dataset that is in the right form for visualisation and modeling. 


Usually you will need to:

- summarise your data, 
- create new variables, 
- rename variables, 
- reorder the observations. 


We are going to use the dplyr library from tidyverse to learn how to carry out these tasks. 

## Sources

Again, we are going to follow closely R4DS book chapter "Data transformation" available 

- from http://r4ds.had.co.nz/transform.html.    
- More examples are available from https://rstats-tartu.github.io/lectures/tidyverse.html#dplyr-ja-selle-viis-verbi


Load tidyverse library and dataset:

```{r}
library(tidyverse)
library(lubridate) # library to work with dates and time
library(here) # (always) locate files in your project folder P.S. load here after lubridate, because lubridate has also (now deprecated) function called here
```

### COVID19 data 

Estonian COVID19 tests data was downloaded from Estonian Health Board open data portal <https://www.terviseamet.ee/et/koroonaviirus/avaandmed> and contains positive and negative test results with test dates, including metadata about subject gender, age group, country, and county.
Whole dataset was sampled down and includes 5% of the original data.
Dataset was downloaded and prepared using [get_data.R](scripts/get_data.R) script

Let's import covid_tests.csv file from data subfolder.
As we cleverly assigned short name to our dataset after preprocessing, we can use same name for our imported object for better readability of code. 
It's very difficult to follow, when all object are named "df1", "df2" etc or "m", "m1" etc. 

```{r}
(covid_tests <- read_csv(here("data", "covid_tests.csv")))
```


`here()` uses clever heuristics to identify your working directory and updates file paths respectively:

```{r}
here("data", "covid_tests.csv")
```

Set up here like so, cd to your working directory and run:
```{r}
here::set_here()
```


You could also just run:
```{r, eval=FALSE}
covid_tests <- read_csv("data/covid_tests.csv")
```


## dplyr basics

Most of the data transformation tasks can be carried out using five verbs from dplyr library:

- Pick observations by their values (filter()).
- Reorder the rows (arrange()).
- Pick variables by their names (select()).
- Create new variables with functions of existing variables (mutate()).
- Collapse many values down to a single summary (summarise()).

- These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. 


These six functions provide the verbs for a language of data manipulation.

All verbs work similarly:

- The first argument is a data frame.

- The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).

- The result is a new data frame.

Together these properties make it easy to chain together multiple simple steps to achieve a complex result. Let's dive in and see how these verbs work.


## Filter rows with filter()

filter() allows you to subset observations based on their values.

The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. 

For example, we can filter only positive test results (P):

```{r}
filter(covid_tests, ResultValue == "P")
```

dplyr runs the filtering operation and returns a new data frame. 

dplyr functions never modify their inputs, so if you want to save the result, you'll need to use the assignment operator, `<-`, like so:

Assign bacterial viruses to object phages:
```{r}
pos_tests <- filter(covid_tests, ResultValue == "P")
pos_tests
```

### Comparisons

What is this == operator? Why not use = to check equality:

```{r, eval=FALSE}
filter(covid_tests, ResultValue = "P")
```

It appears that = is another assignment operator besides ->

There's another common problem you might encounter when using ==: floating point numbers. 

Although, theoretically TRUE, following comparisons return FALSE!

```{r}
sqrt(2) ^ 2 == 2
1/49 * 49 == 1
```

This is because computers and R use finite precision arithmetic and cannot store an infinite number of digits.

This can be overcome by using near() function instead of ==:
```{r}
near(sqrt(2) ^ 2,  2)
near(1 / 49 * 49, 1)
```

### Logical operators

Multiple comparisons within filter() function are combined with comma "," which means "and" (&). 

In case of "and" all comparisons must evaluate to TRUE for observations to be returned.

Together, logical (boolean) operators are:

- & is AND, 
- | is OR, 
- ! is NOT


The following code finds all tests from "Tartu maakond" OR "Harju maakond":

```{r}
filter(covid_tests, County == "Tartu maakond" | County == "Harju maakond")
```

You can't write something like `filter(covid_tests, County == "Tartu maakond" | "Harju maakond")` and in case of numeric 
variables this will give you wrong answer instead of Error, so be careful:

```{r, eval=FALSE}
filter(covid_tests, County == "Tartu maakond" | "Harju maakond")
```

A useful short-hand for this problem is x %in% y. This will select every row where x is one of the values in y:

Filter observations where County is "Tartu maakond" and "Harju maakond":
```{r}
filter(covid_tests, County %in% c("Tartu maakond", "Harju maakond"))
```

What happens under the hood:
```{r, eval=FALSE}
head(covid_tests$County) %in% c("Tartu maakond", "Harju maakond")
```

Remember that:
- !(x & y) is the same as !x | !y  
- !(x | y) is the same as !x & !y  


For example, if you wanted to find tests that are not from two oldest age groups -- "80-84" and "üle 85", you could use either of the following two filters:

```{r}
age_groups <- filter(covid_tests, !(AgeGroup == "80-84" | AgeGroup == "üle 85"))
```


```{r}
unique(age_groups$AgeGroup)
```

```{r}
!all(c("80-84", "üle 85") %in% age_groups$AgeGroup)
```


Previous expression can be rewritten like so:
```{r}
filter(covid_tests, !(AgeGroup == "80-84") , !(AgeGroup == "üle 85"))
```

OR

```{r}
filter(covid_tests, AgeGroup != "80-84" , AgeGroup != "üle 85")
```

Which one from these three is more explicit? 

Note that comma here in filter function between logical evaluations means "&".

"Small than", "bigger than" and "NOT":
```{r}
3 >= c(2, 3, 4)
3 <= c(2, 3, 4)
3 != c(2, 3, 4)
```


### Missing values

One important feature of R that can make comparison tricky are missing values, or "NA"s ("not availables"). 

NA represents an unknown value so missing values are "contagious": 

almost any operation involving an unknown value will also be unknown.

```{r}
NA > 5
10 == NA
NA + 10
NA / 2
```

As Rsudio already might suggest, if you want to determine if a value is missing, use is.na():
```{r}
x <- NA
is.na(x)
```

Let's use is.na() within filter to filter rows with missing "AgeGroup":
```{r}
filter(covid_tests, is.na(AgeGroup))
```

Ok. Now we got all rows with missing "AgeGroup", how would you change this code to really exclude these rows with missing data (Hint: !FALSE):
```{r}
filter(covid_tests, !is.na(AgeGroup))
```


There are other functions that remove rows with NA in any of the columns from your data frame and keep only "complete cases":

```{r}
na.exclude(covid_tests)
na.omit(covid_tests)
drop_na(covid_tests)
```


```{r}
drop_na(covid_tests)
```


```{r}
drop_na(covid_tests, AgeGroup)
```


### Finding non-exact matches

Often you find yourself in need to filter categorical variables based on some non-exact matching, for example when 

- values are too long,   
- there are too many unique unkown values,   
- some observations that belong to the same category have slightly different values (`foo bar` and `foo-bar` and `foobar`). 

To solve this problem more elegantly, you can use regular expressions. (for help see ?`regular expression`)

regular expressions are covered in *tidyverse* by the **stringr** package. 


The stringr function that is useful within filter is str_detect().


```{r}
library(stringr)
fruits <- c("banana", "foo TRUMP bar", "foo bar", "foo-bar", "foobar", "foo")
str_detect(fruits, "foo(.+)?bar")
```

```{r}
str_which(fruits, "foo(.*)?bar")
```

```{r}
fruits[str_detect(fruits, "foo(.*)?bar")]
```


As you can see, str_detect() returns logical vector.

So how this should work when you run str_detect inside filter?

Tests from  "Harju" county:
```{r}
filter(covid_tests, str_detect(County, "Harju"))
```


### Exercises

1. Find all tests that 

- are done in 1 Aug 2021 and after:

```{r}
aug <- filter(covid_tests, StatisticsDate >= "2021-08-01")
min(aug$StatisticsDate)
max(aug$StatisticsDate)
```

- belong to AgeGroup *0-4*

```{r}
ageg <- filter(covid_tests, AgeGroup == "0-4")
unique(ageg$AgeGroup)
sum(is.na(ageg$AgeGroup))
```


- were released between 1 Jan 2021 - 1 Mar 2021, including these days:

```{r}
betw <- filter(covid_tests, StatisticsDate >= "2021-01-01", StatisticsDate <= "2021-03-01")
min(betw$StatisticsDate)
max(betw$StatisticsDate)
n_distinct(betw$StatisticsDate)
```


2. there is also between() function in dplyr. What does it do? How can you use it to find tests done in 2021 between week 8-10? Find "wk" and "yr" variables in data.
```{r}
filter(covid_tests, yr == 2021, between(wk, 8, 10))
```


## Arrange rows with arrange()

arrange() works similarly to filter() except that instead of selecting rows, it changes their order.

It takes a data frame and a set of column names to order by. 

If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

```{r}
arrange(covid_tests, StatisticsDate)
```

Use desc() to re-order by a column in descending order. You can combine variables:
```{r}
arrange(covid_tests, AgeGroup, desc(StatisticsDate))
```

Missing values are always sorted at the end, even with desc() function:
```{r}
df <- tibble(x = c(5, NA, 2))
arrange(df, x)
```

```{r}
arrange(df, desc(x))
```


### Get unique rows with distinct()

Sometimes observations become duplicated during data wrangling, sometimes you need to get unique combinations of observations. 
dplyr has distinct() function to retain only unique rows from input table.


covid_tests data has `r nrow(covid_tests)` rows, but what if we would like to get only unique wk and yr combinations..  

First let's select organism_name and tax_id columns
```{r}
yrwk <- select(covid_tests, yr, wk)
yrwk
```

Here we are... with distinct organism_name tax_id combinations.
```{r}
distinct(yrwk)
```

distinct() works on tables
```{r}
?distinct
```

unique() function from base R works on vectors
```{r}
x <- rep(letters[1:3], each = 3) # simulate character vector x with repetitions
x
unique(x) # select unique set of values from x
```


### Exercises

1. How could you use arrange() to sort covid_tests with missing County values to the start? (Hint: use is.na()).

```{r}

```


2. Sort covid_tests to find most recent tests:

```{r}

```


## Select columns with select()

select() allows you to rapidly zoom in on a useful subset of columns using operations based on the names of the variables.

Select first three columns:
```{r}
select(covid_tests, 1:3)
```

Select columns from Gender to ResultValue and other way around:
```{r}
select(covid_tests, Gender:ResultValue)
select(covid_tests, ResultValue:Gender)
```

!!! select works in both directions: L>R and R<L!


Exlude column Country:
```{r}
select(covid_tests, -Country)
```

Another way, exclude as a vector, in case of vector use quoted variable names!:
```{r}
vars_out <- c("Country", "yr", "wk")
select(covid_tests, -vars_out)
```


> Use minus sign to exclude variables! 
> Submit variables to select as character vector!


There are a number of __helper functions you can use within select()__:

- starts_with("abc"): matches names that begin with "abc".

Select columns that start with "bill"
```{r}
library(palmerpenguins)
select(penguins, starts_with("bill"))
```


- ends_with("xyz"): matches names that end with "xyz".


Select columns that end with "mm"
```{r}
select(penguins, ends_with("mm"))
```


- contains("ijk"): matches names that contain "ijk".

All columns that contain word "length"
```{r}
select(penguins, contains("length"))
```


- matches("(.)\\1"): selects variables that match a regular expression. 

This one matches any variables that contain repeated characters. You'll learn more about regular expressions in strings.

```{r, eval = FALSE}
matches("^abc") # same as starts_with("abc")
matches("xyz$") # same as ends_with("xyz")
matches("ijk") # same as contains("ijk")
```

Select columns/variables with bill measurements: 
```{r}
select(penguins, matches("bill.*mm"))
```


- num_range("V", 1:10000) matches V1, V2 and V3.


- everything() is useful if you have a handful of variables you'd like to move to the start of the data frame.

Move columns tax_id, size_kb and gc to the start of the data frame, and keep all other columns.

You can rearrange the order of columns, let's move date and test result in front.
```{r}
select(covid_tests, StatisticsDate, ResultValue, everything())
```


See ?select for more details.


### Exercises

1. What happens if you include the name of a variable (e.g. gc) multiple times in a select() call?

```{r}

```


2. What does the one_of() function do? 

Why might it be helpful in conjunction with this vector?
```{r}
(vars <- c("StatisticsDate", "ResultValue", "flipper_length_mm"))
select(covid_tests, one_of(vars))
```

What happens if you try to select columns just by using this vector:

```{r, eval=FALSE}
select(covid_tests, c("StatisticsDate", "ResultValue", "flipper_length_mm"))
```

3. Select all variables from 'penguins' dataset that contain string 'FLIPPER' (note case!). 

Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

```{r}
?select
select(penguins, contains("FLIPPER", ignore.case = TRUE))
```