tidyverse_beauty_of_across4.Rmd

# tidyverse中的across()之美4 {#tidyverse-beauty-of-across4}

本章讲讲`mutate()`中的`across()`与`c_acorss()`、`map_df()`、`map2_dfc()`系列的纠缠。

内容涉及迭代、泛函、返回数据框、**数据框并入**等概念。


```{r, message=FALSE, warning=FALSE}
library(tidyverse)

penguins <- palmerpenguins::penguins %>% drop_na()
```


## `mutate()` 数据框并入

在以往的学习中，我们了解到`mutate()`的功能是新增一列，

```{r}
d <- tibble(
  x = 3:5
)

d %>% 
  mutate(x1 = x*2)
```


有时候我在想同时新增多列呢？


```{r}
d %>% 
  mutate(
    x1 = x*2,
    x2 = x^2
  )
```

从形式上看，相当于在原来d的基础上**并入**了一个新的数据框。或许`mutate()`和我们认识的不一样， 于是我试试
```{r}
t <- tibble(
  y = 4:6
)

d %>% 
  mutate(t)
```

奥利给。我们再看看，让数据框是以函数的形式返回


```{r}
my_fun <- function(x) {
  tibble(
    x1 = x * 2, 
    x2 = x^2)
}

d %>%
  mutate(
    my_fun(x)
  )
```


是不是很惊喜。 我们再看看`across()`函数，在`mutate()` 中 `across()` 返回的就是数据框，正好并入原来d，道理是一样的。

```{r}
d %>%
  mutate(
    across(x, list(f1 = ~ .x * 2, f2 = ~ .x^2))
  )
```


事实上，`across()`的**返回数据框**的特性，结合`mutate()`的**并入数据框**功能，让数据处理如鱼得水、如虎添翼。


## 从一个问题开始

计算每天水分和食物的所占比例，比如第一天water和food都是10.0，那么各自比例都是50%.
```{r beauty-of-across4-2}
d <- tibble::tribble(
  ~water, ~food,
  10.0,   10.0,
  12.1,   10.3,
  13.5,   19.1,
  17.4,   16.0,
  25.8,   15.6,
  27.4,   19.8
)
d
```

## 传统的方法

传统的方法是，把数据框旋转成长表格，计算所占比例后，再旋转回来
```{r beauty-of-across4-3}
d %>%
  rownames_to_column() %>%
  pivot_longer(
    cols = !rowname
  ) %>%
  group_by(rowname) %>%
  mutate(
    percent = 100 * value / sum(value)
  ) %>%
  ungroup() %>%
  pivot_wider(
    names_from = name,
    values_from = c(value, percent),
    names_glue = "{name}_{.value}"
  )
```

## across()的方法

传统的方法，用到基本的dplyr函数，思路很清晰，但有点周折。下面，我列出几个比较新颖的方法，当然这些方法都来源于强大`across()`函数

### 方法1
```{r beauty-of-across4-4}
d %>%
  mutate(100 * across(.names = "%{.col}") / rowSums(across())) %>%
  ungroup()
```


### 方法2
```{r beauty-of-across4-5}
rowPercent <- function(df) {
  df / rowSums(df) * 100
}

d %>%
  mutate(rowPercent(across(.names = "%{.col}")))
```


### 方法3

```{r, eval=FALSE}
d %>% 
  rowwise() %>% 
  mutate(
    across(everything(), ~ .x / sum(c_across()) )
  )


df %>% 
  rowwise() %>% 
  mutate(
    across(everything(), .names = "prop_{.col}", ~ .x / sum(c_across())  )
  )


df %>% 
  rowwise() %>% 
  mutate(
    across(.names = "prop_{.col}", .fns =  ~ .x / sum(c_across())  )
  )
```


```{r beauty-of-across4-6}
d %>%
  rowwise() %>%
  mutate(100 * across(.names = "%{.col}") / sum(c_across())) %>%
  ungroup()
```


### 方法4
```{r beauty-of-across4-7}
scale <- function(x) {
  100 * x / sum(x, na.rm = TRUE)
}

d %>%
  rowwise() %>%
  mutate(
    scale(across(.names = "%{.col}"))
  )
```


### 方法5
```{r beauty-of-across4-8}
d %>%
  rowwise() %>% 
  mutate(100 * proportions(across(.names = "%{.col}")))
```

### 方法6
更好的呈现比例

```{r beauty-of-across4-8-0}
d %>%
  rowwise() %>% 
  mutate(
   across(c(water, food), ~scales::label_percent(scale = 100)(.x /sum(c_across())), .names = "{.col}_%")
  )
```


上面的方法虽然很多，但基本思路是一样的。


## 纠缠不清的迭代

我们先弄清楚迭代方向：

- `rowwise()` 一行一行的处理
- `across()`  一列一列的处理
- `rowwise() + across()` 这种组合，双重迭代，(一行一行 `+` 一列一列)就变成了一个一个的处理
- `across() + purrr::map_dbl()`这种组合分两种情形： 
    - `purrr::map_dbl()` 作为`across( .fns = )` 中的函数，即`across(.cols = , .fns = map_dbl() )`。`across()`一列一列的迭代，每一列又传入`purrr::map_dbl()`再次迭代，因此这里是双重迭代
    - `across()`作为`purrr::map_df(.x = )`的数据，即`purrr::map_df(.x = across(), .f =  )`。因为在`mutate()`中`across()`返回数据框，因此可以把`across()`整体视为**数据框**，然后这个数据框传入`purrr::map_df(.x = )`进行迭代，因此这种情形可以认为只有`purrr::map_*()`一次迭代。


```{r}
# rowwise() + across()
# rowwise() 设定行方向后，接着across() 就行方向上的元素一个一个的执行.fns
# 循环模式：第一层，一行一行的，第二层在每一行里，一个元素到一个元素


penguins %>% 
  group_by(species, year) %>% 
  summarise(flipper_length_mm = list(flipper_length_mm)) %>% 
  ungroup() %>% 
  pivot_wider(
    names_from = year,
    values_from = flipper_length_mm
  ) %>% 
  rowwise() %>% 
  mutate(
    across(where(is.list), .fns = length)
  )
```


```{r}
# across(.cols = ,     .fns = purrr::map_dbl()  )
# 用across()就是一列一列的处理，
# 此时的一列是vector or list，又可以进入purrr::map_dbl()再次迭代，对这一列的每个元素，执行.f
# 然后across()到下一列
# 循环模式：第一层，一列一列，第二层在每一列里，一个元素到一个元素

penguins %>% 
  group_by(species, year) %>% 
  summarise(flipper_length_mm = list(flipper_length_mm)) %>% 
  ungroup() %>% 
  pivot_wider(
    names_from = year,
    values_from = flipper_length_mm
  ) %>% 
  mutate(
    across(where(is.list), ~ purrr::map_dbl(.x, length))
  )
```

```{r}
# `purrr::map_df(.x = across(),  .f = )`
# mutate()中的`across()`整体被视为**数据框**，传入purrr::map_df(.x = across(),    .f = )，然后迭代，返回数据框最后并入最初的df


penguins %>%
  select(species, starts_with("bill_")) %>% 
  head(5) %>% 
  
  mutate(
    map_dfc(
      .x = across(ends_with("_mm"), .names = '{sub("_mm", "", .col)}'),
      .f = ~.x^2
    )
  )
```


写成分步的形式，可能更好理解
```{r}
penguins %>%
  select(species, starts_with("bill_")) %>% 
  head(5) %>% 
  
  mutate({
    data <- across(ends_with("_mm"), .names = '{sub("_mm", "", .col)}')
    out <- map_dfc(data, .f = ~.x^2)
    out
  })
```

再回头看`across()`的常规用法，是否对它有了新的认识？
```{r}
penguins %>%
  select(species, starts_with("bill_")) %>% 
  head(5) %>% 
  
  mutate(
    across(ends_with("_mm"), .fns = ~.x^2, .names = '{sub("_mm", "", .col)}')
  )
```


### 案例1
觉得不过瘾，我们看下面复杂点的例子

```{r}
tt <- penguins %>% 
  group_by(species, year) %>% 
  summarise(
    across(c(bill_length_mm, bill_depth_mm), list) 
  ) %>% 
  ungroup() 
tt
```

```{r}
tt %>%
  mutate(
    map_dfc(
      .x = across(ends_with("_mm"), .names = '{sub("_mm", "", .col)}'),
      .f = ~ map_dbl(.x, length)
    )
  )
```


分步写法

```{r}
tt %>%
  mutate({
    data <- across(ends_with("_mm"), .names = '{sub("_mm", "", .col)}')
    out <- map_dfc(data, .f = ~ map_dbl(.x, length))
    out
  })
```

回到常规写法
```{r}
tt %>%
  mutate(
    across(ends_with("_mm"), .fns = ~ map_dbl(.x, length), .names = '{sub("_mm", "", .col)}')
  )
```

### 案例2

更变态的案例

```{r}
tt %>%
  mutate(
    purrr::map2_dfr(
      .x = across(bill_length_mm, .names = "cor"),
      .y = across(bill_depth_mm),
      .f = ~ map2_dbl(.x, .y, cor)
    )
  )
```


```{r}
tt %>%
  mutate(
    purrr::map2_dfr(
      .x = across(bill_length_mm, .names = "cor"),
      .y = across(bill_depth_mm),
      .f = ~ map2_dbl(.x, .y, cor)
    )
  )
```

分步写法
```{r}
tt %>%
  mutate({
    data1 <- across(bill_length_mm, .names = "cor")
    data2 <- across(bill_depth_mm)
    out <- purrr::map2_dfc(data1, data2, .f = ~ map2_dbl(.x, .y, cor))
    out
  })

```


常规方法
```{r}
tt %>% 
  rowwise() %>% 
  mutate(
    cor = cor(bill_length_mm, bill_depth_mm)
  )
```
我们这样折腾只是为了展示各种迭代.


## 习题

### 习题1

对于[数据](https://github.com/tidyverse/dplyr/issues/6109)
```{r}
df <- tibble(
  id = 1:10,
  sex = c("m", "m", "m", "f", "f", "f", "m", "f", "f", "m"),
  lds1.x = c(NA, 1, 0, 1, NA, 0, 0, NA, 0, 1),
  lds1.y = c(1, NA, 1, 1, 0, NA, 0, 3, NA, 1),
  lds2.x = c(2, 1, NA, 0, 0, NA, 1, NA, NA, 1),
  lds2.y = c(0, 2, 2, NA, NA, 0, 0, 3, 0, NA)
)
df
```

希望两两`coalesce`，比如，

```{r, eval=FALSE}
df %>%
  mutate(
    lds1 = coalesce(lds1.x, lds1.y),
    lds2 = coalesce(lds2.x, lds2.y)
  )
```

但要求是用`across()`写。


解题思路：

- 在`mutate()`中，把`across()`整体当作**数据框**用，比如
```{r, eval=FALSE}
df %>%
  mutate(
    across(ends_with(".x"))
  )


df %>%
  mutate(
    across(ends_with(".x"), .names = '{sub(".x","",.col)}')
  )
```

- 在`mutate()`中`across()`视为数据框，传递给`map_dfc()`后，`map_dfc()`将其转换成一个新的数据框，这个新的数据框最后并入`df` 
```{r, eval=FALSE}
df %>%
  mutate(
    map_dfc(
      .x = across(ends_with(".x"), .names = '{sub(".x","", .col)}'),
      .f = ~is.na(.x)
    )
  )
```


- 两个`across()`对应两个**数据框**，传递给`map2_dfc()`函数
```{r}
df %>%
  mutate(
    map2_dfr(
      .x = across(ends_with(".x"), .names = '{sub(".x","",.col)}'),
      .y = across(ends_with(".y")),
      .f = coalesce               # Vectors coalesce
    )
  )
```

- 分步写，更清晰和优雅。迭代过程：数据框df1的第一列和数据框data2的第一列coalesce，然后数据框df1的第二列和数据框df2的第二列coalesce.

```{r}
df %>%
  mutate({
    df1 <- across(ends_with(".x"), .names = '{sub(".x","",.col)}')
    df2 <- across(ends_with(".y"))
    out <- purrr::map2_dfc(df1, df2, ~ coalesce(.x, .y)) 
    out
  })

```


### 习题2

题目：如果符合某个条件，就让指定的列反号。比如，如果x小于4，x和y两列就反号。

事实上，完成这个任务的方法很多，我们只是演示`across()`的某些特征。

```{r}
d <- tibble( x = 1:4, y = 1:4)
d
```


```{r}
# using data frame returns
d %>% 
  mutate({
    test <- x < 4
    x[test] <- -x[test]
    y[test] <- -y[test]
    data.frame(x = x, y = y)
  })
```


```{r}
# using across()
d %>% 
  mutate({
    test <- x < 4
    across(c(x, y), ~ {.x[test] <- -.x[test]; .x })
  })
```


```{r}
# further abstract
negate_if <- function(condition, cols) {
  across({{ cols }}, ~ {
    .x[condition] <- -.x[condition]
    .x
  })
}
d %>% 
  mutate(negate_if(x < 4, c(x, y)))
```


```{r beauty-of-across4-98, echo = F}
# remove the objects
# ls() %>% stringr::str_flatten(collapse = ", ")

rm(d, scale, rowPercent, penguins, negate_if, df, tt)
```


```{r beauty-of-across4-99, echo = F, message = F, warning = F, results = "hide"}
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```