Skip to content

Commit

Permalink
Update collapse categories
Browse files Browse the repository at this point in the history
  • Loading branch information
Cghlewis committed Apr 10, 2024
1 parent 2abffe4 commit 0c94ea7
Show file tree
Hide file tree
Showing 2 changed files with 306 additions and 235 deletions.
254 changes: 147 additions & 107 deletions recode-values/collapse-categories.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -40,25 +40,36 @@ d6

Collapse all spellings of Google Meet into one uniform category

* Note: We are recoding into a new variable using `dplyr::mutate()` and naming the new variable a different name than the original. It keeps both the new and old versions of the variable.

* Note: Using `stringr::str_detect()` we can detect different spellings of google meet. I enter a pattern ("google|meet") that means to look for any value in online_platform that contains the word "google" OR "meet" and recategorize those to "Google Meet"

```{r}
d6 %>%
dplyr::mutate(online_platform_new=dplyr::if_else(
stringr::str_detect(online_platform, "google|meet"), "Google Meet", online_platform))
d6 %>%
dplyr::mutate(
online_platform_new =
dplyr::if_else(
stringr::str_detect(online_platform, "google|meet"),
"Google Meet",
online_platform
)
)
```

We can also do this using `%in%` instead of `stringr::str_detect()`

```{r}
d6 %>%
dplyr::mutate(online_platform_new=dplyr::if_else(
online_platform %in% c("google","meet", "Google Meet"), "Google Meet", online_platform))
d6 %>%
dplyr::mutate(
online_platform_new =
dplyr::if_else(
online_platform %in% c("google", "meet", "Google Meet"),
"Google Meet",
online_platform
)
)
```

Expand All @@ -81,62 +92,38 @@ Collapse all spellings of Google Meet into one uniform category

```{r}
d12 %>%
dplyr::mutate(online_platform_new=dplyr::if_else(
stringr::str_detect(online_platform, stringr::regex("google|meet", ignore_case = TRUE)), "Google Meet", online_platform))
d12 %>%
dplyr::mutate(
online_platform_new =
dplyr::if_else(
stringr::str_detect(
online_platform,
stringr::regex("google|meet", ignore_case = TRUE)
),
"Google Meet",
online_platform
)
)
```

We can also do this by adding (?i) to the pattern, which means match case insensitive.

```{r}
d12 %>%
dplyr::mutate(online_platform_new=dplyr::if_else(
stringr::str_detect(online_platform, "(?i)google|meet"), "Google Meet", online_platform))
```


**3\. Collapse a numeric variable (Var2) into discrete character categories**

Review the data (d8)

```{r, echo=FALSE}
source("data.R")
d8
d12 %>%
dplyr::mutate(
online_platform_new = dplyr::if_else(
stringr::str_detect(online_platform, "(?i)google|meet"),
"Google Meet",
online_platform
)
)
```

Collapse Var2 into low, medium, high categories.

* Note: We are using a nested if-else statement here to meet multiple criteria and multiple outputs

* Note: We use the & operator to say if a value meets both conditions

* Note: We cannot use `dplyr::between` because our left side is not >=

```{r}
d8 %>%
dplyr::mutate(Var2_new = dplyr::if_else(Var2 >14, "high",
dplyr::if_else((Var2 >6 & Var2 <=14), "medium", "low")))
```

* Note: Because `dplyr::if_else()` evaluates in order, we don't actually have to add the right side of our "medium" logic because all of the highs have been assigned and will not be written over by the mediums

```{r}
d8 %>%
dplyr::mutate(Var2_new = dplyr::if_else(Var2 >14, "high",
dplyr::if_else((Var2 >6), "medium", "low")))
```


**4\. Collapse a numeric variable (Var2) into discrete numeric categories**
**3\. Collapse a numeric variable (Var2) into discrete numeric categories**

Review the data (d10)

Expand All @@ -153,7 +140,7 @@ Create an indicator variable. 1 = 8 or lower, 0 = 9 or higher

```{r}
d10 %>%
d10 %>%
dplyr::mutate(Var2_new = dplyr::if_else(Var2 <= 8, 1, 0))
```
Expand All @@ -162,17 +149,19 @@ d10 %>%

```{r}
d10 %>%
dplyr::mutate(Var2_new = dplyr::if_else(Var2 <= 8 | is.na(Var2), 1, 0))
d10 %>%
dplyr::mutate(Var2_new = dplyr::if_else(Var2 <= 8 |
is.na(Var2), 1, 0))
```

You could also specifically add NA this way as well

```{r}
d10 %>%
dplyr::mutate(Var2_new = dplyr::if_else(Var2 >8 & !is.na(Var2), 0, 1))
d10 %>%
dplyr::mutate(Var2_new = dplyr::if_else(Var2 > 8 &
!is.na(Var2), 0, 1))
```

Expand All @@ -199,55 +188,61 @@ d6

Collapse all spellings of Google Meet into one uniform category

* Note: We are recoding into a new variable using `dplyr::mutate()` and naming the new variable a different name than the original. It keeps both the new and old versions of the variable.

* Note: Using `stringr::str_detect()` we can detect different spellings of google meet

* Note: You can also combine the pattern "google|meet" like we did above in `dplyr::if_else()` rather than separate the patterns out.

```{r}
d6 %>%
dplyr::mutate(online_platform_new = dplyr::case_when(
stringr::str_detect(online_platform,"google") ~ "Google Meet",
stringr::str_detect(online_platform,"meet") ~ "Google Meet",
TRUE ~ online_platform))
d6 %>%
dplyr::mutate(
online_platform_new =
dplyr::case_when(
stringr::str_detect(online_platform, "google|meet") ~ "Google Meet",
TRUE ~ online_platform
)
)
```

* Note: If you do not add *TRUE ~ online_platform*, you will get NA for all categories that are not specifically mentioned in the statement.

```{r}
d6 %>%
dplyr::mutate(online_platform_new = dplyr::case_when(
stringr::str_detect(online_platform,"google") ~ "Google Meet",
stringr::str_detect(online_platform,"meet") ~ "Google Meet"))
d6 %>%
dplyr::mutate(online_platform_new =
dplyr::case_when(
stringr::str_detect(online_platform, "google|meet") ~ "Google Meet"
))
```

* Note: And if you don't care about the other platforms, you can always collapse the remaining categories into another name, such as "other". Notice here that our NA value will be recoded into "Other".

```{r}
d6 %>%
dplyr::mutate(online_platform_new = dplyr::case_when(
stringr::str_detect(online_platform,"google") ~ "Google Meet",
stringr::str_detect(online_platform,"meet") ~ "Google Meet",
TRUE ~ "Other"))
d6 %>%
dplyr::mutate(
online_platform_new =
dplyr::case_when(
stringr::str_detect(online_platform, "google|meet") ~ "Google Meet",
TRUE ~ "Other"
)
)
```

* Note: If you want to keep NA as NA, you would need to specifically call out NA. For `dplyr::case_when()` you need to call out the specific type of NA. In this case *NA_character_*.
* Note: If you want to keep NA as NA, while recoding the rest to "other", you would need to specifically call out NA. For `dplyr::case_when()` you need to call out the specific type of NA. In this case *NA_character_*.

```{r}
d6 %>%
dplyr::mutate(online_platform_new = dplyr::case_when(
is.na(online_platform) ~ NA_character_,
stringr::str_detect(online_platform,"google") ~ "Google Meet",
stringr::str_detect(online_platform,"meet") ~ "Google Meet",
TRUE ~ "Other"))
d6 %>%
dplyr::mutate(
online_platform_new =
dplyr::case_when(
is.na(online_platform) ~ NA_character_,
stringr::str_detect(online_platform, "google|meet") ~ "Google Meet",
TRUE ~ "Other"
)
)
```

Expand All @@ -265,40 +260,84 @@ d9

Collapse all spellings of Google Meet into one uniform category and clarify that google means Google Hangouts

* Note: Using `stringr::str_detect()` we can detect different spellings of google meet

* Note: Here I am adding the `base::tolower()` function to deal with the varying capitalization. You can also use the solutions we showed above in `dplyr::if_else()` where we add either (?i) to the pattern or add `stringr::regex()` around the pattern with the argument *ignore_case = TRUE*.

* Note: Note that I am not putting "google" at the top because if I did, all instances of "google" would be recoded to google hangouts.

```{r}
d9 %>%
dplyr::mutate(online_platform=tolower(online_platform)) %>%
dplyr::mutate(online_platform_new = dplyr::case_when(
stringr::str_detect(online_platform,"met") ~ "google meet",
stringr::str_detect(online_platform,"meet") ~ "google meet",
stringr::str_detect(online_platform, "google") ~ "google hangouts",
stringr::str_detect(online_platform, "hangouts") ~ "google hangouts",
TRUE ~ online_platform))
d9 %>%
dplyr::mutate(online_platform =
tolower(online_platform)) %>%
dplyr::mutate(
online_platform_new =
dplyr::case_when(
stringr::str_detect(online_platform, "met|meet") ~ "google meet",
stringr::str_detect(online_platform, "google|hangouts") ~ "google hangouts",
TRUE ~ online_platform
)
)
```

* Note: See how the result would be different if I put "google" at the top. The ordering matters. Once a statement is evaluated, it will not be rewritten.

```{r}
d9 %>%
dplyr::mutate(online_platform=tolower(online_platform)) %>%
dplyr::mutate(online_platform_new = dplyr::case_when(
stringr::str_detect(online_platform, "google") ~ "google hangouts",
stringr::str_detect(online_platform,"met") ~ "google meet",
stringr::str_detect(online_platform,"meet") ~ "google meet",
stringr::str_detect(online_platform, "hangouts") ~ "google hangouts",
TRUE ~ online_platform))
d9 %>%
dplyr::mutate(online_platform =
tolower(online_platform)) %>%
dplyr::mutate(
online_platform_new =
dplyr::case_when(
stringr::str_detect(online_platform, "google|hangouts") ~ "google hangouts",
stringr::str_detect(online_platform, "met|meet") ~ "google meet",
TRUE ~ online_platform
)
)
```


**3\. Collapse a numeric variable (Var2) into discrete character categories**

Review the data (d8)

```{r, echo=FALSE}
source("data.R")
d8
```

Collapse Var2 into low, medium, high categories.

* Note: We use the & operator to say if a value meets both conditions

* Note: We cannot use `dplyr::between` because our left side is not >=

```{r}
d8 %>%
dplyr::mutate(Var2_new =
dplyr::case_when(Var2 > 14 ~ "high",
Var2 > 6 & Var2 <= 14 ~ "medium",
TRUE ~ "low"))
```

* Note: Because `dplyr::case_when()` evaluates in order, we don't actually have to add the right side of our "medium" logic because all of the highs have been assigned and will not be written over by the mediums

```{r}
d8 %>%
dplyr::mutate(Var2_new =
dplyr::case_when(Var2 > 14 ~ "high",
Var2 > 6 ~ "medium",
TRUE ~ "low"))
```

---

Expand All @@ -323,10 +362,11 @@ Collapse all spellings of Google Meet into one uniform category

```{r}
d6 %>%
dplyr::mutate(online_platform_new =
dplyr::recode(
online_platform, google = "Google Meet", meet = "Google Meet"))
d6 %>%
dplyr::mutate(
online_platform_new =
dplyr::recode(online_platform, google = "Google Meet", meet = "Google Meet")
)
```

Expand Down
Loading

0 comments on commit 0c94ea7

Please sign in to comment.