Some tips and tricks to guide/help you - calculating proportions on a dataset with multiple grouping variables #31
Replies: 9 comments
-
Pure base R (no package dependencies) solution## All base R code to process and report indicators ----
### Read data ----
nut_dat <- read.csv("https://github.com/OxfordIHTM/oxford-ihtm-forum/files/14256522/nut_data.csv")
### Recode variables to be able to perform proportion calculation ----
nut_dat$muac_gam <- with(nut_dat, ifelse(muac < 12.5, 1, 0))
nut_dat$muac_mam <- with(nut_dat, ifelse(muac < 12.5 & muac >= 11.5, 1, 0))
nut_dat$muac_sam <- with(nut_dat, ifelse(muac < 11.5, 1, 0))
nut_dat$oedema <- with(nut_dat, ifelse(oedema == 2, 0, 1))
nut_dat$dia <- with(nut_dat, ifelse(dia == 2, 0, 1))
nut_dat$fev <- with(nut_dat, ifelse(fev == 2, 0, 1))
nut_dat$bf <- with(nut_dat, ifelse(bf == 2, 0, 1))
### Get proportion of each indicator by state ----
state_results <- lapply(
X = nut_dat[ , c("muac_gam", "muac_mam", "muac_sam", "oedema", "dia", "fev", "bf")],
FUN = function(x) tapply(X = x, INDEX = list(nut_dat$state_name), FUN = function(x) sum(x, na.rm = TRUE) / sum(!is.na(x)))
)
#### Alternative approach using mean function ----
state_results <- lapply(
X = nut_dat[ , c("muac_gam", "muac_mam", "muac_sam", "oedema", "dia", "fev", "bf")],
FUN = function(x) tapply(X = x, INDEX = list(nut_dat$state_name), FUN = function(x) mean(x, na.rm = TRUE))
)
state_results <- do.call(cbind, state_results)
state_results <- data.frame(state = row.names(state_results), state_results)
row.names(state_results) <- NULL
### Get proportion of each indicator by district ----
district_results <- lapply(
X = nut_dat[ , c("muac_gam", "muac_mam", "muac_sam", "oedema", "dia", "fev", "bf")],
FUN = function(x) tapply(X = x, INDEX = list(nut_dat$district_name), FUN = function(x) sum(x, na.rm = TRUE) / sum(!is.na(x)))
)
#### Alternative approach using mean function ----
district_results <- lapply(
X = nut_dat[ , c("muac_gam", "muac_mam", "muac_sam", "oedema", "dia", "fev", "bf")],
FUN = function(x) tapply(X = x, INDEX = list(nut_dat$district_name), FUN = function(x) mean(x, na.rm = TRUE))
)
district_results <- do.call(cbind, district_results)
district_results <- data.frame(district = row.names(district_results), district_results)
row.names(district_results) <- NULL
### Create "pretty" version of results tables for reporting purposes ----
#### State results - prettiefied ----
state_results_pretty <- data.frame(
state = state_results$state,
do.call(
cbind,
lapply(
X = round(
state_results[ , c("muac_gam", "muac_mam", "muac_sam", "oedema", "dia", "fev", "bf")] * 100,
digits = 1
),
FUN = function(x) paste0(x, "%")
)
)
)
names(state_results_pretty) <- c("State", "GAM", "MAM", "SAM", "Oedema", "Diarrhoea", "Fever", "Breastfeeding")
#### District results - prettified ----
district_results_pretty <- data.frame(
district = district_results$district,
do.call(
cbind,
lapply(
X = round(
district_results[ , c("muac_gam", "muac_mam", "muac_sam", "oedema", "dia", "fev", "bf")] * 100,
digits = 1
),
FUN = function(x) paste0(x, "%")
)
)
)
names(district_results_pretty) <- c("District", "GAM", "MAM", "SAM", "Oedema", "Diarrhoea", "Fever", "Breastfeeding") |
Beta Was this translation helpful? Give feedback.
-
The code above produces the following results at the state level: state muac_gam muac_mam muac_sam oedema dia fev bf
1 State A 0.1379310 0.09195402 0.04597701 0.05747126 0.3678161 0.43678161 0.8045977
2 State B 0.1839080 0.14942529 0.03448276 0.01149425 0.1149425 0.06896552 0.7471264
3 State C 0.1954023 0.12643678 0.06896552 0.00000000 0.1609195 0.20689655 0.7701149
4 State D 0.1379310 0.11494253 0.02298851 0.00000000 0.1149425 0.12643678 0.8850575
5 State E 0.2183908 0.17241379 0.04597701 0.00000000 0.2068966 0.31034483 0.8965517 The "prettified" table of results at the state level looks like this: State GAM MAM SAM Oedema Diarrhoea Fever Breastfeeding
1 State A 13.8% 9.2% 4.6% 5.7% 36.8% 43.7% 80.5%
2 State B 18.4% 14.9% 3.4% 1.1% 11.5% 6.9% 74.7%
3 State C 19.5% 12.6% 6.9% 0% 16.1% 20.7% 77%
4 State D 13.8% 11.5% 2.3% 0% 11.5% 12.6% 88.5%
5 State E 21.8% 17.2% 4.6% 0% 20.7% 31% 89.7% |
Beta Was this translation helpful? Give feedback.
-
The code above produces the following results at the district level: district muac_gam muac_mam muac_sam oedema dia fev bf
1 District A 0.20689655 0.13793103 0.06896552 0.00000000 0.48275862 0.58620690 0.8965517
2 District B 0.06896552 0.06896552 0.00000000 0.00000000 0.37931034 0.51724138 0.8275862
3 District C 0.13793103 0.06896552 0.06896552 0.17241379 0.24137931 0.20689655 0.6896552
4 District D 0.13793103 0.10344828 0.03448276 0.03448276 0.17241379 0.06896552 0.5862069
5 District E 0.24137931 0.17241379 0.06896552 0.00000000 0.03448276 0.00000000 0.8620690
6 District F 0.17241379 0.17241379 0.00000000 0.00000000 0.13793103 0.13793103 0.7931034
7 District G 0.17241379 0.13793103 0.03448276 0.00000000 0.06896552 0.20689655 0.8275862
8 District H 0.13793103 0.13793103 0.00000000 0.00000000 0.24137931 0.27586207 0.5862069
9 District I 0.27586207 0.10344828 0.17241379 0.00000000 0.17241379 0.13793103 0.8965517
10 District J 0.17241379 0.17241379 0.00000000 0.00000000 0.10344828 0.06896552 0.8965517
11 District K 0.13793103 0.13793103 0.00000000 0.00000000 0.06896552 0.06896552 0.8965517
12 District L 0.10344828 0.03448276 0.06896552 0.00000000 0.17241379 0.24137931 0.8620690
13 District M 0.24137931 0.20689655 0.03448276 0.00000000 0.20689655 0.24137931 0.8965517
14 District N 0.20689655 0.13793103 0.06896552 0.00000000 0.27586207 0.48275862 0.9655172
15 District O 0.20689655 0.17241379 0.03448276 0.00000000 0.13793103 0.20689655 0.8275862 The "prettified" table of results at the district level looks like this: District GAM MAM SAM Oedema Diarrhoea Fever Breastfeeding
1 District A 20.7% 13.8% 6.9% 0% 48.3% 58.6% 89.7%
2 District B 6.9% 6.9% 0% 0% 37.9% 51.7% 82.8%
3 District C 13.8% 6.9% 6.9% 17.2% 24.1% 20.7% 69%
4 District D 13.8% 10.3% 3.4% 3.4% 17.2% 6.9% 58.6%
5 District E 24.1% 17.2% 6.9% 0% 3.4% 0% 86.2%
6 District F 17.2% 17.2% 0% 0% 13.8% 13.8% 79.3%
7 District G 17.2% 13.8% 3.4% 0% 6.9% 20.7% 82.8%
8 District H 13.8% 13.8% 0% 0% 24.1% 27.6% 58.6%
9 District I 27.6% 10.3% 17.2% 0% 17.2% 13.8% 89.7%
10 District J 17.2% 17.2% 0% 0% 10.3% 6.9% 89.7%
11 District K 13.8% 13.8% 0% 0% 6.9% 6.9% 89.7%
12 District L 10.3% 3.4% 6.9% 0% 17.2% 24.1% 86.2%
13 District M 24.1% 20.7% 3.4% 0% 20.7% 24.1% 89.7%
14 District N 20.7% 13.8% 6.9% 0% 27.6% 48.3% 96.6%
15 District O 20.7% 17.2% 3.4% 0% 13.8% 20.7% 82.8% |
Beta Was this translation helpful? Give feedback.
-
Notes on getting the proportion
in R, I express this as follows: ## A vector of values indicating whether a child has GAM or not: 1 = Yes; 0 = No; NA = no data/not applicable ----
gam <- c(1, 1, 1, 0, 0, 1, 0, 1, NA, 1)
## Calculate proportion of children with GAM ---
sum(gam, na.rm = TRUE) / sum(!is.na(gam)) which gives me: ## Numerator is 6 (there are 6 1s in the data)
## Denominator is 9 (there are 9 records that have data)
0.6666667 Now, you might notice also that result of the calculation we are performing above on the 1s and 0s can also be achieved by simply getting the mean of the values: mean(c(1, 1, 1, 0, 0, 1, 0, 1, NA, 1), na.rm = TRUE)
## or
mean(gam, na.rm = TRUE) which gives us: 0.6666667 |
Beta Was this translation helpful? Give feedback.
-
Notes on getting the proportion at every grouping in the dataset
I know you guys use Google searches and ChatGPT to look up solutions and this option for doing iterative operations will rarely come out in your results. Even the Nothing wrong with |
Beta Was this translation helpful? Give feedback.
-
Notes on output tables
|
Beta Was this translation helpful? Give feedback.
-
Dependency heavy approach/technique to producing the same results as aboveNow, for most of you, you are getting exposed to the ## Dependency heavy approach - using dplyr and friends
### Load package dependencies ----
library(dplyr)
library(scales)
### Read data ----
nut_dat <- read.table("https://raw.githubusercontent.com/OxfordIHTM/teaching_datasets/main/nut.dat", header = TRUE)
### Get state level results ----
state_results_alt <- nut_dat %>%
mutate(
muac_gam = ifelse(muac < 12.5, 1, 0),
muac_mam = ifelse(muac < 12.5 & muac >= 11.5, 1, 0),
muac_sam = ifelse(muac < 11.5, 1, 0),
oedema = ifelse(oedema == 1, 1, 0),
dia = ifelse(dia == 1, 1, 0),
fev = ifelse(fev == 1, 1, 0),
bf = ifelse(bf == 1, 1, 0)
) %>%
group_by(state_id, state_name) %>%
summarise(
muac_gam = sum(muac_gam, na.rm = TRUE) / sum(!is.na(muac_gam)),
muac_mam = sum(muac_mam, na.rm = TRUE) / sum(!is.na(muac_mam)),
muac_sam = sum(muac_sam, na.rm = TRUE) / sum(!is.na(muac_sam)),
oedema = sum(oedema, na.rm = TRUE) / sum(!is.na(oedema)),
dia = sum(dia, na.rm = TRUE) / sum(!is.na(dia)),
fev = sum(fev, na.rm = TRUE) / sum(!is.na(fev)),
bf = sum(bf, na.rm = TRUE) / sum(!is.na(bf)),
.groups = "drop"
)
#### Alternative approach using mean function to simplify further ----
state_results_alt <- nut_dat %>%
mutate(
muac_gam = ifelse(muac < 12.5, 1, 0),
muac_mam = ifelse(muac < 12.5 & muac >= 11.5, 1, 0),
muac_sam = ifelse(muac < 11.5, 1, 0),
oedema = ifelse(oedema == 1, 1, 0),
dia = ifelse(dia == 1, 1, 0),
fev = ifelse(fev == 1, 1, 0),
bf = ifelse(bf == 1, 1, 0)
) %>%
group_by(state_id, state_name) %>%
summarise(
muac_gam = mean(muac_gam, na.rm = TRUE),
muac_mam = mean(muac_mam, na.rm = TRUE),
muac_sam = mean(muac_sam, na.rm = TRUE),
oedema = mean(oedema, na.rm = TRUE),
dia = mean(dia, na.rm = TRUE),
fev = mean(fev, na.rm = TRUE),
bf = mean(bf, na.rm = TRUE),
.groups = "drop"
)
#### Further simplify by using full power of the dplyr functions ----
state_results_alt <- nut_dat %>%
mutate(
across(
.cols = muac,
.fns = list(
gam = ~ ifelse(.x < 12.5, 1, 0),
mam = ~ ifelse(.x < 12.5 & .x >= 11.5, 1, 0),
sam = ~ ifelse(.x < 11.5, 1, 0)
)
),
across(
.cols = oedema:bf,
.fns = function(x) ifelse(x == 1, 1, 0)
)
) %>%
group_by(state_id, state_name) %>%
summarise(
across(
.cols = starts_with("muac_"),
.fns = mean
),
across(
.cols = oedema:bf,
.fns = mean
),
.groups = "drop"
)
### Get district level results ----
district_results_alt <- nut_dat %>%
mutate(
muac_gam = ifelse(muac < 12.5, 1, 0),
muac_mam = ifelse(muac < 12.5 & muac >= 11.5, 1, 0),
muac_sam = ifelse(muac < 11.5, 1, 0),
oedema = ifelse(oedema == 1, 1, 0),
dia = ifelse(dia == 1, 1, 0),
fev = ifelse(fev == 1, 1, 0),
bf = ifelse(bf == 1, 1, 0)
) %>%
group_by(district_id, district_name) %>%
summarise(
muac_gam = sum(muac_gam, na.rm = TRUE) / sum(!is.na(muac_gam)),
muac_mam = sum(muac_mam, na.rm = TRUE) / sum(!is.na(muac_mam)),
muac_sam = sum(muac_sam, na.rm = TRUE) / sum(!is.na(muac_sam)),
oedema = sum(oedema, na.rm = TRUE) / sum(!is.na(oedema)),
dia = sum(dia, na.rm = TRUE) / sum(!is.na(dia)),
fev = sum(fev, na.rm = TRUE) / sum(!is.na(fev)),
bf = sum(bf, na.rm = TRUE) / sum(!is.na(bf)),
.groups = "drop"
)
#### Alternative approach using mean function to simplify further ----
district_results_alt <- nut_dat %>%
mutate(
muac_gam = ifelse(muac < 12.5, 1, 0),
muac_mam = ifelse(muac < 12.5 & muac >= 11.5, 1, 0),
muac_sam = ifelse(muac < 11.5, 1, 0),
oedema = ifelse(oedema == 1, 1, 0),
dia = ifelse(dia == 1, 1, 0),
fev = ifelse(fev == 1, 1, 0),
bf = ifelse(bf == 1, 1, 0)
) %>%
group_by(district_id, district_name) %>%
summarise(
muac_gam = mean(muac_gam, na.rm = TRUE),
muac_mam = mean(muac_mam, na.rm = TRUE),
muac_sam = mean(muac_sam, na.rm = TRUE),
oedema = mean(oedema, na.rm = TRUE),
dia = mean(dia, na.rm = TRUE),
fev = mean(fev, na.rm = TRUE),
bf = mean(bf, na.rm = TRUE),
.groups = "drop"
)
#### Further simplify by using full power of the dplyr functions ----
district_results_alt <- nut_dat %>%
mutate(
across(
.cols = muac,
.fns = list(
gam = ~ ifelse(.x < 12.5, 1, 0),
mam = ~ ifelse(.x < 12.5 & .x >= 11.5, 1, 0),
sam = ~ ifelse(.x < 11.5, 1, 0)
)
),
across(
.cols = oedema:bf,
.fns = function(x) ifelse(x == 1, 1, 0)
)
) %>%
group_by(district_id, district_name) %>%
summarise(
across(
.cols = starts_with("muac_"),
.fns = mean
),
across(
.cols = oedema:bf,
.fns = mean
),
.groups = "drop"
)
### Prettify the output tables
#### State results ----
state_results_alt_pretty <- state_results_alt %>%
mutate(
across(
.cols = muac_gam:bf,
.fns = scales::label_percent()
)
) %>%
rename(
`State Identifier` = state_id,
State = state_name,
GAM = muac_gam,
MAM = muac_mam,
SAM = muac_sam,
Oedema = oedema,
Diarrhoea = dia,
Fever = fev,
Breastfeeding = bf
)
#### District results ----
district_results_alt_pretty <- district_results_alt %>%
mutate(
across(
.cols = muac_gam:bf,
.fns = scales::label_percent()
)
) %>%
rename(
`District Identifier` = district_id,
District = district_name,
GAM = muac_gam,
MAM = muac_mam,
SAM = muac_sam,
Oedema = oedema,
Diarrhoea = dia,
Fever = fev,
Breastfeeding = bf
) |
Beta Was this translation helpful? Give feedback.
-
The output of the state level results is: # A tibble: 5 × 9
state_id state_name muac_gam muac_mam muac_sam oedema dia fev bf
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 State A 0.138 0.0920 0.0460 0.0575 0.368 0.437 0.805
2 2 State B 0.184 0.149 0.0345 0.0115 0.115 0.0690 0.747
3 3 State C 0.195 0.126 0.0690 0 0.161 0.207 0.770
4 4 State D 0.138 0.115 0.0230 0 0.115 0.126 0.885
5 5 State E 0.218 0.172 0.0460 0 0.207 0.310 0.897 The output of the district level results is: # A tibble: 15 × 9
district_id district_name muac_gam muac_mam muac_sam oedema dia fev bf
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 District A 0.207 0.138 0.0690 0 0.483 0.586 0.897
2 2 District B 0.0690 0.0690 0 0 0.379 0.517 0.828
3 3 District C 0.138 0.0690 0.0690 0.172 0.241 0.207 0.690
4 4 District D 0.138 0.103 0.0345 0.0345 0.172 0.0690 0.586
5 5 District E 0.241 0.172 0.0690 0 0.0345 0 0.862
6 6 District F 0.172 0.172 0 0 0.138 0.138 0.793
7 7 District G 0.172 0.138 0.0345 0 0.0690 0.207 0.828
8 8 District H 0.138 0.138 0 0 0.241 0.276 0.586
9 9 District I 0.276 0.103 0.172 0 0.172 0.138 0.897
10 10 District J 0.172 0.172 0 0 0.103 0.0690 0.897
11 11 District K 0.138 0.138 0 0 0.0690 0.0690 0.897
12 12 District L 0.103 0.0345 0.0690 0 0.172 0.241 0.862
13 13 District M 0.241 0.207 0.0345 0 0.207 0.241 0.897
14 14 District N 0.207 0.138 0.0690 0 0.276 0.483 0.966
15 15 District O 0.207 0.172 0.0345 0 0.138 0.207 0.828 The output is almost exactly the same as the previous approach. The main difference being that it is easier to keep the state identifier variable in the resulting output and the resulting output is a |
Beta Was this translation helpful? Give feedback.
-
Notes on using dependencies
What I want you to take away from here is the judicious use of dependencies. What I am seeing (and this is because of all the things you are seeing in the results of your Google search and what ChatGPT is telling you) is that some of you are declaring dependencies to at least 10 or so packages right away. But when I look at your code, you are only using 1 or 2 at most of those that you are declaring dependency on. This is not necessarily wrong, but I don't think it is good practice because you are requiring others to install these 10 or so package unnecessarily. Also, the more packages you declare a dependency on, the more fragile your workflow is. Another point here is that I am seeing some of you declaring a dependency to |
Beta Was this translation helpful? Give feedback.
-
Using an example make believe data: nut_data.csv
We read the data as follows:
This is what
nut_dat
looks like (showing first 10 rows of data only)This pretend nutrition dataset is from a country with 5 states ("State A", "State B", "State C", "State D", "State E") and with 3 districts per state ("District A", "District B", "District C", "District D", "District E", "District F", "District G", "District H", "District I", "District J", "District K", "District L", "District M", "District N", "District O").
The nutrition survey was done throughout the country (national survey) but with a sample that is representative down to the district level. Hence, results can be reported at the district level.
The variables in the dataset are:
In this example, I would like to answer the following:
What are the prevalences of acute wasting based on MUAC at the district, state, and national level?
global acute malnutrition (GAM): MUAC < 12.5
moderate acute malnutrition (MAM): MUAC < 12.5 & MUAC >= 11.5
severe acute malnutrition (SAM): MUAC < 11.5
What is the prevalence of oedema at the district, state, and national level?
What is the period prevalence of diarrhoea and fever at the district, state, and national level?
What is the prevalence of a child ever having been breastfed at the district, state, and national level?
Beta Was this translation helpful? Give feedback.
All reactions