Some tips and tricks to guide/help you - calculating proportions on a dataset with multiple grouping variables #31

ernestguevarra · 2024-02-13T05:31:53Z

ernestguevarra
Feb 13, 2024
Maintainer

Using an example make believe data: nut_data.csv

We read the data as follows:

nut_dat <- read.csv("https://github.com/OxfordIHTM/oxford-ihtm-forum/files/14256522/nut_data.csv")

This is what nut_dat looks like (showing first 10 rows of data only)

   state_id state_name district_id district_name age sex    ht   wt muac oedema dia fev bf
1         1    State A           1    District A   7   1  69.5  7.3 13.3      2   2   2  2
2         1    State A           1    District A  42   1  79.9 10.4 13.3      2   1   2  1
3         1    State A           1    District A  60   1  78.4 14.0 14.6      2   2   2  1
4         1    State A           1    District A  60   2  91.2 12.2 15.6      2   2   2  1
5         1    State A           1    District A  48   1  73.3  8.4 13.3      2   1   1  3
6         1    State A           1    District A  60   2 105.2 15.2 14.3      2   2   1  1
7         1    State A           1    District A  18   1  78.6  8.3 12.4      2   1   1  3
8         1    State A           1    District A  48   1  99.5 15.9 14.2      2   2   2  1
9         1    State A           1    District A  60   2  97.9 13.3 12.8      2   1   1  1
10        1    State A           1    District A  36   1  82.2  9.8 13.4      2   2   1  1

This pretend nutrition dataset is from a country with 5 states ("State A", "State B", "State C", "State D", "State E") and with 3 districts per state ("District A", "District B", "District C", "District D", "District E", "District F", "District G", "District H", "District I", "District J", "District K", "District L", "District M", "District N", "District O").

The nutrition survey was done throughout the country (national survey) but with a sample that is representative down to the district level. Hence, results can be reported at the district level.

The variables in the dataset are:

Variable Name	Variable Description
state_id	Unique identifier for the state
state_name	Name of the state
district_id	Unique identifier for the district
district_name	Name of the district
age	Age of child in months
sex	Sex of child: 1 = Male; 2 = Female
ht	Height of child in centimetres
wt	Weight of child in kilograms
muac	Mid-upper arm circumference of child in centimetres
oedema	Does the child have oedema? 1 = Yes; 2 = No
dia	In the past 2 weeks (period), has the child had diarrhoea? 1 = Yes; 2 = No
fev	In the past 2 weeks (period), has the child had fever? 1 = Yes; 2 = No
bf	Has the child ever breastfed? 1 = Yes; 2 = No

In this example, I would like to answer the following:

What are the prevalences of acute wasting based on MUAC at the district, state, and national level?

global acute malnutrition (GAM): MUAC < 12.5
moderate acute malnutrition (MAM): MUAC < 12.5 & MUAC >= 11.5
severe acute malnutrition (SAM): MUAC < 11.5
What is the prevalence of oedema at the district, state, and national level?
What is the period prevalence of diarrhoea and fever at the district, state, and national level?
What is the prevalence of a child ever having been breastfed at the district, state, and national level?

ernestguevarra · 2024-02-13T05:36:50Z

ernestguevarra
Feb 13, 2024
Maintainer Author

Pure base R (no package dependencies) solution

## All base R code to process and report indicators ----

### Read data ----
nut_dat <- read.csv("https://github.com/OxfordIHTM/oxford-ihtm-forum/files/14256522/nut_data.csv")

### Recode variables to be able to perform proportion calculation ----
nut_dat$muac_gam <- with(nut_dat, ifelse(muac < 12.5, 1, 0))
nut_dat$muac_mam <- with(nut_dat, ifelse(muac < 12.5 & muac >= 11.5, 1, 0))
nut_dat$muac_sam <- with(nut_dat, ifelse(muac < 11.5, 1, 0))
nut_dat$oedema <- with(nut_dat, ifelse(oedema == 2, 0, 1))
nut_dat$dia <- with(nut_dat, ifelse(dia == 2, 0, 1))
nut_dat$fev <- with(nut_dat, ifelse(fev == 2, 0, 1))
nut_dat$bf <- with(nut_dat, ifelse(bf == 2, 0, 1))

### Get proportion of each indicator by state ----
state_results <- lapply(
  X = nut_dat[ , c("muac_gam", "muac_mam", "muac_sam", "oedema", "dia", "fev", "bf")],
  FUN = function(x) tapply(X = x, INDEX = list(nut_dat$state_name), FUN = function(x) sum(x, na.rm = TRUE) / sum(!is.na(x)))
)

#### Alternative approach using mean function ----
state_results <- lapply(
  X = nut_dat[ , c("muac_gam", "muac_mam", "muac_sam", "oedema", "dia", "fev", "bf")],
  FUN = function(x) tapply(X = x, INDEX = list(nut_dat$state_name), FUN = function(x) mean(x, na.rm = TRUE))
)

state_results <- do.call(cbind, state_results)
state_results <- data.frame(state = row.names(state_results), state_results)
row.names(state_results) <- NULL


### Get proportion of each indicator by district ----
district_results <- lapply(
  X = nut_dat[ , c("muac_gam", "muac_mam", "muac_sam", "oedema", "dia", "fev", "bf")],
  FUN = function(x) tapply(X = x, INDEX = list(nut_dat$district_name), FUN = function(x) sum(x, na.rm = TRUE) / sum(!is.na(x)))
)

#### Alternative approach using mean function ----
district_results <- lapply(
  X = nut_dat[ , c("muac_gam", "muac_mam", "muac_sam", "oedema", "dia", "fev", "bf")],
  FUN = function(x) tapply(X = x, INDEX = list(nut_dat$district_name), FUN = function(x) mean(x, na.rm = TRUE))
)

district_results <- do.call(cbind, district_results)
district_results <- data.frame(district = row.names(district_results), district_results)
row.names(district_results) <- NULL

### Create "pretty" version of results tables for reporting purposes ----

#### State results - prettiefied ----
state_results_pretty <- data.frame(
  state = state_results$state,
  do.call(
    cbind,
    lapply(
      X = round(
        state_results[ , c("muac_gam", "muac_mam", "muac_sam", "oedema", "dia", "fev", "bf")] * 100, 
        digits = 1
      ),
      FUN = function(x) paste0(x, "%")
    )
  )
)

names(state_results_pretty) <- c("State", "GAM", "MAM", "SAM", "Oedema", "Diarrhoea", "Fever", "Breastfeeding")

#### District results - prettified ----
district_results_pretty <- data.frame(
  district = district_results$district,
  do.call(
    cbind,
    lapply(
      X = round(
        district_results[ , c("muac_gam", "muac_mam", "muac_sam", "oedema", "dia", "fev", "bf")] * 100, 
        digits = 1
      ),
      FUN = function(x) paste0(x, "%")
    )
  )
)

names(district_results_pretty) <- c("District", "GAM", "MAM", "SAM", "Oedema", "Diarrhoea", "Fever", "Breastfeeding")

0 replies

ernestguevarra · 2024-02-13T05:41:42Z

ernestguevarra
Feb 13, 2024
Maintainer Author

The code above produces the following results at the state level:

    state  muac_gam   muac_mam   muac_sam     oedema       dia        fev        bf
1 State A 0.1379310 0.09195402 0.04597701 0.05747126 0.3678161 0.43678161 0.8045977
2 State B 0.1839080 0.14942529 0.03448276 0.01149425 0.1149425 0.06896552 0.7471264
3 State C 0.1954023 0.12643678 0.06896552 0.00000000 0.1609195 0.20689655 0.7701149
4 State D 0.1379310 0.11494253 0.02298851 0.00000000 0.1149425 0.12643678 0.8850575
5 State E 0.2183908 0.17241379 0.04597701 0.00000000 0.2068966 0.31034483 0.8965517

The "prettified" table of results at the state level looks like this:

    State   GAM   MAM  SAM Oedema Diarrhoea Fever Breastfeeding
1 State A 13.8%  9.2% 4.6%   5.7%     36.8% 43.7%         80.5%
2 State B 18.4% 14.9% 3.4%   1.1%     11.5%  6.9%         74.7%
3 State C 19.5% 12.6% 6.9%     0%     16.1% 20.7%           77%
4 State D 13.8% 11.5% 2.3%     0%     11.5% 12.6%         88.5%
5 State E 21.8% 17.2% 4.6%     0%     20.7%   31%         89.7%

0 replies

ernestguevarra · 2024-02-13T05:45:05Z

ernestguevarra
Feb 13, 2024
Maintainer Author

The code above produces the following results at the district level:

     district   muac_gam   muac_mam   muac_sam     oedema        dia        fev        bf
1  District A 0.20689655 0.13793103 0.06896552 0.00000000 0.48275862 0.58620690 0.8965517
2  District B 0.06896552 0.06896552 0.00000000 0.00000000 0.37931034 0.51724138 0.8275862
3  District C 0.13793103 0.06896552 0.06896552 0.17241379 0.24137931 0.20689655 0.6896552
4  District D 0.13793103 0.10344828 0.03448276 0.03448276 0.17241379 0.06896552 0.5862069
5  District E 0.24137931 0.17241379 0.06896552 0.00000000 0.03448276 0.00000000 0.8620690
6  District F 0.17241379 0.17241379 0.00000000 0.00000000 0.13793103 0.13793103 0.7931034
7  District G 0.17241379 0.13793103 0.03448276 0.00000000 0.06896552 0.20689655 0.8275862
8  District H 0.13793103 0.13793103 0.00000000 0.00000000 0.24137931 0.27586207 0.5862069
9  District I 0.27586207 0.10344828 0.17241379 0.00000000 0.17241379 0.13793103 0.8965517
10 District J 0.17241379 0.17241379 0.00000000 0.00000000 0.10344828 0.06896552 0.8965517
11 District K 0.13793103 0.13793103 0.00000000 0.00000000 0.06896552 0.06896552 0.8965517
12 District L 0.10344828 0.03448276 0.06896552 0.00000000 0.17241379 0.24137931 0.8620690
13 District M 0.24137931 0.20689655 0.03448276 0.00000000 0.20689655 0.24137931 0.8965517
14 District N 0.20689655 0.13793103 0.06896552 0.00000000 0.27586207 0.48275862 0.9655172
15 District O 0.20689655 0.17241379 0.03448276 0.00000000 0.13793103 0.20689655 0.8275862

The "prettified" table of results at the district level looks like this:

     District   GAM   MAM   SAM Oedema Diarrhoea Fever Breastfeeding
1  District A 20.7% 13.8%  6.9%     0%     48.3% 58.6%         89.7%
2  District B  6.9%  6.9%    0%     0%     37.9% 51.7%         82.8%
3  District C 13.8%  6.9%  6.9%  17.2%     24.1% 20.7%           69%
4  District D 13.8% 10.3%  3.4%   3.4%     17.2%  6.9%         58.6%
5  District E 24.1% 17.2%  6.9%     0%      3.4%    0%         86.2%
6  District F 17.2% 17.2%    0%     0%     13.8% 13.8%         79.3%
7  District G 17.2% 13.8%  3.4%     0%      6.9% 20.7%         82.8%
8  District H 13.8% 13.8%    0%     0%     24.1% 27.6%         58.6%
9  District I 27.6% 10.3% 17.2%     0%     17.2% 13.8%         89.7%
10 District J 17.2% 17.2%    0%     0%     10.3%  6.9%         89.7%
11 District K 13.8% 13.8%    0%     0%      6.9%  6.9%         89.7%
12 District L 10.3%  3.4%  6.9%     0%     17.2% 24.1%         86.2%
13 District M 24.1% 20.7%  3.4%     0%     20.7% 24.1%         89.7%
14 District N 20.7% 13.8%  6.9%     0%     27.6% 48.3%         96.6%
15 District O 20.7% 17.2%  3.4%     0%     13.8% 20.7%         82.8%

0 replies

ernestguevarra · 2024-02-13T06:00:27Z

ernestguevarra
Feb 13, 2024
Maintainer Author

Notes on getting the proportion

This is an approach I use all the time in my work analysing survey data. I see each indicator that is reporting a proportion as a set of 1s and 0s with 1 indicating that the specific row of data has the condition or the indicator definition that is of interest to me and 0 as not. So, from a calculation perspective, my equation for any proportion type of indicator looks like this:

$$ \frac{\text{sum of all values}}{\text{sum of all that has values}} $$

in R, I express this as follows:

## A vector of values indicating whether a child has GAM or not: 1 = Yes; 0 = No; NA = no data/not applicable ----
gam <- c(1, 1, 1, 0, 0, 1, 0, 1, NA, 1)

## Calculate proportion of children with GAM ---
sum(gam, na.rm = TRUE) / sum(!is.na(gam))

which gives me:

## Numerator is 6 (there are 6 1s in the data)
## Denominator is 9 (there are 9 records that have data)

0.6666667

Now, you might notice also that result of the calculation we are performing above on the 1s and 0s can also be achieved by simply getting the mean of the values:

mean(c(1, 1, 1, 0, 0, 1, 0, 1, NA, 1), na.rm = TRUE)
## or
mean(gam, na.rm = TRUE)

which gives us:

0.6666667

0 replies

ernestguevarra · 2024-02-13T06:08:43Z

ernestguevarra
Feb 13, 2024
Maintainer Author

Notes on getting the proportion at every grouping in the dataset

I use here a combination of the base functions called lapply() and tapply(). You can see the help files for these functions using ?lapply and ?tapply. These functions allow you to perform functions/operations iteratively across a vector or list of values (for lapply()) or across a grouping of values (tapply()). This is an alternative to using a for() loop function for iteration.

I know you guys use Google searches and ChatGPT to look up solutions and this option for doing iterative operations will rarely come out in your results. Even the for() loop, which is another pure Base R solution to this, will rarely (if at all) come out. But I want to expose you to these options because it is important to learn R from its roots/base approach without dependencies to the common tidyverse approach which is what internet search or AI results will consistently show.

Nothing wrong with tidyverse but R is not just tidyverse and important to be aware of this.

0 replies

ernestguevarra · 2024-02-13T06:13:43Z

ernestguevarra
Feb 13, 2024
Maintainer Author

Notes on output tables

You will notice that in my initial output tables, I don't do any rounding off or any "beautification" of the output values. I do this because sometimes, the output table is just an intermediate output and that you will have another step wherein you will use the values of this table to do another calculation or to feed into a model or for mapping or for plotting. Anytime that you will have another step of operations to perform, use the full actual results rather than something that has been rounded off or whatever else kind of truncation that you like performing.
Then, in order to make nice reports and outputs, I create a "pretty" version of the table. I round off to the nearest 1 decimal point (using the round() function) and if a proportion, I usually put a % sign. These "pretty" tables you can then output/save as CSV files or XLSX files so that you can use them later to cut and paste onto your Word document. If you are using Rmarkdown, then you can use these pretty outputs to show as tables in R code chunks in Rmarkdown.

0 replies

ernestguevarra · 2024-02-13T06:51:00Z

ernestguevarra
Feb 13, 2024
Maintainer Author

Dependency heavy approach/technique to producing the same results as above

Now, for most of you, you are getting exposed to the tidyverse side of things in R whenever you do a Google search or ask ChatGPT for some help. This is fine. And I want to show an example of how you can get the same results as above with all the dependencies that you can use.

## Dependency heavy approach - using dplyr and friends

### Load package dependencies ----
library(dplyr)
library(scales)

### Read data ----
nut_dat <- read.table("https://raw.githubusercontent.com/OxfordIHTM/teaching_datasets/main/nut.dat", header = TRUE)


### Get state level results ----
state_results_alt <- nut_dat %>%
  mutate(
    muac_gam = ifelse(muac < 12.5, 1, 0),
    muac_mam = ifelse(muac < 12.5 & muac >= 11.5, 1, 0),
    muac_sam = ifelse(muac < 11.5, 1, 0),
    oedema = ifelse(oedema == 1, 1, 0),
    dia = ifelse(dia == 1, 1, 0),
    fev = ifelse(fev == 1, 1, 0),
    bf = ifelse(bf == 1, 1, 0)
  ) %>%
  group_by(state_id, state_name) %>%
  summarise(
    muac_gam = sum(muac_gam, na.rm = TRUE) / sum(!is.na(muac_gam)),
    muac_mam = sum(muac_mam, na.rm = TRUE) / sum(!is.na(muac_mam)),
    muac_sam = sum(muac_sam, na.rm = TRUE) / sum(!is.na(muac_sam)),
    oedema = sum(oedema, na.rm = TRUE) / sum(!is.na(oedema)),
    dia = sum(dia, na.rm = TRUE) / sum(!is.na(dia)),
    fev = sum(fev, na.rm = TRUE) / sum(!is.na(fev)),
    bf = sum(bf, na.rm = TRUE) / sum(!is.na(bf)),
    .groups = "drop"
  )

#### Alternative approach using mean function to simplify further ----
state_results_alt <- nut_dat %>%
  mutate(
    muac_gam = ifelse(muac < 12.5, 1, 0),
    muac_mam = ifelse(muac < 12.5 & muac >= 11.5, 1, 0),
    muac_sam = ifelse(muac < 11.5, 1, 0),
    oedema = ifelse(oedema == 1, 1, 0),
    dia = ifelse(dia == 1, 1, 0),
    fev = ifelse(fev == 1, 1, 0),
    bf = ifelse(bf == 1, 1, 0)
  ) %>%
  group_by(state_id, state_name) %>%
  summarise(
    muac_gam = mean(muac_gam,  na.rm = TRUE),
    muac_mam = mean(muac_mam, na.rm = TRUE),
    muac_sam = mean(muac_sam, na.rm = TRUE),
    oedema = mean(oedema, na.rm = TRUE),
    dia = mean(dia, na.rm = TRUE),
    fev = mean(fev, na.rm = TRUE),
    bf = mean(bf, na.rm = TRUE),
    .groups = "drop"
  )

#### Further simplify by using full power of the dplyr functions ----
state_results_alt <- nut_dat %>%
  mutate(
    across(
      .cols = muac,
      .fns = list(
        gam = ~ ifelse(.x < 12.5, 1, 0),
        mam = ~ ifelse(.x < 12.5 & .x >= 11.5, 1, 0),
        sam = ~ ifelse(.x < 11.5, 1, 0)
      )
    ),
    across(
      .cols = oedema:bf,
      .fns = function(x) ifelse(x == 1, 1, 0)
    )
  ) %>% 
  group_by(state_id, state_name) %>%
  summarise(
    across(
      .cols = starts_with("muac_"),
      .fns = mean
    ),
    across(
      .cols = oedema:bf,
      .fns = mean
    ),
    .groups = "drop"
  )


### Get district level results ----
district_results_alt <- nut_dat %>%
  mutate(
    muac_gam = ifelse(muac < 12.5, 1, 0),
    muac_mam = ifelse(muac < 12.5 & muac >= 11.5, 1, 0),
    muac_sam = ifelse(muac < 11.5, 1, 0),
    oedema = ifelse(oedema == 1, 1, 0),
    dia = ifelse(dia == 1, 1, 0),
    fev = ifelse(fev == 1, 1, 0),
    bf = ifelse(bf == 1, 1, 0)
  ) %>%
  group_by(district_id, district_name) %>%
  summarise(
    muac_gam = sum(muac_gam, na.rm = TRUE) / sum(!is.na(muac_gam)),
    muac_mam = sum(muac_mam, na.rm = TRUE) / sum(!is.na(muac_mam)),
    muac_sam = sum(muac_sam, na.rm = TRUE) / sum(!is.na(muac_sam)),
    oedema = sum(oedema, na.rm = TRUE) / sum(!is.na(oedema)),
    dia = sum(dia, na.rm = TRUE) / sum(!is.na(dia)),
    fev = sum(fev, na.rm = TRUE) / sum(!is.na(fev)),
    bf = sum(bf, na.rm = TRUE) / sum(!is.na(bf)),
    .groups = "drop"
  )

#### Alternative approach using mean function to simplify further ----
district_results_alt <- nut_dat %>%
  mutate(
    muac_gam = ifelse(muac < 12.5, 1, 0),
    muac_mam = ifelse(muac < 12.5 & muac >= 11.5, 1, 0),
    muac_sam = ifelse(muac < 11.5, 1, 0),
    oedema = ifelse(oedema == 1, 1, 0),
    dia = ifelse(dia == 1, 1, 0),
    fev = ifelse(fev == 1, 1, 0),
    bf = ifelse(bf == 1, 1, 0)
  ) %>%
  group_by(district_id, district_name) %>%
  summarise(
    muac_gam = mean(muac_gam,  na.rm = TRUE),
    muac_mam = mean(muac_mam, na.rm = TRUE),
    muac_sam = mean(muac_sam, na.rm = TRUE),
    oedema = mean(oedema, na.rm = TRUE),
    dia = mean(dia, na.rm = TRUE),
    fev = mean(fev, na.rm = TRUE),
    bf = mean(bf, na.rm = TRUE),
    .groups = "drop"
  )

#### Further simplify by using full power of the dplyr functions ----
district_results_alt <- nut_dat %>%
  mutate(
    across(
      .cols = muac,
      .fns = list(
        gam = ~ ifelse(.x < 12.5, 1, 0),
        mam = ~ ifelse(.x < 12.5 & .x >= 11.5, 1, 0),
        sam = ~ ifelse(.x < 11.5, 1, 0)
      )
    ),
    across(
      .cols = oedema:bf,
      .fns = function(x) ifelse(x == 1, 1, 0)
    )
  ) %>% 
  group_by(district_id, district_name) %>%
  summarise(
    across(
      .cols = starts_with("muac_"),
      .fns = mean
    ),
    across(
      .cols = oedema:bf,
      .fns = mean
    ),
    .groups = "drop"
  )


### Prettify the output tables

#### State results ----
state_results_alt_pretty <- state_results_alt %>%
  mutate(
    across(
      .cols = muac_gam:bf,
      .fns = scales::label_percent()
    )
  ) %>%
  rename(
    `State Identifier` = state_id,
    State = state_name,
    GAM = muac_gam,
    MAM = muac_mam,
    SAM = muac_sam,
    Oedema = oedema,
    Diarrhoea = dia,
    Fever = fev,
    Breastfeeding = bf
  )

#### District results ----
district_results_alt_pretty <- district_results_alt %>%
  mutate(
    across(
      .cols = muac_gam:bf,
      .fns = scales::label_percent()
    )
  ) %>%
  rename(
    `District Identifier` = district_id,
    District = district_name,
    GAM = muac_gam,
    MAM = muac_mam,
    SAM = muac_sam,
    Oedema = oedema,
    Diarrhoea = dia,
    Fever = fev,
    Breastfeeding = bf
  )

0 replies

ernestguevarra · 2024-02-13T06:55:53Z

ernestguevarra
Feb 13, 2024
Maintainer Author

The output of the state level results is:

# A tibble: 5 × 9
  state_id state_name muac_gam muac_mam muac_sam oedema   dia    fev    bf
     <dbl> <chr>         <dbl>    <dbl>    <dbl>  <dbl> <dbl>  <dbl> <dbl>
1        1 State A       0.138   0.0920   0.0460 0.0575 0.368 0.437  0.805
2        2 State B       0.184   0.149    0.0345 0.0115 0.115 0.0690 0.747
3        3 State C       0.195   0.126    0.0690 0      0.161 0.207  0.770
4        4 State D       0.138   0.115    0.0230 0      0.115 0.126  0.885
5        5 State E       0.218   0.172    0.0460 0      0.207 0.310  0.897

The output of the district level results is:

# A tibble: 15 × 9
   district_id district_name muac_gam muac_mam muac_sam oedema    dia    fev    bf
         <dbl> <chr>            <dbl>    <dbl>    <dbl>  <dbl>  <dbl>  <dbl> <dbl>
 1           1 District A      0.207    0.138    0.0690 0      0.483  0.586  0.897
 2           2 District B      0.0690   0.0690   0      0      0.379  0.517  0.828
 3           3 District C      0.138    0.0690   0.0690 0.172  0.241  0.207  0.690
 4           4 District D      0.138    0.103    0.0345 0.0345 0.172  0.0690 0.586
 5           5 District E      0.241    0.172    0.0690 0      0.0345 0      0.862
 6           6 District F      0.172    0.172    0      0      0.138  0.138  0.793
 7           7 District G      0.172    0.138    0.0345 0      0.0690 0.207  0.828
 8           8 District H      0.138    0.138    0      0      0.241  0.276  0.586
 9           9 District I      0.276    0.103    0.172  0      0.172  0.138  0.897
10          10 District J      0.172    0.172    0      0      0.103  0.0690 0.897
11          11 District K      0.138    0.138    0      0      0.0690 0.0690 0.897
12          12 District L      0.103    0.0345   0.0690 0      0.172  0.241  0.862
13          13 District M      0.241    0.207    0.0345 0      0.207  0.241  0.897
14          14 District N      0.207    0.138    0.0690 0      0.276  0.483  0.966
15          15 District O      0.207    0.172    0.0345 0      0.138  0.207  0.828

The output is almost exactly the same as the previous approach. The main difference being that it is easier to keep the state identifier variable in the resulting output and the resulting output is a tibble which is a tidyverse interpretation of a data.frame. It still looks and feels and works like a data.frame so you can use it how you would use a data.frame.

0 replies

ernestguevarra · 2024-02-13T07:05:54Z

ernestguevarra
Feb 13, 2024
Maintainer Author

Notes on using dependencies

I used two dependencies in this alternative code - dplyr and scales packages.
The dplyr functions I used are: mutate, group_by, summarise or summarize, across, and rename
The scales function I used is: label_percent
I could have potentially not used a dependency for the scales package as I only used that for one purpose - prettifying the output

What I want you to take away from here is the judicious use of dependencies. What I am seeing (and this is because of all the things you are seeing in the results of your Google search and what ChatGPT is telling you) is that some of you are declaring dependencies to at least 10 or so packages right away. But when I look at your code, you are only using 1 or 2 at most of those that you are declaring dependency on. This is not necessarily wrong, but I don't think it is good practice because you are requiring others to install these 10 or so package unnecessarily. Also, the more packages you declare a dependency on, the more fragile your workflow is.

Another point here is that I am seeing some of you declaring a dependency to tidyverse. I am sort of OK with this but I want you to realise that tidyverse is a conglomeration of 25 individual packages!!! So, to declare a dependency to tidyverse means to declare dependency to at least 25 packages. And to be able to install each of these packages, you require at least another 200 co-dependent packages on top of that. So, as you can see above, it would seem an overkill to have to install at least 225 packages (for tidyverse) just to perform the operations above which just needs 1 package (dplyr) and maybe another one (scales) just for convenience.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Oxford iHealth

Some tips and tricks to guide/help you - calculating proportions on a dataset with multiple grouping variables #31

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Oxford iHealth

Some tips and tricks to guide/help you - calculating proportions on a dataset with multiple grouping variables #31

ernestguevarra Feb 13, 2024 Maintainer

Replies: 9 comments

ernestguevarra Feb 13, 2024 Maintainer Author

Pure base R (no package dependencies) solution

ernestguevarra Feb 13, 2024 Maintainer Author

ernestguevarra Feb 13, 2024 Maintainer Author

ernestguevarra Feb 13, 2024 Maintainer Author

Notes on getting the proportion

ernestguevarra Feb 13, 2024 Maintainer Author

Notes on getting the proportion at every grouping in the dataset

ernestguevarra Feb 13, 2024 Maintainer Author

Notes on output tables

ernestguevarra Feb 13, 2024 Maintainer Author

Dependency heavy approach/technique to producing the same results as above

ernestguevarra Feb 13, 2024 Maintainer Author

ernestguevarra Feb 13, 2024 Maintainer Author

Notes on using dependencies

ernestguevarra
Feb 13, 2024
Maintainer

ernestguevarra
Feb 13, 2024
Maintainer Author

ernestguevarra
Feb 13, 2024
Maintainer Author

ernestguevarra
Feb 13, 2024
Maintainer Author

ernestguevarra
Feb 13, 2024
Maintainer Author

ernestguevarra
Feb 13, 2024
Maintainer Author

ernestguevarra
Feb 13, 2024
Maintainer Author

ernestguevarra
Feb 13, 2024
Maintainer Author

ernestguevarra
Feb 13, 2024
Maintainer Author

ernestguevarra
Feb 13, 2024
Maintainer Author