Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Kish-method to rescale_weights #575

Merged
merged 47 commits into from
Dec 31, 2024
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
a55625d
Add Kish-method to `rescale_weights`
strengejacke Dec 17, 2024
6da102a
fix
strengejacke Dec 17, 2024
469a881
fix
strengejacke Dec 17, 2024
c08b07a
lintr
strengejacke Dec 17, 2024
3cfc128
fix
strengejacke Dec 17, 2024
46e7b6c
fix
strengejacke Dec 17, 2024
ff72739
fix
strengejacke Dec 17, 2024
80d9746
docs
strengejacke Dec 18, 2024
5463eb3
docs
strengejacke Dec 18, 2024
456166b
remove experimental code for now
strengejacke Dec 18, 2024
dd8ca23
docs
strengejacke Dec 18, 2024
33f8482
fix, add tests
strengejacke Dec 18, 2024
8bd8ebe
fix, add tests
strengejacke Dec 18, 2024
9b093ff
add tests
strengejacke Dec 18, 2024
80b7e3a
add tests
strengejacke Dec 18, 2024
2b24120
styler
strengejacke Dec 18, 2024
3c0af2c
lintr, wordlist
strengejacke Dec 18, 2024
f466642
docs
strengejacke Dec 18, 2024
8214df0
fix
strengejacke Dec 18, 2024
5ad1bcb
docs, tests, rename into rescaled_weights
strengejacke Dec 18, 2024
6415437
docs
strengejacke Dec 18, 2024
75843e6
Merge branch 'main' into rescale_weights_kish
strengejacke Dec 18, 2024
6778a51
Update R/rescale_weights.R
strengejacke Dec 18, 2024
749cb2f
Update NEWS.md
strengejacke Dec 18, 2024
d085d8c
Update R/rescale_weights.R
strengejacke Dec 18, 2024
98695f2
address comments
strengejacke Dec 18, 2024
164aea8
fix test
strengejacke Dec 19, 2024
f54a939
Merge branch 'main' into rescale_weights_kish
strengejacke Dec 23, 2024
6e14888
Merge branch 'main' into rescale_weights_kish
strengejacke Dec 23, 2024
39136e8
Merge branch 'main' into rescale_weights_kish
strengejacke Dec 31, 2024
c055e3b
address comments (docs)
strengejacke Dec 31, 2024
dcfc246
implement by
strengejacke Dec 31, 2024
f56e84c
implement `by`
strengejacke Dec 31, 2024
67b9ddc
typo
strengejacke Dec 31, 2024
4512769
update examples
strengejacke Dec 31, 2024
02283ee
docs
strengejacke Dec 31, 2024
71795b5
fix
strengejacke Dec 31, 2024
725b631
fix
strengejacke Dec 31, 2024
5fc7686
fix
strengejacke Dec 31, 2024
cff6780
desc
strengejacke Dec 31, 2024
2d9d111
fix, add tests
strengejacke Dec 31, 2024
05efe33
examples
strengejacke Dec 31, 2024
126ee64
docs
strengejacke Dec 31, 2024
edce1bf
docs
strengejacke Dec 31, 2024
1f9b8b7
tests
strengejacke Dec 31, 2024
475117e
add tests
strengejacke Dec 31, 2024
227c99b
typo
strengejacke Dec 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Type: Package
Package: datawizard
Title: Easy Data Wrangling and Statistical Transformations
Version: 0.13.0.19
Version: 0.13.0.20
Authors@R: c(
person("Indrajeet", "Patil", , "patilindrajeet.science@gmail.com", role = "aut",
comment = c(ORCID = "0000-0003-1995-6531")),
Expand Down
12 changes: 10 additions & 2 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,28 @@ BREAKING CHANGES AND DEPRECATIONS

* *datawizard* now requires R >= 4.0 (#515).

* Argument `drop_na` in `data_match()` is deprecated now. Please use
* Argument `drop_na` in `data_match()` is deprecated now. Please use
`remove_na` instead.

* In `data_rename()` (#567):
- argument `pattern` is deprecated. Use `select` instead.
- argument `safe` is deprecated. The function now errors when `select`
- argument `safe` is deprecated. The function now errors when `select`
contains unknown column names.
- when `replacement` is `NULL`, an error is now thrown (previously, column
indices were used as new names).
- if `select` (previously `pattern`) is a named vector, then all elements
must be named, e.g. `c(length = "Sepal.Length", "Sepal.Width")` errors.

* The name of the rescaled weights variables in `rescale_weights()` have been
renamed. `pweights_a` and `pweights_b` are now named `rescaled_weights_a`
and `rescaled_weights_b`.

CHANGES

* `rescale_weights()` gets a `method` argument, to choose method to rescale
weights. Options are `"carle"` (the default) and `"kish"`, a newly added
method to rescale weights.
strengejacke marked this conversation as resolved.
Show resolved Hide resolved

* The `select` argument, which is available in different functions to select
variables, can now also be a character vector with quoted variable names,
including a colon to indicate a range of several variables (e.g. `"cyl:gear"`).
Expand Down
279 changes: 202 additions & 77 deletions R/rescale_weights.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,96 +2,159 @@
#' @name rescale_weights
#'
#' @description Most functions to fit multilevel and mixed effects models only
#' allow to specify frequency weights, but not design (i.e. sampling or
#' probability) weights, which should be used when analyzing complex samples
#' and survey data. `rescale_weights()` implements an algorithm proposed
#' by \cite{Asparouhov (2006)} and \cite{Carle (2009)} to rescale design
#' weights in survey data to account for the grouping structure of multilevel
#' models, which then can be used for multilevel modelling.
#' allow to specify frequency weights, but not design (i.e. sampling or
strengejacke marked this conversation as resolved.
Show resolved Hide resolved
#' probability) weights, which should be used when analyzing complex samples
#' and survey data. `rescale_weights()` implements two algorithms, one proposed
strengejacke marked this conversation as resolved.
Show resolved Hide resolved
#' by \cite{Asparouhov (2006)} and \cite{Carle (2009)} and one proposed by
#' \cite{Kish 1965}, to rescale design weights in survey data to account for the
strengejacke marked this conversation as resolved.
Show resolved Hide resolved
#' grouping structure of multilevel models, which then can be used for
#' multilevel modelling.
#'
#' @param data A data frame.
#' @param by Variable names (as character vector, or as formula), indicating
#' the grouping structure (strata) of the survey data (level-2-cluster
#' variable). It is also possible to create weights for multiple group
#' variables; in such cases, each created weighting variable will be suffixed
#' by the name of the group variable.
#' the grouping structure (strata) of the survey data (level-2-cluster
#' variable). It is also possible to create weights for multiple group
#' variables; in such cases, each created weighting variable will be suffixed
#' by the name of the group variable. Argument `by` only applies to the default
#' rescaling-method (`method = "carle"`), not to `method = "kish"`.
#' @param probability_weights Variable indicating the probability (design or
#' sampling) weights of the survey data (level-1-weight).
#' sampling) weights of the survey data (level-1-weight).
#' @param nest Logical, if `TRUE` and `by` indicates at least two
#' group variables, then groups are "nested", i.e. groups are now a
#' combination from each group level of the variables in `by`.
#' group variables, then groups are "nested", i.e. groups are now a
#' combination from each group level of the variables in `by`.
#' @param method String, indicating which rescale-method is used for rescaling
#' weights. Can be either `"carle"` (default) or `"kish"`. See 'Details'.
#'
#' @return `data`, including the new weighting variables: `pweights_a`
#' and `pweights_b`, which represent the rescaled design weights to use
#' in multilevel models (use these variables for the `weights` argument).
#' @return `data`, including the new weighting variable(s). For
#' `method = "carle"`, new columns `rescaled_weights_a` and `rescaled_weights_b`
#' are returned, and for `method = "klish"`, the returned data contains a column
strengejacke marked this conversation as resolved.
Show resolved Hide resolved
#' `rescaled_weights`. These represent the rescaled design weights to use in
#' multilevel models (use these variables for the `weights` argument).
#'
#' @details
#' - `method = "carle"`
#'
#' Rescaling is based on two methods: For `pweights_a`, the sample weights
#' `probability_weights` are adjusted by a factor that represents the proportion
#' of group size divided by the sum of sampling weights within each group. The
#' adjustment factor for `pweights_b` is the sum of sample weights within each
#' group divided by the sum of squared sample weights within each group (see
#' Carle (2009), Appendix B). In other words, `pweights_a` "scales the weights
#' so that the new weights sum to the cluster sample size" while `pweights_b`
#' "scales the weights so that the new weights sum to the effective cluster
#' size".
#'
#' Regarding the choice between scaling methods A and B, Carle suggests that
#' "analysts who wish to discuss point estimates should report results based on
#' weighting method A. For analysts more interested in residual between-group
#' variance, method B may generally provide the least biased estimates". In
#' general, it is recommended to fit a non-weighted model and weighted models
#' with both scaling methods and when comparing the models, see whether the
#' "inferential decisions converge", to gain confidence in the results.
#'
#' Though the bias of scaled weights decreases with increasing group size,
#' method A is preferred when insufficient or low group size is a concern.
#'
#' The group ID and probably PSU may be used as random effects (e.g. nested
#' design, or group and PSU as varying intercepts), depending on the survey
#' design that should be mimicked.
#' Rescaling is based on two methods: For `rescaled_weights_a`, the sample
#' weights `probability_weights` are adjusted by a factor that represents the
#' proportion of group size divided by the sum of sampling weights within each
#' group. The adjustment factor for `rescaled_weights_b` is the sum of sample
#' weights within each group divided by the sum of squared sample weights
#' within each group (see Carle (2009), Appendix B). In other words,
#' `rescaled_weights_a` "scales the weights so that the new weights sum to the
#' cluster sample size" while `rescaled_weights_b` "scales the weights so that
#' the new weights sum to the effective cluster size".
#'
#' Regarding the choice between scaling methods A and B, Carle suggests that
#' "analysts who wish to discuss point estimates should report results based
#' on weighting method A. For analysts more interested in residual
#' between-group variance, method B may generally provide the least biased
#' estimates". In general, it is recommended to fit a non-weighted model and
#' weighted models with both scaling methods and when comparing the models,
#' see whether the "inferential decisions converge", to gain confidence in the
#' results.
#'
#' Though the bias of scaled weights decreases with increasing group size,
#' method A is preferred when insufficient or low group size is a concern.
#'
#' The group ID and probably PSU may be used as random effects (e.g. nested
#' design, or group and PSU as varying intercepts), depending on the survey
#' design that should be mimicked.
#'
#' - `method = "kish"`
#'
#' Rescaling is based on scaling the sample weights so the mean value is 1,
#' which means the sum of all weights equals the sample size. Next, the design
#' effect (_Kish 1965_) is calculated, which is the mean of the squared weights
#' divided by the squared mean of the weights. The scales sample weights are
#' then divided by the design effect.
strengejacke marked this conversation as resolved.
Show resolved Hide resolved
#'
#' Some tests on real-world survey-data suggest that, in comparison to the
#' Carle-method, the Kish-method comes closer to estimates from a regular
#' survey-design using the **survey** package. Note that these tests are not
#' representative and it is recommended to check your results against a
#' standard survey-design.
#'
#' @references
#' - Asparouhov T. (2006). General Multi-Level Modeling with Sampling
#' Weights. Communications in Statistics - Theory and Methods 35: 439-460
#'
#' - Carle A.C. (2009). Fitting multilevel models in complex survey data
#' with design weights: Recommendations. BMC Medical Research Methodology
#' 9(49): 1-13
#'
#' - Asparouhov T. (2006). General Multi-Level Modeling with Sampling
#' Weights. Communications in Statistics - Theory and Methods 35: 439-460
#' - Kish, L. (1965) Survey Sampling. London: Wiley.
strengejacke marked this conversation as resolved.
Show resolved Hide resolved
#'
#' @examplesIf all(insight::check_if_installed(c("lme4", "parameters"), quietly = TRUE))
#' data(nhanes_sample)
#' head(rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR"))
#'
#' # also works with multiple group-variables
#' head(rescale_weights(nhanes_sample, c("SDMVSTRA", "SDMVPSU"), "WTINT2YR"))
#'
#' # or nested structures.
#' x <- rescale_weights(
#' data = nhanes_sample,
#' by = c("SDMVSTRA", "SDMVPSU"),
#' probability_weights = "WTINT2YR",
#' nest = TRUE
#' )
#' head(x)
#'
#' \donttest{
etiennebacher marked this conversation as resolved.
Show resolved Hide resolved
#' # compare different methods, using multilevel-Poisson regression
#'
#' @examples
#' if (require("lme4")) {
#' data(nhanes_sample)
#' head(rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR"))
#'
#' # also works with multiple group-variables
#' head(rescale_weights(nhanes_sample, c("SDMVSTRA", "SDMVPSU"), "WTINT2YR"))
#'
#' # or nested structures.
#' x <- rescale_weights(
#' data = nhanes_sample,
#' by = c("SDMVSTRA", "SDMVPSU"),
#' probability_weights = "WTINT2YR",
#' nest = TRUE
#' )
#' head(x)
#'
#' nhanes_sample <- rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR")
#'
#' glmer(
#' total ~ factor(RIAGENDR) * (log(age) + factor(RIDRETH1)) + (1 | SDMVPSU),
#' family = poisson(),
#' data = nhanes_sample,
#' weights = pweights_a
#' )
#' d <- rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR")
#' result1 <- lme4::glmer(
#' total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU),
#' family = poisson(),
#' data = d,
#' weights = rescaled_weights_a
#' )
#' result2 <- lme4::glmer(
#' total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU),
#' family = poisson(),
#' data = d,
#' weights = rescaled_weights_b
#' )
#'
#' d <- rescale_weights(
#' nhanes_sample,
#' probability_weights = "WTINT2YR",
#' method = "kish"
#' )
#' result3 <- lme4::glmer(
#' total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU),
#' family = poisson(),
#' data = d,
#' weights = rescaled_weights
#' )
#' parameters::compare_parameters(
#' list(result1, result2, result3),
#' exponentiate = TRUE,
#' column_names = c("Carle (A)", "Carle (B)", "Kish")
#' )
#' }
#' @export
rescale_weights <- function(data, by, probability_weights, nest = FALSE) {
rescale_weights <- function(data,
by = NULL,
probability_weights = NULL,
nest = FALSE,
method = "carle") {
strengejacke marked this conversation as resolved.
Show resolved Hide resolved
if (inherits(by, "formula")) {
by <- all.vars(by)
}

# check for existing variable names
if ((method == "carle" && any(c("rescaled_weights_a", "rescaled_weights_b") %in% colnames(data))) ||
(method == "kish" && "rescaled_weights" %in% colnames(data))) {
insight::format_warning("The variable name for the rescaled weights already exists in the data. Returned columns will be renamed into unique names.") # nolint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an error? No particular pref, but it seems like it could be a footgun

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no particular preference either. Since we "decide" on the name of the new column, I thought it would be fair to just warn, and give the rescaled weights column a different name than the default one, if it already exists in the data

}

# need probability_weights
if (is.null(probability_weights)) {
insight::format_error("The argument `probability_weights` is missing, but required to rescale weights.")
}

# check if weight has missings. we need to remove them first,
# and add back weights to correct cases later

Expand All @@ -104,9 +167,63 @@ rescale_weights <- function(data, by, probability_weights, nest = FALSE) {
data_tmp <- data
}

fun_args <- list(
nest = nest,
probability_weights = probability_weights,
data_tmp = data_tmp,
data = data,
by = by,
weight_non_na = weight_non_na
)

switch(method,
carle = do.call(.rescale_weights_carle, fun_args),
do.call(.rescale_weights_kish, fun_args)
)
}


# rescale weights, method Kish ----------------------------

.rescale_weights_kish <- function(nest, probability_weights, data_tmp, data, by, weight_non_na) {
# check argument
if (!is.null(by)) {
insight::format_warning("The `by` argument is not used for `method = \"kish\" and will be ignored.")
strengejacke marked this conversation as resolved.
Show resolved Hide resolved
}
p_weights <- data_tmp[[probability_weights]]
# design effect according to Kish
deff <- mean(p_weights^2) / (mean(p_weights)^2)
# rescale weights, so their mean is 1
z_weights <- p_weights * (1 / mean(p_weights))
# divide weights by design effect
data$rescaled_weights <- NA_real_
data$rescaled_weights[weight_non_na] <- z_weights / deff
# return result
data
}


# rescale weights, method Carle ----------------------------

.rescale_weights_carle <- function(nest, probability_weights, data_tmp, data, by, weight_non_na) {
# sort id
data_tmp$.bamboozled <- seq_len(nrow(data_tmp))

if (is.null(by)) {
insight::format_error("Argument `by` must be specified. Please provide one or more variable names in `by` that indicate the grouping structure (strata) of the survey data (level-2-cluster variable).") # nolint
}

if (!all(by %in% colnames(data_tmp))) {
dont_exist <- by[which(!by %in% colnames(data_tmp))]
strengejacke marked this conversation as resolved.
Show resolved Hide resolved
insight::format_error(
paste0(
"The following variable(s) specified in `by` don't exist in the dataset: ",
text_concatenate(dont_exist), "."
),
.misspelled_string(colnames(data_tmp), dont_exist, "Possibly misspelled?")
)
}

if (nest && length(by) < 2) {
insight::format_warning(
sprintf(
Expand All @@ -129,7 +246,15 @@ rescale_weights <- function(data, by, probability_weights, nest = FALSE) {
})
}

do.call(cbind, list(data, out))
make_unique_names <- any(vapply(out, function(i) any(colnames(i) %in% colnames(data)), logical(1)))
# add weights to data frame
out <- do.call(cbind, list(data, out))
# check if we have to rename columns
if (make_unique_names) {
colnames(out) <- make.unique(colnames(out), sep = "_")
}

out
}


Expand Down Expand Up @@ -158,12 +283,12 @@ rescale_weights <- function(data, by, probability_weights, nest = FALSE) {
w_b <- x[[probability_weights]] * x$sum_weights_by_group / x$sum_squared_weights_by_group

out <- data.frame(
pweights_a = rep(NA_real_, times = n),
pweights_b = rep(NA_real_, times = n)
rescaled_weights_a = rep(NA_real_, times = n),
rescaled_weights_b = rep(NA_real_, times = n)
)

out$pweights_a[weight_non_na] <- w_a
out$pweights_b[weight_non_na] <- w_b
out$rescaled_weights_a[weight_non_na] <- w_a
out$rescaled_weights_b[weight_non_na] <- w_b

out
}
Expand Down Expand Up @@ -206,12 +331,12 @@ rescale_weights <- function(data, by, probability_weights, nest = FALSE) {
w_b <- x[[probability_weights]] * x$sum_weights_by_group / x$sum_squared_weights_by_group

out <- data.frame(
pweights_a = rep(NA_real_, times = n),
pweights_b = rep(NA_real_, times = n)
rescaled_weights_a = rep(NA_real_, times = n),
rescaled_weights_b = rep(NA_real_, times = n)
)

out$pweights_a[weight_non_na] <- w_a
out$pweights_b[weight_non_na] <- w_b
out$rescaled_weights_a[weight_non_na] <- w_a
out$rescaled_weights_b[weight_non_na] <- w_b

out
}
Loading
Loading