diff --git a/DESCRIPTION b/DESCRIPTION index 3f970c585..fadd5fee9 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,7 +1,7 @@ Type: Package Package: performance Title: Assessment of Regression Models Performance -Version: 0.12.4.9 +Version: 0.12.4.14 Authors@R: c(person(given = "Daniel", family = "Lüdecke", @@ -74,7 +74,7 @@ Depends: R (>= 3.6) Imports: bayestestR (>= 0.15.0), - insight (>= 0.20.5), + insight (>= 1.0.0), datawizard (>= 0.13.0), stats, utils @@ -160,4 +160,4 @@ Config/Needs/website: r-lib/pkgdown, easystats/easystatstemplate Config/rcmdcheck/ignore-inconsequential-notes: true -Remotes: easystats/insight +Remotes: easystats/datawizard, easystats/see diff --git a/NEWS.md b/NEWS.md index be90d0056..d4f14ab62 100644 --- a/NEWS.md +++ b/NEWS.md @@ -2,6 +2,9 @@ ## Breaking changes +* `check_outliers()` with `method = "optics"` now returns a further refined + cluster selection, by passing the `optics_xi` argument to `dbscan::extractXi()`. + * Deprecated arguments and alias-function-names have been removed. * Argument names in `check_model()` that refer to plot-aesthetics (like diff --git a/R/check_collinearity.R b/R/check_collinearity.R index 1cb0b6913..b5019c5c3 100644 --- a/R/check_collinearity.R +++ b/R/check_collinearity.R @@ -73,7 +73,12 @@ #' This portion of multicollinearity among the component terms of an #' interaction is also called "inessential ill-conditioning", which leads to #' inflated VIF values that are typically seen for models with interaction -#' terms _(Francoeur 2013)_. +#' terms _(Francoeur 2013)_. Centering interaction terms can resolve this +#' issue _(Kim and Jung 2024)_. +#' +#' @section Multicollinearity and Polynomial Terms: +#' Polynomial transformations are considered a single term and thus VIFs are +#' not calculated between them. #' #' @section Concurvity for Smooth Terms in Generalized Additive Models: #' `check_concurvity()` is a wrapper around `mgcv::concurvity()`, and can be @@ -91,26 +96,30 @@ #' @references #' #' - Francoeur, R. B. (2013). Could Sequential Residual Centering Resolve -#' Low Sensitivity in Moderated Regression? Simulations and Cancer Symptom -#' Clusters. Open Journal of Statistics, 03(06), 24-44. +#' Low Sensitivity in Moderated Regression? Simulations and Cancer Symptom +#' Clusters. Open Journal of Statistics, 03(06), 24-44. +#' +#' - James, G., Witten, D., Hastie, T., and Tibshirani, R. (eds.). (2013). An +#' introduction to statistical learning: with applications in R. New York: +#' Springer. #' -#' - James, G., Witten, D., Hastie, T., and Tibshirani, R. (eds.). (2013). -#' An introduction to statistical learning: with applications in R. New York: -#' Springer. +#' - Kim, Y., & Jung, G. (2024). Understanding linear interaction analysis with +#' causal graphs. British Journal of Mathematical and Statistical Psychology, +#' 00, 1–14. #' #' - Marcoulides, K. M., and Raykov, T. (2019). Evaluation of Variance -#' Inflation Factors in Regression Models Using Latent Variable Modeling -#' Methods. Educational and Psychological Measurement, 79(5), 874–882. +#' Inflation Factors in Regression Models Using Latent Variable Modeling +#' Methods. Educational and Psychological Measurement, 79(5), 874–882. #' #' - McElreath, R. (2020). Statistical rethinking: A Bayesian course with -#' examples in R and Stan. 2nd edition. Chapman and Hall/CRC. +#' examples in R and Stan. 2nd edition. Chapman and Hall/CRC. #' #' - Vanhove, J. (2019). Collinearity isn't a disease that needs curing. -#' [webpage](https://janhove.github.io/posts/2019-09-11-collinearity/) +#' [webpage](https://janhove.github.io/posts/2019-09-11-collinearity/) #' #' - Zuur AF, Ieno EN, Elphick CS. A protocol for data exploration to avoid -#' common statistical problems: Data exploration. Methods in Ecology and -#' Evolution (2010) 1:3–14. +#' common statistical problems: Data exploration. Methods in Ecology and +#' Evolution (2010) 1:3–14. #' #' @family functions to check model assumptions and and assess model quality #' @@ -193,7 +202,7 @@ plot.check_collinearity <- function(x, ...) { x <- insight::format_table(x) x <- datawizard::data_rename( x, - pattern = "SE_factor", + select = "SE_factor", replacement = "Increased SE", verbose = FALSE ) @@ -514,7 +523,7 @@ check_collinearity.zerocount <- function(x, if (!is.null(insight::find_interactions(x)) && any(result > 10) && isTRUE(verbose)) { insight::format_alert( "Model has interaction terms. VIFs might be inflated.", - "You may check multicollinearity among predictors of a model without interaction terms." + "Try to center the variables used for the interaction, or check multicollinearity among predictors of a model without interaction terms." # nolint ) } diff --git a/R/check_model_diagnostics.R b/R/check_model_diagnostics.R index 9595ed968..df16f252c 100644 --- a/R/check_model_diagnostics.R +++ b/R/check_model_diagnostics.R @@ -10,16 +10,9 @@ dat$group[dat$VIF >= 5 & dat$VIF < 10] <- "moderate" dat$group[dat$VIF >= 10] <- "high" - dat <- datawizard::data_rename( - dat, - c("Term", "VIF", "SE_factor", "Component"), - c("x", "y", "se", "facet"), - verbose = FALSE - ) - dat <- datawizard::data_select( dat, - c("x", "y", "facet", "group"), + select = c(x = "Term", y = "VIF", facet = "Component", group = "group"), verbose = FALSE ) diff --git a/R/check_outliers.R b/R/check_outliers.R index ffb2d633f..0e68d1b18 100644 --- a/R/check_outliers.R +++ b/R/check_outliers.R @@ -198,7 +198,8 @@ #' extreme values), this algorithm functions in a different manner and won't #' always detect outliers. Note that `method = "optics"` requires the #' **dbscan** package to be installed, and that it takes some time to compute -#' the results. +#' the results. Additionally, the `optics_xi` (default to 0.05) is passed to +#' the [dbscan::extractXi()] function to further refine the cluster selection. #' #' - **Local Outlier Factor**: #' Based on a K nearest neighbors algorithm, LOF compares the local density of @@ -242,6 +243,7 @@ #' mcd = stats::qchisq(p = 1 - 0.001, df = ncol(x)), #' ics = 0.001, #' optics = 2 * ncol(x), +#' optics_xi = 0.05, #' lof = 0.001 #' ) #' ``` @@ -881,6 +883,13 @@ check_outliers.data.frame <- function(x, } else if (is.numeric(threshold)) { thresholds <- .check_outliers_thresholds(x) thresholds <- lapply(thresholds, function(x) threshold) + # need to fix this manually - "optics" automatically includes method + # "optics_xi", which is allowed to range between 0 and 1 - since values + # for "optics" can be > 1, it might overwrite "optics_xi" with an invalid + # value... + if (thresholds$optics_xi > 1) { + thresholds$optics_xi <- 0.05 + } } else { insight::format_error( paste( @@ -890,7 +899,13 @@ check_outliers.data.frame <- function(x, ) } - thresholds <- thresholds[names(thresholds) %in% method] + # Keep only relevant threshold + valid <- method + if("optics" %in% valid) { + valid <- c(valid, "optics_xi") + method <- c(method, "optics_xi") + } + thresholds <- thresholds[names(thresholds) %in% valid] out.meta <- .check_outliers.data.frame_method(x, method, thresholds, ID, ID.names, ...) out <- out.meta$out @@ -1207,7 +1222,8 @@ check_outliers.data.frame <- function(x, out <- c(out, .check_outliers_optics( x, threshold = thresholds$optics, - ID.names = ID.names + ID.names = ID.names, + xi = thresholds$optics_xi )) count.table <- datawizard::data_filter( @@ -1506,38 +1522,23 @@ check_outliers.DHARMa <- check_outliers.performance_simres } .check_outliers_thresholds_nowarn <- function(x) { - zscore <- stats::qnorm(p = 1 - 0.001 / 2) - zscore_robust <- stats::qnorm(p = 1 - 0.001 / 2) - iqr <- 1.7 - ci <- 1 - 0.001 - eti <- 1 - 0.001 - hdi <- 1 - 0.001 - bci <- 1 - 0.001 - cook <- stats::qf(0.5, ncol(x), nrow(x) - ncol(x)) - pareto <- 0.7 - mahalanobis_value <- stats::qchisq(p = 1 - 0.001, df = ncol(x)) - mahalanobis_robust <- stats::qchisq(p = 1 - 0.001, df = ncol(x)) - mcd <- stats::qchisq(p = 1 - 0.001, df = ncol(x)) - ics <- 0.001 - optics <- 2 * ncol(x) - lof <- 0.001 - list( - zscore = zscore, - zscore_robust = zscore_robust, - iqr = iqr, - ci = ci, - hdi = hdi, - eti = eti, - bci = bci, - cook = cook, - pareto = pareto, - mahalanobis = mahalanobis_value, - mahalanobis_robust = mahalanobis_robust, - mcd = mcd, - ics = ics, - optics = optics, - lof = lof + zscore = stats::qnorm(p = 1 - 0.001 / 2), + zscore_robust = stats::qnorm(p = 1 - 0.001 / 2), + iqr = 1.7, + ci = 1 - 0.001, + hdi = 1 - 0.001, + eti = 1 - 0.001, + bci = 1 - 0.001, + cook = stats::qf(0.5, ncol(x), nrow(x) - ncol(x)), + pareto = 0.7, + mahalanobis = stats::qchisq(p = 1 - 0.001, df = ncol(x)), + mahalanobis_robust = stats::qchisq(p = 1 - 0.001, df = ncol(x)), + mcd = stats::qchisq(p = 1 - 0.001, df = ncol(x)), + ics = 0.001, + optics = 2 * ncol(x), + optics_xi = 0.05, + lof = 0.001 ) } @@ -1929,7 +1930,8 @@ check_outliers.DHARMa <- check_outliers.performance_simres .check_outliers_optics <- function(x, threshold = NULL, - ID.names = NULL) { + ID.names = NULL, + xi = 0.05) { out <- data.frame(Row = seq_len(nrow(x))) if (!is.null(ID.names)) { @@ -1940,7 +1942,7 @@ check_outliers.DHARMa <- check_outliers.performance_simres # Compute rez <- dbscan::optics(x, minPts = threshold) - rez <- dbscan::extractXi(rez, xi = 0.05) # TODO: find automatic way of setting xi + rez <- dbscan::extractXi(rez, xi = xi) # TODO: find automatic way of setting xi out$Distance_OPTICS <- rez$coredist diff --git a/man/check_collinearity.Rd b/man/check_collinearity.Rd index 847ff110d..699695a06 100644 --- a/man/check_collinearity.Rd +++ b/man/check_collinearity.Rd @@ -111,7 +111,14 @@ If interaction terms are included in a model, high VIF values are expected. This portion of multicollinearity among the component terms of an interaction is also called "inessential ill-conditioning", which leads to inflated VIF values that are typically seen for models with interaction -terms \emph{(Francoeur 2013)}. +terms \emph{(Francoeur 2013)}. Centering interaction terms can resolve this +issue \emph{(Kim and Jung 2024)}. +} + +\section{Multicollinearity and Polynomial Terms}{ + +Polynomial transformations are considered a single term and thus VIFs are +not calculated between them. } \section{Concurvity for Smooth Terms in Generalized Additive Models}{ @@ -144,9 +151,12 @@ plot(x) \item Francoeur, R. B. (2013). Could Sequential Residual Centering Resolve Low Sensitivity in Moderated Regression? Simulations and Cancer Symptom Clusters. Open Journal of Statistics, 03(06), 24-44. -\item James, G., Witten, D., Hastie, T., and Tibshirani, R. (eds.). (2013). -An introduction to statistical learning: with applications in R. New York: +\item James, G., Witten, D., Hastie, T., and Tibshirani, R. (eds.). (2013). An +introduction to statistical learning: with applications in R. New York: Springer. +\item Kim, Y., & Jung, G. (2024). Understanding linear interaction analysis with +causal graphs. British Journal of Mathematical and Statistical Psychology, +00, 1–14. \item Marcoulides, K. M., and Raykov, T. (2019). Evaluation of Variance Inflation Factors in Regression Models Using Latent Variable Modeling Methods. Educational and Psychological Measurement, 79(5), 874–882. diff --git a/man/check_outliers.Rd b/man/check_outliers.Rd index 623eae4b2..489dbafc3 100644 --- a/man/check_outliers.Rd +++ b/man/check_outliers.Rd @@ -236,7 +236,8 @@ detect several outliers (as these are usually defined as a percentage of extreme values), this algorithm functions in a different manner and won't always detect outliers. Note that \code{method = "optics"} requires the \strong{dbscan} package to be installed, and that it takes some time to compute -the results. +the results. Additionally, the \code{optics_xi} (default to 0.05) is passed to +the \code{\link[dbscan:optics]{dbscan::extractXi()}} function to further refine the cluster selection. \item \strong{Local Outlier Factor}: Based on a K nearest neighbors algorithm, LOF compares the local density of a point to the local densities of its neighbors instead of computing a @@ -283,6 +284,7 @@ Default thresholds are currently specified as follows: mcd = stats::qchisq(p = 1 - 0.001, df = ncol(x)), ics = 0.001, optics = 2 * ncol(x), + optics_xi = 0.05, lof = 0.001 ) }\if{html}{\out{}} diff --git a/tests/testthat/test-check_collinearity.R b/tests/testthat/test-check_collinearity.R index ea3dd56cd..8bab16003 100644 --- a/tests/testthat/test-check_collinearity.R +++ b/tests/testthat/test-check_collinearity.R @@ -23,6 +23,12 @@ test_that("check_collinearity, correct order in print", { }) +test_that("check_collinearity, interaction", { + m <- lm(mpg ~ wt * cyl, data = mtcars) + expect_message(check_collinearity(m), regex = "Model has interaction terms") +}) + + test_that("check_collinearity", { skip_if_not_installed("glmmTMB") skip_if_not(getRversion() >= "4.0.0") diff --git a/vignettes/check_model.Rmd b/vignettes/check_model.Rmd index bf933e13c..3e52ef797 100644 --- a/vignettes/check_model.Rmd +++ b/vignettes/check_model.Rmd @@ -250,7 +250,7 @@ Our model clearly suffers from multicollinearity, as all predictors have high VI ### How to fix this? -Usually, predictors with (very) high VIF values should be removed from the model to fix multicollinearity. Some caution is needed for interaction terms. If interaction terms are included in a model, high VIF values are expected. This portion of multicollinearity among the component terms of an interaction is also called "inessential ill-conditioning", which leads to inflated VIF values that are typically seen for models with interaction terms _(Francoeur 2013)_. In such cases, re-fit your model without interaction terms and check this model for collinearity among predictors. +Usually, predictors with (very) high VIF values should be removed from the model to fix multicollinearity. Some caution is needed for interaction terms. If interaction terms are included in a model, high VIF values are expected. This portion of multicollinearity among the component terms of an interaction is also called "inessential ill-conditioning", which leads to inflated VIF values that are typically seen for models with interaction terms _(Francoeur 2013)_. In such cases, try centering the involved interaction terms, which can reduce multicollinearity _(Kim and Jung 2024)_, or re-fit your model without interaction terms and check this model for collinearity among predictors. ## Normality of residuals @@ -293,6 +293,8 @@ Gelman A, and Hill J. Data analysis using regression and multilevel/hierarchical James, G., Witten, D., Hastie, T., and Tibshirani, R. (eds.).An introduction to statistical learning: with applications in R. New York: Springer, 2013 +Kim, Y., & Jung, G. (2024). Understanding linear interaction analysis with causal graphs. British Journal of Mathematical and Statistical Psychology, 00, 1–14. + Leys C, Delacre M, Mora YL, Lakens D, Ley C. How to Classify, Detect, and Manage Univariate and Multivariate Outliers, With Emphasis on Pre-Registration. International Review of Social Psychology, 2019 McElreath, R. Statistical rethinking: A Bayesian course with examples in R and Stan. 2nd edition. Chapman and Hall/CRC, 2020