Merge branch 'main' into strengejacke/issue781

easystats · Dec 22, 2024 · 68f2db4 · 68f2db4
2 parents 966094c + 14a121f
commit 68f2db4
Show file tree

Hide file tree

Showing 9 changed files with 93 additions and 66 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,7 +1,7 @@
 Type: Package
 Package: performance
 Title: Assessment of Regression Models Performance
-Version: 0.12.4.9
+Version: 0.12.4.14
 Authors@R:
     c(person(given = "Daniel",
              family = "Lüdecke",
@@ -74,7 +74,7 @@ Depends:
     R (>= 3.6)
 Imports:
     bayestestR (>= 0.15.0),
-    insight (>= 0.20.5),
+    insight (>= 1.0.0),
     datawizard (>= 0.13.0),
     stats,
     utils
@@ -160,4 +160,4 @@ Config/Needs/website:
     r-lib/pkgdown,
     easystats/easystatstemplate
 Config/rcmdcheck/ignore-inconsequential-notes: true
-Remotes: easystats/insight
+Remotes: easystats/datawizard, easystats/see
diff --git a/NEWS.md b/NEWS.md
@@ -2,6 +2,9 @@
 
 ## Breaking changes
 
+* `check_outliers()` with `method = "optics"` now returns a further refined
+  cluster selection, by passing the `optics_xi` argument to `dbscan::extractXi()`.
+
 * Deprecated arguments and alias-function-names have been removed.
 
 * Argument names in `check_model()` that refer to plot-aesthetics (like

diff --git a/R/check_collinearity.R b/R/check_collinearity.R
@@ -73,7 +73,12 @@
 #' This portion of multicollinearity among the component terms of an
 #' interaction is also called "inessential ill-conditioning", which leads to
 #' inflated VIF values that are typically seen for models with interaction
-#' terms _(Francoeur 2013)_.
+#' terms _(Francoeur 2013)_. Centering interaction terms can resolve this
+#' issue _(Kim and Jung 2024)_.
+#'
+#' @section Multicollinearity and Polynomial Terms:
+#' Polynomial transformations are considered a single term and thus VIFs are
+#' not calculated between them.
 #'
 #' @section Concurvity for Smooth Terms in Generalized Additive Models:
 #' `check_concurvity()` is a wrapper around `mgcv::concurvity()`, and can be
@@ -91,26 +96,30 @@
 #' @references
 #'
 #' - Francoeur, R. B. (2013). Could Sequential Residual Centering Resolve
-#' Low Sensitivity in Moderated Regression? Simulations and Cancer Symptom
-#' Clusters. Open Journal of Statistics, 03(06), 24-44.
+#'   Low Sensitivity in Moderated Regression? Simulations and Cancer Symptom
+#'   Clusters. Open Journal of Statistics, 03(06), 24-44.
+#'
+#' - James, G., Witten, D., Hastie, T., and Tibshirani, R. (eds.). (2013). An
+#'   introduction to statistical learning: with applications in R. New York:
+#'   Springer.
 #'
-#' - James, G., Witten, D., Hastie, T., and Tibshirani, R. (eds.). (2013).
-#' An introduction to statistical learning: with applications in R. New York:
-#' Springer.
+#' - Kim, Y., & Jung, G. (2024). Understanding linear interaction analysis with
+#'   causal graphs. British Journal of Mathematical and Statistical Psychology,
+#'   00, 1–14.
 #'
 #' - Marcoulides, K. M., and Raykov, T. (2019). Evaluation of Variance
-#' Inflation Factors in Regression Models Using Latent Variable Modeling
-#' Methods. Educational and Psychological Measurement, 79(5), 874–882.
+#'   Inflation Factors in Regression Models Using Latent Variable Modeling
+#'   Methods. Educational and Psychological Measurement, 79(5), 874–882.
 #'
 #' - McElreath, R. (2020). Statistical rethinking: A Bayesian course with
-#' examples in R and Stan. 2nd edition. Chapman and Hall/CRC.
+#'   examples in R and Stan. 2nd edition. Chapman and Hall/CRC.
 #'
 #' - Vanhove, J. (2019). Collinearity isn't a disease that needs curing.
-#' [webpage](https://janhove.github.io/posts/2019-09-11-collinearity/)
+#'   [webpage](https://janhove.github.io/posts/2019-09-11-collinearity/)
 #'
 #' - Zuur AF, Ieno EN, Elphick CS. A protocol for data exploration to avoid
-#' common statistical problems: Data exploration. Methods in Ecology and
-#' Evolution (2010) 1:3–14.
+#'   common statistical problems: Data exploration. Methods in Ecology and
+#'   Evolution (2010) 1:3–14.
 #'
 #' @family functions to check model assumptions and and assess model quality
 #'
@@ -193,7 +202,7 @@ plot.check_collinearity <- function(x, ...) {
   x <- insight::format_table(x)
   x <- datawizard::data_rename(
     x,
-    pattern = "SE_factor",
+    select = "SE_factor",
     replacement = "Increased SE",
     verbose = FALSE
   )
@@ -514,7 +523,7 @@ check_collinearity.zerocount <- function(x,
   if (!is.null(insight::find_interactions(x)) && any(result > 10) && isTRUE(verbose)) {
     insight::format_alert(
       "Model has interaction terms. VIFs might be inflated.",
-      "You may check multicollinearity among predictors of a model without interaction terms."
+      "Try to center the variables used for the interaction, or check multicollinearity among predictors of a model without interaction terms." # nolint
     )
   }
 

diff --git a/R/check_model_diagnostics.R b/R/check_model_diagnostics.R
@@ -10,16 +10,9 @@
   dat$group[dat$VIF >= 5 & dat$VIF < 10] <- "moderate"
   dat$group[dat$VIF >= 10] <- "high"
 
-  dat <- datawizard::data_rename(
-    dat,
-    c("Term", "VIF", "SE_factor", "Component"),
-    c("x", "y", "se", "facet"),
-    verbose = FALSE
-  )
-
   dat <- datawizard::data_select(
     dat,
-    c("x", "y", "facet", "group"),
+    select = c(x = "Term", y = "VIF", facet = "Component", group = "group"),
     verbose = FALSE
   )
 

diff --git a/R/check_outliers.R b/R/check_outliers.R
@@ -198,7 +198,8 @@
 #'  extreme values), this algorithm functions in a different manner and won't
 #'  always detect outliers. Note that `method = "optics"` requires the
 #'  **dbscan** package to be installed, and that it takes some time to compute
-#'  the results.
+#'  the results. Additionally, the `optics_xi` (default to 0.05) is passed to
+#'  the [dbscan::extractXi()] function to further refine the cluster selection.
 #'
 #'  - **Local Outlier Factor**:
 #'  Based on a K nearest neighbors algorithm, LOF compares the local density of
@@ -242,6 +243,7 @@
 #'   mcd = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
 #'   ics = 0.001,
 #'   optics = 2 * ncol(x),
+#'   optics_xi = 0.05,
 #'   lof = 0.001
 #' )
 #' ```
@@ -881,6 +883,13 @@ check_outliers.data.frame <- function(x,
   } else if (is.numeric(threshold)) {
     thresholds <- .check_outliers_thresholds(x)
     thresholds <- lapply(thresholds, function(x) threshold)
+    # need to fix this manually - "optics" automatically includes method
+    # "optics_xi", which is allowed to range between 0 and 1 - since values
+    # for "optics" can be > 1, it might overwrite "optics_xi" with an invalid
+    # value...
+    if (thresholds$optics_xi > 1) {
+      thresholds$optics_xi <- 0.05
+    }
   } else {
     insight::format_error(
       paste(
@@ -890,7 +899,13 @@ check_outliers.data.frame <- function(x,
     )
   }
 
-  thresholds <- thresholds[names(thresholds) %in% method]
+  # Keep only relevant threshold
+  valid <- method
+  if("optics" %in% valid) {
+    valid <- c(valid, "optics_xi")
+    method <- c(method, "optics_xi")
+  }
+  thresholds <- thresholds[names(thresholds) %in% valid]
 
   out.meta <- .check_outliers.data.frame_method(x, method, thresholds, ID, ID.names, ...)
   out <- out.meta$out
@@ -1207,7 +1222,8 @@ check_outliers.data.frame <- function(x,
     out <- c(out, .check_outliers_optics(
       x,
       threshold = thresholds$optics,
-      ID.names = ID.names
+      ID.names = ID.names,
+      xi = thresholds$optics_xi
     ))
 
     count.table <- datawizard::data_filter(
@@ -1506,38 +1522,23 @@ check_outliers.DHARMa <- check_outliers.performance_simres
 }
 
 .check_outliers_thresholds_nowarn <- function(x) {
-  zscore <- stats::qnorm(p = 1 - 0.001 / 2)
-  zscore_robust <- stats::qnorm(p = 1 - 0.001 / 2)
-  iqr <- 1.7
-  ci <- 1 - 0.001
-  eti <- 1 - 0.001
-  hdi <- 1 - 0.001
-  bci <- 1 - 0.001
-  cook <- stats::qf(0.5, ncol(x), nrow(x) - ncol(x))
-  pareto <- 0.7
-  mahalanobis_value <- stats::qchisq(p = 1 - 0.001, df = ncol(x))
-  mahalanobis_robust <- stats::qchisq(p = 1 - 0.001, df = ncol(x))
-  mcd <- stats::qchisq(p = 1 - 0.001, df = ncol(x))
-  ics <- 0.001
-  optics <- 2 * ncol(x)
-  lof <- 0.001
-
   list(
-    zscore = zscore,
-    zscore_robust = zscore_robust,
-    iqr = iqr,
-    ci = ci,
-    hdi = hdi,
-    eti = eti,
-    bci = bci,
-    cook = cook,
-    pareto = pareto,
-    mahalanobis = mahalanobis_value,
-    mahalanobis_robust = mahalanobis_robust,
-    mcd = mcd,
-    ics = ics,
-    optics = optics,
-    lof = lof
+    zscore = stats::qnorm(p = 1 - 0.001 / 2),
+    zscore_robust = stats::qnorm(p = 1 - 0.001 / 2),
+    iqr = 1.7,
+    ci = 1 - 0.001,
+    hdi = 1 - 0.001,
+    eti = 1 - 0.001,
+    bci = 1 - 0.001,
+    cook = stats::qf(0.5, ncol(x), nrow(x) - ncol(x)),
+    pareto = 0.7,
+    mahalanobis = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
+    mahalanobis_robust = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
+    mcd = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
+    ics = 0.001,
+    optics = 2 * ncol(x),
+    optics_xi = 0.05,
+    lof = 0.001
   )
 }
 
@@ -1929,7 +1930,8 @@ check_outliers.DHARMa <- check_outliers.performance_simres
 
 .check_outliers_optics <- function(x,
                                    threshold = NULL,
-                                   ID.names = NULL) {
+                                   ID.names = NULL,
+                                   xi = 0.05) {
   out <- data.frame(Row = seq_len(nrow(x)))
 
   if (!is.null(ID.names)) {
@@ -1940,7 +1942,7 @@ check_outliers.DHARMa <- check_outliers.performance_simres
 
   # Compute
   rez <- dbscan::optics(x, minPts = threshold)
-  rez <- dbscan::extractXi(rez, xi = 0.05) # TODO: find automatic way of setting xi
+  rez <- dbscan::extractXi(rez, xi = xi) # TODO: find automatic way of setting xi
 
   out$Distance_OPTICS <- rez$coredist
 

diff --git a/man/check_collinearity.Rd b/man/check_collinearity.Rd
diff --git a/man/check_outliers.Rd b/man/check_outliers.Rd
diff --git a/tests/testthat/test-check_collinearity.R b/tests/testthat/test-check_collinearity.R
@@ -23,6 +23,12 @@ test_that("check_collinearity, correct order in print", {
 })
 
 
+test_that("check_collinearity, interaction", {
+  m <- lm(mpg ~ wt * cyl, data = mtcars)
+  expect_message(check_collinearity(m), regex = "Model has interaction terms")
+})
+
+
 test_that("check_collinearity", {
   skip_if_not_installed("glmmTMB")
   skip_if_not(getRversion() >= "4.0.0")

diff --git a/vignettes/check_model.Rmd b/vignettes/check_model.Rmd
@@ -250,7 +250,7 @@ Our model clearly suffers from multicollinearity, as all predictors have high VI
 
 ### How to fix this?
 
-Usually, predictors with (very) high VIF values should be removed from the model to fix multicollinearity. Some caution is needed for interaction terms. If interaction terms are included in a model, high VIF values are expected. This portion of multicollinearity among the component terms of an interaction is also called "inessential ill-conditioning", which leads to inflated VIF values that are typically seen for models with interaction terms _(Francoeur 2013)_. In such cases, re-fit your model without interaction terms and check this model for collinearity among predictors.
+Usually, predictors with (very) high VIF values should be removed from the model to fix multicollinearity. Some caution is needed for interaction terms. If interaction terms are included in a model, high VIF values are expected. This portion of multicollinearity among the component terms of an interaction is also called "inessential ill-conditioning", which leads to inflated VIF values that are typically seen for models with interaction terms _(Francoeur 2013)_. In such cases, try centering the involved interaction terms, which can reduce multicollinearity _(Kim and Jung 2024)_, or re-fit your model without interaction terms and check this model for collinearity among predictors.
 
 ## Normality of residuals
 
@@ -293,6 +293,8 @@ Gelman A, and Hill J. Data analysis using regression and multilevel/hierarchical
 
 James, G., Witten, D., Hastie, T., and Tibshirani, R. (eds.).An introduction to statistical learning: with applications in R. New York: Springer, 2013
 
+Kim, Y., & Jung, G. (2024). Understanding linear interaction analysis with causal graphs. British Journal of Mathematical and Statistical Psychology, 00, 1–14.
+
 Leys C, Delacre M, Mora YL, Lakens D, Ley C. How to Classify, Detect, and Manage Univariate and Multivariate Outliers, With Emphasis on Pre-Registration. International Review of Social Psychology, 2019
 
 McElreath, R. Statistical rethinking: A Bayesian course with examples in R and Stan. 2nd edition. Chapman and Hall/CRC, 2020