Skip to content

Commit

Permalink
Merge branch 'main' into strengejacke/issue781
Browse files Browse the repository at this point in the history
  • Loading branch information
strengejacke authored Dec 22, 2024
2 parents 966094c + 14a121f commit 68f2db4
Show file tree
Hide file tree
Showing 9 changed files with 93 additions and 66 deletions.
6 changes: 3 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Type: Package
Package: performance
Title: Assessment of Regression Models Performance
Version: 0.12.4.9
Version: 0.12.4.14
Authors@R:
c(person(given = "Daniel",
family = "Lüdecke",
Expand Down Expand Up @@ -74,7 +74,7 @@ Depends:
R (>= 3.6)
Imports:
bayestestR (>= 0.15.0),
insight (>= 0.20.5),
insight (>= 1.0.0),
datawizard (>= 0.13.0),
stats,
utils
Expand Down Expand Up @@ -160,4 +160,4 @@ Config/Needs/website:
r-lib/pkgdown,
easystats/easystatstemplate
Config/rcmdcheck/ignore-inconsequential-notes: true
Remotes: easystats/insight
Remotes: easystats/datawizard, easystats/see
3 changes: 3 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

## Breaking changes

* `check_outliers()` with `method = "optics"` now returns a further refined
cluster selection, by passing the `optics_xi` argument to `dbscan::extractXi()`.

* Deprecated arguments and alias-function-names have been removed.

* Argument names in `check_model()` that refer to plot-aesthetics (like
Expand Down
37 changes: 23 additions & 14 deletions R/check_collinearity.R
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,12 @@
#' This portion of multicollinearity among the component terms of an
#' interaction is also called "inessential ill-conditioning", which leads to
#' inflated VIF values that are typically seen for models with interaction
#' terms _(Francoeur 2013)_.
#' terms _(Francoeur 2013)_. Centering interaction terms can resolve this
#' issue _(Kim and Jung 2024)_.
#'
#' @section Multicollinearity and Polynomial Terms:
#' Polynomial transformations are considered a single term and thus VIFs are
#' not calculated between them.
#'
#' @section Concurvity for Smooth Terms in Generalized Additive Models:
#' `check_concurvity()` is a wrapper around `mgcv::concurvity()`, and can be
Expand All @@ -91,26 +96,30 @@
#' @references
#'
#' - Francoeur, R. B. (2013). Could Sequential Residual Centering Resolve
#' Low Sensitivity in Moderated Regression? Simulations and Cancer Symptom
#' Clusters. Open Journal of Statistics, 03(06), 24-44.
#' Low Sensitivity in Moderated Regression? Simulations and Cancer Symptom
#' Clusters. Open Journal of Statistics, 03(06), 24-44.
#'
#' - James, G., Witten, D., Hastie, T., and Tibshirani, R. (eds.). (2013). An
#' introduction to statistical learning: with applications in R. New York:
#' Springer.
#'
#' - James, G., Witten, D., Hastie, T., and Tibshirani, R. (eds.). (2013).
#' An introduction to statistical learning: with applications in R. New York:
#' Springer.
#' - Kim, Y., & Jung, G. (2024). Understanding linear interaction analysis with
#' causal graphs. British Journal of Mathematical and Statistical Psychology,
#' 00, 1–14.
#'
#' - Marcoulides, K. M., and Raykov, T. (2019). Evaluation of Variance
#' Inflation Factors in Regression Models Using Latent Variable Modeling
#' Methods. Educational and Psychological Measurement, 79(5), 874–882.
#' Inflation Factors in Regression Models Using Latent Variable Modeling
#' Methods. Educational and Psychological Measurement, 79(5), 874–882.
#'
#' - McElreath, R. (2020). Statistical rethinking: A Bayesian course with
#' examples in R and Stan. 2nd edition. Chapman and Hall/CRC.
#' examples in R and Stan. 2nd edition. Chapman and Hall/CRC.
#'
#' - Vanhove, J. (2019). Collinearity isn't a disease that needs curing.
#' [webpage](https://janhove.github.io/posts/2019-09-11-collinearity/)
#' [webpage](https://janhove.github.io/posts/2019-09-11-collinearity/)
#'
#' - Zuur AF, Ieno EN, Elphick CS. A protocol for data exploration to avoid
#' common statistical problems: Data exploration. Methods in Ecology and
#' Evolution (2010) 1:3–14.
#' common statistical problems: Data exploration. Methods in Ecology and
#' Evolution (2010) 1:3–14.
#'
#' @family functions to check model assumptions and and assess model quality
#'
Expand Down Expand Up @@ -193,7 +202,7 @@ plot.check_collinearity <- function(x, ...) {
x <- insight::format_table(x)
x <- datawizard::data_rename(
x,
pattern = "SE_factor",
select = "SE_factor",
replacement = "Increased SE",
verbose = FALSE
)
Expand Down Expand Up @@ -514,7 +523,7 @@ check_collinearity.zerocount <- function(x,
if (!is.null(insight::find_interactions(x)) && any(result > 10) && isTRUE(verbose)) {
insight::format_alert(
"Model has interaction terms. VIFs might be inflated.",
"You may check multicollinearity among predictors of a model without interaction terms."
"Try to center the variables used for the interaction, or check multicollinearity among predictors of a model without interaction terms." # nolint
)
}

Expand Down
9 changes: 1 addition & 8 deletions R/check_model_diagnostics.R
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,9 @@
dat$group[dat$VIF >= 5 & dat$VIF < 10] <- "moderate"
dat$group[dat$VIF >= 10] <- "high"

dat <- datawizard::data_rename(
dat,
c("Term", "VIF", "SE_factor", "Component"),
c("x", "y", "se", "facet"),
verbose = FALSE
)

dat <- datawizard::data_select(
dat,
c("x", "y", "facet", "group"),
select = c(x = "Term", y = "VIF", facet = "Component", group = "group"),
verbose = FALSE
)

Expand Down
74 changes: 38 additions & 36 deletions R/check_outliers.R
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,8 @@
#' extreme values), this algorithm functions in a different manner and won't
#' always detect outliers. Note that `method = "optics"` requires the
#' **dbscan** package to be installed, and that it takes some time to compute
#' the results.
#' the results. Additionally, the `optics_xi` (default to 0.05) is passed to
#' the [dbscan::extractXi()] function to further refine the cluster selection.
#'
#' - **Local Outlier Factor**:
#' Based on a K nearest neighbors algorithm, LOF compares the local density of
Expand Down Expand Up @@ -242,6 +243,7 @@
#' mcd = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
#' ics = 0.001,
#' optics = 2 * ncol(x),
#' optics_xi = 0.05,
#' lof = 0.001
#' )
#' ```
Expand Down Expand Up @@ -881,6 +883,13 @@ check_outliers.data.frame <- function(x,
} else if (is.numeric(threshold)) {
thresholds <- .check_outliers_thresholds(x)
thresholds <- lapply(thresholds, function(x) threshold)
# need to fix this manually - "optics" automatically includes method
# "optics_xi", which is allowed to range between 0 and 1 - since values
# for "optics" can be > 1, it might overwrite "optics_xi" with an invalid
# value...
if (thresholds$optics_xi > 1) {
thresholds$optics_xi <- 0.05
}
} else {
insight::format_error(
paste(
Expand All @@ -890,7 +899,13 @@ check_outliers.data.frame <- function(x,
)
}

thresholds <- thresholds[names(thresholds) %in% method]
# Keep only relevant threshold
valid <- method
if("optics" %in% valid) {

Check warning on line 904 in R/check_outliers.R

View workflow job for this annotation

GitHub Actions / lint / lint

file=R/check_outliers.R,line=904,col=5,[spaces_left_parentheses_linter] Place a space before left parenthesis, except in a function call.
valid <- c(valid, "optics_xi")
method <- c(method, "optics_xi")
}
thresholds <- thresholds[names(thresholds) %in% valid]

out.meta <- .check_outliers.data.frame_method(x, method, thresholds, ID, ID.names, ...)
out <- out.meta$out
Expand Down Expand Up @@ -1207,7 +1222,8 @@ check_outliers.data.frame <- function(x,
out <- c(out, .check_outliers_optics(
x,
threshold = thresholds$optics,
ID.names = ID.names
ID.names = ID.names,
xi = thresholds$optics_xi
))

count.table <- datawizard::data_filter(
Expand Down Expand Up @@ -1506,38 +1522,23 @@ check_outliers.DHARMa <- check_outliers.performance_simres
}

.check_outliers_thresholds_nowarn <- function(x) {
zscore <- stats::qnorm(p = 1 - 0.001 / 2)
zscore_robust <- stats::qnorm(p = 1 - 0.001 / 2)
iqr <- 1.7
ci <- 1 - 0.001
eti <- 1 - 0.001
hdi <- 1 - 0.001
bci <- 1 - 0.001
cook <- stats::qf(0.5, ncol(x), nrow(x) - ncol(x))
pareto <- 0.7
mahalanobis_value <- stats::qchisq(p = 1 - 0.001, df = ncol(x))
mahalanobis_robust <- stats::qchisq(p = 1 - 0.001, df = ncol(x))
mcd <- stats::qchisq(p = 1 - 0.001, df = ncol(x))
ics <- 0.001
optics <- 2 * ncol(x)
lof <- 0.001

list(
zscore = zscore,
zscore_robust = zscore_robust,
iqr = iqr,
ci = ci,
hdi = hdi,
eti = eti,
bci = bci,
cook = cook,
pareto = pareto,
mahalanobis = mahalanobis_value,
mahalanobis_robust = mahalanobis_robust,
mcd = mcd,
ics = ics,
optics = optics,
lof = lof
zscore = stats::qnorm(p = 1 - 0.001 / 2),
zscore_robust = stats::qnorm(p = 1 - 0.001 / 2),
iqr = 1.7,
ci = 1 - 0.001,
hdi = 1 - 0.001,
eti = 1 - 0.001,
bci = 1 - 0.001,
cook = stats::qf(0.5, ncol(x), nrow(x) - ncol(x)),
pareto = 0.7,
mahalanobis = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
mahalanobis_robust = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
mcd = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
ics = 0.001,
optics = 2 * ncol(x),
optics_xi = 0.05,
lof = 0.001
)
}

Expand Down Expand Up @@ -1929,7 +1930,8 @@ check_outliers.DHARMa <- check_outliers.performance_simres

.check_outliers_optics <- function(x,
threshold = NULL,
ID.names = NULL) {
ID.names = NULL,
xi = 0.05) {
out <- data.frame(Row = seq_len(nrow(x)))

if (!is.null(ID.names)) {
Expand All @@ -1940,7 +1942,7 @@ check_outliers.DHARMa <- check_outliers.performance_simres

# Compute
rez <- dbscan::optics(x, minPts = threshold)
rez <- dbscan::extractXi(rez, xi = 0.05) # TODO: find automatic way of setting xi
rez <- dbscan::extractXi(rez, xi = xi) # TODO: find automatic way of setting xi

out$Distance_OPTICS <- rez$coredist

Expand Down
16 changes: 13 additions & 3 deletions man/check_collinearity.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 3 additions & 1 deletion man/check_outliers.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions tests/testthat/test-check_collinearity.R
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,12 @@ test_that("check_collinearity, correct order in print", {
})


test_that("check_collinearity, interaction", {
m <- lm(mpg ~ wt * cyl, data = mtcars)
expect_message(check_collinearity(m), regex = "Model has interaction terms")
})


test_that("check_collinearity", {
skip_if_not_installed("glmmTMB")
skip_if_not(getRversion() >= "4.0.0")
Expand Down
4 changes: 3 additions & 1 deletion vignettes/check_model.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,7 @@ Our model clearly suffers from multicollinearity, as all predictors have high VI

### How to fix this?

Usually, predictors with (very) high VIF values should be removed from the model to fix multicollinearity. Some caution is needed for interaction terms. If interaction terms are included in a model, high VIF values are expected. This portion of multicollinearity among the component terms of an interaction is also called "inessential ill-conditioning", which leads to inflated VIF values that are typically seen for models with interaction terms _(Francoeur 2013)_. In such cases, re-fit your model without interaction terms and check this model for collinearity among predictors.
Usually, predictors with (very) high VIF values should be removed from the model to fix multicollinearity. Some caution is needed for interaction terms. If interaction terms are included in a model, high VIF values are expected. This portion of multicollinearity among the component terms of an interaction is also called "inessential ill-conditioning", which leads to inflated VIF values that are typically seen for models with interaction terms _(Francoeur 2013)_. In such cases, try centering the involved interaction terms, which can reduce multicollinearity _(Kim and Jung 2024)_, or re-fit your model without interaction terms and check this model for collinearity among predictors.

## Normality of residuals

Expand Down Expand Up @@ -293,6 +293,8 @@ Gelman A, and Hill J. Data analysis using regression and multilevel/hierarchical

James, G., Witten, D., Hastie, T., and Tibshirani, R. (eds.).An introduction to statistical learning: with applications in R. New York: Springer, 2013

Kim, Y., & Jung, G. (2024). Understanding linear interaction analysis with causal graphs. British Journal of Mathematical and Statistical Psychology, 00, 1–14.

Leys C, Delacre M, Mora YL, Lakens D, Ley C. How to Classify, Detect, and Manage Univariate and Multivariate Outliers, With Emphasis on Pre-Registration. International Review of Social Psychology, 2019

McElreath, R. Statistical rethinking: A Bayesian course with examples in R and Stan. 2nd edition. Chapman and Hall/CRC, 2020
Expand Down

0 comments on commit 68f2db4

Please sign in to comment.