Skip to content

Conversation

@ANAMASGARD
Copy link

Fixes #828

This PR updates the multicollinearity section based on @bbolker's feedback and recent research showing that automatically removing high VIF predictors isn't usually the right approach.

What changed:

  • Removed the recommendation to drop predictors with high VIF values
  • Added context about when multicollinearity actually matters (interpretation vs prediction)
  • Explained that removing variables can introduce omitted variable bias
  • Kept the existing interaction terms guidance intact
  • Added 6 references (Graham 2003, O'Brien 2007, Morrissey & Ruxton 2018, Feng et al. 2019, Gregorich et al. 2021, Vanhove 2021)

Why:

The old advice was too prescriptive. Multicollinearity mostly affects coefficient precision, not bias, and whether it's a problem depends on your research goals. The vignette now reflects this nuance instead of suggesting a one-size-fits-all solution.

Checked that the vignette builds without errors and the new text flows well with the existing content.

- Remove blanket recommendation to drop high VIF predictors
- Add nuanced, research-backed guidance on when multicollinearity matters
- Distinguish between prediction vs. interpretation goals
- Add 6 academic references (2003-2021) supporting new approach
- Improve prose clarity and structure
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ANAMASGARD, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly revises the multicollinearity section within the check_model vignette. The update moves away from a prescriptive approach of automatically removing high VIF predictors, instead offering a more nuanced understanding of multicollinearity's implications based on research goals (prediction vs. interpretation) and highlighting the risks of omitted variable bias. The aim is to provide more robust and context-aware guidance for users.

Highlights

  • Multicollinearity Guidance: Removed the recommendation to automatically drop predictors with high VIF values, advocating for a more nuanced approach.
  • Contextual Understanding: Added detailed context on when multicollinearity truly matters, differentiating its impact on model interpretation versus prediction.
  • Omitted Variable Bias: Explained the potential for introducing omitted variable bias if variables are removed solely based on high VIFs.
  • Interaction Terms: Maintained the existing guidance for handling interaction terms, acknowledging their inherent multicollinearity.
  • New References: Incorporated six new academic references to support the updated statistical guidance on multicollinearity.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a much-needed and well-executed update to the guidance on multicollinearity in the check_model vignette. The new text correctly emphasizes that multicollinearity is primarily a problem for interpretation, not prediction, and wisely advises against the automatic removal of predictors with high VIFs. The inclusion of recent literature strengthens the recommendations.

However, I've noticed a significant inconsistency. The documentation for the check_collinearity() function itself (in R/check_collinearity.R) still contains the old, more prescriptive advice that has been removed from the vignette. This will lead to conflicting information for users depending on whether they read the vignette or the function's help page. To ensure consistency across the package, I strongly recommend updating the roxygen documentation in R/check_collinearity.R to align with the excellent new guidance in this PR.

I have one suggestion for the vignette text to further improve its clarity for beginners.

## Multicollinearity

This plot checks for potential collinearity among predictors. In a nutshell multicollinearity means that once you know the effect of one predictor, the value of knowing the other predictor is rather low. Multicollinearity might arise when a third, unobserved variable has a causal effect on each of the two predictors that are associated with the outcome. In such cases, the actual relationship that matters would be the association between the unobserved variable and the outcome.
This plot checks for potential collinearity among predictors. Multicollinearity occurs when predictor variables are highly correlated with each other, conditional on the other variables in the model. This should not be confused with simple pairwise correlation between predictors; what matters is the association between predictors *after accounting for all other variables in the model*.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The revised definition of multicollinearity is much more precise, which is great. However, the previous version included a very intuitive 'in a nutshell' explanation that was very accessible. Consider blending the intuitive and formal definitions to help readers who are new to this topic.

This plot checks for potential collinearity among predictors. Multicollinearity occurs when predictor variables are highly correlated with each other, conditional on the other variables in the model. In other words, the information one predictor provides about the outcome is redundant in the presence of the other predictors. This should not be confused with simple pairwise correlation between predictors; what matters is the association between predictors *after accounting for all other variables in the model*.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

advice in multicollinearity section of vignette

1 participant