-
-
Notifications
You must be signed in to change notification settings - Fork 102
Revise multicollinearity guidance in check_model vignette #871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Revise multicollinearity guidance in check_model vignette #871
Conversation
- Remove blanket recommendation to drop high VIF predictors - Add nuanced, research-backed guidance on when multicollinearity matters - Distinguish between prediction vs. interpretation goals - Add 6 academic references (2003-2021) supporting new approach - Improve prose clarity and structure
Summary of ChangesHello @ANAMASGARD, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly revises the multicollinearity section within the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request provides a much-needed and well-executed update to the guidance on multicollinearity in the check_model vignette. The new text correctly emphasizes that multicollinearity is primarily a problem for interpretation, not prediction, and wisely advises against the automatic removal of predictors with high VIFs. The inclusion of recent literature strengthens the recommendations.
However, I've noticed a significant inconsistency. The documentation for the check_collinearity() function itself (in R/check_collinearity.R) still contains the old, more prescriptive advice that has been removed from the vignette. This will lead to conflicting information for users depending on whether they read the vignette or the function's help page. To ensure consistency across the package, I strongly recommend updating the roxygen documentation in R/check_collinearity.R to align with the excellent new guidance in this PR.
I have one suggestion for the vignette text to further improve its clarity for beginners.
vignettes/check_model.Rmd
Outdated
| ## Multicollinearity | ||
|
|
||
| This plot checks for potential collinearity among predictors. In a nutshell multicollinearity means that once you know the effect of one predictor, the value of knowing the other predictor is rather low. Multicollinearity might arise when a third, unobserved variable has a causal effect on each of the two predictors that are associated with the outcome. In such cases, the actual relationship that matters would be the association between the unobserved variable and the outcome. | ||
| This plot checks for potential collinearity among predictors. Multicollinearity occurs when predictor variables are highly correlated with each other, conditional on the other variables in the model. This should not be confused with simple pairwise correlation between predictors; what matters is the association between predictors *after accounting for all other variables in the model*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The revised definition of multicollinearity is much more precise, which is great. However, the previous version included a very intuitive 'in a nutshell' explanation that was very accessible. Consider blending the intuitive and formal definitions to help readers who are new to this topic.
This plot checks for potential collinearity among predictors. Multicollinearity occurs when predictor variables are highly correlated with each other, conditional on the other variables in the model. In other words, the information one predictor provides about the outcome is redundant in the presence of the other predictors. This should not be confused with simple pairwise correlation between predictors; what matters is the association between predictors *after accounting for all other variables in the model*.
Fixes #828
This PR updates the multicollinearity section based on @bbolker's feedback and recent research showing that automatically removing high VIF predictors isn't usually the right approach.
What changed:
Why:
The old advice was too prescriptive. Multicollinearity mostly affects coefficient precision, not bias, and whether it's a problem depends on your research goals. The vignette now reflects this nuance instead of suggesting a one-size-fits-all solution.
Checked that the vignette builds without errors and the new text flows well with the existing content.