-
Notifications
You must be signed in to change notification settings - Fork 246
[Tests] Add recovery-based validation to LM-Eval tests #1750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @rahul-tuli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a significant enhancement to the LM-Eval testing framework by implementing recovery-based validation. Instead of relying on fixed performance thresholds, tests now dynamically compare the accuracy of quantized models against their original, unquantized base models. This strategic shift aims to make the test suite more robust and less susceptible to external regressions in base models, thereby reducing false positives in continuous integration pipelines. The change ensures that the quality of quantization is maintained relative to the base model's performance, providing a more reliable and adaptive testing environment.
Highlights
- Recovery-Based Validation for LM-Eval Tests: LM-Eval tests now employ a recovery-based validation strategy, comparing the performance of quantized models directly against their base (unquantized) counterparts. This ensures that tests pass if the quantized model maintains a specified percentage (defaulting to 98%) of the base model's performance.
- Improved Test Robustness and Reduced False Positives: This change significantly enhances test resilience against external factors, such as regressions in upstream base models (e.g., due to
transformers
library updates). By focusing on relative performance recovery, the system avoids false positives that previously blocked CI, streamlining the development workflow. - Backward Compatibility with Absolute Thresholds: The new validation mechanism is backward compatible. Existing test configurations that specify absolute performance thresholds will continue to function, but these thresholds will now serve as warnings rather than strict assertions, allowing for a smooth transition.
- Minimal Performance Impact: While the new approach introduces an additional evaluation of the base model per test, adding approximately 2-3 minutes, the results are cached within the test session. This minimizes the overall performance impact relative to the time required for quantization.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a robust recovery-based validation system for LM-Eval tests, which is a great improvement for test stability against upstream model changes. The implementation is well-structured, with the evaluation logic nicely refactored into a reusable _evaluate_model
function. I've identified a couple of areas for improvement: there's an inconsistency in the default recovery threshold value that should be aligned, and a bug in the recovery calculation when the base model's score is zero, which could lead to false negatives in tests. Addressing these points will make the new validation logic even more reliable.
bbc7069
to
cfaf403
Compare
Introduce recovery_threshold field with default value of 0.95, making recovery-based testing the default behavior. Support both global float thresholds and per-metric dict thresholds. Make metrics field optional since it's now used only for warnings rather than test validation. Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: rahul-tuli <rtuli@redhat.com>
Update class docstring to document recovery testing behavior and configuration options. Add logging to display recovery threshold and metrics configuration at test startup. Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: rahul-tuli <rtuli@redhat.com>
Implement _eval_base_model() to evaluate the uncompressed model using lm_eval. This provides baseline metrics for recovery testing. Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: rahul-tuli <rtuli@redhat.com>
Modify test_lm_eval() to always call _eval_base_model() and store results in self.base_results for recovery testing. Update log messages to distinguish between base and compressed model evaluation phases. Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: rahul-tuli <rtuli@redhat.com>
Implement _validate_recovery() to compare compressed model metrics against base model metrics. Features: - Computes recovery ratio (compressed/base or inverted for lower-is-better) - Supports both global and per-metric thresholds - Direction-aware handling (higher/lower is better) - Handles edge cases (zero values, missing metrics) - Fails test if any metric doesn't meet threshold Call recovery validation from _run_lm_eval() as the primary test validation mechanism. Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: rahul-tuli <rtuli@redhat.com>
Replace assertion-based metrics validation with warning-only checks. The metrics dict is now optional and used for informational purposes rather than test failures. This provides backward compatibility while making recovery testing the primary validation mechanism. Changes: - Remove stderr-based tolerance logic (unused) - Standardize on ±5% relative tolerance - Log warnings instead of raising assertions - Skip missing metrics gracefully - Direction-aware warnings (higher vs lower is better) Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: rahul-tuli <rtuli@redhat.com>
cfaf403
to
023ccf1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this in. As a follow up, we should get rid of the exact match values altogether
@gemini-code-assist review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements recovery-based testing as the default validation mechanism for LM-Eval tests, making them robust to upstream changes while ensuring quantization quality is maintained. Instead of relying on absolute thresholds that can become stale, tests now compare compressed model performance against base model performance.
- Introduces recovery-based validation comparing compressed vs base model performance
- Adds configurable recovery thresholds (default 95%) with per-metric override support
- Converts absolute metrics to optional warnings instead of test failures
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant improvement by implementing recovery-based testing for lm-eval, making the tests more robust to upstream changes. The overall implementation is well-structured, with clear logic and enhanced logging. I have identified one high-severity issue where model arguments could be incorrectly overwritten, potentially leading to invalid test results, and one medium-severity issue related to a hardcoded value that could be improved for better maintainability. My feedback focuses on making the new logic more robust and maintainable.
NVFP4 scheme uses 4-bit weights and activations, resulting in lower recovery than FP8/INT8. Set per-metric thresholds based on observed values to allow ~89-92% recovery. Signed-off-by: rahul-tuli <rtuli@redhat.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these changes make sense, but we should prune the config yaml files to remove the keys we're no longer using
Just nits, can address later if need be |
Changes based on kylesayrs review: - Remove is_dict variable, use isinstance() inline - Replace hardcoded 0.95 with model_fields default value - Add safety checks for higher_is_better_map existence - Refactor to iterate over compressed_metrics instead of base_metrics - Add warnings when config has keys not found in results - Validate recovery_threshold dict keys against actual results This eliminates code duplication, improves robustness, and ensures configs are validated against actual test results. Signed-off-by: Rahul Tuli <rtuli@redhat.com>
0922265
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so my thought is that we should keep exact match values (strict / flexible) as a reference in the config but we care remove using them in the test.
They proved useful just now when we were considering the two cases that were < 95%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff! Looks like we can lower the threshold for the failing AWQ tests, after release
Recovery-Based Testing for LM-Eval
This PR implements recovery-based testing as the default validation mechanism for all lm-eval tests. Tests now compare compressed model performance against base model performance, making them robust to upstream changes while ensuring quantization quality.
Current Problem:
Example:
Qwen2.5-VL tests fail with transformers ≥ 4.54.0 due to ~10% base model accuracy drop, despite quantization maintaining the same relative performance.
Solution:
Recovery testing validates that compressed models retain ≥95% (configurable) of base model performance, regardless of absolute score changes.
🚀 New Behavior
Default Behavior (Zero Config Required)
All lm-eval tests now automatically:
Recovery Formula:
Recovery Interpretation:
1.00
= Perfect (0% degradation)0.96
= 96% retained (4% degradation) ✅0.93
= 93% retained (7% degradation) ❌ (with default threshold)📝 Configuration Options
Option 1: Use Default (Recommended)
No configuration needed - uses 95% recovery threshold:
Option 2: Override Global Threshold
Set a different threshold for all metrics:
Option 3: Per-Metric Thresholds
Set different thresholds for different metrics:
Option 4: With Absolute Metric Warnings
Keep absolute metrics for informational warnings (not failures):
Example Output
✅ Recovery Validation (Always Shown)
Absolute Metric Warnings (If Configured)
Note: The warning above doesn't fail the test - recovery validation already passed!
🔄 Migration Guide
Existing Configs with Absolute Metrics
Before (absolute thresholds cause failures):
After (minimal - uses recovery testing):
After (keep warnings):
No Breaking Changes
metrics
dict now shows warnings instead of failingImplementation Details
Files Changed
tests/lmeval/test_lmeval.py
(+151/-31 lines)recovery_threshold
config field (default: 0.95)metrics
field optional_eval_base_model()
method_validate_recovery()
method_check_absolute_warnings()
to only warn, not failKey Features
Direction-Aware Recovery
Edge Case Handling
recovery = 1.0 if compressed == 0 else 0.0
Flexible Thresholds
recovery_threshold: 0.93
recovery_threshold: {metric1: 0.95, metric2: 0.90}
Comprehensive Logging
Performance Impact
Additional Runtime:
Trade-off: Doubled evaluation time for robust, meaningful metrics that don't break from upstream changes.
Mitigation: Tests run on weekly cadence, making the additional time acceptable.
✅ Benefits
Testing
To test recovery-based validation:
# Uses default recovery threshold (0.95) CADENCE=weekly TEST_DATA_FILE=tests/lmeval/configs/fp8_dynamic_per_token.yaml \ pytest tests/lmeval/test_lmeval.py -v