Skip to content

Conversation

rahul-tuli
Copy link
Collaborator

@rahul-tuli rahul-tuli commented Aug 18, 2025

Recovery-Based Testing for LM-Eval

This PR implements recovery-based testing as the default validation mechanism for all lm-eval tests. Tests now compare compressed model performance against base model performance, making them robust to upstream changes while ensuring quantization quality.

Current Problem:

  • Tests fail when base models regress due to external changes (e.g., transformers updates, lm-eval changes)
  • False positives block CI even when quantization maintains expected recovery
  • Absolute thresholds become stale as models/libraries evolve

Example:
Qwen2.5-VL tests fail with transformers ≥ 4.54.0 due to ~10% base model accuracy drop, despite quantization maintaining the same relative performance.

Solution:
Recovery testing validates that compressed models retain ≥95% (configurable) of base model performance, regardless of absolute score changes.


🚀 New Behavior

Default Behavior (Zero Config Required)

All lm-eval tests now automatically:

  1. ✅ Evaluate the base (uncompressed) model
  2. ✅ Quantize the model using configured scheme
  3. ✅ Evaluate the compressed model
  4. ✅ Validate recovery ≥ 95% (default threshold)
  5. ✅ Show optional warnings for absolute metrics

Recovery Formula:

# For "higher is better" metrics (accuracy, F1, etc.)
recovery = compressed_score / base_score

# For "lower is better" metrics (perplexity, loss)
recovery = base_score / compressed_score  # Inverted!

# Validation
assert recovery >= threshold  # Default: 0.95

Recovery Interpretation:

  • 1.00 = Perfect (0% degradation)
  • 0.96 = 96% retained (4% degradation) ✅
  • 0.93 = 93% retained (7% degradation) ❌ (with default threshold)

📝 Configuration Options

Option 1: Use Default (Recommended)

No configuration needed - uses 95% recovery threshold:

cadence: "weekly"
model: meta-llama/Meta-Llama-3-8B-Instruct
scheme: FP8_DYNAMIC
lmeval:
  # That's it! Uses recovery_threshold: 0.95 by default

Option 2: Override Global Threshold

Set a different threshold for all metrics:

lmeval:
  recovery_threshold: 0.93  # All metrics need ≥93% recovery

Option 3: Per-Metric Thresholds

Set different thresholds for different metrics:

lmeval:
  recovery_threshold:
    exact_match,flexible-extract: 0.95  # Strict threshold
    exact_match,strict-match: 0.90      # Relaxed threshold

Option 4: With Absolute Metric Warnings

Keep absolute metrics for informational warnings (not failures):

lmeval:
  recovery_threshold: 0.95  # Required - TEST FAILS if not met
  metrics:                   # Optional - warnings only, no failures
    exact_match,flexible-extract: 0.75
    exact_match,strict-match: 0.72

Example Output

✅ Recovery Validation (Always Shown)

================================================================================
RECOVERY TESTING COMPARISON
================================================================================
✓ exact_match,flexible-extract   | Base: 0.7890 | Compressed: 0.7601 |
  Recovery: 96.34% ↑ | Threshold: ≥95.00%

✓ exact_match,strict-match       | Base: 0.7564 | Compressed: 0.7262 |
  Recovery: 96.01% ↑ | Threshold: ≥95.00%
================================================================================
✓ ALL METRICS PASSED RECOVERY THRESHOLDS
================================================================================

Absolute Metric Warnings (If Configured)

================================================================================
ABSOLUTE METRICS CHECK (warnings only, not failures)
================================================================================
✓ exact_match,flexible-extract   | Expected: 0.7500 (±5%) | Got: 0.7601 |
  Within expected range

⚠ exact_match,strict-match       | Expected: 0.8000 (±5%) | Got: 0.7262 |
  Below expected range
================================================================================

Note: The warning above doesn't fail the test - recovery validation already passed!


🔄 Migration Guide

Existing Configs with Absolute Metrics

Before (absolute thresholds cause failures):

lmeval:
  metrics:
    exact_match: 0.75  # TEST FAILS if not met

After (minimal - uses recovery testing):

lmeval:
  # Uses default recovery_threshold: 0.95
  # No other config needed!

After (keep warnings):

lmeval:
  # recovery_threshold: 0.95 is implicit (default)
  metrics:  # Now just warnings, won't fail tests
    exact_match: 0.75

No Breaking Changes

  • ✅ All existing configs continue to work
  • metrics dict now shows warnings instead of failing
  • ✅ Recovery testing automatically enabled with sensible default
  • ✅ Backward compatible with all test infrastructure

Implementation Details

Files Changed

  • tests/lmeval/test_lmeval.py (+151/-31 lines)
    • Added recovery_threshold config field (default: 0.95)
    • Made metrics field optional
    • Added _eval_base_model() method
    • Added _validate_recovery() method
    • Modified _check_absolute_warnings() to only warn, not fail
    • Updated test flow to always evaluate base model first

Key Features

  1. Direction-Aware Recovery

    • Automatically detects "higher is better" vs "lower is better" metrics
    • Inverts ratio for perplexity-style metrics
  2. Edge Case Handling

    • Zero base values: recovery = 1.0 if compressed == 0 else 0.0
    • Missing metrics: Skipped gracefully
    • Metadata filtering: Skips stderr and alias keys
  3. Flexible Thresholds

    • Global float: recovery_threshold: 0.93
    • Per-metric dict: recovery_threshold: {metric1: 0.95, metric2: 0.90}
    • Fallback to 0.95 for unlisted metrics when using dict
  4. Comprehensive Logging

    • Recovery threshold displayed at test start
    • Detailed comparison table with base/compressed/recovery values
    • Clear pass/fail indicators with direction arrows (↑/↓)
    • Separate section for optional absolute warnings

Performance Impact

Additional Runtime:

  • Base model evaluation: ~2-10 minutes
  • Compressed model evaluation: ~2-10 minutes (unchanged)
  • Total: ~2x single evaluation time

Trade-off: Doubled evaluation time for robust, meaningful metrics that don't break from upstream changes.

Mitigation: Tests run on weekly cadence, making the additional time acceptable.


✅ Benefits

Benefit Description
🛡️ Robustness Tests never break from lm-eval or transformers updates
📊 Meaningful Measures actual compression degradation, not arbitrary thresholds
🎯 Automatic Works out of box, no config needed
🔧 Flexible Override threshold globally or per-metric
↔️ Compatible Zero breaking changes, existing configs work
🧹 Simple ~150 lines in single file, no new dependencies

Testing

To test recovery-based validation:

# Uses default recovery threshold (0.95)
CADENCE=weekly TEST_DATA_FILE=tests/lmeval/configs/fp8_dynamic_per_token.yaml \
  pytest tests/lmeval/test_lmeval.py -v

Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @rahul-tuli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to the LM-Eval testing framework by implementing recovery-based validation. Instead of relying on fixed performance thresholds, tests now dynamically compare the accuracy of quantized models against their original, unquantized base models. This strategic shift aims to make the test suite more robust and less susceptible to external regressions in base models, thereby reducing false positives in continuous integration pipelines. The change ensures that the quality of quantization is maintained relative to the base model's performance, providing a more reliable and adaptive testing environment.

Highlights

  • Recovery-Based Validation for LM-Eval Tests: LM-Eval tests now employ a recovery-based validation strategy, comparing the performance of quantized models directly against their base (unquantized) counterparts. This ensures that tests pass if the quantized model maintains a specified percentage (defaulting to 98%) of the base model's performance.
  • Improved Test Robustness and Reduced False Positives: This change significantly enhances test resilience against external factors, such as regressions in upstream base models (e.g., due to transformers library updates). By focusing on relative performance recovery, the system avoids false positives that previously blocked CI, streamlining the development workflow.
  • Backward Compatibility with Absolute Thresholds: The new validation mechanism is backward compatible. Existing test configurations that specify absolute performance thresholds will continue to function, but these thresholds will now serve as warnings rather than strict assertions, allowing for a smooth transition.
  • Minimal Performance Impact: While the new approach introduces an additional evaluation of the base model per test, adding approximately 2-3 minutes, the results are cached within the test session. This minimizes the overall performance impact relative to the time required for quantization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust recovery-based validation system for LM-Eval tests, which is a great improvement for test stability against upstream model changes. The implementation is well-structured, with the evaluation logic nicely refactored into a reusable _evaluate_model function. I've identified a couple of areas for improvement: there's an inconsistency in the default recovery threshold value that should be aligned, and a bug in the recovery calculation when the base model's score is zero, which could lead to false negatives in tests. Addressing these points will make the new validation logic even more reliable.

@dsikka dsikka self-requested a review August 18, 2025 17:13
@rahul-tuli rahul-tuli force-pushed the feat/recovery-based-testing branch 2 times, most recently from bbc7069 to cfaf403 Compare October 1, 2025 09:52
Introduce recovery_threshold field with default value of 0.95,
making recovery-based testing the default behavior. Support both
global float thresholds and per-metric dict thresholds.

Make metrics field optional since it's now used only for warnings
rather than test validation.

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: rahul-tuli <rtuli@redhat.com>
Update class docstring to document recovery testing behavior
and configuration options. Add logging to display recovery
threshold and metrics configuration at test startup.

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: rahul-tuli <rtuli@redhat.com>
Implement _eval_base_model() to evaluate the uncompressed model
using lm_eval. This provides baseline metrics for recovery testing.

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: rahul-tuli <rtuli@redhat.com>
Modify test_lm_eval() to always call _eval_base_model() and
store results in self.base_results for recovery testing.

Update log messages to distinguish between base and compressed
model evaluation phases.

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: rahul-tuli <rtuli@redhat.com>
Implement _validate_recovery() to compare compressed model metrics
against base model metrics. Features:
- Computes recovery ratio (compressed/base or inverted for lower-is-better)
- Supports both global and per-metric thresholds
- Direction-aware handling (higher/lower is better)
- Handles edge cases (zero values, missing metrics)
- Fails test if any metric doesn't meet threshold

Call recovery validation from _run_lm_eval() as the primary
test validation mechanism.

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: rahul-tuli <rtuli@redhat.com>
Replace assertion-based metrics validation with warning-only checks.
The metrics dict is now optional and used for informational purposes
rather than test failures. This provides backward compatibility while
making recovery testing the primary validation mechanism.

Changes:
- Remove stderr-based tolerance logic (unused)
- Standardize on ±5% relative tolerance
- Log warnings instead of raising assertions
- Skip missing metrics gracefully
- Direction-aware warnings (higher vs lower is better)

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: rahul-tuli <rtuli@redhat.com>
@rahul-tuli rahul-tuli force-pushed the feat/recovery-based-testing branch from cfaf403 to 023ccf1 Compare October 1, 2025 09:53
@rahul-tuli rahul-tuli marked this pull request as ready for review October 1, 2025 10:05
@rahul-tuli rahul-tuli requested a review from shanjiaz October 1, 2025 10:06
@rahul-tuli rahul-tuli self-assigned this Oct 1, 2025
Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this in. As a follow up, we should get rid of the exact match values altogether

@rahul-tuli
Copy link
Collaborator Author

@gemini-code-assist review

@rahul-tuli rahul-tuli requested a review from Copilot October 1, 2025 10:45
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements recovery-based testing as the default validation mechanism for LM-Eval tests, making them robust to upstream changes while ensuring quantization quality is maintained. Instead of relying on absolute thresholds that can become stale, tests now compare compressed model performance against base model performance.

  • Introduces recovery-based validation comparing compressed vs base model performance
  • Adds configurable recovery thresholds (default 95%) with per-metric override support
  • Converts absolute metrics to optional warnings instead of test failures

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant improvement by implementing recovery-based testing for lm-eval, making the tests more robust to upstream changes. The overall implementation is well-structured, with clear logic and enhanced logging. I have identified one high-severity issue where model arguments could be incorrectly overwritten, potentially leading to invalid test results, and one medium-severity issue related to a hardcoded value that could be improved for better maintainability. My feedback focuses on making the new logic more robust and maintainable.

  NVFP4 scheme uses 4-bit weights and activations, resulting in
  lower recovery than FP8/INT8. Set per-metric thresholds based
  on observed values to allow ~89-92% recovery.

Signed-off-by: rahul-tuli <rtuli@redhat.com>
dsikka
dsikka previously approved these changes Oct 1, 2025
@dsikka dsikka added the ready When a PR is ready for review label Oct 1, 2025
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these changes make sense, but we should prune the config yaml files to remove the keys we're no longer using

kylesayrs
kylesayrs previously approved these changes Oct 1, 2025
@kylesayrs
Copy link
Collaborator

Just nits, can address later if need be

Changes based on kylesayrs review:
- Remove is_dict variable, use isinstance() inline
- Replace hardcoded 0.95 with model_fields default value
- Add safety checks for higher_is_better_map existence
- Refactor to iterate over compressed_metrics instead of base_metrics
- Add warnings when config has keys not found in results
- Validate recovery_threshold dict keys against actual results

This eliminates code duplication, improves robustness, and ensures
configs are validated against actual test results.

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
@rahul-tuli rahul-tuli dismissed stale reviews from kylesayrs and brian-dellabetta via 0922265 October 1, 2025 16:04
Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so my thought is that we should keep exact match values (strict / flexible) as a reference in the config but we care remove using them in the test.

They proved useful just now when we were considering the two cases that were < 95%.

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! Looks like we can lower the threshold for the failing AWQ tests, after release

@dsikka dsikka merged commit cb8d775 into main Oct 1, 2025
8 checks passed
@dsikka dsikka deleted the feat/recovery-based-testing branch October 1, 2025 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants