[Tests] Add recovery-based validation to LM-Eval tests #1750

rahul-tuli · 2025-08-18T17:06:40Z

Recovery-Based Testing for LM-Eval

This PR implements recovery-based testing as the default validation mechanism for all lm-eval tests. Tests now compare compressed model performance against base model performance, making them robust to upstream changes while ensuring quantization quality.

Current Problem:

Tests fail when base models regress due to external changes (e.g., transformers updates, lm-eval changes)
False positives block CI even when quantization maintains expected recovery
Absolute thresholds become stale as models/libraries evolve

Example:
Qwen2.5-VL tests fail with transformers ≥ 4.54.0 due to ~10% base model accuracy drop, despite quantization maintaining the same relative performance.

Solution:
Recovery testing validates that compressed models retain ≥95% (configurable) of base model performance, regardless of absolute score changes.

🚀 New Behavior

Default Behavior (Zero Config Required)

All lm-eval tests now automatically:

✅ Evaluate the base (uncompressed) model
✅ Quantize the model using configured scheme
✅ Evaluate the compressed model
✅ Validate recovery ≥ 95% (default threshold)
✅ Show optional warnings for absolute metrics

Recovery Formula:

# For "higher is better" metrics (accuracy, F1, etc.)
recovery = compressed_score / base_score

# For "lower is better" metrics (perplexity, loss)
recovery = base_score / compressed_score  # Inverted!

# Validation
assert recovery >= threshold  # Default: 0.95

Recovery Interpretation:

1.00 = Perfect (0% degradation)
0.96 = 96% retained (4% degradation) ✅
0.93 = 93% retained (7% degradation) ❌ (with default threshold)

📝 Configuration Options

Option 1: Use Default (Recommended)

No configuration needed - uses 95% recovery threshold:

cadence: "weekly"
model: meta-llama/Meta-Llama-3-8B-Instruct
scheme: FP8_DYNAMIC
lmeval:
  # That's it! Uses recovery_threshold: 0.95 by default

Option 2: Override Global Threshold

Set a different threshold for all metrics:

lmeval:
  recovery_threshold: 0.93  # All metrics need ≥93% recovery

Option 3: Per-Metric Thresholds

Set different thresholds for different metrics:

lmeval:
  recovery_threshold:
    exact_match,flexible-extract: 0.95  # Strict threshold
    exact_match,strict-match: 0.90      # Relaxed threshold

Option 4: With Absolute Metric Warnings

Keep absolute metrics for informational warnings (not failures):

lmeval:
  recovery_threshold: 0.95  # Required - TEST FAILS if not met
  metrics:                   # Optional - warnings only, no failures
    exact_match,flexible-extract: 0.75
    exact_match,strict-match: 0.72

Example Output

✅ Recovery Validation (Always Shown)

================================================================================
RECOVERY TESTING COMPARISON
================================================================================
✓ exact_match,flexible-extract   | Base: 0.7890 | Compressed: 0.7601 |
  Recovery: 96.34% ↑ | Threshold: ≥95.00%

✓ exact_match,strict-match       | Base: 0.7564 | Compressed: 0.7262 |
  Recovery: 96.01% ↑ | Threshold: ≥95.00%
================================================================================
✓ ALL METRICS PASSED RECOVERY THRESHOLDS
================================================================================

Absolute Metric Warnings (If Configured)

================================================================================
ABSOLUTE METRICS CHECK (warnings only, not failures)
================================================================================
✓ exact_match,flexible-extract   | Expected: 0.7500 (±5%) | Got: 0.7601 |
  Within expected range

⚠ exact_match,strict-match       | Expected: 0.8000 (±5%) | Got: 0.7262 |
  Below expected range
================================================================================

Note: The warning above doesn't fail the test - recovery validation already passed!

🔄 Migration Guide

Existing Configs with Absolute Metrics

Before (absolute thresholds cause failures):

lmeval:
  metrics:
    exact_match: 0.75  # TEST FAILS if not met

After (minimal - uses recovery testing):

lmeval:
  # Uses default recovery_threshold: 0.95
  # No other config needed!

After (keep warnings):

lmeval:
  # recovery_threshold: 0.95 is implicit (default)
  metrics:  # Now just warnings, won't fail tests
    exact_match: 0.75

No Breaking Changes

✅ All existing configs continue to work
✅ metrics dict now shows warnings instead of failing
✅ Recovery testing automatically enabled with sensible default
✅ Backward compatible with all test infrastructure

Implementation Details

Files Changed

tests/lmeval/test_lmeval.py (+151/-31 lines)
- Added recovery_threshold config field (default: 0.95)
- Made metrics field optional
- Added _eval_base_model() method
- Added _validate_recovery() method
- Modified _check_absolute_warnings() to only warn, not fail
- Updated test flow to always evaluate base model first

Key Features

Direction-Aware Recovery
- Automatically detects "higher is better" vs "lower is better" metrics
- Inverts ratio for perplexity-style metrics
Edge Case Handling
- Zero base values: recovery = 1.0 if compressed == 0 else 0.0
- Missing metrics: Skipped gracefully
- Metadata filtering: Skips stderr and alias keys
Flexible Thresholds
- Global float: recovery_threshold: 0.93
- Per-metric dict: recovery_threshold: {metric1: 0.95, metric2: 0.90}
- Fallback to 0.95 for unlisted metrics when using dict
Comprehensive Logging
- Recovery threshold displayed at test start
- Detailed comparison table with base/compressed/recovery values
- Clear pass/fail indicators with direction arrows (↑/↓)
- Separate section for optional absolute warnings

Performance Impact

Additional Runtime:

Base model evaluation: ~2-10 minutes
Compressed model evaluation: ~2-10 minutes (unchanged)
Total: ~2x single evaluation time

Trade-off: Doubled evaluation time for robust, meaningful metrics that don't break from upstream changes.

Mitigation: Tests run on weekly cadence, making the additional time acceptable.

✅ Benefits

Benefit	Description
🛡️ Robustness	Tests never break from lm-eval or transformers updates
📊 Meaningful	Measures actual compression degradation, not arbitrary thresholds
🎯 Automatic	Works out of box, no config needed
🔧 Flexible	Override threshold globally or per-metric
↔️ Compatible	Zero breaking changes, existing configs work
🧹 Simple	~150 lines in single file, no new dependencies

Testing

To test recovery-based validation:

# Uses default recovery threshold (0.95)
CADENCE=weekly TEST_DATA_FILE=tests/lmeval/configs/fp8_dynamic_per_token.yaml \
  pytest tests/lmeval/test_lmeval.py -v

github-actions · 2025-08-18T17:06:47Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist

Summary of Changes

Hello @rahul-tuli, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to the LM-Eval testing framework by implementing recovery-based validation. Instead of relying on fixed performance thresholds, tests now dynamically compare the accuracy of quantized models against their original, unquantized base models. This strategic shift aims to make the test suite more robust and less susceptible to external regressions in base models, thereby reducing false positives in continuous integration pipelines. The change ensures that the quality of quantization is maintained relative to the base model's performance, providing a more reliable and adaptive testing environment.

Highlights

Recovery-Based Validation for LM-Eval Tests: LM-Eval tests now employ a recovery-based validation strategy, comparing the performance of quantized models directly against their base (unquantized) counterparts. This ensures that tests pass if the quantized model maintains a specified percentage (defaulting to 98%) of the base model's performance.
Improved Test Robustness and Reduced False Positives: This change significantly enhances test resilience against external factors, such as regressions in upstream base models (e.g., due to transformers library updates). By focusing on relative performance recovery, the system avoids false positives that previously blocked CI, streamlining the development workflow.
Backward Compatibility with Absolute Thresholds: The new validation mechanism is backward compatible. Existing test configurations that specify absolute performance thresholds will continue to function, but these thresholds will now serve as warnings rather than strict assertions, allowing for a smooth transition.
Minimal Performance Impact: While the new approach introduces an additional evaluation of the base model per test, adding approximately 2-3 minutes, the results are cached within the test session. This minimizes the overall performance impact relative to the time required for quantization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a robust recovery-based validation system for LM-Eval tests, which is a great improvement for test stability against upstream model changes. The implementation is well-structured, with the evaluation logic nicely refactored into a reusable _evaluate_model function. I've identified a couple of areas for improvement: there's an inconsistency in the default recovery threshold value that should be aligned, and a bug in the recovery calculation when the base model's score is zero, which could lead to false negatives in tests. Addressing these points will make the new validation logic even more reliable.

tests/lmeval/test_lmeval.py

Introduce recovery_threshold field with default value of 0.95, making recovery-based testing the default behavior. Support both global float thresholds and per-metric dict thresholds. Make metrics field optional since it's now used only for warnings rather than test validation. Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: rahul-tuli <rtuli@redhat.com>

Update class docstring to document recovery testing behavior and configuration options. Add logging to display recovery threshold and metrics configuration at test startup. Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: rahul-tuli <rtuli@redhat.com>

Implement _eval_base_model() to evaluate the uncompressed model using lm_eval. This provides baseline metrics for recovery testing. Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: rahul-tuli <rtuli@redhat.com>

Modify test_lm_eval() to always call _eval_base_model() and store results in self.base_results for recovery testing. Update log messages to distinguish between base and compressed model evaluation phases. Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: rahul-tuli <rtuli@redhat.com>

Implement _validate_recovery() to compare compressed model metrics against base model metrics. Features: - Computes recovery ratio (compressed/base or inverted for lower-is-better) - Supports both global and per-metric thresholds - Direction-aware handling (higher/lower is better) - Handles edge cases (zero values, missing metrics) - Fails test if any metric doesn't meet threshold Call recovery validation from _run_lm_eval() as the primary test validation mechanism. Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: rahul-tuli <rtuli@redhat.com>

Replace assertion-based metrics validation with warning-only checks. The metrics dict is now optional and used for informational purposes rather than test failures. This provides backward compatibility while making recovery testing the primary validation mechanism. Changes: - Remove stderr-based tolerance logic (unused) - Standardize on ±5% relative tolerance - Log warnings instead of raising assertions - Skip missing metrics gracefully - Direction-aware warnings (higher vs lower is better) Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: rahul-tuli <rtuli@redhat.com>

dsikka

Thanks for adding this in. As a follow up, we should get rid of the exact match values altogether

rahul-tuli · 2025-10-01T10:45:18Z

@gemini-code-assist review

Copilot

Pull Request Overview

This PR implements recovery-based testing as the default validation mechanism for LM-Eval tests, making them robust to upstream changes while ensuring quantization quality is maintained. Instead of relying on absolute thresholds that can become stale, tests now compare compressed model performance against base model performance.

Introduces recovery-based validation comparing compressed vs base model performance
Adds configurable recovery thresholds (default 95%) with per-metric override support
Converts absolute metrics to optional warnings instead of test failures

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

tests/lmeval/test_lmeval.py

gemini-code-assist

Code Review

This pull request introduces a significant improvement by implementing recovery-based testing for lm-eval, making the tests more robust to upstream changes. The overall implementation is well-structured, with clear logic and enhanced logging. I have identified one high-severity issue where model arguments could be incorrectly overwritten, potentially leading to invalid test results, and one medium-severity issue related to a hardcoded value that could be improved for better maintainability. My feedback focuses on making the new logic more robust and maintainable.

tests/lmeval/test_lmeval.py

NVFP4 scheme uses 4-bit weights and activations, resulting in lower recovery than FP8/INT8. Set per-metric thresholds based on observed values to allow ~89-92% recovery. Signed-off-by: rahul-tuli <rtuli@redhat.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

brian-dellabetta

I think these changes make sense, but we should prune the config yaml files to remove the keys we're no longer using

tests/lmeval/test_lmeval.py

kylesayrs · 2025-10-01T15:35:12Z

Just nits, can address later if need be

Changes based on kylesayrs review: - Remove is_dict variable, use isinstance() inline - Replace hardcoded 0.95 with model_fields default value - Add safety checks for higher_is_better_map existence - Refactor to iterate over compressed_metrics instead of base_metrics - Add warnings when config has keys not found in results - Validate recovery_threshold dict keys against actual results This eliminates code duplication, improves robustness, and ensures configs are validated against actual test results. Signed-off-by: Rahul Tuli <rtuli@redhat.com>

dsikka

Ok so my thought is that we should keep exact match values (strict / flexible) as a reference in the config but we care remove using them in the test.

They proved useful just now when we were considering the two cases that were < 95%.

brian-dellabetta

Good stuff! Looks like we can lower the threshold for the failing AWQ tests, after release

rahul-tuli requested review from brian-dellabetta and kylesayrs August 18, 2025 17:06

gemini-code-assist bot reviewed Aug 18, 2025

View reviewed changes

tests/lmeval/test_lmeval.py Outdated Show resolved Hide resolved

tests/lmeval/test_lmeval.py Outdated Show resolved Hide resolved

dsikka self-requested a review August 18, 2025 17:13

rahul-tuli force-pushed the feat/recovery-based-testing branch 2 times, most recently from bbc7069 to cfaf403 Compare October 1, 2025 09:52

rahul-tuli added 6 commits October 1, 2025 09:53

Add base model evaluation method

b945dba

Implement _eval_base_model() to evaluate the uncompressed model using lm_eval. This provides baseline metrics for recovery testing. Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: rahul-tuli <rtuli@redhat.com>

rahul-tuli force-pushed the feat/recovery-based-testing branch from cfaf403 to 023ccf1 Compare October 1, 2025 09:53

rahul-tuli marked this pull request as ready for review October 1, 2025 10:05

rahul-tuli requested a review from shanjiaz October 1, 2025 10:06

rahul-tuli self-assigned this Oct 1, 2025

dsikka reviewed Oct 1, 2025

View reviewed changes

rahul-tuli requested a review from Copilot October 1, 2025 10:45

Copilot AI reviewed Oct 1, 2025

View reviewed changes

tests/lmeval/test_lmeval.py Show resolved Hide resolved

tests/lmeval/test_lmeval.py Show resolved Hide resolved

tests/lmeval/test_lmeval.py Show resolved Hide resolved

gemini-code-assist bot reviewed Oct 1, 2025

View reviewed changes

tests/lmeval/test_lmeval.py Outdated Show resolved Hide resolved

tests/lmeval/test_lmeval.py Outdated Show resolved Hide resolved

Set recovery thresholds for NVFP4 config

b5efda0

NVFP4 scheme uses 4-bit weights and activations, resulting in lower recovery than FP8/INT8. Set per-metric thresholds based on observed values to allow ~89-92% recovery. Signed-off-by: rahul-tuli <rtuli@redhat.com>

dsikka previously approved these changes Oct 1, 2025

View reviewed changes

dsikka added the ready When a PR is ready for review label Oct 1, 2025

Update tests/lmeval/test_lmeval.py

87a2f15

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

rahul-tuli dismissed dsikka’s stale review via 87a2f15 October 1, 2025 14:01

Merge branch 'main' into feat/recovery-based-testing

f9714fe

brian-dellabetta requested changes Oct 1, 2025

View reviewed changes

tests/lmeval/test_lmeval.py Show resolved Hide resolved

tests/lmeval/test_lmeval.py Show resolved Hide resolved

tests/lmeval/test_lmeval.py Show resolved Hide resolved

brian-dellabetta mentioned this pull request Oct 1, 2025

lmeval threshold p1 #1888

Closed

brian-dellabetta previously approved these changes Oct 1, 2025

View reviewed changes

kylesayrs reviewed Oct 1, 2025

View reviewed changes

kylesayrs previously approved these changes Oct 1, 2025

View reviewed changes

rahul-tuli dismissed stale reviews from kylesayrs and brian-dellabetta via 0922265 October 1, 2025 16:04

dsikka added 2 commits October 1, 2025 13:05

Merge branch 'main' into feat/recovery-based-testing

81740b8

Merge branch 'main' into feat/recovery-based-testing

45016bf

dsikka reviewed Oct 1, 2025

View reviewed changes

brian-dellabetta approved these changes Oct 1, 2025

View reviewed changes

dsikka approved these changes Oct 1, 2025

View reviewed changes

dsikka merged commit cb8d775 into main Oct 1, 2025
8 checks passed

dsikka deleted the feat/recovery-based-testing branch October 1, 2025 17:58

[Tests] Add recovery-based validation to LM-Eval tests #1750

[Tests] Add recovery-based validation to LM-Eval tests #1750

Conversation

rahul-tuli commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Recovery-Based Testing for LM-Eval

🚀 New Behavior

Default Behavior (Zero Config Required)

📝 Configuration Options

Option 1: Use Default (Recommended)

Option 2: Override Global Threshold

Option 3: Per-Metric Thresholds

Option 4: With Absolute Metric Warnings

Example Output

✅ Recovery Validation (Always Shown)

Absolute Metric Warnings (If Configured)

🔄 Migration Guide

Existing Configs with Absolute Metrics

No Breaking Changes

Implementation Details

Files Changed

Key Features

Performance Impact

✅ Benefits

Testing

Uh oh!

github-actions bot commented Aug 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

rahul-tuli commented Oct 1, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylesayrs commented Oct 1, 2025

Uh oh!

dsikka left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rahul-tuli commented Aug 18, 2025 •

edited

Loading

dsikka left a comment •

edited

Loading