[Tests] Add recovery-based validation to LM-Eval tests (#1750)

rahul-tuli · gemini-code-assist[bot] · dsikka · web-flow · commit cb8d775229ea · 2025-10-01T13:58:00.000-04:00
# Recovery-Based Testing for LM-Eval This PR implements **recovery-based testing** as the default validation mechanism for all lm-eval tests. Tests now compare compressed model performance against base model performance, making them robust to upstream changes while ensuring quantization quality. **Current Problem:** - Tests fail when base models regress due to external changes (e.g., transformers updates, lm-eval changes) - False positives block CI even when quantization maintains expected recovery - Absolute thresholds become stale as models/libraries evolve **Example:** Qwen2.5-VL tests fail with transformers ≥ 4.54.0 due to ~10% base model accuracy drop, despite quantization maintaining the same relative performance. **Solution:** Recovery testing validates that compressed models retain ≥95% (configurable) of base model performance, regardless of absolute score changes. --- ## 🚀 New Behavior ### Default Behavior (Zero Config Required) All lm-eval tests now **automatically**: 1. ✅ Evaluate the base (uncompressed) model 2. ✅ Quantize the model using configured scheme 3. ✅ Evaluate the compressed model 4. ✅ Validate recovery ≥ 95% (default threshold) 5. ✅ Show optional warnings for absolute metrics **Recovery Formula:** ```python # For "higher is better" metrics (accuracy, F1, etc.) recovery = compressed_score / base_score # For "lower is better" metrics (perplexity, loss) recovery = base_score / compressed_score # Inverted! # Validation assert recovery >= threshold # Default: 0.95 ``` **Recovery Interpretation:** - `1.00` = Perfect (0% degradation) - `0.96` = 96% retained (4% degradation) ✅ - `0.93` = 93% retained (7% degradation) ❌ (with default threshold) --- ## 📝 Configuration Options ### Option 1: Use Default (Recommended) No configuration needed - uses 95% recovery threshold: ```yaml cadence: "weekly" model: meta-llama/Meta-Llama-3-8B-Instruct scheme: FP8_DYNAMIC lmeval: # That's it! Uses recovery_threshold: 0.95 by default ``` ### Option 2: Override Global Threshold Set a different threshold for all metrics: ```yaml lmeval: recovery_threshold: 0.93 # All metrics need ≥93% recovery ``` ### Option 3: Per-Metric Thresholds Set different thresholds for different metrics: ```yaml lmeval: recovery_threshold: exact_match,flexible-extract: 0.95 # Strict threshold exact_match,strict-match: 0.90 # Relaxed threshold ``` ### Option 4: With Absolute Metric Warnings Keep absolute metrics for informational warnings (not failures): ```yaml lmeval: recovery_threshold: 0.95 # Required - TEST FAILS if not met metrics: # Optional - warnings only, no failures exact_match,flexible-extract: 0.75 exact_match,strict-match: 0.72 ``` --- ## Example Output ### ✅ Recovery Validation (Always Shown) ``` ================================================================================ RECOVERY TESTING COMPARISON ================================================================================ ✓ exact_match,flexible-extract | Base: 0.7890 | Compressed: 0.7601 | Recovery: 96.34% ↑ | Threshold: ≥95.00% ✓ exact_match,strict-match | Base: 0.7564 | Compressed: 0.7262 | Recovery: 96.01% ↑ | Threshold: ≥95.00% ================================================================================ ✓ ALL METRICS PASSED RECOVERY THRESHOLDS ================================================================================ ``` ### Absolute Metric Warnings (If Configured) ``` ================================================================================ ABSOLUTE METRICS CHECK (warnings only, not failures) ================================================================================ ✓ exact_match,flexible-extract | Expected: 0.7500 (±5%) | Got: 0.7601 | Within expected range ⚠ exact_match,strict-match | Expected: 0.8000 (±5%) | Got: 0.7262 | Below expected range ================================================================================ ``` **Note:** The warning above doesn't fail the test - recovery validation already passed! --- ## 🔄 Migration Guide ### Existing Configs with Absolute Metrics **Before (absolute thresholds cause failures):** ```yaml lmeval: metrics: exact_match: 0.75 # TEST FAILS if not met ``` **After (minimal - uses recovery testing):** ```yaml lmeval: # Uses default recovery_threshold: 0.95 # No other config needed! ``` **After (keep warnings):** ```yaml lmeval: # recovery_threshold: 0.95 is implicit (default) metrics: # Now just warnings, won't fail tests exact_match: 0.75 ``` ### No Breaking Changes - ✅ All existing configs continue to work - ✅ `metrics` dict now shows warnings instead of failing - ✅ Recovery testing automatically enabled with sensible default - ✅ Backward compatible with all test infrastructure --- ## Implementation Details ### Files Changed - **`tests/lmeval/test_lmeval.py`** (+151/-31 lines) - Added `recovery_threshold` config field (default: 0.95) - Made `metrics` field optional - Added `_eval_base_model()` method - Added `_validate_recovery()` method - Modified `_check_absolute_warnings()` to only warn, not fail - Updated test flow to always evaluate base model first ### Key Features 1. **Direction-Aware Recovery** - Automatically detects "higher is better" vs "lower is better" metrics - Inverts ratio for perplexity-style metrics 2. **Edge Case Handling** - Zero base values: `recovery = 1.0 if compressed == 0 else 0.0` - Missing metrics: Skipped gracefully - Metadata filtering: Skips stderr and alias keys 3. **Flexible Thresholds** - Global float: `recovery_threshold: 0.93` - Per-metric dict: `recovery_threshold: {metric1: 0.95, metric2: 0.90}` - Fallback to 0.95 for unlisted metrics when using dict 4. **Comprehensive Logging** - Recovery threshold displayed at test start - Detailed comparison table with base/compressed/recovery values - Clear pass/fail indicators with direction arrows (↑/↓) - Separate section for optional absolute warnings --- ## Performance Impact **Additional Runtime:** - Base model evaluation: ~2-10 minutes - Compressed model evaluation: ~2-10 minutes (unchanged) - **Total: ~2x single evaluation time** **Trade-off:** Doubled evaluation time for robust, meaningful metrics that don't break from upstream changes. **Mitigation:** Tests run on weekly cadence, making the additional time acceptable. --- ## ✅ Benefits | Benefit | Description | |---------|-------------| | 🛡️ **Robustness** | Tests never break from lm-eval or transformers updates | | 📊 **Meaningful** | Measures actual compression degradation, not arbitrary thresholds | | 🎯 **Automatic** | Works out of box, no config needed | | 🔧 **Flexible** | Override threshold globally or per-metric | | ↔️ **Compatible** | Zero breaking changes, existing configs work | | 🧹 **Simple** | ~150 lines in single file, no new dependencies | --- ## Testing To test recovery-based validation: ```bash # Uses default recovery threshold (0.95) CADENCE=weekly TEST_DATA_FILE=tests/lmeval/configs/fp8_dynamic_per_token.yaml \ pytest tests/lmeval/test_lmeval.py -v ``` --- --------- Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: rahul-tuli <rtuli@redhat.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
diff --git a/tests/lmeval/configs/w4a4_nvfp4.yaml b/tests/lmeval/configs/w4a4_nvfp4.yaml
@@ -5,6 +5,12 @@ dataset_id: HuggingFaceH4/ultrachat_200k
 dataset_split: train_sft
 num_calibration_samples: 20
 lmeval:
+  # NVFP4 (4-bit weights + 4-bit activations) has lower recovery than FP8/INT8
+  # Observed: strict-match ~92.81%, flexible-extract ~89.59%
+  recovery_threshold:
+    exact_match,strict-match: 0.92
+    exact_match,flexible-extract: 0.89
+  # Absolute metrics for warnings only
   metrics:
     exact_match,flexible-extract: 0.70
     exact_match,strict-match: 0.65
diff --git a/tests/lmeval/test_lmeval.py b/tests/lmeval/test_lmeval.py
@@ -2,6 +2,7 @@
 import random
 import shutil
 from pathlib import Path
+from typing import Optional, Union
 
 import numpy
 import pandas as pd
@@ -23,8 +24,12 @@ class LmEvalConfig(BaseModel):
     task: str = "gsm8k"
     num_fewshot: int = 5
     limit: int = 1000
-    metrics: dict
     batch_size: int = 100
+    # Recovery testing (default): compare against base model performance
+    # Default threshold is 0.95 (retain ≥95% of base), can be overridden
+    recovery_threshold: Union[float, dict] = 0.95
+    # Optional absolute metrics for warnings (not failures)
+    metrics: Optional[dict] = None
 
 
 try:
@@ -62,6 +67,16 @@ class TestLMEval:
     or another identifier which can be used for the particular test case. If a recipe
     is not provided, it is assumed that the scheme provided is a preset scheme and will
     be used for quantization. Otherwise, the recipe will always be used if given.
+
+    Recovery Testing (DEFAULT):
+    Tests now use recovery-based validation by default, comparing compressed model
+    performance against the base model. Default threshold is 0.95 (≥95% recovery).
+
+    Config options:
+    - recovery_threshold: 0.95 (default if not specified)
+    - recovery_threshold: 0.93 (override default globally)
+    - recovery_threshold: {"metric1": 0.95, "metric2": 0.90} (per-metric)
+    - metrics: {...} (optional - used for warnings only, not failures)
     """  # noqa: E501
 
     def set_up(self, test_data_file: str):
@@ -89,6 +104,11 @@ def set_up(self, test_data_file: str):
 
         logger.info("========== RUNNING ==============")
         logger.info(self.scheme)
+        logger.info(
+            f"Recovery threshold: {self.lmeval.recovery_threshold} (default: 0.95)"
+        )
+        if self.lmeval.metrics:
+            logger.info("Absolute metrics provided - will show warnings if outside ±5%")
 
         self.num_calibration_samples = eval_config.get("num_calibration_samples", 512)
         self.max_seq_length = 2048
@@ -97,6 +117,10 @@ def test_lm_eval(self, test_data_file: str):
         # Run vLLM with saved model
         self.set_up(test_data_file)
 
+        # Always evaluate base model for recovery testing
+        logger.info("================= Evaluating BASE model ======================")
+        self.base_results = self._eval_base_model()
+
         if not self.save_dir:
             self.save_dir = self.model.split("/")[1] + f"-{self.scheme}"
         oneshot_model, processor = run_oneshot_for_e2e_testing(
@@ -119,11 +143,28 @@ def test_lm_eval(self, test_data_file: str):
         # Reset session for next test case
         self._handle_recipe()
 
-        logger.info("================= Running LM Eval ======================")
+        logger.info("================= Running LM Eval on COMPRESSED model ==========")
         self._run_lm_eval()
 
         self.tear_down()
 
+    @log_time
+    def _eval_base_model(self):
+        """Evaluate the base (uncompressed) model."""
+        model_args = {**self.lmeval.model_args, "pretrained": self.model}
+
+        results = lm_eval.simple_evaluate(
+            model=self.lmeval.model,
+            model_args=model_args,
+            tasks=[self.lmeval.task],
+            num_fewshot=self.lmeval.num_fewshot,
+            limit=self.lmeval.limit,
+            device="cuda:0",
+            batch_size=self.lmeval.batch_size,
+        )
+
+        return results
+
     @log_time
     def _save_compressed_model(self, oneshot_model, processor):
         oneshot_model.save_pretrained(self.save_dir)
@@ -152,46 +193,147 @@ def _run_lm_eval(self):
             batch_size=self.lmeval.batch_size,
         )
 
+        # Always use recovery testing
+        self._validate_recovery(results)
+
+        # If absolute metrics provided, show warnings (not failures)
+        if self.lmeval.metrics:
+            self._check_absolute_warnings(results)
+
+    def _validate_recovery(self, compressed_results):
+        """Validate using recovery testing - compare against base model."""
+        base_metrics = self.base_results["results"][self.lmeval.task]
+        compressed_metrics = compressed_results["results"][self.lmeval.task]
+        higher_is_better_map = compressed_results.get("higher_is_better", {}).get(
+            self.lmeval.task, {}
+        )
+
+        logger.info("=" * 80)
+        logger.info("RECOVERY TESTING COMPARISON")
+        logger.info("=" * 80)
+
+        # Get default threshold from config schema
+        default_threshold = self.lmeval.model_fields["recovery_threshold"].default
+
+        failures = []
+        # Iterate over compressed metrics (what we actually got)
+        for metric_key, compressed_val in compressed_metrics.items():
+            # Skip stderr and other metadata
+            if "stderr" in metric_key or metric_key.startswith("alias"):
+                continue
+
+            base_val = base_metrics.get(metric_key)
+            if base_val is None:
+                logger.warning(
+                    f"Metric {metric_key} in compressed results "
+                    f"not found in base results, skipping"
+                )
+                continue
+
+            # Get threshold for this metric
+            if isinstance(self.lmeval.recovery_threshold, dict):
+                threshold = self.lmeval.recovery_threshold.get(
+                    metric_key, default_threshold
+                )
+            else:
+                threshold = self.lmeval.recovery_threshold
+
+            # Get direction
+            base_metric_name = metric_key.split(",")[0]
+            higher_is_better = higher_is_better_map.get(base_metric_name, True)
+
+            # Compute recovery
+            if base_val == 0:
+                recovery = 1.0 if compressed_val == 0 else 0.0
+            elif higher_is_better:
+                recovery = compressed_val / base_val
+            else:
+                # For "lower is better", invert ratio
+                recovery = base_val / compressed_val
+
+            # Check threshold
+            passed = recovery >= threshold
+            direction = "↑" if higher_is_better else "↓"
+
+            msg = (
+                f"{metric_key:40} | Base: {base_val:.4f} | "
+                f"Compressed: {compressed_val:.4f} | "
+                f"Recovery: {recovery:6.2%} {direction} | Threshold: ≥{threshold:.2%}"
+            )
+
+            if passed:
+                logger.info(f"✓ {msg}")
+            else:
+                logger.error(f"✗ {msg}")
+                failures.append(
+                    f"{metric_key}: {recovery:.2%} < {threshold:.2%} "
+                    f"(base={base_val:.4f}, compressed={compressed_val:.4f})"
+                )
+
+        # Validate that config thresholds match actual results
+        if isinstance(self.lmeval.recovery_threshold, dict):
+            for config_metric_key in self.lmeval.recovery_threshold.keys():
+                if config_metric_key not in compressed_metrics:
+                    logger.warning(
+                        f"Metric {config_metric_key} in recovery_threshold config "
+                        f"not found in results"
+                    )
+
+        logger.info("=" * 80)
+
+        if failures:
+            failure_msg = "\n".join(failures)
+            raise AssertionError(f"Recovery testing failed:\n{failure_msg}")
+
+        logger.info("✓ ALL METRICS PASSED RECOVERY THRESHOLDS")
+        logger.info("=" * 80)
+
+    def _check_absolute_warnings(self, results):
+        """Check absolute metrics and warn if outside ±5% tolerance (not a failure)."""
+        logger.info("=" * 80)
+        logger.info("ABSOLUTE METRICS CHECK (warnings only, not failures)")
+        logger.info("=" * 80)
+
         metrics: dict = results["results"][self.lmeval.task]
         for metric_key, expected_val in self.lmeval.metrics.items():
-            # stderr metrics are only used as absolute tolerance
-            # checks for actual values
+            # Skip stderr metrics
             if "stderr" in metric_key:
                 continue
+
             actual_val = metrics.get(metric_key)
-            higher_is_better = results["higher_is_better"][self.lmeval.task].get(
-                metric_key.split(",")[0], True
-            )
-            stderr_key = metric_key.replace(",", "_stderr,")
-            std_err = self.lmeval.metrics.get(stderr_key)
-
-            # If stderr is provided, use it as absolute tolerance
-            # Otherwise, default to a 5% relative tolerance
-            if std_err is None:
-                logger.info(
-                    f"Comparing {metric_key}: Expecting {expected_val} "
-                    f"relative tolerance ±5%, Got {actual_val}. "
-                    f"Higher is better: {higher_is_better}"
+            if actual_val is None:
+                logger.warning(
+                    f"Metric {metric_key} in config not found in results, "
+                    f"skipping warning check"
                 )
-                # If higher is better, assert actual val >= expected val * (1 - stderr)
-                if higher_is_better:
-                    assert actual_val >= expected_val * (0.95)
-                # If higher is worse, assert actual val <= expected val * (1 + stderr)
-                else:
-                    assert actual_val <= expected_val * (1.05)
+                continue
+
+            higher_is_better = (
+                results.get("higher_is_better", {})
+                .get(self.lmeval.task, {})
+                .get(metric_key.split(",")[0], True)
+            )
 
+            # Check if within ±5% relative tolerance
+            lower_bound = expected_val * 0.95
+            upper_bound = expected_val * 1.05
+
+            if higher_is_better:
+                # For higher is better, we care about lower bound
+                if actual_val < lower_bound:
+                    logger.warning(
+                        f"⚠ {metric_key:40} | Expected: {expected_val:.4f} (±5%) | "
+                        f"Got: {actual_val:.4f} | Below expected range"
+                    )
             else:
-                logger.info(
-                    f"Comparing {metric_key}: Expecting {expected_val} "
-                    f"absolute tolerance ±{std_err*100}%, Got {actual_val}. "
-                    f"Higher is better: {higher_is_better}"
-                )
-                # If higher is better, assert actual val >= expected val - stderr
-                if higher_is_better:
-                    assert actual_val >= expected_val - std_err
-                # If higher is worse, assert actual val <= expected val + stderr
-                else:
-                    assert actual_val <= expected_val + std_err
+                # For lower is better, we care about upper bound
+                if actual_val > upper_bound:
+                    logger.warning(
+                        f"⚠ {metric_key:40} | Expected: {expected_val:.4f} (±5%) | "
+                        f"Got: {actual_val:.4f} | Above expected range"
+                    )
+
+        logger.info("=" * 80)
 
     def tear_down(self):
         timer = get_singleton_manager()