Skip to content

feat: integrate KVPress for KV cache compression (#366)#623

Open
kschwethelm wants to merge 7 commits intoPrunaAI:mainfrom
kschwethelm:feat/kvpress
Open

feat: integrate KVPress for KV cache compression (#366)#623
kschwethelm wants to merge 7 commits intoPrunaAI:mainfrom
kschwethelm:feat/kvpress

Conversation

@kschwethelm
Copy link
Copy Markdown

Description

Integrate KVPress into Pruna, making 20 KV cache compression strategies available for causal language models. KVPress compresses the key-value cache during the prefill phase, reducing memory usage for long-context inference.

Key implementation details:

  • New kvpress algorithm module following the PrunaAlgorithmBase
    pattern
  • Supports 20 scorer presses (ExpectedAttention, SnapKV, StreamingLLM, TOVA, KVzip, etc.)
  • Configurable compression_ratio and press_kwargs for press-specific parameters
  • New KV_CACHER algorithm tag for the cache compression category
  • Compatibility defined with quantization algorithms (before) and torch_compile (after)
  • Uses reapply save strategy — press is re-applied on model load

Excluded press types: Wrapper presses (ChunkPress, AdaKVPress, PerLayerCompressionPress, DMSPress, etc.) are not included in this initial integration. These require a nested ScorerPress instance as a constructor argument, which doesn't fit the current single-class design. Similarly, ThinKPress is excluded as it compresses along the channel dimension with a different parameter interface. These could be added in a follow-up if needed.

Some downstream evaluation results are available in repo kschwethelm/pruna-kvpress-eval.

Related Issue

Fixes #366

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor (no functional change)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Testing

  • I added or updated tests covering my changes
  • Existing tests pass locally (uv run pytest -m "cpu and not slow")

Unit tests added in tests/algorithms/test_kvpress.py with a dedicated tester in tests/algorithms/testers/kvpress.py. Integration evaluated in a separate repo -> see evaluation report.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code, especially for agent-assisted changes
  • I updated the documentation where necessary

Add NVIDIA KVPress as an optional dependency, enabling 31 KV cache
compression strategies for causal language models. Includes algorithm
class, test tester, and compatibility updates across existing LLM
algorithms.
kvpress 0.5.2 relaxes the datasets<3 constraint and reverts to
transformers>=4.56, resolving the dependency conflict. uv sync --extra
kvpress now works without workarounds.
Allow passing additional keyword arguments to the press constructor
via the press_kwargs hyperparameter, enabling fine-grained control
over press-specific settings like window_size, n_sink, etc.
- Replace tags.QUANTIZER with explicit LLM algorithm names to avoid
  false symmetry matches with diffuser algorithms
- Fix SmashConfig.add() dict flattening: only flatten when key is a
  registered algorithm name, not for dict-valued hyperparameters
- Remove wrapper/special presses from PRESS_TYPES (CriticalKVPress
  and others that don't accept compression_ratio directly)
- Add unit tests for press type validation and kwargs forwarding
- Add SnapKV integration test with press_kwargs
Add a new KV_CACHER algorithm tag for KV cache compression algorithms,
separate from CACHER (used by diffuser cachers). Use the tag in all
LLM algorithm compatibility lists instead of explicit "kvpress" strings.
@codacy-production
Copy link
Copy Markdown

Not up to standards ⛔

🔴 Issues 9 high · 5 minor

Alerts:
⚠ 14 issues (≤ 0 issues of at least minor severity)

Results:
14 new issues

Category Results
Documentation 5 minor
Security 9 high

View in Codacy

🟢 Metrics 15 complexity · 0 duplication

Metric Results
Complexity 15
Duplication 0

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Integrate KVPress

1 participant