Skip to content

sequint/kv-caching-inference-evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KV Caching Inference Evaluation with GPT-2

Overview

A simple model evaluation pipeline to evaluate the impact of Key-Value (KV) caching on inference performance of a lightweight language model (GPT-2) run purely on local CPUs. The evaluation aims to compare runtime and memory usage across different input text lengths:

  • Short (single token or word)
  • Medium (paragraph)
  • Long (multi-paragraph text)

Repository Strucutre

kv-caching-inference-evaluation/
│-- evaluate.py           # Main entry point for the evaluation pipeline
│-- README.md             # Project overview and instructions
│-- src/
│    │-- models.py        # GPT-2 model and tokenizer setup
│    │-- dataloaders.py   # Input text loader (short, medium, long)
│    │-- benchmarks.py    # Benchmarking routines for runtime and memory
│    │-- res_handlers.py  # Logging and visualization of results
│-- outputs/
│    │-- metrics/         # Benchmark CSV outputs
│    │-- plots/           # Visualizations of runtime and memory
│-- tests/
     │-- dataloaders_test.py
     │-- models_test.py

Setup Instructions

# Clone this repository
git clone <repo_url>
cd kv-caching-inference-evaluation

# Install dependencies
pip install torch transformers pandas matplotlib

# Run evaluation
python evaluate.py
  • Loads GPT-2 and tokenizer
  • Runs benchmarks for each input text
  • Saves results to CSV (outputs/metrics/benchmarks.csv)
  • Generates runtime and memory comparison plots (outputs/plots/)

Benchmarking Methodology

  1. Inputs: short, medium, long text examples
  2. Inference Modes:
    • use_cache=False (no KV caching)
    • use_cache=True (with KV caching)
  3. Metrics Collected:
    • Inference runtime (seconds)
    • Cache memory usage (MB)

Results

The results show that utilizing KV caching for a generative model significantly reduces latency, effectively improving the inference runtime from roughly O(n²) to near-linear time, because token values in the sequence are stored and accessed in constant time instead of being recomputed.

However, this efficiency comes with a classic space vs. time tradeoff: the key-value storage grows with input size, and as shown below, memory usage increases substantially as the text length increases.

Runtime Comparison

GPT-2 Runtime Comparison

Memory Comparison

GPT-2 Memory Comparison

About

An evaluation pipeline to evaluate the impact of Key-Value (KV) caching on inference performance of a lightweight language model (GPT-2) run purely on local CPUs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages