A simple model evaluation pipeline to evaluate the impact of Key-Value (KV) caching on inference performance of a lightweight language model (GPT-2) run purely on local CPUs. The evaluation aims to compare runtime and memory usage across different input text lengths:
- Short (single token or word)
- Medium (paragraph)
- Long (multi-paragraph text)
kv-caching-inference-evaluation/
│-- evaluate.py # Main entry point for the evaluation pipeline
│-- README.md # Project overview and instructions
│-- src/
│ │-- models.py # GPT-2 model and tokenizer setup
│ │-- dataloaders.py # Input text loader (short, medium, long)
│ │-- benchmarks.py # Benchmarking routines for runtime and memory
│ │-- res_handlers.py # Logging and visualization of results
│-- outputs/
│ │-- metrics/ # Benchmark CSV outputs
│ │-- plots/ # Visualizations of runtime and memory
│-- tests/
│-- dataloaders_test.py
│-- models_test.py
# Clone this repository
git clone <repo_url>
cd kv-caching-inference-evaluation
# Install dependencies
pip install torch transformers pandas matplotlib
# Run evaluation
python evaluate.py- Loads GPT-2 and tokenizer
- Runs benchmarks for each input text
- Saves results to CSV (outputs/metrics/benchmarks.csv)
- Generates runtime and memory comparison plots (outputs/plots/)
- Inputs: short, medium, long text examples
- Inference Modes:
- use_cache=False (no KV caching)
- use_cache=True (with KV caching)
- Metrics Collected:
- Inference runtime (seconds)
- Cache memory usage (MB)
The results show that utilizing KV caching for a generative model significantly reduces latency, effectively improving the inference runtime from roughly O(n²) to near-linear time, because token values in the sequence are stored and accessed in constant time instead of being recomputed.
However, this efficiency comes with a classic space vs. time tradeoff: the key-value storage grows with input size, and as shown below, memory usage increases substantially as the text length increases.

