KV Caching Inference Evaluation with GPT-2

Overview

A simple model evaluation pipeline to evaluate the impact of Key-Value (KV) caching on inference performance of a lightweight language model (GPT-2) run purely on local CPUs. The evaluation aims to compare runtime and memory usage across different input text lengths:

Short (single token or word)
Medium (paragraph)
Long (multi-paragraph text)

Repository Strucutre

kv-caching-inference-evaluation/
│-- evaluate.py           # Main entry point for the evaluation pipeline
│-- README.md             # Project overview and instructions
│-- src/
│    │-- models.py        # GPT-2 model and tokenizer setup
│    │-- dataloaders.py   # Input text loader (short, medium, long)
│    │-- benchmarks.py    # Benchmarking routines for runtime and memory
│    │-- res_handlers.py  # Logging and visualization of results
│-- outputs/
│    │-- metrics/         # Benchmark CSV outputs
│    │-- plots/           # Visualizations of runtime and memory
│-- tests/
     │-- dataloaders_test.py
     │-- models_test.py

Setup Instructions

# Clone this repository
git clone <repo_url>
cd kv-caching-inference-evaluation

# Install dependencies
pip install torch transformers pandas matplotlib

# Run evaluation
python evaluate.py

Loads GPT-2 and tokenizer
Runs benchmarks for each input text
Saves results to CSV (outputs/metrics/benchmarks.csv)
Generates runtime and memory comparison plots (outputs/plots/)

Benchmarking Methodology

Inputs: short, medium, long text examples
Inference Modes:
- use_cache=False (no KV caching)
- use_cache=True (with KV caching)
Metrics Collected:
- Inference runtime (seconds)
- Cache memory usage (MB)

Results

The results show that utilizing KV caching for a generative model significantly reduces latency, effectively improving the inference runtime from roughly O(n²) to near-linear time, because token values in the sequence are stored and accessed in constant time instead of being recomputed.

However, this efficiency comes with a classic space vs. time tradeoff: the key-value storage grows with input size, and as shown below, memory usage increases substantially as the text length increases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KV Caching Inference Evaluation with GPT-2

Overview

Repository Strucutre

Setup Instructions

Benchmarking Methodology

Results

Runtime Comparison

Memory Comparison

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
outputs		outputs
src		src
tests		tests
README.md		README.md
evaluate.py		evaluate.py

Folders and files

Latest commit

History

Repository files navigation

KV Caching Inference Evaluation with GPT-2

Overview

Repository Strucutre

Setup Instructions

Benchmarking Methodology

Results

Runtime Comparison

Memory Comparison

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages