PageBrain

PageBrain is a lightweight LLM serving system that implements PagedAttention and page-level KV cache management from scratch — enabling efficient memory usage, minimized fragmentation, and scalable multi-request inference on a single GPU.

This project was developed as a graduation capstone and continues to evolve into a personal research/engineering project. It aims to provide an educational reference implementation of modern LLM serving techniques, inspired by systems like vLLM but written to be readable, hackable, and extensible.

✨ Key Features

🚀 Page-based KV Cache Memory System
Fully custom logical–physical KV layout with block tables, mapping, and allocation.
🧩 PagedAttention Module (Custom Implementation)
No full-context reconstruction. Attends only to necessary blocks, accumulating scores across pages.
🧠 HuggingFace Model Integration
Works with GPT-2 (others planned). Drop-in replacement for standard attention.
🔄 Prefill + Decode Step Unified Forward Loop
Ensures identical results to vanilla HF models under matching conditions.
📦 Multi-Request Scheduler
Request queue → Sequence → SequenceGroup → Forward execution pipeline.

🏗 System Architecture

Fig.1 - System architecture of PageBrain

Core Components

Engine – Central loop that polls requests, schedules sequences, and triggers model forward passes.
Scheduler – Converts user requests into sequences, groups them, and decides execution order.
SequenceGroup – Batched group of active sequences for a single forward call.
Executor – Runs the model with the provided SequenceGroup + CacheManager.
CacheManager / BlockManager – Handles KV cache allocation, updates, lookup, and physical memory layouts.
PagedAttention module – Computes attention using logical block indices instead of full KV tensors.

🔍 PagedAttention

Fig.2 - Illustraion of attention

Modern decoder-only Transformers rely on a repeated attention loop during text generation. At every generation step, the model:

receives a new input sequence,
computes new q, k, v vectors for the final token,
attends to all previous key/value vectors,
and produces the next output token.

During this process, one important observation is that previous K/V vectors never change. Thus, these intermediate results are typically cached and reused at every step to avoid recomputation.

The Problem: Repeated Computation & Memory Fragmentation

Although KV caching is conceptually simple, implementing it efficiently on GPUs is not.

Each layer stores massive K/V tensors.
PyTorch allocates them contiguously on GPU memory.
As generation progresses, the cache grows token-by-token, and eventually GPU memory becomes filled with non-contiguous free spaces.

Even if the GPU has enough total memory, it can still fail to allocate a continuous block large enough → classic memory fragmentation problem.

OS-Inspired Solution: Page-based Cache Management

PageBrain adopts the same idea used in operating systems:

Introduce a virtual memory abstraction for KV cache → logical token indices are mapped to physical GPU blocks.

GPU memory is divided into fixed-size blocks (“pages”).
Each logical K/V position (layer × token index) maps to a page entry.
A BlockTable manages all logical → physical mappings.
New K/V entries no longer require contiguous GPU space.

This allows PageBrain to overcome fragmentation entirely while keeping allocation fast.

The Core: Page-by-Page Attention Accumulation

Fig.3 - Illustraion of PagedAttention

With the page-based KV layout established, PageBrain implements a custom PagedAttention operator:

K/V vectors are stored in fixed-size pages.
During attention computation, the system iterates over pages, not the full dense tensor.
Only the blocks relevant for the current sequence are read.
Attention scores are accumulated incrementally as the pages are scanned.

This means:

No reconstruction of the full KV tensor
No need for contiguous memory
Consistent attention scores thanks to correct masking + block indexing
Efficient scaling for long sequences

The math shown in Fig.3 describes exactly this process: accumulating a_ij * v_j page-by-page until the full context has been consumed.

🚀 Installation

Python 3.10+ recommended

git clone https://github.com/yourname/PageBrain.git
cd PageBrain
pip install -r requirements.txt
pip install -e .

▶️ Basic Usage

PageBrain can be used in two ways:

(CLI Server) Launch the full PageBrain inference server with a single command.
(Python API) Load the model and execute generation directly inside your own Python code.

1. Run the PageBrain Server (CLI)

PageBrain also provides a command-line interface for running a full LLM serving backend. This mode launches:

the HTTP API server
the request scheduler
KV cache manager
PagedAttention-enabled model

... so you can send requests via REST API or curl.

pagebrain --model openai-community/gpt2

2. Programmatic Usage (Python API)

You can load a HuggingFace model, wrap it with PageBrain’s paged KV cache system, and run inference directly in Python. This approach is ideal for research, experimentation, or embedding PageBrain into a larger application.

from transformers import AutoTokenizer, AutoModelForCausalLM
from pagebrain.models.gpt2 import PagedGPT2LMHeadModel
from pagebrain.block import BlockManager
from pagebrain.cache import CacheManager

# Load tokenizer & base model
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
base = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")

# Initialize cache manager for paged KV memory
block_manager = BlockManager(config)
cache_manager = CacheManager(block_manager, config)

# Wrap the model with PageBrain's PagedAttention-enabled module
model = PagedGPT2LMHeadModel(base, cache_manager).eval()

# Example inference
inputs = tokenizer("Hello, PageBrain!", return_tensors="pt")
outputs = model(inputs['input_ids'], input_pos=..., cache_pos=...)

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
benchmarks		benchmarks
examples		examples
src/pagebrain		src/pagebrain
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PageBrain

✨ Key Features

🏗 System Architecture

Core Components

🔍 PagedAttention

The Problem: Repeated Computation & Memory Fragmentation

OS-Inspired Solution: Page-based Cache Management

The Core: Page-by-Page Attention Accumulation

🚀 Installation

▶️ Basic Usage

1. Run the PageBrain Server (CLI)

2. Programmatic Usage (Python API)

About

Uh oh!

Releases

Packages

Languages

jiseokson/PageBrain

Folders and files

Latest commit

History

Repository files navigation

PageBrain

✨ Key Features

🏗 System Architecture

Core Components

🔍 PagedAttention

The Problem: Repeated Computation & Memory Fragmentation

OS-Inspired Solution: Page-based Cache Management

The Core: Page-by-Page Attention Accumulation

🚀 Installation

▶️ Basic Usage

1. Run the PageBrain Server (CLI)

2. Programmatic Usage (Python API)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages