pg_gpt2

pg_gpt2 is a complete implementation of the GPT-2 architecture entirely inside PostgreSQL. It extends the database with tensor algebra, automatic differentiation, AdamW optimization, checkpointing, and a Byte-Pair Encoding tokenizer — allowing end-to-end training and text generation purely through SQL and C extensions.

Overview

PostgreSQL is used as both the storage and execution environment for a large-scale transformer model. Each layer, weight, and intermediate activation lives in relational tables; tensor operations are implemented as C functions returning BYTEA buffers. Every forward pass, gradient computation, and parameter update is a deterministic SQL transaction.

The project demonstrates that a relational database can serve as a full numerical engine, state store, and model runtime — no Python, PyTorch, or external ML stack required.

Prerequisites

Building the extension requires the PostgreSQL server development headers and build tooling so that pg_config --pgxs resolves to the pgxs.mk makefile. On Debian/Ubuntu systems install the package:

sudo apt-get install postgresql-server-dev-16

If PostgreSQL is installed somewhere custom, set the PG_CONFIG environment variable to point at the desired pg_config binary before running make.

Core Design Principles

Postgres as OS — All computation and persistence live in SQL schemas and C extensions.
Full Reproducibility — Every step, gradient, and checkpoint is a logged transaction.
Numerical Fidelity — Bit-level parity with PyTorch’s GPT-2 (float32, row-major, GELU, LayerNorm, AdamW).
Composability — Every tensor op is an SQL function; model architectures are relational graphs.
Auditable Learning — Because gradients and weights are rows, the entire training process is queryable and replayable.

Architecture Summary

Component	Description
Tensor Engine	C implementations of `matmul`, `add`, `gelu`, `softmax`, `layernorm`, `cross_entropy` over contiguous `float32` blobs (`BYTEA`).
Autodiff Engine	Reverse-mode differentiation recorded in a relational tape (`llm_tape`, `llm_tensor_rt`), supporting backpropagation of all GPT-2 ops.
Optimizer	AdamW with bias correction, decoupled weight decay, gradient clipping, and cosine learning-rate schedule.
Checkpointing	Import/export weights as `.npz` or `.safetensors` archives. Every snapshot is versioned in `llm_checkpoint`.
Tokenizer	Native Byte-Pair Encoding (BPE) tokenizer/decoder built from `vocab.json` + `merges.txt`.
Sampling Engine	Temperature, top-k, and top-p (nucleus) sampling for autoregressive generation.
Training Loop	SQL functions (`llm_train`, `llm_train_step`, `llm_loss`) orchestrate forward, backward, optimizer updates, and logging.
Inference	`llm_generate(prompt)` runs encoding → forward → sampling → decoding, returning coherent text completions.

Key Tables

Table	Purpose
`llm_param`	Model parameters, gradients, optimizer state.
`llm_dataset`	Tokenized training sequences.
`llm_tape` / `llm_tensor_rt`	Computational graph and runtime tensors for autograd.
`llm_autograd_mode`	Single-row toggle that signals when forward passes should record autograd tape entries.
`llm_checkpoint`	Versioned checkpoint metadata and file paths.
`llm_bpe_vocab` / `llm_bpe_merges`	GPT-2 tokenizer vocabulary and merge ranks.
`llm_train_log`	Per-step learning rate and loss history.

Autograd Workflow

End-to-end training relies on a thin runtime that records every forward op in SQL so that gradients can be replayed later. The key moving pieces are:

Parameter materialization. llm_materialize_params copies each row in llm_param into the temporary llm_tensor cache and creates a matching row in llm_tensor_rt. During that copy the helper pg_llm_autograd_map_param (or its SQL equivalent INSERT in the function) must be invoked so the runtime tensor id is associated with the original (model, name, token_id) tuple. Any new C routine that constructs parameter views needs to perform the same mapping or gradients will not flow back into llm_param. 【F:sql/pg_llm--0.1.0.sql†L403-L438】【F:src/pg_llm_autograd.c†L216-L246】
Forward tape recording. Every C kernel checks pg_llm_autograd_enabled(); when the flag is set the inputs and outputs are registered with pg_llm_autograd_track_tensor and the op is appended to llm_tape with any metadata (shape, constants, etc.). This produces an ordered tape of all ops in the forward pass. 【F:src/pg_llm.c†L19-L210】
Reverse traversal. llm_backprop walks the tape from the newest node back to the seed, dispatching gradients based on the recorded name field and writing results into llm_tensor_rt.grad. Once complete, llm_accumulate_grads copies those buffers back into llm_param.grad using the mapping created in step 1. 【F:sql/llm_backprop.sql†L1-L78】【F:sql/pg_llm--0.1.0.sql†L439-L456】
Tied embeddings. GPT-2 reuses the token embedding (wte) for the final logits projection. After flattening the embedding table into a single matrix for pg_llm_matmul, ensure that buffer is still mapped to the original embedding rows (via pg_llm_autograd_map_param) so the logits gradient is accumulated back into wte rather than a detached copy. 【F:sql/pg_llm--0.1.0.sql†L173-L205】【F:src/pg_llm_autograd.c†L216-L246】

SQL API Reference

Model Initialization

SELECT pg_llm_import_npz('/mnt/models/gpt2-small.npz', 'gpt2-small');

Imports all pretrained GPT-2 weights into the llm_param table.

Forward Pass and Inference

-- Generate text directly in SQL
SELECT llm_generate('Once upon a time', 80, 0.9, 40, 0.92);

Training

-- Train for 10,000 steps on tokenized text dataset
SELECT llm_train(
  'gpt2-small',
  10000,        -- steps
  12, 12, 768,  -- layers, heads, hidden size
  50257,        -- vocab size
  0.9, 0.999, 1e-8, 0.01, 2.5e-4, 2000
);

Every step performs:

Forward pass → loss (llm_loss)
Reverse pass (llm_backprop)
Gradient accumulation
AdamW parameter updates
Logging to llm_train_log

Checkpointing

-- Save a new checkpoint
SELECT llm_checkpoint_save('gpt2-small','after warmup 2k');

-- Restore a checkpoint
SELECT llm_checkpoint_load('gpt2-small',1);

Tokenizer Utilities

-- Load GPT-2 BPE vocab and merges
SELECT pg_llm_load_bpe_vocab('/mnt/gpt2/vocab.json','gpt2-small');
SELECT pg_llm_load_bpe_merges('/mnt/gpt2/merges.txt','gpt2-small');

-- Encode and decode text
SELECT llm_encode('Hello world!','gpt2-small');
SELECT llm_decode(ARRAY[15496,2159,0],'gpt2-small');

Utility Scripts

The repository includes Python helpers for preparing external assets before calling the SQL functions above. All scripts live under scripts/.

Script	Purpose
`convert_gpt2_checkpoint.py`	Download/convert a HuggingFace GPT-2 checkpoint into the gzip-based `.npz` container expected by `pg_llm_import_npz`.
`ingest_tokenizer.py`	Load `vocab.json` and `merges.txt` tokenizer assets into `llm_bpe_vocab`/`llm_bpe_merges` using a PostgreSQL connection.
`prepare_dataset.py`	Tokenize raw text files with the GPT-2 tokenizer and populate `llm_dataset` with fixed-length `(tokens, target)` arrays.

Install the optional Python dependencies with:

pip install transformers torch psycopg[binary]

Examples:

# 1. Convert HuggingFace weights to /mnt/models/gpt2-small.npz
python scripts/convert_gpt2_checkpoint.py --source gpt2 --output /mnt/models/gpt2-small.npz

# 2. Load tokenizer assets into PostgreSQL
python scripts/ingest_tokenizer.py \
  --dsn postgresql://postgres@localhost:5432/postgres \
  --model gpt2-small \
  --vocab /mnt/gpt2/vocab.json \
  --merges /mnt/gpt2/merges.txt --truncate

# 3. Tokenize a corpus and fill llm_dataset
python scripts/prepare_dataset.py \
  --dsn postgresql://postgres@localhost:5432/postgres \
  --tokenizer gpt2 \
  --input /mnt/corpus/*.txt \
  --block-size 1024 --truncate

An end-to-end walkthrough that stitches the helper scripts together is available in docs/python_workflow.md.

Mathematical Fidelity

All core operations follow the official GPT-2 equations:

Attention [ \mathrm{Attn}(x) = \mathrm{softmax}!\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V ] with causal masking and learned positional embeddings.

Feed-Forward [ \mathrm{FFN}(x) = \mathrm{GELU}(xW_1 + b_1)W_2 + b_2 ]

LayerNorm [ y = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}\gamma + \beta ]

Loss [ L = -\log \frac{e^{z_t}}{\sum_j e^{z_j}} ]

Optimizer (AdamW) [ \begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat{m}_t &= m_t / (1-\beta_1^t), \quad \hat{v}t = v_t / (1-\beta_2^t) \ \theta_t &= \theta{t-1} - \eta (\hat{m}_t / (\sqrt{\hat{v}t}+\epsilon) + \lambda\theta{t-1}) \end{aligned} ]

Example: End-to-End Flow

-- 1. Load model + tokenizer
SELECT pg_llm_import_npz('/mnt/models/gpt2-small.npz','gpt2-small');
SELECT pg_llm_load_bpe_vocab('/mnt/gpt2/vocab.json','gpt2-small');
SELECT pg_llm_load_bpe_merges('/mnt/gpt2/merges.txt','gpt2-small');

-- 2. Encode text
SELECT llm_encode('The database that dreamed of language.','gpt2-small');

-- 3. Generate continuation
SELECT llm_generate('The database that dreamed of language', 40, 0.8, 40, 0.95);

-- 4. Train or fine-tune
SELECT llm_train('gpt2-small', 5000, 12, 12, 768, 50257);

-- 5. Save checkpoint
SELECT llm_checkpoint_save('gpt2-small','finetuned on corpus X');

Performance Notes

All tensors are stored as raw BYTEA blobs and processed in-memory.
Core kernels (pg_llm_matmul, attention) use a tiled AVX2-aware micro-kernel that falls back to scalar math when SIMD is unavailable, delivering BLAS-class throughput without external dependencies.
Attention is evaluated in configurable row chunks (default 64 tokens) so that context matrices never exceed a manageable working set, enabling GPT-2 scale sequence lengths inside Postgres.
For large models, raise work_mem/maintenance_work_mem and consider chunking your training data via windowed queries so each step fits inside the executor's memory context.
Store activations and optimizer scratch data in UNLOGGED tables (e.g., CREATE UNLOGGED TABLE llm_activations (...)) to avoid WAL amplification when materializing large tensors.
Autograd tape pruning and gradient accumulation can be parallelized safely within a transaction.

Why Do This?

Proof of Concept: show that gradient-based learning can be expressed purely as relational algebra and transaction semantics.
Determinism: every computation is replayable and version-controlled.
Integration: unifies data, model, and training loop under a single ACID engine.
Pedagogy: transparent view into transformer internals, queryable step-by-step.

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
.github/workflows		.github/workflows
docs		docs
expected		expected
scripts		scripts
sql		sql
src		src
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pg_llm.control		pg_llm.control

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pg_gpt2

Overview

Prerequisites

Core Design Principles

Architecture Summary

Key Tables

Autograd Workflow

SQL API Reference

Model Initialization

Forward Pass and Inference

Training

Checkpointing

Tokenizer Utilities

Utility Scripts

Mathematical Fidelity

Example: End-to-End Flow

Performance Notes

Why Do This?

About

Uh oh!

Languages

seanwevans/pg_gpt2

Folders and files

Latest commit

History

Repository files navigation

pg_gpt2

Overview

Prerequisites

Core Design Principles

Architecture Summary

Key Tables

Autograd Workflow

SQL API Reference

Model Initialization

Forward Pass and Inference

Training

Checkpointing

Tokenizer Utilities

Utility Scripts

Mathematical Fidelity

Example: End-to-End Flow

Performance Notes

Why Do This?

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages