pg_gpt2 is a complete implementation of the GPT-2 architecture entirely inside PostgreSQL. It extends the database with tensor algebra, automatic differentiation, AdamW optimization, checkpointing, and a Byte-Pair Encoding tokenizer — allowing end-to-end training and text generation purely through SQL and C extensions.
PostgreSQL is used as both the storage and execution environment for a large-scale transformer model.
Each layer, weight, and intermediate activation lives in relational tables; tensor operations are implemented as C
functions returning BYTEA
buffers.
Every forward pass, gradient computation, and parameter update is a deterministic SQL transaction.
The project demonstrates that a relational database can serve as a full numerical engine, state store, and model runtime — no Python, PyTorch, or external ML stack required.
Building the extension requires the PostgreSQL server development headers and build tooling so that pg_config --pgxs
resolves to the pgxs.mk
makefile. On Debian/Ubuntu systems install the package:
sudo apt-get install postgresql-server-dev-16
If PostgreSQL is installed somewhere custom, set the PG_CONFIG
environment variable to point at the desired pg_config
binary before running make
.
- Postgres as OS — All computation and persistence live in SQL schemas and C extensions.
- Full Reproducibility — Every step, gradient, and checkpoint is a logged transaction.
- Numerical Fidelity — Bit-level parity with PyTorch’s GPT-2 (
float32
, row-major, GELU, LayerNorm, AdamW). - Composability — Every tensor op is an SQL function; model architectures are relational graphs.
- Auditable Learning — Because gradients and weights are rows, the entire training process is queryable and replayable.
Component | Description |
---|---|
Tensor Engine | C implementations of matmul , add , gelu , softmax , layernorm , cross_entropy over contiguous float32 blobs (BYTEA ). |
Autodiff Engine | Reverse-mode differentiation recorded in a relational tape (llm_tape , llm_tensor_rt ), supporting backpropagation of all GPT-2 ops. |
Optimizer | AdamW with bias correction, decoupled weight decay, gradient clipping, and cosine learning-rate schedule. |
Checkpointing | Import/export weights as .npz or .safetensors archives. Every snapshot is versioned in llm_checkpoint . |
Tokenizer | Native Byte-Pair Encoding (BPE) tokenizer/decoder built from vocab.json + merges.txt . |
Sampling Engine | Temperature, top-k, and top-p (nucleus) sampling for autoregressive generation. |
Training Loop | SQL functions (llm_train , llm_train_step , llm_loss ) orchestrate forward, backward, optimizer updates, and logging. |
Inference | llm_generate(prompt) runs encoding → forward → sampling → decoding, returning coherent text completions. |
Table | Purpose |
---|---|
llm_param |
Model parameters, gradients, optimizer state. |
llm_dataset |
Tokenized training sequences. |
llm_tape / llm_tensor_rt |
Computational graph and runtime tensors for autograd. |
llm_autograd_mode |
Single-row toggle that signals when forward passes should record autograd tape entries. |
llm_checkpoint |
Versioned checkpoint metadata and file paths. |
llm_bpe_vocab / llm_bpe_merges |
GPT-2 tokenizer vocabulary and merge ranks. |
llm_train_log |
Per-step learning rate and loss history. |
End-to-end training relies on a thin runtime that records every forward op in SQL so that gradients can be replayed later. The key moving pieces are:
- Parameter materialization.
llm_materialize_params
copies each row inllm_param
into the temporaryllm_tensor
cache and creates a matching row inllm_tensor_rt
. During that copy the helperpg_llm_autograd_map_param
(or its SQL equivalentINSERT
in the function) must be invoked so the runtime tensor id is associated with the original(model, name, token_id)
tuple. Any new C routine that constructs parameter views needs to perform the same mapping or gradients will not flow back intollm_param
. 【F:sql/pg_llm--0.1.0.sql†L403-L438】【F:src/pg_llm_autograd.c†L216-L246】 - Forward tape recording. Every C kernel checks
pg_llm_autograd_enabled()
; when the flag is set the inputs and outputs are registered withpg_llm_autograd_track_tensor
and the op is appended tollm_tape
with any metadata (shape, constants, etc.). This produces an ordered tape of all ops in the forward pass. 【F:src/pg_llm.c†L19-L210】 - Reverse traversal.
llm_backprop
walks the tape from the newest node back to the seed, dispatching gradients based on the recordedname
field and writing results intollm_tensor_rt.grad
. Once complete,llm_accumulate_grads
copies those buffers back intollm_param.grad
using the mapping created in step 1. 【F:sql/llm_backprop.sql†L1-L78】【F:sql/pg_llm--0.1.0.sql†L439-L456】 - Tied embeddings. GPT-2 reuses the token embedding (
wte
) for the final logits projection. After flattening the embedding table into a single matrix forpg_llm_matmul
, ensure that buffer is still mapped to the original embedding rows (viapg_llm_autograd_map_param
) so the logits gradient is accumulated back intowte
rather than a detached copy. 【F:sql/pg_llm--0.1.0.sql†L173-L205】【F:src/pg_llm_autograd.c†L216-L246】
SELECT pg_llm_import_npz('/mnt/models/gpt2-small.npz', 'gpt2-small');
Imports all pretrained GPT-2 weights into the llm_param
table.
-- Generate text directly in SQL
SELECT llm_generate('Once upon a time', 80, 0.9, 40, 0.92);
-- Train for 10,000 steps on tokenized text dataset
SELECT llm_train(
'gpt2-small',
10000, -- steps
12, 12, 768, -- layers, heads, hidden size
50257, -- vocab size
0.9, 0.999, 1e-8, 0.01, 2.5e-4, 2000
);
Every step performs:
- Forward pass → loss (
llm_loss
) - Reverse pass (
llm_backprop
) - Gradient accumulation
- AdamW parameter updates
- Logging to
llm_train_log
-- Save a new checkpoint
SELECT llm_checkpoint_save('gpt2-small','after warmup 2k');
-- Restore a checkpoint
SELECT llm_checkpoint_load('gpt2-small',1);
-- Load GPT-2 BPE vocab and merges
SELECT pg_llm_load_bpe_vocab('/mnt/gpt2/vocab.json','gpt2-small');
SELECT pg_llm_load_bpe_merges('/mnt/gpt2/merges.txt','gpt2-small');
-- Encode and decode text
SELECT llm_encode('Hello world!','gpt2-small');
SELECT llm_decode(ARRAY[15496,2159,0],'gpt2-small');
The repository includes Python helpers for preparing external assets before
calling the SQL functions above. All scripts live under scripts/
.
Script | Purpose |
---|---|
convert_gpt2_checkpoint.py |
Download/convert a HuggingFace GPT-2 checkpoint into the gzip-based .npz container expected by pg_llm_import_npz . |
ingest_tokenizer.py |
Load vocab.json and merges.txt tokenizer assets into llm_bpe_vocab /llm_bpe_merges using a PostgreSQL connection. |
prepare_dataset.py |
Tokenize raw text files with the GPT-2 tokenizer and populate llm_dataset with fixed-length (tokens, target) arrays. |
Install the optional Python dependencies with:
pip install transformers torch psycopg[binary]
Examples:
# 1. Convert HuggingFace weights to /mnt/models/gpt2-small.npz
python scripts/convert_gpt2_checkpoint.py --source gpt2 --output /mnt/models/gpt2-small.npz
# 2. Load tokenizer assets into PostgreSQL
python scripts/ingest_tokenizer.py \
--dsn postgresql://postgres@localhost:5432/postgres \
--model gpt2-small \
--vocab /mnt/gpt2/vocab.json \
--merges /mnt/gpt2/merges.txt --truncate
# 3. Tokenize a corpus and fill llm_dataset
python scripts/prepare_dataset.py \
--dsn postgresql://postgres@localhost:5432/postgres \
--tokenizer gpt2 \
--input /mnt/corpus/*.txt \
--block-size 1024 --truncate
An end-to-end walkthrough that stitches the helper scripts together is available in docs/python_workflow.md.
All core operations follow the official GPT-2 equations:
Attention [ \mathrm{Attn}(x) = \mathrm{softmax}!\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V ] with causal masking and learned positional embeddings.
Feed-Forward [ \mathrm{FFN}(x) = \mathrm{GELU}(xW_1 + b_1)W_2 + b_2 ]
LayerNorm [ y = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}\gamma + \beta ]
Loss [ L = -\log \frac{e^{z_t}}{\sum_j e^{z_j}} ]
Optimizer (AdamW) [ \begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat{m}_t &= m_t / (1-\beta_1^t), \quad \hat{v}t = v_t / (1-\beta_2^t) \ \theta_t &= \theta{t-1} - \eta (\hat{m}_t / (\sqrt{\hat{v}t}+\epsilon) + \lambda\theta{t-1}) \end{aligned} ]
-- 1. Load model + tokenizer
SELECT pg_llm_import_npz('/mnt/models/gpt2-small.npz','gpt2-small');
SELECT pg_llm_load_bpe_vocab('/mnt/gpt2/vocab.json','gpt2-small');
SELECT pg_llm_load_bpe_merges('/mnt/gpt2/merges.txt','gpt2-small');
-- 2. Encode text
SELECT llm_encode('The database that dreamed of language.','gpt2-small');
-- 3. Generate continuation
SELECT llm_generate('The database that dreamed of language', 40, 0.8, 40, 0.95);
-- 4. Train or fine-tune
SELECT llm_train('gpt2-small', 5000, 12, 12, 768, 50257);
-- 5. Save checkpoint
SELECT llm_checkpoint_save('gpt2-small','finetuned on corpus X');
- All tensors are stored as raw
BYTEA
blobs and processed in-memory. - Core kernels (
pg_llm_matmul
, attention) use a tiled AVX2-aware micro-kernel that falls back to scalar math when SIMD is unavailable, delivering BLAS-class throughput without external dependencies. - Attention is evaluated in configurable row chunks (default 64 tokens) so that context matrices never exceed a manageable working set, enabling GPT-2 scale sequence lengths inside Postgres.
- For large models, raise
work_mem
/maintenance_work_mem
and consider chunking your training data via windowed queries so each step fits inside the executor's memory context. - Store activations and optimizer scratch data in
UNLOGGED
tables (e.g.,CREATE UNLOGGED TABLE llm_activations (...)
) to avoid WAL amplification when materializing large tensors. - Autograd tape pruning and gradient accumulation can be parallelized safely within a transaction.
- Proof of Concept: show that gradient-based learning can be expressed purely as relational algebra and transaction semantics.
- Determinism: every computation is replayable and version-controlled.
- Integration: unifies data, model, and training loop under a single ACID engine.
- Pedagogy: transparent view into transformer internals, queryable step-by-step.