Skip to content

seanwevans/pg_gpt2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pg_gpt2

pg_gpt2 is a complete implementation of the GPT-2 architecture entirely inside PostgreSQL. It extends the database with tensor algebra, automatic differentiation, AdamW optimization, checkpointing, and a Byte-Pair Encoding tokenizer — allowing end-to-end training and text generation purely through SQL and C extensions.


Overview

PostgreSQL is used as both the storage and execution environment for a large-scale transformer model. Each layer, weight, and intermediate activation lives in relational tables; tensor operations are implemented as C functions returning BYTEA buffers. Every forward pass, gradient computation, and parameter update is a deterministic SQL transaction.

The project demonstrates that a relational database can serve as a full numerical engine, state store, and model runtime — no Python, PyTorch, or external ML stack required.


Prerequisites

Building the extension requires the PostgreSQL server development headers and build tooling so that pg_config --pgxs resolves to the pgxs.mk makefile. On Debian/Ubuntu systems install the package:

sudo apt-get install postgresql-server-dev-16

If PostgreSQL is installed somewhere custom, set the PG_CONFIG environment variable to point at the desired pg_config binary before running make.


Core Design Principles

  1. Postgres as OS — All computation and persistence live in SQL schemas and C extensions.
  2. Full Reproducibility — Every step, gradient, and checkpoint is a logged transaction.
  3. Numerical Fidelity — Bit-level parity with PyTorch’s GPT-2 (float32, row-major, GELU, LayerNorm, AdamW).
  4. Composability — Every tensor op is an SQL function; model architectures are relational graphs.
  5. Auditable Learning — Because gradients and weights are rows, the entire training process is queryable and replayable.

Architecture Summary

Component Description
Tensor Engine C implementations of matmul, add, gelu, softmax, layernorm, cross_entropy over contiguous float32 blobs (BYTEA).
Autodiff Engine Reverse-mode differentiation recorded in a relational tape (llm_tape, llm_tensor_rt), supporting backpropagation of all GPT-2 ops.
Optimizer AdamW with bias correction, decoupled weight decay, gradient clipping, and cosine learning-rate schedule.
Checkpointing Import/export weights as .npz or .safetensors archives. Every snapshot is versioned in llm_checkpoint.
Tokenizer Native Byte-Pair Encoding (BPE) tokenizer/decoder built from vocab.json + merges.txt.
Sampling Engine Temperature, top-k, and top-p (nucleus) sampling for autoregressive generation.
Training Loop SQL functions (llm_train, llm_train_step, llm_loss) orchestrate forward, backward, optimizer updates, and logging.
Inference llm_generate(prompt) runs encoding → forward → sampling → decoding, returning coherent text completions.

Key Tables

Table Purpose
llm_param Model parameters, gradients, optimizer state.
llm_dataset Tokenized training sequences.
llm_tape / llm_tensor_rt Computational graph and runtime tensors for autograd.
llm_autograd_mode Single-row toggle that signals when forward passes should record autograd tape entries.
llm_checkpoint Versioned checkpoint metadata and file paths.
llm_bpe_vocab / llm_bpe_merges GPT-2 tokenizer vocabulary and merge ranks.
llm_train_log Per-step learning rate and loss history.

Autograd Workflow

End-to-end training relies on a thin runtime that records every forward op in SQL so that gradients can be replayed later. The key moving pieces are:

  1. Parameter materialization. llm_materialize_params copies each row in llm_param into the temporary llm_tensor cache and creates a matching row in llm_tensor_rt. During that copy the helper pg_llm_autograd_map_param (or its SQL equivalent INSERT in the function) must be invoked so the runtime tensor id is associated with the original (model, name, token_id) tuple. Any new C routine that constructs parameter views needs to perform the same mapping or gradients will not flow back into llm_param. 【F:sql/pg_llm--0.1.0.sql†L403-L438】【F:src/pg_llm_autograd.c†L216-L246】
  2. Forward tape recording. Every C kernel checks pg_llm_autograd_enabled(); when the flag is set the inputs and outputs are registered with pg_llm_autograd_track_tensor and the op is appended to llm_tape with any metadata (shape, constants, etc.). This produces an ordered tape of all ops in the forward pass. 【F:src/pg_llm.c†L19-L210】
  3. Reverse traversal. llm_backprop walks the tape from the newest node back to the seed, dispatching gradients based on the recorded name field and writing results into llm_tensor_rt.grad. Once complete, llm_accumulate_grads copies those buffers back into llm_param.grad using the mapping created in step 1. 【F:sql/llm_backprop.sql†L1-L78】【F:sql/pg_llm--0.1.0.sql†L439-L456】
  4. Tied embeddings. GPT-2 reuses the token embedding (wte) for the final logits projection. After flattening the embedding table into a single matrix for pg_llm_matmul, ensure that buffer is still mapped to the original embedding rows (via pg_llm_autograd_map_param) so the logits gradient is accumulated back into wte rather than a detached copy. 【F:sql/pg_llm--0.1.0.sql†L173-L205】【F:src/pg_llm_autograd.c†L216-L246】

SQL API Reference

Model Initialization

SELECT pg_llm_import_npz('/mnt/models/gpt2-small.npz', 'gpt2-small');

Imports all pretrained GPT-2 weights into the llm_param table.

Forward Pass and Inference

-- Generate text directly in SQL
SELECT llm_generate('Once upon a time', 80, 0.9, 40, 0.92);

Training

-- Train for 10,000 steps on tokenized text dataset
SELECT llm_train(
  'gpt2-small',
  10000,        -- steps
  12, 12, 768,  -- layers, heads, hidden size
  50257,        -- vocab size
  0.9, 0.999, 1e-8, 0.01, 2.5e-4, 2000
);

Every step performs:

  1. Forward pass → loss (llm_loss)
  2. Reverse pass (llm_backprop)
  3. Gradient accumulation
  4. AdamW parameter updates
  5. Logging to llm_train_log

Checkpointing

-- Save a new checkpoint
SELECT llm_checkpoint_save('gpt2-small','after warmup 2k');

-- Restore a checkpoint
SELECT llm_checkpoint_load('gpt2-small',1);

Tokenizer Utilities

-- Load GPT-2 BPE vocab and merges
SELECT pg_llm_load_bpe_vocab('/mnt/gpt2/vocab.json','gpt2-small');
SELECT pg_llm_load_bpe_merges('/mnt/gpt2/merges.txt','gpt2-small');

-- Encode and decode text
SELECT llm_encode('Hello world!','gpt2-small');
SELECT llm_decode(ARRAY[15496,2159,0],'gpt2-small');

Utility Scripts

The repository includes Python helpers for preparing external assets before calling the SQL functions above. All scripts live under scripts/.

Script Purpose
convert_gpt2_checkpoint.py Download/convert a HuggingFace GPT-2 checkpoint into the gzip-based .npz container expected by pg_llm_import_npz.
ingest_tokenizer.py Load vocab.json and merges.txt tokenizer assets into llm_bpe_vocab/llm_bpe_merges using a PostgreSQL connection.
prepare_dataset.py Tokenize raw text files with the GPT-2 tokenizer and populate llm_dataset with fixed-length (tokens, target) arrays.

Install the optional Python dependencies with:

pip install transformers torch psycopg[binary]

Examples:

# 1. Convert HuggingFace weights to /mnt/models/gpt2-small.npz
python scripts/convert_gpt2_checkpoint.py --source gpt2 --output /mnt/models/gpt2-small.npz

# 2. Load tokenizer assets into PostgreSQL
python scripts/ingest_tokenizer.py \
  --dsn postgresql://postgres@localhost:5432/postgres \
  --model gpt2-small \
  --vocab /mnt/gpt2/vocab.json \
  --merges /mnt/gpt2/merges.txt --truncate

# 3. Tokenize a corpus and fill llm_dataset
python scripts/prepare_dataset.py \
  --dsn postgresql://postgres@localhost:5432/postgres \
  --tokenizer gpt2 \
  --input /mnt/corpus/*.txt \
  --block-size 1024 --truncate

An end-to-end walkthrough that stitches the helper scripts together is available in docs/python_workflow.md.


Mathematical Fidelity

All core operations follow the official GPT-2 equations:

Attention [ \mathrm{Attn}(x) = \mathrm{softmax}!\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V ] with causal masking and learned positional embeddings.

Feed-Forward [ \mathrm{FFN}(x) = \mathrm{GELU}(xW_1 + b_1)W_2 + b_2 ]

LayerNorm [ y = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}\gamma + \beta ]

Loss [ L = -\log \frac{e^{z_t}}{\sum_j e^{z_j}} ]

Optimizer (AdamW) [ \begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \ \hat{m}_t &= m_t / (1-\beta_1^t), \quad \hat{v}t = v_t / (1-\beta_2^t) \ \theta_t &= \theta{t-1} - \eta (\hat{m}_t / (\sqrt{\hat{v}t}+\epsilon) + \lambda\theta{t-1}) \end{aligned} ]


Example: End-to-End Flow

-- 1. Load model + tokenizer
SELECT pg_llm_import_npz('/mnt/models/gpt2-small.npz','gpt2-small');
SELECT pg_llm_load_bpe_vocab('/mnt/gpt2/vocab.json','gpt2-small');
SELECT pg_llm_load_bpe_merges('/mnt/gpt2/merges.txt','gpt2-small');

-- 2. Encode text
SELECT llm_encode('The database that dreamed of language.','gpt2-small');

-- 3. Generate continuation
SELECT llm_generate('The database that dreamed of language', 40, 0.8, 40, 0.95);

-- 4. Train or fine-tune
SELECT llm_train('gpt2-small', 5000, 12, 12, 768, 50257);

-- 5. Save checkpoint
SELECT llm_checkpoint_save('gpt2-small','finetuned on corpus X');

Performance Notes

  • All tensors are stored as raw BYTEA blobs and processed in-memory.
  • Core kernels (pg_llm_matmul, attention) use a tiled AVX2-aware micro-kernel that falls back to scalar math when SIMD is unavailable, delivering BLAS-class throughput without external dependencies.
  • Attention is evaluated in configurable row chunks (default 64 tokens) so that context matrices never exceed a manageable working set, enabling GPT-2 scale sequence lengths inside Postgres.
  • For large models, raise work_mem/maintenance_work_mem and consider chunking your training data via windowed queries so each step fits inside the executor's memory context.
  • Store activations and optimizer scratch data in UNLOGGED tables (e.g., CREATE UNLOGGED TABLE llm_activations (...)) to avoid WAL amplification when materializing large tensors.
  • Autograd tape pruning and gradient accumulation can be parallelized safely within a transaction.

Why Do This?

  • Proof of Concept: show that gradient-based learning can be expressed purely as relational algebra and transaction semantics.
  • Determinism: every computation is replayable and version-controlled.
  • Integration: unifies data, model, and training loop under a single ACID engine.
  • Pedagogy: transparent view into transformer internals, queryable step-by-step.