FREE Reverse Engineering Self-Study Course HERE
A comprehensive tutorial implementation of a GPT (Generative Pre-trained Transformer) language model written entirely in pure Rust using only the ndarray crate for tensor operations. This project demonstrates how to build a working transformer architecture from first principles without relying on deep learning frameworks.
- Introduction
- Architecture Overview
- Module Deep Dive
- Configuration
- Building and Running
- Testing
- Code Coverage
- License
This project implements a miniature version of the GPT architecture, the same fundamental design behind models like ChatGPT. The goal is educational: to understand every component of a transformer language model by implementing it from scratch in Rust.
Rust provides memory safety without garbage collection, zero-cost abstractions, and excellent performance. By implementing a neural network in Rust without a framework, we gain deep insight into:
- How tensor operations work at a low level
- Memory management in neural networks
- The actual mathematics behind attention mechanisms
- Gradient computation and backpropagation
By studying this codebase, you will understand:
- Embeddings: How discrete tokens become continuous vectors
- Attention Mechanisms: The core innovation that powers transformers
- Layer Normalization: How to stabilize training
- Feed-Forward Networks: Position-wise transformations
- Autoregressive Generation: How language models generate text token by token
┌──────────────────────────────────────────────────────────────┐
│ TinyGPT │
├──────────────────────────────────────────────────────────────┤
│ Input Tokens: [t1, t2, ..., tn] │
│ │ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Token │ + │ Position │ │
│ │ Embedding │ │ Embedding │ │
│ └─────────────┘ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────┐ │
│ │ Transformer Block ×N │ │
│ │ ┌─────────────────────────────┐ │ │
│ │ │ LayerNorm │ │ │
│ │ │ ↓ │ │ │
│ │ │ Multi-Head Attention │ │ │
│ │ │ ↓ │ │ │
│ │ │ + Residual Connection │ │ │
│ │ │ ↓ │ │ │
│ │ │ LayerNorm │ │ │
│ │ │ ↓ │ │ │
│ │ │ Feed-Forward Network │ │ │
│ │ │ ↓ │ │ │
│ │ │ + Residual Connection │ │ │
│ │ └─────────────────────────────┘ │ │
│ └───────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ LayerNorm │ │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Output Head │ → Logits [vocab_size] │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ Softmax → Probabilities → Sample → Next Token │
└──────────────────────────────────────────────────────────────┘
This module defines the hyperparameters for the model using a JSON configuration file.
#[derive(Deserialize)]
pub struct Config {
pub block_size: usize, // Maximum context length (8)
pub embed_dim: usize, // Embedding dimension (128)
pub n_heads: usize, // Number of attention heads (4)
pub n_layers: usize, // Number of transformer blocks (4)
pub lr: f32, // Learning rate (0.01)
pub epochs: usize, // Training iterations (5000)
pub batch_size: usize, // Batch size (16)
}The configuration is loaded at compile time using include_str! and parsed with serde_json. The once_cell::Lazy pattern ensures the configuration is loaded exactly once and shared globally.
This module provides core mathematical operations:
Random Initialization (randn): Generates weights from a scaled normal distribution. The scaling factor of 0.02 prevents exploding gradients at initialization.
pub fn randn(r: usize, c: usize) -> Array2<f32> {
Array2::from_shape_fn((r, c), |_| rng().sample::<f32, _>(StandardNormal) * 0.02)
}Softmax (softmax, softmax1d): Converts logits to probabilities with numerical stability by subtracting the maximum value before exponentiation.
Layer Normalization (layer_norm): Normalizes activations across the feature dimension with learnable scale (
Implements categorical sampling for text generation:
pub fn sample(probs: &Array1<f32>, rng: &mut ThreadRng) -> usize {
let (r, mut c) = (rng.random::<f32>(), 0.0);
probs.iter().position(|&p| { c += p; r < c }).unwrap_or(0)
}This walks through the cumulative distribution until the random value is exceeded, effectively sampling from the probability distribution.
Handles tokenization at the word level:
from_corpus: Builds word-to-index and index-to-word mappings from training textencode: Converts a word to its vocabulary indexdecode: Converts a sequence of indices back to text Each sentence in the corpus is appended with an<END>token to mark sentence boundaries.
The DataLoader generates random training batches:
pub fn get_batch(&self, rng: &mut ThreadRng) -> (Array2<usize>, Array2<usize>) {
// For each batch item, sample a random starting position
// xb contains input tokens, yb contains target tokens (shifted by 1)
}The target for each position is the next token in the sequence, enabling the model to learn next-token prediction.
Implements a fully connected layer with:
-
Forward pass:
$y = xW + b$ -
Gradient storage:
dwanddbaccumulate gradients during backpropagation -
Parameter update:
step(lr)applies gradient descent
Provides learnable token embeddings:
pub fn forward(&self, idx: &[usize]) -> Array2<f32> {
Array2::from_shape_fn((idx.len(), self.w.shape()[1]), |(i, j)| self.w[[idx[i], j]])
}This gathers rows from the embedding matrix corresponding to the input token indices.
A thin wrapper around the layer_norm math function with learnable
Implements scaled dot-product attention with causal masking:
pub fn forward(&self, x: &Array2<f32>) -> Array2<f32> {
let (k, q, v) = (self.key.forward(x), self.query.forward(x), self.value.forward(x));
let mut w = q.dot(&k.t()) / (self.hs as f32).sqrt();
// Apply causal mask
for i in 0..t {
for j in i + 1..t {
w[[i, j]] = f32::NEG_INFINITY;
}
}
softmax(&w).dot(&v)
}The causal mask ensures position
Runs multiple attention heads in parallel and projects their concatenated outputs:
pub fn forward(&self, x: &Array2<f32>) -> Array2<f32> {
let o: Vec<_> = self.heads.iter().map(|h| h.forward(x)).collect();
let v: Vec<_> = o.iter().map(|a| a.view()).collect();
self.proj.forward(&ndarray::concatenate(Axis(1), &v).unwrap())
}Position-wise feed-forward network with ReLU activation:
pub fn forward(&self, x: &Array2<f32>) -> Array2<f32> {
self.l2.forward(&self.l1.forward(x).mapv(|v| v.max(0.0)))
}The hidden dimension is 4× the embedding dimension, following the original transformer paper.
Combines attention and feed-forward with residual connections (pre-norm architecture):
pub fn forward(&self, x: &Array2<f32>) -> Array2<f32> {
let x = x + &self.sa.forward(&self.ln1.forward(x));
&x + &self.ffn.forward(&self.ln2.forward(&x))
}The full GPT model combining all components:
- Token + Position Embeddings: Sum of token and positional embeddings
- Transformer Blocks: Stack of N transformer blocks
- Final LayerNorm: Stabilizes outputs before projection
- Output Head: Projects to vocabulary size for next-token prediction
The
backwardmethod implements simplified backpropagation through the output head and embeddings, computing gradients for gradient descent. Thegeneratemethod implements autoregressive generation:
pub fn generate(&self, start: usize, n: usize, rng: &mut ThreadRng) -> Vec<usize> {
let mut out = vec![start];
for _ in 0..n {
let ctx: Vec<_> = out.iter().rev().take(CFG.block_size).rev().copied().collect();
let logits = self.forward(&ctx);
out.push(sample(&softmax1d(&logits.row(logits.shape()[0] - 1)), rng));
}
out
}Orchestrates the training process:
pub fn train_steps(&mut self, steps: usize, rng: &mut ThreadRng) {
for step in 0..steps {
let (xb, yb) = self.loader.get_batch(rng);
self.model.zero_grad();
self.model.backward(&xb, &yb);
self.model.step(CFG.lr);
}
}Each step: get batch → zero gradients → compute gradients → update parameters.
Ties everything together:
- Load corpus from
corpus.json - Build vocabulary
- Create trainer with model and data
- Train for configured epochs
- Generate sample text
{
"block_size": 8,
"embed_dim": 128,
"n_heads": 4,
"n_layers": 4,
"lr": 0.01,
"epochs": 5000,
"batch_size": 16
}| Parameter | Description | Value |
|---|---|---|
block_size |
Maximum context window for attention | 8 tokens |
embed_dim |
Dimension of token/position embeddings | 128 |
n_heads |
Number of parallel attention heads | 4 |
n_layers |
Number of transformer blocks | 4 |
lr |
Learning rate for gradient descent | 0.01 |
epochs |
Number of training iterations | 5000 |
batch_size |
Number of sequences per batch | 16 |
Contains 50 sentences focused on Reverse Engineering concepts. Each sentence is tokenized at the word level, with <END> tokens appended.
- Rust 1.70+ (install via rustup)
- Cargo (included with Rust)
# Debug build
cargo build
# Release build (optimized, recommended)
cargo build --release# Run debug build
cargo run
# Run release build (faster)
cargo run --releaseTinyGPT
Step 0, loss=5.1234
Step 300, loss=4.2345
Step 600, loss=3.5678
...
Step 4800, loss=2.1234
generated text:
the binary analysis requires understanding of assembly code <END> reverse engineering
The loss should decrease over training as the model learns patterns in the corpus.
This project includes comprehensive unit tests for all modules, achieving 95%+ code coverage.
cargo testcargo test -- --nocapturecargo test test_linear_forwardcargo test math::tests
cargo test linear::tests
cargo test tiny_gpt::tests| Module | Tests | Coverage |
|---|---|---|
math.rs |
13 | 100% |
sampling.rs |
5 | 100% |
linear.rs |
7 | 100% |
embedding.rs |
8 | 100% |
layer_norm.rs |
8 | 100% |
head.rs |
7 | 100% |
attention.rs |
8 | 100% |
feed_forward.rs |
8 | 100% |
block.rs |
8 | 100% |
tiny_gpt.rs |
12 | 100% |
vocab.rs |
10 | 100% |
data.rs |
5 | 100% |
config.rs |
9 | 100% |
trainer.rs |
11 | 89% |
| Total | 123 | 95.30% |
cargo install cargo-tarpaulin# Output to terminal
cargo tarpaulin --out Stdout
# Generate HTML report
cargo tarpaulin --out Html|| Tested/Total Lines:
|| src/attention.rs: 14/14
|| src/block.rs: 14/14
|| src/config.rs: 1/1
|| src/data.rs: 11/11
|| src/embedding.rs: 9/9
|| src/feed_forward.rs: 11/11
|| src/head.rs: 23/23
|| src/layer_norm.rs: 5/5
|| src/linear.rs: 13/13
|| src/math.rs: 18/18
|| src/sampling.rs: 6/6
|| src/tiny_gpt.rs: 67/67
|| src/vocab.rs: 15/15
|| src/trainer.rs: 16/18
||
|| 95.30% coverage, 223/234 lines covered
The uncovered lines are:
main.rs: Entry point (standard practice not to unit test)trainer.rs: Thetrain()wrapper that runs for 5000 epochs
rust_gpt/
├── Cargo.toml # Dependencies and project metadata
├── config.json # Model hyperparameters
├── corpus.json # Training data (50 RE-focused sentences)
├── README.md # This file
└── src/
├── main.rs # Application entry point
├── config.rs # Configuration loading
├── math.rs # Mathematical utilities (randn, softmax, layer_norm)
├── sampling.rs # Probability sampling for generation
├── vocab.rs # Vocabulary management (encode/decode)
├── data.rs # Batch data loading
├── linear.rs # Linear (fully connected) layer
├── embedding.rs # Token/position embedding layer
├── layer_norm.rs # Layer normalization module
├── head.rs # Single attention head
├── attention.rs # Multi-head attention
├── feed_forward.rs # Position-wise feed-forward network
├── block.rs # Complete transformer block
├── tiny_gpt.rs # Full TinyGPT model
└── trainer.rs # Training loop and generation
[dependencies]
ndarray = { version = "0.17.1", features = ["rayon"] }
rand = "0.9.2"
rand_distr = "0.5.1"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
once_cell = "1.19"| Crate | Purpose |
|---|---|
ndarray |
N-dimensional arrays for tensor operations |
rand |
Random number generation |
rand_distr |
Statistical distributions (Normal) |
serde |
Serialization framework |
serde_json |
JSON parsing |
once_cell |
Lazy static initialization |
- Attention Is All You Need - Original Transformer paper
- Language Models are Unsupervised Multitask Learners - GPT-2 paper
- The Illustrated Transformer - Visual explanation
- minGPT - Andrej Karpathy's minimal GPT implementation
