Skip to content

mytechnotalent/TinyGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image

FREE Reverse Engineering Self-Study Course HERE


TinyGPT

A Pure Rust GPT Implementation From Scratch

A comprehensive tutorial implementation of a GPT (Generative Pre-trained Transformer) language model written entirely in pure Rust using only the ndarray crate for tensor operations. This project demonstrates how to build a working transformer architecture from first principles without relying on deep learning frameworks.


Table of Contents

  1. Introduction
  2. Architecture Overview
  3. Module Deep Dive
  4. Configuration
  5. Building and Running
  6. Testing
  7. Code Coverage
  8. License

Introduction

This project implements a miniature version of the GPT architecture, the same fundamental design behind models like ChatGPT. The goal is educational: to understand every component of a transformer language model by implementing it from scratch in Rust.

Why Rust?

Rust provides memory safety without garbage collection, zero-cost abstractions, and excellent performance. By implementing a neural network in Rust without a framework, we gain deep insight into:

  • How tensor operations work at a low level
  • Memory management in neural networks
  • The actual mathematics behind attention mechanisms
  • Gradient computation and backpropagation

What This Project Teaches

By studying this codebase, you will understand:

  • Embeddings: How discrete tokens become continuous vectors
  • Attention Mechanisms: The core innovation that powers transformers
  • Layer Normalization: How to stabilize training
  • Feed-Forward Networks: Position-wise transformations
  • Autoregressive Generation: How language models generate text token by token

Architecture Overview

┌──────────────────────────────────────────────────────────────┐
│                          TinyGPT                             │
├──────────────────────────────────────────────────────────────┤
│  Input Tokens: [t1, t2, ..., tn]                             │
│         │                                                    │
│         ▼                                                    │
│  ┌─────────────┐   ┌─────────────┐                           │
│  │   Token     │ + │  Position   │                           │
│  │  Embedding  │   │  Embedding  │                           │
│  └─────────────┘   └─────────────┘                           │
│         │                                                    │
│         ▼                                                    │
│  ┌───────────────────────────────────┐                       │
│  │       Transformer Block ×N        │                       │
│  │  ┌─────────────────────────────┐  │                       │
│  │  │  LayerNorm                  │  │                       │
│  │  │      ↓                      │  │                       │
│  │  │  Multi-Head Attention       │  │                       │
│  │  │      ↓                      │  │                       │
│  │  │  + Residual Connection      │  │                       │
│  │  │      ↓                      │  │                       │
│  │  │  LayerNorm                  │  │                       │
│  │  │      ↓                      │  │                       │
│  │  │  Feed-Forward Network       │  │                       │
│  │  │      ↓                      │  │                       │
│  │  │  + Residual Connection      │  │                       │
│  │  └─────────────────────────────┘  │                       │
│  └───────────────────────────────────┘                       │
│         │                                                    │
│         ▼                                                    │
│  ┌─────────────┐                                             │
│  │  LayerNorm  │                                             │
│  └─────────────┘                                             │
│         │                                                    │
│         ▼                                                    │
│  ┌─────────────┐                                             │
│  │ Output Head │  →  Logits [vocab_size]                     │
│  └─────────────┘                                             │
│         │                                                    │
│         ▼                                                    │
│     Softmax → Probabilities → Sample → Next Token            │
└──────────────────────────────────────────────────────────────┘

Module Deep Dive

config.rs - Configuration Management

This module defines the hyperparameters for the model using a JSON configuration file.

#[derive(Deserialize)]
pub struct Config {
    pub block_size: usize,    // Maximum context length (8)
    pub embed_dim: usize,     // Embedding dimension (128)
    pub n_heads: usize,       // Number of attention heads (4)
    pub n_layers: usize,      // Number of transformer blocks (4)
    pub lr: f32,              // Learning rate (0.01)
    pub epochs: usize,        // Training iterations (5000)
    pub batch_size: usize,    // Batch size (16)
}

The configuration is loaded at compile time using include_str! and parsed with serde_json. The once_cell::Lazy pattern ensures the configuration is loaded exactly once and shared globally.

math.rs - Mathematical Utilities

This module provides core mathematical operations: Random Initialization (randn): Generates weights from a scaled normal distribution. The scaling factor of 0.02 prevents exploding gradients at initialization.

pub fn randn(r: usize, c: usize) -> Array2<f32> {
    Array2::from_shape_fn((r, c), |_| rng().sample::<f32, _>(StandardNormal) * 0.02)
}

Softmax (softmax, softmax1d): Converts logits to probabilities with numerical stability by subtracting the maximum value before exponentiation. Layer Normalization (layer_norm): Normalizes activations across the feature dimension with learnable scale ($\gamma$) and shift ($\beta$) parameters.

sampling.rs - Token Sampling

Implements categorical sampling for text generation:

pub fn sample(probs: &Array1<f32>, rng: &mut ThreadRng) -> usize {
    let (r, mut c) = (rng.random::<f32>(), 0.0);
    probs.iter().position(|&p| { c += p; r < c }).unwrap_or(0)
}

This walks through the cumulative distribution until the random value is exceeded, effectively sampling from the probability distribution.

vocab.rs - Vocabulary Management

Handles tokenization at the word level:

  • from_corpus: Builds word-to-index and index-to-word mappings from training text
  • encode: Converts a word to its vocabulary index
  • decode: Converts a sequence of indices back to text Each sentence in the corpus is appended with an <END> token to mark sentence boundaries.

data.rs - Data Loading

The DataLoader generates random training batches:

pub fn get_batch(&self, rng: &mut ThreadRng) -> (Array2<usize>, Array2<usize>) {
    // For each batch item, sample a random starting position
    // xb contains input tokens, yb contains target tokens (shifted by 1)
}

The target for each position is the next token in the sequence, enabling the model to learn next-token prediction.

linear.rs - Linear Layer

Implements a fully connected layer with:

  • Forward pass: $y = xW + b$
  • Gradient storage: dw and db accumulate gradients during backpropagation
  • Parameter update: step(lr) applies gradient descent

embedding.rs - Embedding Layer

Provides learnable token embeddings:

pub fn forward(&self, idx: &[usize]) -> Array2<f32> {
    Array2::from_shape_fn((idx.len(), self.w.shape()[1]), |(i, j)| self.w[[idx[i], j]])
}

This gathers rows from the embedding matrix corresponding to the input token indices.

layer_norm.rs - Layer Normalization

A thin wrapper around the layer_norm math function with learnable $\gamma$ (initialized to 1) and $\beta$ (initialized to 0) parameters.

head.rs - Single Attention Head

Implements scaled dot-product attention with causal masking:

pub fn forward(&self, x: &Array2<f32>) -> Array2<f32> {
    let (k, q, v) = (self.key.forward(x), self.query.forward(x), self.value.forward(x));
    let mut w = q.dot(&k.t()) / (self.hs as f32).sqrt();
    // Apply causal mask
    for i in 0..t {
        for j in i + 1..t {
            w[[i, j]] = f32::NEG_INFINITY;
        }
    }
    softmax(&w).dot(&v)
}

The causal mask ensures position $i$ can only attend to positions $\leq i$.

attention.rs - Multi-Head Attention

Runs multiple attention heads in parallel and projects their concatenated outputs:

pub fn forward(&self, x: &Array2<f32>) -> Array2<f32> {
    let o: Vec<_> = self.heads.iter().map(|h| h.forward(x)).collect();
    let v: Vec<_> = o.iter().map(|a| a.view()).collect();
    self.proj.forward(&ndarray::concatenate(Axis(1), &v).unwrap())
}

feed_forward.rs - Feed-Forward Network

Position-wise feed-forward network with ReLU activation:

pub fn forward(&self, x: &Array2<f32>) -> Array2<f32> {
    self.l2.forward(&self.l1.forward(x).mapv(|v| v.max(0.0)))
}

The hidden dimension is 4× the embedding dimension, following the original transformer paper.

block.rs - Transformer Block

Combines attention and feed-forward with residual connections (pre-norm architecture):

pub fn forward(&self, x: &Array2<f32>) -> Array2<f32> {
    let x = x + &self.sa.forward(&self.ln1.forward(x));
    &x + &self.ffn.forward(&self.ln2.forward(&x))
}

tiny_gpt.rs - Complete Model

The full GPT model combining all components:

  1. Token + Position Embeddings: Sum of token and positional embeddings
  2. Transformer Blocks: Stack of N transformer blocks
  3. Final LayerNorm: Stabilizes outputs before projection
  4. Output Head: Projects to vocabulary size for next-token prediction The backward method implements simplified backpropagation through the output head and embeddings, computing gradients for gradient descent. The generate method implements autoregressive generation:
pub fn generate(&self, start: usize, n: usize, rng: &mut ThreadRng) -> Vec<usize> {
    let mut out = vec![start];
    for _ in 0..n {
        let ctx: Vec<_> = out.iter().rev().take(CFG.block_size).rev().copied().collect();
        let logits = self.forward(&ctx);
        out.push(sample(&softmax1d(&logits.row(logits.shape()[0] - 1)), rng));
    }
    out
}

trainer.rs - Training Loop

Orchestrates the training process:

pub fn train_steps(&mut self, steps: usize, rng: &mut ThreadRng) {
    for step in 0..steps {
        let (xb, yb) = self.loader.get_batch(rng);
        self.model.zero_grad();
        self.model.backward(&xb, &yb);
        self.model.step(CFG.lr);
    }
}

Each step: get batch → zero gradients → compute gradients → update parameters.

main.rs - Entry Point

Ties everything together:

  1. Load corpus from corpus.json
  2. Build vocabulary
  3. Create trainer with model and data
  4. Train for configured epochs
  5. Generate sample text

Configuration

config.json

{
  "block_size": 8,
  "embed_dim": 128,
  "n_heads": 4,
  "n_layers": 4,
  "lr": 0.01,
  "epochs": 5000,
  "batch_size": 16
}
Parameter Description Value
block_size Maximum context window for attention 8 tokens
embed_dim Dimension of token/position embeddings 128
n_heads Number of parallel attention heads 4
n_layers Number of transformer blocks 4
lr Learning rate for gradient descent 0.01
epochs Number of training iterations 5000
batch_size Number of sequences per batch 16

corpus.json

Contains 50 sentences focused on Reverse Engineering concepts. Each sentence is tokenized at the word level, with <END> tokens appended.

Building and Running

Prerequisites

  • Rust 1.70+ (install via rustup)
  • Cargo (included with Rust)

Build

# Debug build
cargo build

# Release build (optimized, recommended)
cargo build --release

Run

# Run debug build
cargo run

# Run release build (faster)
cargo run --release

Expected Output

TinyGPT
Step 0, loss=5.1234
Step 300, loss=4.2345
Step 600, loss=3.5678
...
Step 4800, loss=2.1234

generated text:
the binary analysis requires understanding of assembly code <END> reverse engineering

The loss should decrease over training as the model learns patterns in the corpus.


Testing

This project includes comprehensive unit tests for all modules, achieving 95%+ code coverage.

Run All Tests

cargo test

Run Tests with Output

cargo test -- --nocapture

Run Specific Test

cargo test test_linear_forward

Run Tests for Specific Module

cargo test math::tests
cargo test linear::tests
cargo test tiny_gpt::tests

Test Summary

Module Tests Coverage
math.rs 13 100%
sampling.rs 5 100%
linear.rs 7 100%
embedding.rs 8 100%
layer_norm.rs 8 100%
head.rs 7 100%
attention.rs 8 100%
feed_forward.rs 8 100%
block.rs 8 100%
tiny_gpt.rs 12 100%
vocab.rs 10 100%
data.rs 5 100%
config.rs 9 100%
trainer.rs 11 89%
Total 123 95.30%

Code Coverage

Install Tarpaulin

cargo install cargo-tarpaulin

Run Coverage

# Output to terminal
cargo tarpaulin --out Stdout

# Generate HTML report
cargo tarpaulin --out Html

Coverage Report

|| Tested/Total Lines:
|| src/attention.rs: 14/14
|| src/block.rs: 14/14
|| src/config.rs: 1/1
|| src/data.rs: 11/11
|| src/embedding.rs: 9/9
|| src/feed_forward.rs: 11/11
|| src/head.rs: 23/23
|| src/layer_norm.rs: 5/5
|| src/linear.rs: 13/13
|| src/math.rs: 18/18
|| src/sampling.rs: 6/6
|| src/tiny_gpt.rs: 67/67
|| src/vocab.rs: 15/15
|| src/trainer.rs: 16/18
|| 
|| 95.30% coverage, 223/234 lines covered

The uncovered lines are:

  • main.rs: Entry point (standard practice not to unit test)
  • trainer.rs: The train() wrapper that runs for 5000 epochs

Project Structure

rust_gpt/
├── Cargo.toml          # Dependencies and project metadata
├── config.json         # Model hyperparameters
├── corpus.json         # Training data (50 RE-focused sentences)
├── README.md           # This file
└── src/
    ├── main.rs         # Application entry point
    ├── config.rs       # Configuration loading
    ├── math.rs         # Mathematical utilities (randn, softmax, layer_norm)
    ├── sampling.rs     # Probability sampling for generation
    ├── vocab.rs        # Vocabulary management (encode/decode)
    ├── data.rs         # Batch data loading
    ├── linear.rs       # Linear (fully connected) layer
    ├── embedding.rs    # Token/position embedding layer
    ├── layer_norm.rs   # Layer normalization module
    ├── head.rs         # Single attention head
    ├── attention.rs    # Multi-head attention
    ├── feed_forward.rs # Position-wise feed-forward network
    ├── block.rs        # Complete transformer block
    ├── tiny_gpt.rs     # Full TinyGPT model
    └── trainer.rs      # Training loop and generation

Dependencies

[dependencies]
ndarray = { version = "0.17.1", features = ["rayon"] }
rand = "0.9.2"
rand_distr = "0.5.1"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
once_cell = "1.19"
Crate Purpose
ndarray N-dimensional arrays for tensor operations
rand Random number generation
rand_distr Statistical distributions (Normal)
serde Serialization framework
serde_json JSON parsing
once_cell Lazy static initialization

Further Reading


License

MIT