This project implements a Bigram Language Model using a Transformer architecture in PyTorch for TinyStories dataset (https://arxiv.org/abs/2305.07759) . The model is designed to predict the next character in a sequence based on the context provided by preceding characters. It leverages multi-head self-attention and feedforward neural networks to achieve this.
- Token and position embeddings
- Multi-head self-attention mechanism
- Feedforward neural network layers
- Layer normalization
- Dropout for regularization
- Character-level text generation
- Python 3.x
- PyTorch 1.7 or later
- CUDA (optional, for GPU acceleration)
input.txt
: The input text file used for training the language model. (from TinyStories: https://arxiv.org/abs/2305.07759)main.py
: The main script containing the implementation of the model and training loop.
The following hyperparameters can be adjusted to tune the model:
batch_size
: Number of sequences processed in parallel (default: 2048)block_size
: Maximum context length for predictions (default: 128)max_iters
: Number of training iterations (default: 1000)eval_interval
: Interval for evaluating the model on validation data (default: 100)learning_rate
: Learning rate for the optimizer (default: 1e-3)eval_iters
: Number of iterations for evaluation (default: 200)n_embd
: Dimensionality of the embeddings (default: 128)n_head
: Number of attention heads (default: 4)n_layer
: Number of Transformer blocks (default: 4)dropout
: Dropout rate (default: 0.0)
- Place your training text in a file named
input.txt
. - Ensure that
input.txt
is in the same directory astrain.py
.
To train the model and generate text, simply run:
python train.py
The script will:
- Read the input text from
input.txt
. - Encode the text into integer sequences.
- Split the data into training and validation sets.
- Train the model for the specified number of iterations.
- Periodically evaluate the model on the validation set and print the losses.
- Generate text samples at regular intervals during training.
- Print a final text sample after training is complete.
- The input text is read from
input.txt
. - Unique characters are extracted to create a vocabulary.
- Characters are mapped to integers for model processing.
get_batch(split)
: Generates batches of input and target sequences for training and validation.
Head
: Implements a single head of self-attention.MultiHeadAttention
: Combines multiple heads of self-attention.FeedFoward
: A feedforward neural network layer.Block
: A single Transformer block, combining self-attention and feedforward layers.BigramLanguageModel
: The main language model combining embedding layers, Transformer blocks, and an output layer.
- The model is trained using the AdamW optimizer.
- Loss is computed using cross-entropy.
- The
estimate_loss()
function evaluates the model on both training and validation sets.
- The
generate()
method ofBigramLanguageModel
generates text by sampling from the learned distribution of next characters.
During training, the script will periodically print text samples generated by the model. Here is an example of what you might see:
...
step 0: train loss 4.1234, val loss 4.5678
Generated text: "Sample text generated by the model..."
...