This repository contains an implementation of a Transformer-based Automatic Speech Recognition (ASR) system. The model uses an encoder-decoder architecture with attention mechanisms to convert speech input into text.
- Full encoder-decoder Transformer architecture
- Character and subword tokenization support
- SpecAugment data augmentation
- Mixed precision training
- Configurable model architecture and training parameters
- Wandb integration for experiment tracking
- CTC loss support for better alignment learning
transformer/
├── configs/
│ └── config.yaml # Configuration file
├── src/
│ ├── tokenizers.py # Tokenization implementations
│ ├── dataset.py # Data loading and preprocessing
│ ├── model.py # Transformer model architecture
│ ├── augmentation.py # SpecAugment implementation
│ ├── metrics.py # Evaluation metrics (WER, CER)
│ ├── utils.py # Helper functions
│ └── trainer.py # Training logic
├── train.ipynb # Training notebook
├── requirements.txt # Python dependencies
└── README.md # This file
- Clone the repository:
git clone https://github.com/realjules/transformer.git
cd transformer
- Create and activate a virtual environment (optional but recommended):
python -m venv venv
source venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
The model is designed to work with the LibriSpeech dataset. The data should be organized in the following structure:
data_root/
├── train-clean-100/
│ ├── data.npy
│ └── ...
├── dev-clean/
│ ├── data.npy
│ └── ...
└── test-clean/
├── data.npy
└── ...
Each data.npy
file should contain entries with:
- Audio file path (relative to partition directory)
- Transcription text
- Get data from
kaggle competitions download -c 11785-hw1p2-f24
The model and training parameters can be configured in configs/config.yaml
. Key configuration options include:
- Dataset settings (paths, features, normalization)
- Model architecture (dimensions, layers, heads)
- Training parameters (optimizer, learning rate, batch size)
- Data augmentation settings
- Tokenization strategy
-
Update the configuration in
configs/config.yaml
to match your setup. -
Open and run
train.ipynb
in Jupyter:
jupyter notebook train.ipynb
The notebook provides a step-by-step guide for:
- Loading and preprocessing data
- Creating the model
- Training and validation
- Saving checkpoints
- Generating predictions
The ASR system consists of:
-
Frontend Processing:
- Feature extraction (MFCC or filterbank features)
- Optional SpecAugment data augmentation
- Downsampling through convolutional layers
-
Transformer:
- Multi-head self-attention encoder
- Multi-head cross-attention decoder
- Position-wise feed-forward networks
- Layer normalization and residual connections
-
Output Processing:
- Character/subword level tokenization
- Optional CTC loss for alignment learning
The model is evaluated using:
- Word Error Rate (WER)
- Character Error Rate (CER)
- Levenshtein Distance
Training progress can be monitored through:
- Wandb dashboard (if enabled)
- Attention visualization plots
- Training/validation metrics logging
Contributions are welcome! Please feel free to submit a Pull Request.
If you use this code in your research, please cite:
@misc{transformer_asr,
author = {Jules Udahemuka},
title = {Transformer-based Speech Recognition},
year = {2024},
publisher = {GitHub},
url = {https://github.com/realjules/transformer}
}
- The Transformer architecture is based on "Attention Is All You Need"
- SpecAugment implementation follows "SpecAugment: A Simple Data Augmentation Method"