Skip to content

PyTorch implementation of Transformer-based Automatic Speech Recognition with attention mechanisms, SpecAugment, CTC loss, and mixed precision training. Achieves competitive WER/CER on LibriSpeech.

Notifications You must be signed in to change notification settings

realjules/speech-transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transformer-based Speech Recognition

This repository contains an implementation of a Transformer-based Automatic Speech Recognition (ASR) system. The model uses an encoder-decoder architecture with attention mechanisms to convert speech input into text.

Features

  • Full encoder-decoder Transformer architecture
  • Character and subword tokenization support
  • SpecAugment data augmentation
  • Mixed precision training
  • Configurable model architecture and training parameters
  • Wandb integration for experiment tracking
  • CTC loss support for better alignment learning

Project Structure

transformer/
├── configs/
│   └── config.yaml          # Configuration file
├── src/
│   ├── tokenizers.py        # Tokenization implementations
│   ├── dataset.py           # Data loading and preprocessing
│   ├── model.py             # Transformer model architecture
│   ├── augmentation.py      # SpecAugment implementation
│   ├── metrics.py           # Evaluation metrics (WER, CER)
│   ├── utils.py             # Helper functions
│   └── trainer.py           # Training logic
├── train.ipynb              # Training notebook
├── requirements.txt         # Python dependencies
└── README.md               # This file

Installation

  1. Clone the repository:
git clone https://github.com/realjules/transformer.git
cd transformer
  1. Create and activate a virtual environment (optional but recommended):
python -m venv venv
source venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

Data Preparation

The model is designed to work with the LibriSpeech dataset. The data should be organized in the following structure:

data_root/
├── train-clean-100/
│   ├── data.npy
│   └── ...
├── dev-clean/
│   ├── data.npy
│   └── ...
└── test-clean/
    ├── data.npy
    └── ...

Each data.npy file should contain entries with:

  • Audio file path (relative to partition directory)
  • Transcription text
  • Get data from kaggle competitions download -c 11785-hw1p2-f24

Configuration

The model and training parameters can be configured in configs/config.yaml. Key configuration options include:

  • Dataset settings (paths, features, normalization)
  • Model architecture (dimensions, layers, heads)
  • Training parameters (optimizer, learning rate, batch size)
  • Data augmentation settings
  • Tokenization strategy

Training

  1. Update the configuration in configs/config.yaml to match your setup.

  2. Open and run train.ipynb in Jupyter:

jupyter notebook train.ipynb

The notebook provides a step-by-step guide for:

  • Loading and preprocessing data
  • Creating the model
  • Training and validation
  • Saving checkpoints
  • Generating predictions

Model Architecture

The ASR system consists of:

  1. Frontend Processing:

    • Feature extraction (MFCC or filterbank features)
    • Optional SpecAugment data augmentation
    • Downsampling through convolutional layers
  2. Transformer:

    • Multi-head self-attention encoder
    • Multi-head cross-attention decoder
    • Position-wise feed-forward networks
    • Layer normalization and residual connections
  3. Output Processing:

    • Character/subword level tokenization
    • Optional CTC loss for alignment learning

Evaluation Metrics

The model is evaluated using:

  • Word Error Rate (WER)
  • Character Error Rate (CER)
  • Levenshtein Distance

Results Visualization

Training progress can be monitored through:

  • Wandb dashboard (if enabled)
  • Attention visualization plots
  • Training/validation metrics logging

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Citation

If you use this code in your research, please cite:

@misc{transformer_asr,
  author = {Jules Udahemuka},
  title = {Transformer-based Speech Recognition},
  year = {2024},
  publisher = {GitHub},
  url = {https://github.com/realjules/transformer}
}

Acknowledgments

About

PyTorch implementation of Transformer-based Automatic Speech Recognition with attention mechanisms, SpecAugment, CTC loss, and mixed precision training. Achieves competitive WER/CER on LibriSpeech.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published