A Framework for Self-Supervised Automatic Speech Recognition

This repository presents a complete, end-to-end framework for building high-performance Automatic Speech Recognition (ASR) models. The core of this project is a two-stage pipeline designed to overcome the challenge of data scarcity in speech by first learning the fundamental structure of audio through self-supervision, and then fine-tuning that knowledge for transcription.

The entire system, from data processing to distributed training and final evaluation, was built from scratch in PyTorch.

Architectural Overview

The architecture is designed to effectively learn powerful representations directly from raw audio. It consists of two main phases: self-supervised pre-training and supervised fine-tuning.

A high-level overview of the data flow and learning process is as follows:

1. PRE-TRAINING (Learning from Unlabeled Audio)
   Raw Audio -> [CNN Feature Extractor] -> Latent Speech Representations
                                |
             (Masking) -> [Transformer Encoder] -> Contextual Representations
                                |
      [Contrastive Loss (InfoNCE)] <-- [Quantization Module]

2. FINE-TUNING (Learning from Labeled Audio + Text)
   Pre-trained Encoder -> [Linear CTC Head] -> Character Probabilities
                                |
                        [CTC Loss] <-- Ground Truth Transcript

Core Concepts of My Implementation:

CNN Feature Extractor: A stack of temporal convolutions processes the raw waveform to produce a sequence of high-level acoustic features.
Contextual Encoder: A powerful Transformer network is used to learn deep contextual relationships between the audio features across the entire sequence.
Self-Supervised Objective: During pre-training, the model is shown masked audio features and learns to identify the correct original feature from a set of distractors (contrastive learning). This forces the encoder to learn robust and generalizable representations of speech without needing any transcripts.
Fine-tuning Head: For the ASR task, the pre-trained encoder is frozen, and a simple linear layer is added on top. This new "head" is trained with the CTC loss to map the powerful speech representations to character predictions.

Features

End-to-End Pipeline: A complete solution covering data preparation, multi-GPU pre-training, fine-tuning, and evaluation.
Fully Configurable: All model hyperparameters, data paths, and training settings are managed through clean and simple YAML files.
Efficient Distributed Training: Leverages PyTorch's native Distributed Data Parallel (DDP) for fast and scalable pre-training on multiple GPUs.
Modular & Reusable Codebase: The project is structured as a Python package (asr) to make components like the model layers and data loaders easy to reuse or extend.
Included Tooling: Comes with a helper script to automatically generate the necessary data manifests from audio/text directories.

Project Structure

ASR-SSL/
├── configs/              # YAML configuration files
├── scripts/              # Helper scripts (e.g., data preparation)
├── asr/                  # Core source code for the model library
├── .gitignore
├── LICENSE
├── README.md
├── requirements.txt
├── pretrain.py           # Executable for pre-training
└── finetune.py           # Executable for fine-tuning

Setup ⚙️

Clone the Repository

git clone https://github.com/iamindrayudh/ASR-SSL.git
cd ASR-SSL

Create a Virtual Environment (Recommended)

python -m venv venv
source venv/bin/activate
# On Windows, use: venv\Scripts\activate

Install Dependencies
```
pip install -r requirements.txt
```

Usage Guide 🛠️

Step 1: Prepare Your Data

Generate the required .tsv manifest files for your datasets using the provided script.

For Pre-training (Unlabeled Audio):

python scripts/create_manifest.py /path/to/unlabeled_audio/ pretrain.tsv --mode pretrain

For Fine-tuning (Labeled Audio & Text): Ensure your directory contains audio (e.g., sample1.wav) and matching transcript files (e.g., sample1.txt).
```
python scripts/create_manifest.py /path/to/labeled_data/ finetune.tsv --mode finetune
```

Step 2: Configure Your Training

Edit the YAML files in the configs/ directory to point to your manifest files and set training parameters.

Step 3: Run Pre-training

Launch the pre-training script using torchrun. Adjust --nproc_per_node to match your number of available GPUs.

# Example for 4 GPUs
torchrun --standalone --nproc_per_node=4 pretrain.py --config configs/pretrain_config.yaml

Step 4: Run Fine-tuning

After pre-training, use the generated checkpoint to fine-tune the model for speech recognition.

python finetune.py --config configs/finetune_config.yaml

The script will save the model with the best Word Error Rate (WER) on your validation set.

Acknowledgments

The self-supervised learning approach implemented in this framework is inspired by the principles outlined in the wav2vec 2.0 paper by Baevski et al.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Framework for Self-Supervised Automatic Speech Recognition

Architectural Overview

Core Concepts of My Implementation:

Features

Project Structure

Setup ⚙️

Usage Guide 🛠️

Step 1: Prepare Your Data

Step 2: Configure Your Training

Step 3: Run Pre-training

Step 4: Run Fine-tuning

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
asr		asr
configs		configs
legacy_code		legacy_code
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
finetune.py		finetune.py
pretrain.py		pretrain.py
requirements.txt		requirements.txt

License

iamindrayudh/ASR-SSL

Folders and files

Latest commit

History

Repository files navigation

A Framework for Self-Supervised Automatic Speech Recognition

Architectural Overview

Core Concepts of My Implementation:

Features

Project Structure

Setup ⚙️

Usage Guide 🛠️

Step 1: Prepare Your Data

Step 2: Configure Your Training

Step 3: Run Pre-training

Step 4: Run Fine-tuning

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages