Skip to content

iamindrayudh/ASR-SSL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Framework for Self-Supervised Automatic Speech Recognition

Python 3.9+ PyTorch 2.0+ License: MIT

This repository presents a complete, end-to-end framework for building high-performance Automatic Speech Recognition (ASR) models. The core of this project is a two-stage pipeline designed to overcome the challenge of data scarcity in speech by first learning the fundamental structure of audio through self-supervision, and then fine-tuning that knowledge for transcription.

The entire system, from data processing to distributed training and final evaluation, was built from scratch in PyTorch.


Architectural Overview

The architecture is designed to effectively learn powerful representations directly from raw audio. It consists of two main phases: self-supervised pre-training and supervised fine-tuning.

A high-level overview of the data flow and learning process is as follows:

1. PRE-TRAINING (Learning from Unlabeled Audio)
   Raw Audio -> [CNN Feature Extractor] -> Latent Speech Representations
                                |
             (Masking) -> [Transformer Encoder] -> Contextual Representations
                                |
      [Contrastive Loss (InfoNCE)] <-- [Quantization Module]

2. FINE-TUNING (Learning from Labeled Audio + Text)
   Pre-trained Encoder -> [Linear CTC Head] -> Character Probabilities
                                |
                        [CTC Loss] <-- Ground Truth Transcript

Core Concepts of My Implementation:

  • CNN Feature Extractor: A stack of temporal convolutions processes the raw waveform to produce a sequence of high-level acoustic features.
  • Contextual Encoder: A powerful Transformer network is used to learn deep contextual relationships between the audio features across the entire sequence.
  • Self-Supervised Objective: During pre-training, the model is shown masked audio features and learns to identify the correct original feature from a set of distractors (contrastive learning). This forces the encoder to learn robust and generalizable representations of speech without needing any transcripts.
  • Fine-tuning Head: For the ASR task, the pre-trained encoder is frozen, and a simple linear layer is added on top. This new "head" is trained with the CTC loss to map the powerful speech representations to character predictions.

Features

  • End-to-End Pipeline: A complete solution covering data preparation, multi-GPU pre-training, fine-tuning, and evaluation.
  • Fully Configurable: All model hyperparameters, data paths, and training settings are managed through clean and simple YAML files.
  • Efficient Distributed Training: Leverages PyTorch's native Distributed Data Parallel (DDP) for fast and scalable pre-training on multiple GPUs.
  • Modular & Reusable Codebase: The project is structured as a Python package (asr) to make components like the model layers and data loaders easy to reuse or extend.
  • Included Tooling: Comes with a helper script to automatically generate the necessary data manifests from audio/text directories.

Project Structure

ASR-SSL/
├── configs/              # YAML configuration files
├── scripts/              # Helper scripts (e.g., data preparation)
├── asr/                  # Core source code for the model library
├── .gitignore
├── LICENSE
├── README.md
├── requirements.txt
├── pretrain.py           # Executable for pre-training
└── finetune.py           # Executable for fine-tuning

Setup ⚙️

  1. Clone the Repository

    git clone https://github.com/iamindrayudh/ASR-SSL.git
    cd ASR-SSL
  2. Create a Virtual Environment (Recommended)

    python -m venv venv
    source venv/bin/activate
    # On Windows, use: venv\Scripts\activate
  3. Install Dependencies

    pip install -r requirements.txt

Usage Guide 🛠️

Step 1: Prepare Your Data

Generate the required .tsv manifest files for your datasets using the provided script.

  • For Pre-training (Unlabeled Audio):

    python scripts/create_manifest.py /path/to/unlabeled_audio/ pretrain.tsv --mode pretrain
  • For Fine-tuning (Labeled Audio & Text): Ensure your directory contains audio (e.g., sample1.wav) and matching transcript files (e.g., sample1.txt).

    python scripts/create_manifest.py /path/to/labeled_data/ finetune.tsv --mode finetune

Step 2: Configure Your Training

Edit the YAML files in the configs/ directory to point to your manifest files and set training parameters.

Step 3: Run Pre-training

Launch the pre-training script using torchrun. Adjust --nproc_per_node to match your number of available GPUs.

# Example for 4 GPUs
torchrun --standalone --nproc_per_node=4 pretrain.py --config configs/pretrain_config.yaml

Step 4: Run Fine-tuning

After pre-training, use the generated checkpoint to fine-tune the model for speech recognition.

python finetune.py --config configs/finetune_config.yaml

The script will save the model with the best Word Error Rate (WER) on your validation set.


Acknowledgments

The self-supervised learning approach implemented in this framework is inspired by the principles outlined in the wav2vec 2.0 paper by Baevski et al.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages