This guide will walk you through the process of training ChunkFormer models from scratch.
git clone https://github.com/khanld/chunkformer.git
cd chunkformerPlease see https://docs.conda.io/en/latest/miniconda.html
conda create -n chunkformer python=3.11
conda activate chunkformer
conda install conda-forge::soxIt's recommended to use PyTorch 2.5.1 with CUDA 12.1, though newer versions work fine.
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.1 -c pytorch -c nvidiapip install -e .This installs ChunkFormer with all required dependencies for training and development.
For training RNN-T models with the k2 pruned loss, refer to this PAGE to find the compatible k2 version.
First, create a data/ directory in your training example folder (e.g., examples/asr/ctc/ or examples/asr/rnnt/):
cd examples/asr/ctc # or examples/asr/rnnt
mkdir -p dataYour training directory must follow this structure:
examples/asr/ctc/ # or rnnt
├── data/ # MUST CREATE THIS FOLDER YOURSELF
│ ├── train_set_name/ # Your training set folder
│ │ └── data.tsv # REQUIRED: Training data file
│ ├── dev_set_name/ # Your validation set folder
│ │ └── data.tsv # REQUIRED: Validation data file
│ └── test_set_name/ # Your test set folder
│ └── data.tsv # REQUIRED: Test data file
├── conf/
│ └── your_config.yaml # Training configuration
├── exp/ # Experiment outputs (auto-created)
├── tensorboard/ # Tensorboard logs (auto-created)
├── tools/ # training tools
├── run.sh # Training script
└── path.sh # Environment setup
Each data.tsv file must contain exactly 3 columns with tab-separated values:
key wav txt
utterance_001 /path/to/audio1.wav transcription text here
utterance_002 /path/to/audio2.wav another transcription
utterance_003 /path/to/audio3.wav more transcription textColumn Specifications:
key: Unique identifier for each utterancewav: Absolute path to the audio file (.wav, .flac, .mp3, etc.)txt: Ground truth transcription text
Edit the run.sh script and modify these key variables:
...
# For multi-GPU training, set `CUDA_VISIBLE_DEVICES`:
export CUDA_VISIBLE_DEVICES="0,1,2,3" # Use 4 GPUs
# To resume training from a checkpoint, set checkpoint path
checkpoint=/path/to/your/checkpoint.pt
# Set your dataset names (must match folder names in data/)
train_set=train_set_name # Your training folder name
dev_set=dev_set_name # Your validation folder name
recog_set=test_set_name # Your test folder name
# Training configuration
train_config=conf/v0.yaml # Your model config file
dir=exp/v0 # Experiment output directory
# To enable Mixed Precision training (default)
chunkformer/bin/train.py \
--use_amp \
...
# To enable streaming decoding during recognition (Stage 4):
# Add --simulate_streaming flag in the recognize.py command
chunkformer/bin/recognize.py \
--simulate_streaming \
...
...For model configuration, refer to the example configuration file in conf/v0.yaml. This file contains all the necessary parameters for model architecture, training settings, and data processing configurations.
ChunkFormer training follows a 7-stage process:
./run.sh --stage 0 --stop-stage 0- Converts
data.tsvfiles to required.list,text, andwav.scpformats - Automatic: No manual intervention needed
./run.sh --stage 1 --stop-stage 1- Computes CMVN (Cepstral Mean and Variance Normalization) statistics
- Generates
global_cmvnfile for feature normalization
./run.sh --stage 2 --stop-stage 2- Creates BPE/character-level vocabulary
- Generates
*_units.txtdictionary file - Builds subword model if using BPE
./run.sh --stage 3 --stop-stage 3- Main training stage
- Trains the neural network model
- Saves checkpoints in
$dir - Monitor training via TensorBoard logs
./run.sh --stage 4 --stop-stage 4- Averages multiple checkpoints for better performance
- Runs inference on test sets
- Computes Word Error Rate (WER) metrics
./run.sh --stage 5 --stop-stage 5- Packages model for ChunkFormer inference
- Creates
model_checkpoint_*directory with all required files
./run.sh --stage 6 --stop-stage 6- Uploads the prepared model directory to the Hugging Face Hub if
hf_tokenandhf_repo_idare set in the script.
Monitor training progress with TensorBoard:
tensorboard --logdir=tensorboard/{your_experiment_name} --port=6006Training logs are organized as:
tensorboard/
├── v0
├── v1
└── ...
After successful training, you'll find:
$dir
|── epoch_{epoch}.pt # model checkpoint at each epoch
|── epoch_{epoch}.yaml # Config and logs at each epoch
├── final.pt # Final model checkpoint
├── avg_5.pt # Averaged checkpoint
├── train.yaml # Training config
└── model_checkpoint_avg_5/ # Ready for inference
├── tokenizer # Tokenizer folder
├── pytorch_model.pt # Model weights
├── config.yaml # Model config
├── global_cmvn # Normalization stats
└── vocab.txt # Vocabulary file
The model_checkpoint_* directory can be used directly with ChunkFormer's inference API:
import chunkformer
model = chunkformer.ChunkFormerModel.from_pretrained('$dir/model_checkpoint_avg_5')