Skip to content

Latest commit

 

History

History
138 lines (110 loc) · 4.01 KB

RESULTS.md

File metadata and controls

138 lines (110 loc) · 4.01 KB

Results

SPGISpeech BPE training results (Zipformer Transducer)

2024-01-05

Zipformer encoder + embedding decoder

Transducer: Zipformer encoder + stateless decoder.

The WERs are:

dev val comment
greedy search 2.08 2.14 --epoch 30 --avg 10
modified beam search 2.05 2.09 --epoch 30 --avg 10 --beam-size 4
fast beam search 2.07 2.17 --epoch 30 --avg 10 --beam 20 --max-contexts 8 --max-states 64

NOTE: SPGISpeech transcripts can be prepared in ortho or norm ways, which refer to whether the transcripts are orthographic or normalized. These WERs correspond to the normalized transcription scenario.

The training command for reproducing is given below:

export CUDA_VISIBLE_DEVICES="0,1,2,3"

python zipformer/train.py \
  --world-size 4 \
  --num-epochs 30 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp \
  --num-workers 2 \
  --max-duration 1000

The decoding command is:

# greedy search
python ./zipformer/decode.py \
  --epoch $epoch \
  --avg $avg \
  --exp-dir ./zipformer/exp \
  --max-duration 1000 \
  --decoding-method greedy_search

# modified beam search
python ./zipformer/decode.py \
  --epoch $epoch \
  --avg $avg \
  --exp-dir ./zipformer/exp \
  --max-duration 1000 \
  --decoding-method modified_beam_search

# fast beam search
python ./zipformer/decode.py \
  --epoch $epoch \
  --avg $avg \
  --exp-dir ./zipformer/exp \
  --max-duration 1000 \
  --decoding-method fast_beam_search \
  --beam 4 \
  --max-contexts 4 \
  --max-states 8

SPGISpeech BPE training results (Pruned Transducer)

2022-05-11

Conformer encoder + embedding decoder

Conformer encoder + non-current decoder. The decoder contains only an embedding layer, a Conv1d (with kernel size 2) and a linear layer (to transform tensor dim).

The WERs are

dev val comment
greedy search 2.46 2.40 --avg-last-n 10 --max-duration 500
modified beam search 2.28 2.24 --avg-last-n 10 --max-duration 500 --beam-size 4
fast beam search 2.38 2.35 --avg-last-n 10 --max-duration 500 --beam-size 4 --max-contexts 4 --max-states 8

NOTE: SPGISpeech transcripts can be prepared in ortho or norm ways, which refer to whether the transcripts are orthographic or normalized. These WERs correspond to the normalized transcription scenario.

The training command for reproducing is given below:

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"

./pruned_transducer_stateless2/train.py \
  --world-size 8 \
  --num-epochs 20 \
  --start-epoch 0 \
  --exp-dir pruned_transducer_stateless2/exp \
  --max-duration 200 \
  --prune-range 5 \
  --lr-factor 5 \
  --lm-scale 0.25 \
  --use-fp16 True

The decoding command is:

# greedy search
./pruned_transducer_stateless2/decode.py \
  --iter 696000 --avg 10 \
  --exp-dir ./pruned_transducer_stateless2/exp \
  --max-duration 100 \
  --decoding-method greedy_search

# modified beam search
./pruned_transducer_stateless2/decode.py \
  --iter 696000 --avg 10 \
  --exp-dir ./pruned_transducer_stateless2/exp \
  --max-duration 100 \
  --decoding-method modified_beam_search \
  --beam-size 4

# fast beam search
./pruned_transducer_stateless2/decode.py \
  --iter 696000 --avg 10 \
  --exp-dir ./pruned_transducer_stateless2/exp \
  --max-duration 1500 \
  --decoding-method fast_beam_search \
  --beam 4 \
  --max-contexts 4 \
  --max-states 8

Pretrained model is available at https://huggingface.co/desh2608/icefall-asr-spgispeech-pruned-transducer-stateless2

The tensorboard training log can be found at https://tensorboard.dev/experiment/ExSoBmrPRx6liMTGLu0Tgw/#scalars