We are training the streaming model on our dataset, which has been manually reviewed and corrected. The audio and text are likely 99.9% verified.
For training, we are using the script based on LibriSpeech (egs/librispeech/ASR/zipformer/train.py).
All parameters are set to default. We are training up to 20 and 30 epochs.
After decoding with the zipformer-pretrained script (egs/librispeech/ASR/zipformer/pretrained.py), some words are missing in the resulting text. The missing words most frequently occur after longer pauses (pauses longer than 3 seconds), but not after every pause; it seems to happen very randomly.
We tried the sherpa-onnx decoder, and it makes the same errors.
Is this a problem with the training or with the two decoders?
What can we try? Do we need to use different parameters for training? Or extra methods for decoding or training?