ASR project for Bengali, exploring Automatic Speech Recognition (ASR) techniques.
ASR system transcribes speech into text, emphasizing the probabilistic sequence of words and sounds.
Two approaches: statistical and end-to-end. Acoustic Model and Lexicon play key roles.
I focus here on the end-to-end approach.
- 09 vowels, 39 consonants
- Official language of Bangladesh
- 2nd most spoken language in India
Spoken by 200M+ people, featuring diverse dialects.
Handling out-of-distribution data and linguistic nuance, improve WER.
- 1180 hours of audio
- Data cleaning: maintain WER < 70%
- Preprocessing: Mel-Log Spectrograms, tokenized embeddings
PyTorch implementation with transducer approach.
40% WER on Common Voice, 10% in French.
- Baseline vs. Whisper vs. BenglaASR
- Whisper-small fine-tuned: WER 67% achieve lower WER then the baseline.
More details presentation slides in French here
Sources: