Skip to content

A Kaldi recipe for training a hybrid DNN-HMM speech recognition model

License

Notifications You must be signed in to change notification settings

alifarrokh/kaldi-dnn-hmm-asr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kaldi DNN-HMM Speech Recognition

This repository contains a simple recipe for training a hybrid DNN-HMM (Deep Neural Network - Hidden Markov Model) speech recognition model using Kaldi.

Prerequisites

1. Installing Kaldi

Follow the Kaldi installation guide to set up Kaldi. After installing Kaldi, replace run.sh with the one available in egs/librispeech recipe.

2. Dataset Metadata

Before using this recipe, you need to prepare two metadata files: dataset/train.txt and dataset/test.txt. Each line in these files should contain a tab-separated list of metadata for a specific training example:

file_id    utt_id    spk_id    start    end    path/to/file.wav    transcription

It is assumed that each training file may contain multiple utterances. Therefore, start and end refer to the start and end times of each utterance, in seconds, respectively.

3. Lexicon

A lexicon is a dictionary that maps each word in the dataset to its corresponding phonetic transcription, detailing how each word is pronounced using smaller linguistic units called phonemes. Each line of dataset/lexicon.txt maps a word from your dataset to a space-separated list of its phonetic representation. The file lexicon.txt.example contains an example subset of a Persian lexicon.

word1 p1 p2 p3
word2 p5 p1 p2 p4
...

3. Language Model

A language model in .arpa format is required, which can be created using the KenLM toolkit.

Pipeline Overview

The following steps are included in the training pipeline:

  1. Dataset preparation (stage 1-2)
  2. MFCC feature extraction (stage 3)
  3. Training a context-independent (monophone) HMM (stage 4)
  4. Aligning monophone states to training samples (stage 5)
  5. Training a context-dependent (triphone) HMM (stage 5)
  6. Extracting mel features (stage 6)
  7. Aligning triphone states to training samples (stage 7)
  8. Training a hybrid DNN-HMM model (stage 7)

Usage

Run certain stages of the recipe:

./run.sh 1 3 # stage=1 stop_stage=3

Results

The models were trained on a Persian dataset containing approximately 40 hours of speech data, and the trained language model was a bigram.

Model WER (%)
Context-independent HMM 56.44
Context-dependent HMM 33.30
Hybrid DNN-HMM 26.56

About

A Kaldi recipe for training a hybrid DNN-HMM speech recognition model

Topics

Resources

License

Stars

Watchers

Forks

Languages