Automatic Speech Recognition for Regional Indian Languages

INTRODUCTION

The aim of this project is to implement automatic speech recognition algorithms using Hidden Markov Models (HMMs) for regional Indian languages. We have self-recorded Tamil digits, Telugu digits and words, and English continuous speech. We have also used external datasets for Hindi continuous speech and English digits. We have implemented HMM based systems using hmmlearn (Python library) and HTK (toolkit). We have also implemented a Deep Neural Network (DNN) based system to draw comparison and have presented our analysis. Following is the list of implementations:

hmmlearn for Tamil, Telugu and English Digits, and Telugu Words Recognition
HTK for Hindi Continuous Speech and Telugu Words Recognition
Deep Neural Network (DNN) for Tamil and Telugu Digits Recognition

LITERATURE REVIEW

Automatic Speech Recognition (ASR) is a well researched field. The utilization of HMMs for ASR is studied well in The Application of Hidden Markov Models in Speech Recognition. The paper presents the core architecture of a HMM-based Large Vocabulary Continuous Speech Recognition (LVCSR) system and then describes ways to achieve state-of-the-art performance. There is also a recent seminar report on Hidden Markov Model and Speech Recognition which explains the Forward algorithm, the Viterbi algorithm and the Baum-Welch algorithm in the context of speech recognition and HMMs concisely.

In the past few years, there has been significant work on developing speech recognition systems using HMMs for regional Indian languages. Syllable Based Continuous Speech Recognition for Tamil Language uses MFCC feature vectors and an acoustic HMM model to develop a recognition system for Tamil. We have used a similar methodology to develop a recognition system for Telugu words using HTK, a toolkit for building HMMs. Grapheme Gaussian Model and Prosodic Syllable Based Tamil Speech Recognition System builds upon this system and produces an accuracy of 77% on a dataset of 20 Tamil words, with 2 speakers and 2 utterances each. However, the implementation of this was beyond the scope of this project. HTK Based Speech Recognition Systems for Indian Regional languages: A Review presents well the summaries and best obtained accuracies of HTK based speech recognition systems developed for 13 regional languages including Tamil, Telugu, Hindi and English.

Automatic Speech Recognition Systems for Regional Languages in India argues that Deep Neural Networks (DNNs) must be more efficent and accurate for speech recognition.

DATASETS

Tamil Digits

The Tamil digits dataset consists of audio files recorded in ‘.wav’ format. Each file contains the utterance of one Tamil digit from 0-9. The length of each file is approximately 1 second. A total of 230-250 samples are present with each digit having around 13-15 samples. The dataset can be accessed here.

The digit-label-utterance mapping is given in the following table.

Digit	Label	Utterance
Zero	0	Poojyam
One	1	Onnu
Two	2	Rendu
Three	3	Munnu
Four	4	Naalu
Five	5	Anju
Six	6	Aaru
Seven	7	Yezhu
Eight	8	Yettu
Nine	9	Ombodu

Telugu Digits

The Telugu digits dataset consists of audio files recorded in ‘.wav’ format. Each file contains the utterance of one Telugu digit from 1-10. A total of ~60 samples are present with each digit having around 6-7 samples. The dataset can be accessed here.

The digit-label-utterance mapping is given in the following table.

Digit	Label	Utterance
One	1	Okati
Two	2	Rendu
Three	3	Mudu
Four	4	Nalugu
Five	5	Aidu
Six	6	Aaru
Seven	7	Edu
Eight	8	Enimidi
Nine	9	Tommidi
Ten	10	Padi

Telugu Words

The Telugu words dataset consists of audio files recorded in ‘.wav’ format. Each file contains the utterance of one Telugu word. A total of 80 samples are present with each word having 4 samples. The dataset can be accessed here.

The Telugu-English word mapping is given in the following table.

Word	Meaning
abbayi	boy
amma	mother
ammayi	girl
andarum	all
batuku	everyone
bojanum	meal
chudama	check
cinnema	movie
dhairyam	courage
kalisi	together
kannu	eye
kodatanu	beats
konchum	slightly
manum	us
meeru	you
nanna	father
nenu	I
pinni	aunt
sonthum	ourselves
yevaru	who

English (Indian Accent) Continuous Speech

Three speakers recorded data for English continuous speech in an Indian accent.

Speaker 1 (Male, 20yrs):

File	Duration
rec1.wav	6:49
rec2.wav	7:04
rec3.wav	13:34
rec4.wav	4:07
rec5.wav	7:53

Speaker 2 (Male, 21yrs):

File	Duration
rec1.wav	18:14
rec2.wav	30:59

Speaker 3 (Male, 21 yrs):

File	Duration
rec1.wav	8:36
rec2.wav	10:29
rec3.wav	11:25
rec4.wav	9:54

Externally Obtained Datasets:

Hindi Continuous Speech:

The dataset consists of 150 sentences in Hindi with 7 different speakers for each. It can be accessed here.
English Digits:

The dataset can be accessed here.

IMPLEMENTATIONS

HMMLEARN FOR TAMIL, TELUGU AND ENGLISH DIGITS, AND TELUGU WORDS RECOGNITION

Dependencies

Python (version 2.7. *)
hmmlearn
python_speech_features

Tamil Digits

Note: To reproduce the results, refer this section for the script and weight file.

Training Results:

Accuracy: 61.48%

Testing Results:

Accuracy: 60.97%

Summary:

Telugu Digits

Note: To reproduce the results, refer this section for the script and weight file.

Training Results:

Accuracy: 50.24%

Testing Results:

Accuracy: 58.06%

Telugu Words

Note: To reproduce the results, refer this section for the script and weight file.

Training Results:

Accuracy: 65%

Testing Results:

Accuracy: 60%

Summary:

English Digits

Note: To reproduce the results, refer this section for the script and weight file.

Entire Dataset

Training Results:

Accuracy: 96.75%

Testing Results:

Accuracy: 94%
Limited Dataset

We used 20% of the original test data and took 15 samples per digit.

Training Results:

Accuracy: 60%

Testing Results:

Accuracy: 60%

HTK FOR HINDI CONTINUOUS SPEECH AND TELUGU WORDS RECOGNITION

HTK Installation (Linux)

Follow the installation steps mentioned here: https://github.com/conbitin/htk3.5-install

Hindi Continuous Speech

Create a fork of this repository: https://github.com/KunalDhawan/ASR-System-for-Hindi-Language/tree/master/HTK and clone it.
Download the Hindi dataset mentioned above.
Store the downloaded 'data' directory in the HTK folder of the cloned repository.
Follow the steps mentioned here.

Telugu Words

Note: Our forked repository can be found here.

Upload the data in ./data dir in the corresponding train and test directories.
Prepare a transliteration file and a lexicon file (phone level) (hindiSentences150.txt and lexicon.txt respectively in the original repository) for all the words present in the speech samples and put in ./doc and ./lm respectively.
Now go to scripts_ph_pl_py folder and edit the HTK_home variable in master.sh with the absolute path of your HTK dir.
Also give read_write permissions to all the files present here -> chmod a+rx *.sh *.pl *.py.
Now cd into the parent directory and run the following commands to:
1. Generate env var and mfcc features
2. Write the transcription
3. Initialize PDF of each phone model
4. Fit the data
5. Evaluate the output

scripts_sh_pl_py/master.sh HCOPY
scripts_sh_pl_py/master.sh LEXICON
scripts_sh_pl_py/master.sh HCOMPV
scripts_sh_pl_py/master.sh HEREST
scripts_sh_pl_py/master.sh ALIGN
scripts_sh_pl_py/master.sh HVITE_MONO

DEEP NEURAL NETWORK (DNN) FOR TAMIL AND TELUGU DIGITS RECOGNITION

Dependencies

Numpy
Pandas
Librosa
Pytorch
Sklearn

Tamil Digits

Note: To reproduce the results, refer this section for the script and weight file.

In order to compare the performance of the HMM model on the tamil digits dataset, we train a modern deep learning architecture for the same dataset and observe the performance and compare it with the previous model.

The deep learning model that has been chosen is a Long Short-Term Memory (LSTM) model. LSTM are a special member of the Recurrent Neural Network (RNN) family and have the ability to model the data based on previous data. A non-recurrent Neural Network does not have any memory whereas an RNN has a limited memory and they tend to perform badly on data that has long term temporal dependency on the previous data. LSTM also has the ability to decide how much information to use in its memory as they have input gates, forget gates and output gates.

The LSTM architecture and the other hyper-parameters and functions used are given below:

Architecture

LSTM( (rnn): LSTM(input = 81, hidden_neurons = 10, num_layers=2, dropout=0.1) (fc): Sequential(Linear(in_features=10, out_features=10, bias=True)) (output) : Softmax(input = 10 , output = 1) )

Hyper-Parameters
- Learning Rate = 0.01
- Loss function used = MSE (Mean Squared Error) Loss
- Optimizer used = Adam Optimizer

Training Results

We trained the model on 220 samples by shuffling the samples. The model was trained for 100 epochs and used batch gradient descent on a batch size of 20 samples. The results are as follows :

Total number of test samples = 220
Correct predictions = 192
Accuracy = 87.27272727272727%

Confusion Matrix:

Loss Plot for Training:

Testing Results

After training the model, we test on a few unseen samples to see the performance of the model.

Total number of test samples = 20
Correct predictions = 13
Accuracy = 65.0%

Confusion Matrix:

Telugu Digits

Note: To reproduce the results, refer this section for the script and weight file.

In order to compare the performance of the HMM model on the telugu digits dataset, we train a modern deep learning architecture for the same dataset and observe the performance and compare it with the previous model.

The deep learning model that has been chosen is a Long Short-Term Memory (LSTM) model. LSTM are a special member of the Recurrent Neural Network (RNN) family and have the ability to model the data based on previous data. A non-recurrent Neural Network does not have any memory whereas an RNN has a limited memory and they tend to perform badly on data that has long term temporal dependency on the previous data. LSTM also has the ability to decide how much information to use in its memory as they have input gates, forget gates and output gates.

The LSTM architecture and the other hyper-parameters and functions used are given below:

Architecture

LSTM( (rnn): LSTM(81, 10, num_layers=2, dropout=0.1) (fc): Sequential( (0): Linear(in_features=10, out_features=10, bias=True)) )

Hyper-Parameters

Learning Rate = 0.01 Loss function used = MSE (Mean Squared Error) Loss Optimizer used = Adam Optimizer

Training Results

We trained the model on 50 samples by shuffling the samples. The model was trained for 20 epochs and used batch gradient descent on a batch size of 5 samples. The results are as follows :

Total number of test samples = 50
Correct predictions = 45
Accuracy = 90.0%

Confusion Matrix:

Loss Plot for Training:

Testing Results

After training the model, we test on a few unseen samples to see the performance of the model.

Total number of test samples = 16
Correct predictions = 10
Accuracy = 62.5%

Confusion Matrix:

COMPARISON

DATASET	IMPLEMENTATION	TRAINING ACCURACY	TESTING ACCURACY	SCRIPT	WEIGHT FILE
Tamil Digits	hmmlearn	61.48%	60.97%	hmm_digits.py	tamil_digits.pkl
Tamil Digits	DNN	87.27%	65%	DNN_Tamil.ipynb	link
Telugu Digits	hmmlearn	50.24%	58.06%	hmm_digits.py	telugu_digits.pkl
Telugu Digits	DNN	90%	62.5%	DNN_TELUGU.ipynb	link
English Digits (entire dataset)	hmmlearn	96.75%	94%	hmm_digits.py	english_digits.pkl
English Digits (limited dataset)	hmmlearn	60%	60%	hmm_digits.py	english_digits_limited.pkl
Telugu Words	hmmlearn	65%	60%	hmm_words.py	telugu_words.pkl
Telugu Words	HTK:heavy_exclamation_mark:	➖	➖	➖	➖
Hindi Continuous Speech	HTK	➖	67.35%	➖	➖

CONCLUSION AND FUTURE WORK

DNN outperforms the training accuracy of hmmlearn by a large margin.
DNN outperforms the testing accuracy of hmmlearn by a smaller margin.
Training and testing on the entire English dataset gave ~95% accuracy as opposed to 60% for limited dataset. Thus, more data will significantly improve the accuracies on our self-recorded regional language datasets.
The HTK implementation works successfully for continuous speech data. The next step would be to try regional language datasets for continuous speech.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
DNN		DNN
References		References
hmmlearn		hmmlearn
images		images
.DS_Store		.DS_Store
README.md		README.md

Adi5598/Automatic-Speech-Recognition

Folders and files

Latest commit

History

Repository files navigation

Automatic Speech Recognition for Regional Indian Languages

INTRODUCTION

LITERATURE REVIEW

DATASETS

Tamil Digits

Telugu Digits

Telugu Words

English (Indian Accent) Continuous Speech

Externally Obtained Datasets:

IMPLEMENTATIONS

HMMLEARN FOR TAMIL, TELUGU AND ENGLISH DIGITS, AND TELUGU WORDS RECOGNITION

Dependencies

Tamil Digits

Telugu Digits

Telugu Words

English Digits

HTK FOR HINDI CONTINUOUS SPEECH AND TELUGU WORDS RECOGNITION

HTK Installation (Linux)

Hindi Continuous Speech

Telugu Words

DEEP NEURAL NETWORK (DNN) FOR TAMIL AND TELUGU DIGITS RECOGNITION

Dependencies

Tamil Digits

Confusion Matrix:

Loss Plot for Training:

Confusion Matrix:

Telugu Digits

Confusion Matrix:

Loss Plot for Training:

Confusion Matrix:

COMPARISON

CONCLUSION AND FUTURE WORK

CONTRIBUTORS

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages