GitHub - HarshitSamani/Spoken-Keyword-spotting: In this repository, we explore using a hybrid system consisting of a Convolutional Neural Network and a Support Vector Machine for Keyword Spotting task.

Spoken Keyword Spotting

Small footprint Spoken Keyword Spotting (KWS)

tags : spoken keyword spotting, kws, continuous speech kws , speech commands, cnn, svm, deep learning, sklearn, tensorflow

About The Project

Spoken Keyword Spotting is the task of identifying predefined words (called as keywords) from speech. Rapid developments and research in the areas of voice-based interaction with machines has tremendously influenced the heavy adaptation of these technologies into everyday life. With the development of devices such as Google Home, Amazon Alexa and Smartphones, speech is increasingly becoming a more natural way to interact with devices. However, always-on speech recognition is generally not preferred due to its energy inefficiency and network congestion that arises due to continuous audio stream from millions of devices to the cloud. Processing such a large amount of audio stream will require more time and adds to the latency and can have privacy issues.

Keyword Spotting (KWS) provides an efficient solution to all the above issues. Modern day voice-based devices first detect predefined keyword(s) — such as ”OK Google”, ”Alexa” — from the speech locally on the device. On successfully detecting such words, a full scale speech recognition is triggered on the cloud (or on the device). Since the KWS system is always-on, it is highly preferred to have low memory footprint and computation complexity, but with high accuracy and low latency. We explore using a hybrid system consisting of a Convolutional Neural Network and a Support Vector Machine for KWS task.

Built With

This project was built with

python v3.8
tensorflow v2.2

The Google Speech Commands dataset is downloaded and setup automatically (if not already present) and hence manual setup is not necessary.

Model overview

To provide a suitable solution for the KWS setting, we look at a hybrid system — consisting of a Convolutional Neural Network (CNN) and a Support Vector Machine (SVM). We train the CNN model to be a feature extractor that embeds the input into a suitable representation that properly captures the relevant information. We consider the output of the 256 dimensional penultimate dense layer (marked with arrow on figure below) as an embedding of the input feature. We train the OCSVM with these embedding as input. The performance of OCSVM is highly dependent on its hyperparameters values. To obtain the best performing OCSVM, we tune the hyperparameters using scikit-optimize library.

Results

The key performance metrics of the developed KWS system is listed below.

Specification	Value
Model size	11.4MB
Model size (Quantized)	978KB
Accuracy	0.9995
Precision	0.9942
Recall (True Detection Rate)	0.9770
F1 Score	0.9855

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
models		models
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spoken Keyword Spotting

About The Project

Built With

Model overview

Results

About

Releases

Packages

Languages

HarshitSamani/Spoken-Keyword-spotting

Folders and files

Latest commit

History

Repository files navigation

Spoken Keyword Spotting

About The Project

Built With

Model overview

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages