CrafyWakeWord it's a library focused on AI-based wake word recognition.
We have launched CrafyWakeWord2: the new version of CrafyWakeWord that offers better results.
Link: https://github.com/chijete/CrafyWakeWord2
- Custom wake word recognition.
- Multiple language support.
- Models portable to other platforms.
- TensorFlow and TensorFlow.js supported.
- Step by step explanation.
You can see an online demo here:
- English demo: https://chijete.github.io/CrafyWakeWord_demo/en/
- Spanish demo: https://chijete.github.io/CrafyWakeWord_demo/es/
You can download pre-trained models in multiple languages from this repository: https://github.com/chijete/CrafyWakeWord_models
Tip: If you want to download a single model and not clone the entire repository, you can use this tool to download a single folder from a git repository: https://download-directory.github.io/
With this tool you can create your custom wake word detection model. For example, you can create a model to detect when the user says the word "banana", and then run your own code accordingly.
- Have Python 3 installed.
- Have Miniconda or Anaconda installed.
- Have a verified Google Cloud account (we will use the Google Cloud Text-to-Speech API to improve the dataset, more information below; the free plan is enough).
The first step is to obtain a dataset of transcribed audios. In this library we will use Mozilla Common Voice to obtain the dataset.
Follow these steps:
- Access to https://commonvoice.mozilla.org/en/datasets
- Select the target language from the Language selector.
- Select the last "Common Voice Corpus" version (Do not select "Delta Segment").
- Enter an email, accept the terms and download the file.
- Clone this repository to a folder on your computer using git, or download and unzip this repository using Github's "Code > Download ZIP" option.
- Unzip the downloaded Mozilla Common Voice file and copy the "cv-corpus-..." folder to the folder where you cloned the repository.
Run this commands in your terminal (conda activate first) or Anaconda terminal:
pip install librosa textgrid torchsummary ffmpeg-python pocketsphinx fastprogress chardet PyAudio clang pgvector hdbscan initdb speechbrain
pip install --upgrade google-cloud-texttospeech
pip install --only-binary :all: pynini
orconda install conda-forge::pynini
conda install -c conda-forge kalpy
pip install montreal-forced-aligner
conda install -c conda-forge sox
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
(Installing PyTorch with CPU, complete instructions on https://pytorch.org/get-started/locally/ - you can use GPU acceleration).pip install ffmpeg onnx tensorflow numpy onnx_tf tensorflow_probability ipython
Install PostgreSQL from https://www.enterprisedb.com/downloads/postgres-postgresql-downloads When the installation is finished, add PostgreSQL to the System Path:
In Windows:
- Open the Windows Control Panel.
- Click on "System and Security".
- Select "System".
- In the "Advanced system settings" window, click on the "Environment Variables" button under the "Advanced" tab.
- In the "System variables" section, look for the "Path" variable and click "Edit...".
- Add the path to the PostgreSQL directory to the end of the list. For example, the path might be something like
"C:\Program Files\PostgreSQL\version\bin"
(replace "version" with the version of PostgreSQL you installed).
When finished, close the terminal and reopen it to apply the changes.
We will use Montreal Forced Aligner to align the audio files from the Mozilla Common Voice dataset. Follow these steps:
- Search for an Acoustic model for your model's target language here: https://mfa-models.readthedocs.io/en/latest/acoustic/index.html
- On the Acoustic model details page, in the Installation section, click "download from the release page".
- At the bottom of the page on Github, in the Assets section, click on the zip file (the first one in the list) to download it.
- Return to the Acoustic model page, and in the Pronunciation dictionaries section, click on the first one in the list.
- On the Pronunciation dictionary details page, in the Installation section, click "download from the release page".
- At the bottom of the page on Github, in the Assets section, click on the dict file (the first one in the list) to download it.
- Copy the two downloaded files to the
mfa
folder within the directory where you cloned the repository.
Edit your_config.json
file:
"common_voice_datapath"
is the path, relative to the root directory, where the downloaded Mozilla Common Voice files are located. Example:"common_voice_datapath": "corpus/cv-corpus-15.0-2023-09-08/en/"
"wake_words"
is the list of words that your model will learn to recognize."google_credentials_file"
is the path, relative to the root directory, where your Google Cloud acccess credentials file is located. You can learn how to get your account credentials JSON file in this help article: https://cloud.google.com/iam/docs/keys-create-delete#creating . You can paste the credentials file in the root directory where you cloned the repository."mfa_DICTIONARY_PATH"
is the path, relative to the root directory, where your downloaded Montreal Forced Aligner Pronunciation dictionary file is located."mfa_ACOUSTIC_MODEL_PATH"
is the path, relative to the root directory, where your downloaded Montreal Forced Aligner Acoustic model file is located."dataset_language"
is the ISO 639-1 code of the target language. Example:"en"
"window_size_ms"
is the number of milliseconds of the model's listening time window."train_epochs"
is the number of epochs for which the model will be trained."add_vanilla_noise_to_negative_dataset"
determines whether to add the base noise to the negative dataset."voices_generation_with_google"
determines whether or not to generate synthetic voices with Google Cloud."custom_dataset_path"
(string or empty string) the path to the directory of your custom dataset. You can get more information in the "Custom datasets" section."tts_generated_clips"
config of clips generation with the Google Cloud Text-to-Speech API."rate"
Speed range of the voices of the generated audios (start, stop and step for np.arange). Min 0.25, max 4.0."pitch"
Pitch range of the voices of the generated audios (start, stop and step for np.arange). Min -20.0, max 20.0.
Run these commands within your conda environment:
python dataset_generation.py
(sorts Mozilla Common Voice audio files)python align.py
(runs Montreal Forced Aligner to align the audios)python align_manage.py
(accommodates the results of Montreal Forced Aligner)python train_model.py
(generates additional training data using the Google Cloud Text-to-Speech API and applying noise, and prepares and trains the final model)
You can test the detection of wake words by running the use_model.py
file and saying the words in order as shown in the console. (It is necessary to have a microphone connected).
Note: This tutorial is primarily designed for Windows.
The resulting file when creating a model with CrafyWakeWord is a PyTorch model file (.pt). You can port this model to other platforms such as ONNX or TensorFlow.
- Verify that the PyTorch model is located in the path
dataset/model_trained.pt
- Run this command in the root of the directory where you cloned the repository:
python convert_to_onnx.py
- The ONNX model will be saved in
dataset/onnx_model_trained.onnx
- Verify that the PyTorch model is located in the path
dataset/model_trained.pt
- Run this command in the root of the directory where you cloned the repository:
python convert_to_onnx.py
- Run this command:
python convert_onnx_to_tf.py
- The TensorFlow model will be saved in
dataset/tf_model_trained
, and the TensorFlow Lite model will be saved indataset/tf_model_trained.tflite
After porting the model to TensorFlow, run the following commands:
conda create -n tfjsconverter python=3.6.8
(only in first execution)conda activate tfjsconverter
pip install tensorflowjs[wizard]
(only in first execution)tensorflowjs_wizard
- ? Please provide the path of model file or the directory that contains model files.
dataset/tf_model_trained
- ? What is your input model format?
Tensorflow Saved Model *
- ? What is tags for the saved model?
serve
- ? What is signature name of the model?
serving_default
- ? Do you want to compress the model?
No compression (Higher accuracy)
- ? Please enter shard size (in bytes) of the weight files?
4194304
- ? Do you want to skip op validation?
No
- ? Do you want to strip debug ops?
Yes
- ? Do you want to enable Control Flow V2 ops?
Yes
- ? Do you want to provide metadata? ENTER
- ? Which directory do you want to save the converted model in?
dataset/web_model
- ? Please provide the path of model file or the directory that contains model files.
- The TensorFlow.js model will be saved in
dataset/web_model
Before training a new model for the same trigger words, make a copy and delete the following files/directories:
dataset/tf_model_trained/
dataset/web_model/
dataset/model_data.json
dataset/model_trained.pt
dataset/onnx_model_trained.onnx
dataset/tf_model_trained.tflite
dataset/zmuv.pt.bin
Before training a completely new model, make sure to make a copy and delete the following files/directories:
dataset/
You may need to update the contents of the following folders if you change the language:
corpus/
mfa/
To improve model training you can add a custom dataset.
The dataset must have a format similar to Mozilla Common Voice: an audio dataset with its corresponding transcription.
To add a custom dataset you must create a directory in the root of the project with the following structure:
clips/
(mandatory) a directory containing all the audio clips in the dataset in MP3 format.train.csv
(mandatory) a table in CSV format with the columns "path" and "sentence". In the "path" column a string must be entered with the full name of the audio clip file (example: "audio_123.mp3"), audio clips must be saved inside theclips/
folder; and in the "sentence" column a string must be entered with the complete transcription of the audio clip (example: "Ducks can fly"). The audio clips that will be used for training must be listed in this file.dev.csv
(optional) same structure astrain.csv
. The audio clips that will be used for dev must be listed in this file.test.csv
(optional) same structure astrain.csv
. The audio clips that will be used for test must be listed in this file.
To use the custom dataset, before performing the training steps, the value of "custom_dataset_path"
in your_config.json
must be set to the path of the directory where the custom dataset is located (relative to the root directory). Example: "custom_dataset/". If you want not to use a custom dataset, then set the value of "custom_dataset_path"
to an empty string.
This library was developed following these instructions: https://github.com/rajashekar/WakeWordDetector/
We thank Rajashekar very much for his excellent work and explanation captured here: https://www.rajashekar.org/wake-word/
Additional thanks to: