speech-assistant

Welcome

Welcome to Speech-Assistant!

This is a project to implement an offline Personal PC assistant that combines AI-powered speech-to-text dictation and conversational voice interactions with Large Language Models.

This is a desktop application working on both Linux and Windows that provides GUI interfaces for both communicating with the assistant, and controlling speech detection. It also includes the functionality of a real-time, offline speech-to-text dictation (or translation) program. It uses the distil-whisper models from HuggingFace which offers an accurate transcription of speech that is fully structured, complete with proper punctuation and syntax. Distil-whisper is based on OpenAI's Whisper model that is 6 times faster, 49% smaller, and performs within 1% word error rate on speech it has never seen before. The research is based on this repo.

The speech-to-text assistant writes down spoken words directly to the keyboard cursor. To use it, hold down a hotkey combination of Windows key (Super) and Shift to begin, and let go to end the recording. Your speech will be transcribed (or translated) in real time and the transcription will be typed in for you at the keyboard cursor. The Whisper models are most effective transcribing a full sequence of speech rather than word-by-word, however, do check out nerd-dictation for the implementation of speech-to-text for the vosk models, also WhisperLive.

demo-preview.mp4

How to Use

You can get started on any operating system you would like. The program was tested in Pop-os (Ubuntu 22.04), Windows 10 and 11. Here is Anaconda's installation instructions. If you are on Windows make sure to have access to the conda command using the Anaconda cmd terminal, or to source it directly. Nvidia and AMD have different packages needed to run Pytorch, please follow as appropriate to ensure smooth compatibility.

Steps

Install Ollama from here
Navigate to the speech-assistant repo using the terminal (using the Anaconda CMD on Windows).
Install dependencies. Please use the command for your corresponding GPU type and operating system. Depending on your internet connection, this will take ~5-15 minutes (Type y and press enter when asked to download packages).
- Nvidia GPU:
```
conda env create -f env-cuda.yml
```
- AMD GPU or any CPU Integrated Graphics on Windows:
```
conda env create -f env-general-win.yml
```
- AMD GPU on Linux:
```
conda env create -f env-amd-linux.yml
```
Activate the conda environment.
```
conda activate speech-assistant
```
Start running the program.
```
python main.py
```
The program is now ready to use!
(Optional) Pull the assistant model of your choice to use assistant capabilities, llama3 is recommended
```
ollama pull llama3:8b
```

Speech-to-text Configurations

The program will download the distil-whisper/distil-small.en model by default and cache it locally in a folder named 'model'. The model consumes ~600 MB of GPU memory, and to improve accuracy, you could choose a bigger model. You could change models in the Options menu. The available model choices are shown below.

Model	Params / M	Rel. Latency	Short-Form WER	Long-Form WER
whisper-tiny.en	39		~15	~15
distil-small.en	166	5.6	12.1	12.8
distil-medium.en	394	6.8	11.1	12.4
distil-large-v2	756	5.8	10.1	11.6
whisper-large-v2	1550	1.0	9.1	11.7
whisper-large-v3	1550

Please note that the distil models are currently English only, except for whisper-large, which supports transcription and translation capabilities of multiple languages.

Notes and Suggestions

You can translate your speech to English in real-time using Whisper-Large by going to options and checking Translate to English
Users with dedicated graphics cards will have a better experience running the big models.
Make sure to locate your primary sound input device!
There is a problem with using PowerShell, use cmd, and activate the conda environment.
Installing with requirement.txt, package ffmpeg will be missing on model inference. This module can be downloaded with Anaconda with conda install ffmpeg -c pytorch.
For transcribing on Windows you can use its built-in dictation service with left windows + h. However, the whisper models can be useful for formatting expressive punctuation, and the implementation allows for private and quick dictation.

Future contributions

Features

Adding more tools for agent
Add sequential inference, for transcription as you talk (WhisperLive)
Add the option to use whatever key bind of the user's choosing (GUI)
Make Dockerfile for containers

Acknowledgements

Distil-whisper paper:

Title: Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
Authors: Sanchit Gandhi, Patrick von Platen, Alexander M. Rush
Year: 2023
Link: ArXiv

Sound effect: soundsforyou on pixabay

User icon: iconpacks

Name		Name	Last commit message	Last commit date
Latest commit History 267 Commits
effects		effects
icons		icons
public		public
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env-amd-linux.yml		env-amd-linux.yml
env-cuda.yml		env-cuda.yml
env-general-win.yml		env-general-win.yml
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

speech-assistant

Table of Contents

Welcome

How to Use

Steps

Speech-to-text Configurations

Notes and Suggestions

Future contributions

Features

Acknowledgements

About

Releases 3

Packages

Languages

License

Mohamad-Hussein/speech-assistant

Folders and files

Latest commit

History

Repository files navigation

speech-assistant

Table of Contents

Welcome

How to Use

Steps

Speech-to-text Configurations

Notes and Suggestions

Future contributions

Features

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages