This project is about building an application which synthesises speech from user-provided text. The application is written in Python and uses the Kivy framework for the user interface.
Encoding intonation and emotions remains a significant challenge in the Assistive Technology Text-To-Speach field, which if overcome could definitely enhance the communication experience for people with speech impairment. The aim of Speech Jokey is therefore to allow people with communication difficulties to interact with more intonation, emotions and emphasis pauses. In addition, the application is specifically designed to be used with eye tracking systems, facilitating the positioning of the cursor between lines and words of a text.
We envision the application to be used as a means to become DJ of your preferred voice, hence the name speech jokey. With the application you'll be creating synthesized speech from your own provided text.
The designed logo for the application is currently:
A video showcase of the current project state of the running application can be found in the /doc
folder.
Speech synthesis is done using various speech synthesis engines. The application currently supports the following speech synthesis engines:
- ElevenLabs API
The project is based on Python 3.11
, but it also supports lower version down to 3.9
. To install Python, follow the instructions on the Python website.
We use poetry for dependency management. To install poetry, run:
pip install poetry
Make sure to configure poetry to install the virtual environment in the project root. This can be done by running:
poetry config virtualenvs.in-project true
Please install the following packages first:
sudo apt-get install xsel xclip
Installing the virtual environment is done by running:
poetry install --no-root
The dependencies are listed in the pyproject.toml file. To add a new dependency, run:
poetry add <dependency>
The following procedures assume that you have installed the dependencies and that you are working inside the virtual environment.
To run the application, execute the following command in the root of the project:
poetry run python src/main.py
To build the application, execute the following command in the root of the project:
(You might wanna grab a coffee while running this)
poetry run pyinstaller src/main.py --onefile --name SpeechJokey
The created build application specification SpeechJokey.spec
can now be found in the root of the project.
This file needs to be modified according to the following steps:
- Import kivy dependencies at the top of the file:
from kivy_deps import sdl2, glew
- Add source tree after
COLLECT(exe,
:Tree('src\\'),
- Add source dependencies after
a.datas,
:*[Tree(p) for p in (sdl2.dep_bins + glew.dep_bins)],
After these modifications, the application can be finalized by running:
(Should be very quick after the initial build)
poetry run pyinstaller SpeechJokey.spec
Inside the dist
output folder a folder with the name SpeechJokey
can be found. This folder contains the final .exe
build of the application.
For a detailed step-by-step guide on how to build a Kivy application, see this written tutorial.
(Keep in mind that the tutorial doesn't use poetry, so any command should be preceeded by poetry run
)
To build the application similar to how it would be built by the CI, copy the SpeechJokey.spec
from .github\static
to the project root and then execute the following command in the root of the project:
poetry run pyinstaller SpeechJokey.spec
This is what the application currently looks like.
Some of the screenshots following this are a little different, but hopefully they get the concept across for others to contribute.
The idea of Speech Jokey is to give the user the possibility to edit the text that he previously wrote, loading it in text input of the application.
The editing part is facilitated thanks to the fact that the line where the cursor is, will be "zoomed". This means that the user will visualize that line with a bigger linespace before and after and with a bigger space inbetween words.
The editing feature is adressed especially to people who need eye tracking devices to move the cursor.
The buttons (Add Break, Change Pitch, Emphasize) enable to insert SSML tags between the text for the speech syntesis.
An audio file is generated thanks to a selected Text-To-Speech API voice.
The user can listen to it, pause it and play it.
The final version of the edited text can be saved as a file in audio format.
Git is a version control system. It allows you to keep track of changes made to your code and to collaborate with others. To learn more about Git, see this fundamental beginner tutorial.
Alternatively, you can play the Git game to learn git interactively.
GitHub is a platform for hosting Git repositories. It allows you to collaborate with others on your code. To learn more about GitHub, see this crash course.
VS Code is a code editor. It allows you to write code and to collaborate with others. To learn more about VS Code, see this crash course.
Kivy is a framework for building user interfaces. It allows you to build user interfaces for your application. To learn more about Kivy, watch this playlist for a beginner friendly introduction to the framework.
Poetry is a tool for dependency management and packaging in Python. It allows you to declare the libraries your project depends on and it will manage (install/update) them for you. For a short introduction to poetry, see this tutorial.