A simple API to access whisper for speech to text transcription.
It simplifies offloading the heavy lifting of using Whisper to a central GPU server, which can be accessed by multiple people.
- Transcribes audio files to text using OpenAI Whisper
- Includes a simple static frontend to transcribe audio files (
/
) - Includes a interactive API documentation using the Swagger UI (
/docs
) - Implements a task queue to handle multiple requests (first in, first out)
- Uses GPU acceleration if available
- Supports loading the model into VRAM on startup OR on first request
- Supports unloading the model after a certain time of inactivity
- Stateless: to prioritize data privacy, the API only stores data in RAM. Audio files are stored using tempfile and are deleted after processing.
- Logs don't contain any transcribed text and transcription ids are obfuscated
- Results are deleted from RAM after a given time
This service performs the best, when it is run on a server with a GPU. For using the high-quality models, I recommend using a GPU with at least 12GB of VRAM. The RTX 3060 12GB is most likely the cheapest option for this task.
This service is optimized for a multi user environment. I will discuss 2 setups:
When you are the only user of this service, you can run it on your local network. This way you can access the service from any device in your network. Use a VPN to access the service from outside your network.
When hosting this service in a more professional environment, we should consider the following:
- should the service be accessible from outside the network?
- who should be able to access the service?
If only users on your local network should be able to access the service and everyone in your network should be able to access it, you can run the service on a server in your network without any further configuration.
If you need to implement access control, I suggest the following:
- use a reverse proxy to terminate SSL
- use oauth2 to only allow users which belong to a certain group to access the service
My setup uses the following software:
- NGINX as a reverse proxy
- Keycloak as an identity provider
- oauth2_proxy to handle oauth2 authentication and session tokens
In case you have some questions about the setup or software, feel free to reach out!
Pre-requisites:
- have Docker installed
- install NVIDIA CUDA (if you want to use GPU acceleration)
- install NVidia Container Toolkit (if you want to use GPU acceleration)
Create the following compose.yaml
file:
services:
whisperAPI:
image: ghcr.io/mayniklas/whisper_api:latest
ports:
- "3001:3001"
environment:
- PORT=3001
- LOAD_MODEL_ON_STARTUP=1
# - UNLOAD_MODEL_AFTER_S=300
# - DEVELOP_MODE=0
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
When nop using GPU acceleration, remove the deploy
section from the compose.yaml
file.
Run the following commands:
docker compose up -d
You can also use docker
directly:
docker run -d -p 3001:3001 --gpus all ghcr.io/mayniklas/whisper_api:latest
Pre-requisites:
- install NVIDIA CUDA
- install ffmpeg (e.g.
sudo apt install ffmpeg
)
Since project is a well packaged python project, you don't have to worry about any project specific installation steps.
- Create a virtual environment
- Install this project in the virtual environment
- Create a systemd service that runs the server
Since I'm personally using NixOS, I created a module that is available through this flake.nix
.
Add the following input to your flake.nix
:
{
inputs = {
whisper_api.url = "github:MayNiklas/whisper_api";
};
}
Import the module in your configuration.nix
and use it:
{ pkgs, config, lib, whisper_api, ... }: {
imports = [ whisper_api.nixosModules.whisper_api ];
services.whisper_api = {
enable = true;
withCUDA = true;
loadModelOnStartup = true;
# unloadModelAfterSeconds = 300;
listen = "0.0.0.0";
openFirewall = true;
environment = { };
};
}
Pre-requisites:
- install NVIDIA CUDA
- install ffmpeg (e.g.
sudo apt install ffmpeg
)
# clone the repository
git clone https://github.com/MayNiklas/whisper_api.git
# change into the directory
cd whisper_api
# create a virtual environment
python3 -m venv .venv
source .venv/bin/activate
# prepare the environment
pip3 install -e '.[dev]'
# run the server from within the virtual environment
cd src/
uvicorn whisper_api:app --reload --host 127.0.0.1 --port 3001
# alternatively, you can use the following command to run the server
export PORT=3001
export LISTEN=127.0.0.1
whisper_api
# clone the repository
git clone https://github.com/MayNiklas/whisper_api.git
# change into the directory
cd whisper_api
# run the server via nix (using CUDA)
nix run .#whisper_api_withCUDA
# enter the development shell providing the necessary environment
nix develop .#withCUDA
parameter | description | possible values | default |
---|---|---|---|
PORT |
Port the API is available under | any number of port interval | 3001 |
LISTEN |
Address the API is available under | any IP or domain you own | 127.0.0.1 |
LOAD_MODEL_ON_STARTUP |
If model shall be loaded on startup | 1 (yes) or 0 (no) |
1 |
DEVELOP_MODE |
Develop mode defaults to smallest model to save time | 1 (yes) or 0 (no) |
0 |
UNLOAD_MODEL_AFTER_S |
If set the model gets unloaded after inactivity of t seconds, unset means no unload | any int (0 for instant unload) | 'unset' |
DELETE_RESULTS_AFTER_M |
Time after which results are deleted from internal storage | any int | 60 |
REFRESH_EXPIRATION_TIME_ON_USAGE |
If result is used expand lifetime | 1 (yes) or 0 (no) |
1 |
RUN_RESULT_EXPIRY_CHECK_M |
Interval in which timeout checks shall be executed | any int (0 enables lazy timeout) | 5 |
USE_GPU_IF_AVAILABLE |
If GPU shall be used when available | 1 (yes) or 0 (no) |
1 |
MAX_MODEL |
Max model to be used for decoding, unset means best possible | name of official model | 'unset' |
MAX_TASK_QUEUE_SIZE |
The limit of tasks that can be queued in the decoder at the same time before rejection | any int | 128 |
CPU_FALLBACK_MODEL |
The fallback when MAX_MODEL is not set and CPU mode is needed |
name of official model | medium |
LOG_DIR |
The directory to store log-file(s) in "" means 'this directory', dir is created if needed | wanted directory name or empty str | "data/" |
LOG_FILE |
The name of the log file | arbitrary filename | whisper_api.log |
LOG_LEVEL_CONSOLE |
The name of the log file | arbitrary filename | whisper_api.log |
LOG_PRIVACY_MODE |
Don't display full task uuids and other sensitive data in the logs | 1 (yes) or 0 (no) |
1 |
LOG_LEVEL_FILE |
Level of logging for the file | "DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL") | "INFO" |
LOG_FORMAT |
Format of the log messages | any valid log message format | *see below* |
LOG_DATE_FORMAT |
Format of the date in log messages | any valid date format | "%d.%m. %H:%M:%S" |
LOG_ROTATION_WHEN |
Specifies when log rotation should occur | "S", "M", "H", "D", "W0"-"W6", "midnight" | "H" |
LOG_ROTATION_INTERVAL |
Interval at which log rotation should occur | any int | 2 |
LOG_ROTATION_BACKUP_COUNT |
Number of backup log files to keep | any int | 48 |
AUTHORIZED_MAILS |
Mail-addresses which are authorized to access special routes (whitespace separated) | any int | 48 |
The log format is: "[{asctime}] [{levelname}][{processName}][{threadName}][{module}.{funcName}] {message}"
, using {
as format specifier.
All logging parameters follow pythons logging and the RotatingFileHandler specification.
The API provides a /logs
route. That route provides all logs for download.
The verification is done based on the 'X-Email'
field in the request headers.
A valid input would be: LOG_AUTHORIZED_MAILS="nik@example.com chris@example.com"
.
Requests from localhost are currently always permitted (want an env-option to disable it? - make an issue).
Other privileged routes may come in the future.
The system will automatically try to use the GPU and the best possible model when USE_GPU_IF_AVAILABLE
and MAX_MODEL
are not set.
MAX_MODEL
must be set when CUDA is not available or explicitly disabled via USE_GPU_IF_AVAILABLE
.
CPU_FALLBACK_MODEL
is the fallback when GPU Mode shall use max-model but CPU shall be limited due to reduced performance.
If UNLOAD_MODEL_AFTER_S
is set to 0
the model will not only be unloaded nearly instantly, it internally also results in busy waiting!
All ints are assumed to be unsigned.
# enable development mode -> use small models
export DEVELOP_MODE=1