The whisperX API is a tool for enhancing and analyzing audio content. This API provides a suite of services for processing audio and video files, including transcription, alignment, diarization, and combining transcript with diarization results.
Swagger UI is available at /docs
for all the services, dump of OpenAPI definition is awailable in folder app/docs
as well. You can explore it directly in Swagger Editor
See the WhisperX Documentation for details on whisperX functions.
- in
.env
you can define default LanguageDEFAULT_LANG
, if not defined en is used (you can also set it in the request) .env
contains definition of Whisper model usingWHISPER_MODEL
(you can also set it in the request).env
contains definition of logging level usingLOG_LEVEL
, if not defined DEBUG is used in development and INFO in production.env
contains definition of environment usingENVIRONMENT
, if not defined production is used.env
contains a booleanDEV
to indicate if the environment is development, if not defined true is used.env
contains a booleanFILTER_WARNING
to enable or disable filtering of specific warnings, if not defined true is used
.oga
,.m4a
,.aac
,.wav
,.amr
,.wma
,.awb
,.mp3
,.ogg
.wmv
,.mkv
,.avi
,.mov
,.mp4
-
Speech-to-Text (
/speech-to-text
)- Upload audio/video files for transcription
- Supports multiple languages and Whisper models
-
Speech-to-Text URL (
/speech-to-text-url
)- Transcribe audio/video from URLs
- Same features as direct upload
-
Individual Services:
- Transcribe (
/service/transcribe
): Convert speech to text - Align (
/service/align
): Align transcript with audio - Diarize (
/service/diarize
): Speaker diarization - Combine (
/service/combine
): Merge transcript with diarization
- Transcribe (
-
Task Management:
- Get all tasks (
/task/all
) - Get task status (
/task/{identifier}
)
- Get all tasks (
Status and result of each tasks are stored in db using ORM Sqlalchemy, db connection is defined by environment variable DB_URL
if value is not specified db.py
sets default db as sqlite:///records.db
See documentation for driver definition at Sqlalchemy Engine configuration if you want to connect other type of db than Sqlite.
Structure of the of the db is described in DB Schema
Configure compute options in .env
:
DEVICE
: Device for inference (cuda
orcpu
, default:cuda
)COMPUTE_TYPE
: Computation type (float16
,float32
,int8
, default:float16
)Note: When using CPU,
COMPUTE_TYPE
must be set toint8
WhisperX supports these model sizes:
tiny
,tiny.en
base
,base.en
small
,small.en
medium
,medium.en
large
,large-v1
,large-v2
,large-v3
,large-v3-turbo
- Distilled models:
distil-large-v2
,distil-medium.en
,distil-small.en
,distil-large-v3
- Custom models:
nyrahealth/faster_CrisperWhisper
Set default model in .env
using WHISPER_MODEL=
(default: tiny)
- Docker with GPU support (nvidia-docker)
- NVIDIA GPU with CUDA support
- At least 8GB RAM (16GB+ recommended for large models)
- Storage space for models (varies by model size):
- tiny/base: ~1GB
- small: ~2GB
- medium: ~5GB
- large: ~10GB
To get started with the API, follow these steps:
- Create virtual environment
- Install pytorch See for more details
- Install whisperX
pip install git+https://github.com/m-bain/whisperx.git
- Install the required dependencies:
pip install -r requirements.txt
The application uses two logging configuration files:
uvicorn_log_conf.yaml
: Used by Uvicorn for logging configuration.gunicorn_logging.conf
: Used by Gunicorn for logging configuration.
Ensure these files are correctly configured and placed in the app
directory.
- Create
.env
file
define your Whisper Model and token for Huggingface
HF_TOKEN=<<YOUR HUGGINGFACE TOKEN>>
WHISPER_MODEL=<<WHISPER MODEL SIZE>>
LOG_LEVEL=<<LOG LEVEL>>
- Run the FastAPI application:
uvicorn app.main:app --reload --log-config uvicorn_log_conf.yaml --log-level $LOG_LEVEL
The API will be accessible at http://127.0.0.1:8000.
- Create
.env
file
define your Whisper Model and token for Huggingface
HF_TOKEN=<<YOUR HUGGINGFACE TOKEN>>
WHISPER_MODEL=<<WHISPER MODEL SIZE>>
LOG_LEVEL=<<LOG LEVEL>>
- Build Image
using docker-compose.yaml
#build and start the image using compose file
docker-compose up
alternative approach
#build image
docker build -t whisperx-service .
# Run Container
docker run -d --gpus all -p 8000:8000 --env-file .env whisperx-service
The API will be accessible at http://127.0.0.1:8000.
The models used by whisperX are stored in root/.cache
, if you want to avoid downloanding the models each time the container is starting you can store the cache in persistent storage. docker-compose.yaml
defines a volume whisperx-models-cache
to store this cache.
- faster-whisper cache:
root/.cache/huggingface/hub
- pyannotate and other models cache:
root/.cache/torch
- ctranslate2 Compatibility
- Only
ctranslate2==4.4.0
is supported due to CUDA compatibility issues with CTranslate2, as newer versions have different CUDA requirements SYSTRAN/faster-whisper#1086.
-
Environment Variables Not Loaded
- Ensure your
.env
file is correctly formatted and placed in the root directory. - Verify that all required environment variables are defined.
- Ensure your
-
Database Connection Issues
- Check the
DB_URL
environment variable for correctness. - Ensure the database server is running and accessible.
- Check the
-
Model Download Failures
- Verify your internet connection.
- Ensure the
HF_TOKEN
is correctly set in the.env
file.
-
GPU Not Detected
- Ensure NVIDIA drivers and CUDA are correctly installed.
- Verify that Docker is configured to use the GPU (
nvidia-docker
).
-
Warnings Not Filtered
- Ensure the
FILTER_WARNING
environment variable is set totrue
in the.env
file.
- Ensure the
- Check the logs for detailed error messages.
- Use the
LOG_LEVEL
environment variable to set the appropriate logging level (DEBUG
,INFO
,WARNING
,ERROR
).
For further assistance, please open an issue on the GitHub repository.