Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,35 @@ jobs:
push: true
tags: ghcr.io/collabora/whisperlive-cpu:latest

build-and-push-docker-tensorrt:
needs: [run-tests, check-code-format]
timeout-minutes: 20
runs-on: ubuntu-22.04
if: github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/tags/'))
steps:
- uses: actions/checkout@v2

- name: Log in to GitHub Container Registry
uses: docker/login-action@v1
with:
registry: ghcr.io
username: ${{ github.repository_owner }}
password: ${{ secrets.GHCR_TOKEN }}

- name: Docker Prune
run: docker system prune -af

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1

- name: Build and push Docker GPU image
uses: docker/build-push-action@v2
with:
context: .
file: docker/Dockerfile.tensorrt
push: true
tags: ghcr.io/collabora/whisperlive-tensorrt:latest

build-and-push-docker-gpu:
needs: [run-tests, check-code-format, build-and-push-docker-cpu]
timeout-minutes: 20
Expand Down
59 changes: 53 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ python3 run_server.py --port 9090 \

# running with custom model
python3 run_server.py --port 9090 \
--backend faster_whisper
--backend faster_whisper \
-fw "/path/to/custom/faster/whisper/model"
```

Expand All @@ -53,24 +53,51 @@ python3 run_server.py -p 9090 \
-trt /home/TensorRT-LLM/examples/whisper/whisper_small \
-m
```
#### Controlling OpenMP Threads
To control the number of threads used by OpenMP, you can set the `OMP_NUM_THREADS` environment variable. This is useful for managing CPU resources and ensuring consistent performance. If not specified, `OMP_NUM_THREADS` is set to `1` by default. You can change this by using the `--omp_num_threads` argument:
```bash
python3 run_server.py --port 9090 \
--backend faster_whisper \
--omp_num_threads 4
```

#### Single model mode
By default, when running the server without specifying a model, the server will instantiate a new whisper model for every client connection. This has the advantage, that the server can use different model sizes, based on the client's requested model size. On the other hand, it also means you have to wait for the model to be loaded upon client connection and you will have increased (V)RAM usage.

When serving a custom TensorRT model using the `-trt` or a custom faster_whisper model using the `-fw` option, the server will instead only instantiate the custom model once and then reuse it for all client connections.

If you don't want this, set `--no_single_model`.


### Running the Client
- Initializing the client:
- Initializing the client with below parameters:
- `lang`: Language of the input audio, applicable only if using a multilingual model.
- `translate`: If set to `True` then translate from any language to `en`.
- `model`: Whisper model size.
- `use_vad`: Whether to use `Voice Activity Detection` on the server.
- `save_output_recording`: Set to True to save the microphone input as a `.wav` file during live transcription. This option is helpful for recording sessions for later playback or analysis. Defaults to `False`.
- `output_recording_filename`: Specifies the `.wav` file path where the microphone input will be saved if `save_output_recording` is set to `True`.
- `max_clients`: Specifies the maximum number of clients the server should allow. Defaults to 4.
- `max_connection_time`: Maximum connection time for each client in seconds. Defaults to 600.

```python
from whisper_live.client import TranscriptionClient
client = TranscriptionClient(
"localhost",
9090,
lang="en",
translate=False,
model="small",
model="small", # also support hf_model => `Systran/faster-whisper-small`
use_vad=False,
save_output_recording=True, # Only used for microphone input, False by Default
output_recording_filename="./output_recording.wav", # Only used for microphone input
max_clients=4,
max_connection_time=600
)
```
It connects to the server running on localhost at port 9090. Using a multilingual model, language for the transcription will be automatically detected. You can also use the language option to specify the target language for the transcription, in this case, English ("en"). The translate option should be set to `True` if we want to translate from the source language to English and `False` if we want to transcribe in the source language.

- Trancribe an audio file:
- Transcribe an audio file:
```python
client("tests/jfk.wav")
```
Expand All @@ -80,9 +107,14 @@ client("tests/jfk.wav")
client()
```

- To transcribe from a RTSP stream:
```python
client(rtsp_url="rtsp://admin:admin@192.168.0.1/rtsp")
```

- To transcribe from a HLS stream:
```python
client(hls_url="http://as-hls-ww-live.akamaized.net/pool_904/live/ww/bbc_1xtra/bbc_1xtra.isml/bbc_1xtra-audio%3d96000.norewind.m3u8")
client(hls_url="http://as-hls-ww-live.akamaized.net/pool_904/live/ww/bbc_1xtra/bbc_1xtra.isml/bbc_1xtra-audio%3d96000.norewind.m3u8")
```

## Browser Extensions
Expand All @@ -96,7 +128,22 @@ client(hls_url="http://as-hls-ww-live.akamaized.net/pool_904/live/ww/bbc_1xtra/b
docker run -it --gpus all -p 9090:9090 ghcr.io/collabora/whisperlive-gpu:latest
```

- TensorRT. Follow [TensorRT_whisper readme](https://github.com/collabora/WhisperLive/blob/main/TensorRT_whisper.md) in order to setup docker and use TensorRT backend. We provide a pre-built docker image which has TensorRT-LLM built and ready to use.
- TensorRT.
```bash
docker run -p 9090:9090 --runtime=nvidia --gpus all --entrypoint /bin/bash -it ghcr.io/collabora/whisperlive-tensorrt

# Build small.en engine
bash build_whisper_tensorrt.sh /app/TensorRT-LLM-examples small.en # float16
bash build_whisper_tensorrt.sh /app/TensorRT-LLM-examples small.en int8 # int8 weight only quantization
bash build_whisper_tensorrt.sh /app/TensorRT-LLM-examples small.en int4 # int4 weight only quantization

# Run server with small.en
python3 run_server.py --port 9090 \
--backend tensorrt \
--trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_small_en_float16"
--trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_small_en_int8"
--trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_small_en_int4"
```

- CPU
```bash
Expand Down
51 changes: 11 additions & 40 deletions TensorRT_whisper.md
Original file line number Diff line number Diff line change
@@ -1,67 +1,38 @@
# Whisper-TensorRT
# WhisperLive-TensorRT
We have only tested the TensorRT backend in docker so, we recommend docker for a smooth TensorRT backend setup.
**Note**: We use [our fork to setup TensorRT](https://github.com/makaveli10/TensorRT-LLM)
**Note**: We use `tensorrt_llm==0.15.0.dev2024111200`

## Installation
- Install [docker](https://docs.docker.com/engine/install/)
- Install [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)

- Clone this repo.
- Run WhisperLive TensorRT in docker
```bash
git clone https://github.com/collabora/WhisperLive.git
cd WhisperLive
docker run -p 9090:9090 --runtime=nvidia --gpus all --entrypoint /bin/bash -it ghcr.io/collabora/whisperlive-tensorrt:latest
```

- Pull the TensorRT-LLM docker image which we prebuilt for WhisperLive TensorRT backend.
```bash
docker pull ghcr.io/collabora/whisperbot-base:latest
```

- Next, we run the docker image and mount WhisperLive repo to the containers `/home` directory.
```bash
docker run -it --gpus all --shm-size=8g \
--ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-p 9090:9090 -v /path/to/WhisperLive:/home/WhisperLive \
ghcr.io/collabora/whisperbot-base:latest
```

- Make sure to test the installation.
```bash
# export ENV=${ENV:-/etc/shinit_v2}
# source $ENV
python -c "import torch; import tensorrt; import tensorrt_llm"
```
**NOTE**: Uncomment and update library paths if imports fail.

## Whisper TensorRT Engine
- We build `small.en` and `small` multilingual TensorRT engine. The script logs the path of the directory with Whisper TensorRT engine. We need the model_path to run the server.
- We build `small.en` and `small` multilingual TensorRT engine as examples below. The script logs the path of the directory with Whisper TensorRT engine. We need that model_path to run the server.
```bash
# convert small.en
bash scripts/build_whisper_tensorrt.sh /root/TensorRT-LLM-examples small.en
bash build_whisper_tensorrt.sh /app/TensorRT-LLM-examples small.en # float16
bash build_whisper_tensorrt.sh /app/TensorRT-LLM-examples small.en int8 # int8 weight only quantization
bash build_whisper_tensorrt.sh /app/TensorRT-LLM-examples small.en int4 # int4 weight only quantization

# convert small multilingual model
bash scripts/build_whisper_tensorrt.sh /root/TensorRT-LLM-examples small
bash build_whisper_tensorrt.sh /app/TensorRT-LLM-examples small
```

## Run WhisperLive Server with TensorRT Backend
```bash
cd /home/WhisperLive

# Install requirements
apt update && bash scripts/setup.sh
pip install -r requirements/server.txt

# Required to create mel spectogram
wget --directory-prefix=assets assets/mel_filters.npz https://raw.githubusercontent.com/openai/whisper/main/whisper/assets/mel_filters.npz

# Run English only model
python3 run_server.py --port 9090 \
--backend tensorrt \
--trt_model_path "path/to/whisper_trt/from/build/step"
--trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_small_en_float16"

# Run Multilingual model
python3 run_server.py --port 9090 \
--backend tensorrt \
--trt_model_path "path/to/whisper_trt/from/build/step" \
--trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_small_float16" \
--trt_multilingual
```
25 changes: 13 additions & 12 deletions docker/Dockerfile.cpu
Original file line number Diff line number Diff line change
@@ -1,22 +1,23 @@
FROM python:3.8-slim-buster
FROM python:3.10-bookworm

ARG DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
ca-certificates \
sudo \
git \
bzip2 \
libx11-6 \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# install lib required for pyaudio
RUN apt update && apt install -y portaudio19-dev && apt-get clean && rm -rf /var/lib/apt/lists/*

# update pip to support for whl.metadata -> less downloading
RUN pip install --no-cache-dir -U "pip>=24"

# create a working directory
RUN mkdir /app
WORKDIR /app

COPY scripts/setup.sh requirements/server.txt /app/
# install pytorch, but without the nvidia-libs that are only necessary for gpu
RUN pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu

RUN apt update && bash setup.sh && pip install -r server.txt
# install the requirements for running the whisper-live server
COPY requirements/server.txt /app/
RUN pip install --no-cache-dir -r server.txt && rm server.txt

COPY whisper_live /app/whisper_live
COPY run_server.py /app
Expand Down
39 changes: 16 additions & 23 deletions docker/Dockerfile.gpu
Original file line number Diff line number Diff line change
@@ -1,33 +1,26 @@
FROM nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu22.04
FROM python:3.10-bookworm

ARG DEBIAN_FRONTEND=noninteractive

# Remove any third-party apt sources to avoid issues with expiring keys.
RUN rm -f /etc/apt/sources.list.d/*.list

# Install some basic utilities.
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
ca-certificates \
sudo \
git \
bzip2 \
libx11-6 \
python3-dev \
python3-pip \
&& python3 -m pip install --upgrade pip \
&& rm -rf /var/lib/apt/lists/*

# Create a working directory.
# install lib required for pyaudio
RUN apt update && apt install -y portaudio19-dev && apt-get clean && rm -rf /var/lib/apt/lists/*

# update pip to support for whl.metadata -> less downloading
RUN pip install --no-cache-dir -U "pip>=24"

# create a working directory
RUN mkdir /app
WORKDIR /app

COPY scripts/setup.sh requirements/server.txt /app
# install the requirements for running the whisper-live server
COPY requirements/server.txt /app/
RUN pip install --no-cache-dir -r server.txt && rm server.txt

RUN apt update && bash setup.sh && rm setup.sh
RUN pip install -r server.txt && rm server.txt
# make the paths of the nvidia libs installed as wheels visible. equivalent to:
# export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
ENV LD_LIBRARY_PATH="/usr/local/lib/python3.10/site-packages/nvidia/cublas/lib:/usr/local/lib/python3.10/site-packages/nvidia/cudnn/lib"

COPY whisper_live /app/whisper_live

COPY run_server.py /app

CMD ["python3", "run_server.py"]
CMD ["python", "run_server.py"]
30 changes: 30 additions & 0 deletions docker/Dockerfile.tensorrt
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
FROM nvidia/cuda:12.5.1-runtime-ubuntu22.04 AS base

ARG DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y \
python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs wget \
&& rm -rf /var/lib/apt/lists/*

FROM base AS devel
RUN pip3 install --no-cache-dir -U tensorrt_llm==0.15.0.dev2024111200 --extra-index-url https://pypi.nvidia.com
WORKDIR /app
RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git && cd TensorRT-LLM && \
git checkout c629546ce429623c8a163633095230154a6f0574 && cd ../ && \
mv TensorRT-LLM/examples ./TensorRT-LLM-examples && \
rm -rf TensorRT-LLM


FROM devel AS release
WORKDIR /app
COPY assets/ ./assets
RUN wget -nc -P assets/ https://raw.githubusercontent.com/openai/whisper/main/whisper/assets/mel_filters.npz

COPY scripts/setup.sh ./
RUN apt update && bash setup.sh && rm setup.sh

COPY requirements/server.txt .
RUN pip install --no-cache-dir -r server.txt && rm server.txt
COPY whisper_live ./whisper_live
COPY scripts/build_whisper_tensorrt.sh .
COPY run_server.py .
9 changes: 5 additions & 4 deletions requirements/server.txt
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
faster-whisper==1.0.1
torch
faster-whisper==1.1.0
websockets
onnxruntime==1.16.0
numba
openai-whisper
kaldialign
soundfile
ffmpeg-python
scipy
jiwer
evaluate
evaluate
numpy<2
openai-whisper==20240930
tokenizers==0.20.3
16 changes: 14 additions & 2 deletions run_server.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import argparse
from whisper_live.server import TranscriptionServer
import os

if __name__ == "__main__":
parser = argparse.ArgumentParser()
Expand All @@ -21,18 +21,30 @@
parser.add_argument('--trt_multilingual', '-m',
action="store_true",
help='Boolean only for TensorRT model. True if multilingual.')
parser.add_argument('--omp_num_threads', '-omp',
type=int,
default=1,
help="Number of threads to use for OpenMP")
parser.add_argument('--no_single_model', '-nsm',
action='store_true',
help='Set this if every connection should instantiate its own model. Only relevant for custom model, passed using -trt or -fw.')
args = parser.parse_args()

if args.backend == "tensorrt":
if args.trt_model_path is None:
raise ValueError("Please Provide a valid tensorrt model path")

if "OMP_NUM_THREADS" not in os.environ:
os.environ["OMP_NUM_THREADS"] = str(args.omp_num_threads)

from whisper_live.server import TranscriptionServer
server = TranscriptionServer()
server.run(
"0.0.0.0",
port=args.port,
backend=args.backend,
faster_whisper_custom_model_path=args.faster_whisper_custom_model_path,
whisper_tensorrt_path=args.trt_model_path,
trt_multilingual=args.trt_multilingual
trt_multilingual=args.trt_multilingual,
single_model=not args.no_single_model,
)
Loading