AIWintermuteAI · AIWintermuteAI · Dec 3, 2024
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -99,6 +99,35 @@ jobs:
           push: true
           tags: ghcr.io/collabora/whisperlive-cpu:latest
 
+  build-and-push-docker-tensorrt:
+    needs: [run-tests, check-code-format]
+    timeout-minutes: 20
+    runs-on: ubuntu-22.04
+    if: github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/tags/'))
+    steps:
+      - uses: actions/checkout@v2
+
+      - name: Log in to GitHub Container Registry
+        uses: docker/login-action@v1
+        with:
+          registry: ghcr.io
+          username: ${{ github.repository_owner }}
+          password: ${{ secrets.GHCR_TOKEN }}
+
+      - name: Docker Prune
+        run: docker system prune -af
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v1
+
+      - name: Build and push Docker GPU image
+        uses: docker/build-push-action@v2
+        with:
+          context: .
+          file: docker/Dockerfile.tensorrt
+          push: true
+          tags: ghcr.io/collabora/whisperlive-tensorrt:latest
+
   build-and-push-docker-gpu:
     needs: [run-tests, check-code-format, build-and-push-docker-cpu]
     timeout-minutes: 20

diff --git a/README.md b/README.md
@@ -36,7 +36,7 @@ python3 run_server.py --port 9090 \
 
 # running with custom model
 python3 run_server.py --port 9090 \
-                      --backend faster_whisper
+                      --backend faster_whisper \
                       -fw "/path/to/custom/faster/whisper/model"
 ```
 
@@ -53,24 +53,51 @@ python3 run_server.py -p 9090 \
                       -trt /home/TensorRT-LLM/examples/whisper/whisper_small \
                       -m
 ```
+#### Controlling OpenMP Threads
+To control the number of threads used by OpenMP, you can set the `OMP_NUM_THREADS` environment variable. This is useful for managing CPU resources and ensuring consistent performance. If not specified, `OMP_NUM_THREADS` is set to `1` by default. You can change this by using the `--omp_num_threads` argument:
+```bash
+python3 run_server.py --port 9090 \
+                      --backend faster_whisper \
+                      --omp_num_threads 4
+```
+
+#### Single model mode
+By default, when running the server without specifying a model, the server will instantiate a new whisper model for every client connection. This has the advantage, that the server can use different model sizes, based on the client's requested model size. On the other hand, it also means you have to wait for the model to be loaded upon client connection and you will have increased (V)RAM usage.
+
+When serving a custom TensorRT model using the `-trt` or a custom faster_whisper model using the `-fw` option, the server will instead only instantiate the custom model once and then reuse it for all client connections.
+
+If you don't want this, set `--no_single_model`.
 
 
 ### Running the Client
-- Initializing the client:
+- Initializing the client with below parameters:
+  - `lang`: Language of the input audio, applicable only if using a multilingual model.
+  - `translate`: If set to `True` then translate from any language to `en`.
+  - `model`: Whisper model size.
+  - `use_vad`: Whether to use `Voice Activity Detection` on the server.
+  - `save_output_recording`: Set to True to save the microphone input as a `.wav` file during live transcription. This option is helpful for recording sessions for later playback or analysis. Defaults to `False`. 
+  - `output_recording_filename`: Specifies the `.wav` file path where the microphone input will be saved if `save_output_recording` is set to `True`.
+  - `max_clients`: Specifies the maximum number of clients the server should allow. Defaults to 4.
+  - `max_connection_time`: Maximum connection time for each client in seconds. Defaults to 600.
+
 ```python
 from whisper_live.client import TranscriptionClient
 client = TranscriptionClient(
   "localhost",
   9090,
   lang="en",
   translate=False,
-  model="small",
+  model="small",                                      # also support hf_model => `Systran/faster-whisper-small`
   use_vad=False,
+  save_output_recording=True,                         # Only used for microphone input, False by Default
+  output_recording_filename="./output_recording.wav", # Only used for microphone input
+  max_clients=4,
+  max_connection_time=600
 )
 ```
 It connects to the server running on localhost at port 9090. Using a multilingual model, language for the transcription will be automatically detected. You can also use the language option to specify the target language for the transcription, in this case, English ("en"). The translate option should be set to `True` if we want to translate from the source language to English and `False` if we want to transcribe in the source language.
 
-- Trancribe an audio file:
+- Transcribe an audio file:
 ```python
 client("tests/jfk.wav")
 ```
@@ -80,9 +107,14 @@ client("tests/jfk.wav")
 client()
 ```
 
+- To transcribe from a RTSP stream:
+```python
+client(rtsp_url="rtsp://admin:admin@192.168.0.1/rtsp")
+```
+
 - To transcribe from a HLS stream:
 ```python
-client(hls_url="http://as-hls-ww-live.akamaized.net/pool_904/live/ww/bbc_1xtra/bbc_1xtra.isml/bbc_1xtra-audio%3d96000.norewind.m3u8") 
+client(hls_url="http://as-hls-ww-live.akamaized.net/pool_904/live/ww/bbc_1xtra/bbc_1xtra.isml/bbc_1xtra-audio%3d96000.norewind.m3u8")
 ```
 
 ## Browser Extensions
@@ -96,7 +128,22 @@ client(hls_url="http://as-hls-ww-live.akamaized.net/pool_904/live/ww/bbc_1xtra/b
   docker run -it --gpus all -p 9090:9090 ghcr.io/collabora/whisperlive-gpu:latest
   ```
 
-  - TensorRT. Follow [TensorRT_whisper readme](https://github.com/collabora/WhisperLive/blob/main/TensorRT_whisper.md) in order to setup docker and use TensorRT backend. We provide a pre-built docker image which has TensorRT-LLM built and ready to use.
+  - TensorRT. 
+  ```bash
+  docker run -p 9090:9090 --runtime=nvidia --gpus all --entrypoint /bin/bash -it ghcr.io/collabora/whisperlive-tensorrt
+
+  # Build small.en engine
+  bash build_whisper_tensorrt.sh /app/TensorRT-LLM-examples small.en        # float16
+  bash build_whisper_tensorrt.sh /app/TensorRT-LLM-examples small.en int8   # int8 weight only quantization
+  bash build_whisper_tensorrt.sh /app/TensorRT-LLM-examples small.en int4   # int4 weight only quantization
+
+  # Run server with small.en
+  python3 run_server.py --port 9090 \
+                        --backend tensorrt \
+                        --trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_small_en_float16"
+                        --trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_small_en_int8"
+                        --trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_small_en_int4"
+  ```
 
 - CPU
 ```bash

diff --git a/TensorRT_whisper.md b/TensorRT_whisper.md
@@ -1,67 +1,38 @@
-# Whisper-TensorRT
+# WhisperLive-TensorRT
 We have only tested the TensorRT backend in docker so, we recommend docker for a smooth TensorRT backend setup.
-**Note**: We use [our fork to setup TensorRT](https://github.com/makaveli10/TensorRT-LLM)
+**Note**: We use `tensorrt_llm==0.15.0.dev2024111200`
 
 ## Installation
 - Install [docker](https://docs.docker.com/engine/install/)
 - Install [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
 
-- Clone this repo.
+- Run WhisperLive TensorRT in docker
 ```bash
-git clone https://github.com/collabora/WhisperLive.git
-cd WhisperLive
+docker run -p 9090:9090 --runtime=nvidia --gpus all --entrypoint /bin/bash -it ghcr.io/collabora/whisperlive-tensorrt:latest
 ```
 
-- Pull the TensorRT-LLM docker image which we prebuilt for WhisperLive TensorRT backend.
-```bash
-docker pull ghcr.io/collabora/whisperbot-base:latest
-```
-
-- Next, we run the docker image and mount WhisperLive repo to the containers `/home` directory.
-```bash
-docker run -it --gpus all --shm-size=8g \
-       --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-       -p 9090:9090 -v /path/to/WhisperLive:/home/WhisperLive \
-       ghcr.io/collabora/whisperbot-base:latest
-```
-
-- Make sure to test the installation. 
-```bash
-# export ENV=${ENV:-/etc/shinit_v2} 
-# source $ENV
-python -c "import torch; import tensorrt; import tensorrt_llm"
-```
-**NOTE**: Uncomment and update library paths if imports fail.
-
 ## Whisper TensorRT Engine
-- We build `small.en` and `small` multilingual TensorRT engine. The script logs the path of the directory with Whisper TensorRT engine. We need the model_path to run the server.
+- We build `small.en` and `small` multilingual TensorRT engine as examples below. The script logs the path of the directory with Whisper TensorRT engine. We need that model_path to run the server.
 ```bash
 # convert small.en
-bash scripts/build_whisper_tensorrt.sh /root/TensorRT-LLM-examples small.en
+bash build_whisper_tensorrt.sh /app/TensorRT-LLM-examples small.en        # float16
+bash build_whisper_tensorrt.sh /app/TensorRT-LLM-examples small.en int8   # int8 weight only quantization
+bash build_whisper_tensorrt.sh /app/TensorRT-LLM-examples small.en int4   # int4 weight only quantization
 
 # convert small multilingual model
-bash scripts/build_whisper_tensorrt.sh /root/TensorRT-LLM-examples small
+bash build_whisper_tensorrt.sh /app/TensorRT-LLM-examples small
 ```
 
 ## Run WhisperLive Server with TensorRT Backend
 ```bash
-cd /home/WhisperLive
-
-# Install requirements
-apt update && bash scripts/setup.sh
-pip install -r requirements/server.txt
-
-# Required to create mel spectogram
-wget --directory-prefix=assets assets/mel_filters.npz https://raw.githubusercontent.com/openai/whisper/main/whisper/assets/mel_filters.npz
-
 # Run English only model
 python3 run_server.py --port 9090 \
                       --backend tensorrt \
-                      --trt_model_path "path/to/whisper_trt/from/build/step"
+                      --trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_small_en_float16"
 
 # Run Multilingual model
 python3 run_server.py --port 9090 \
                       --backend tensorrt \
-                      --trt_model_path "path/to/whisper_trt/from/build/step" \
+                      --trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_small_float16" \
                       --trt_multilingual
 ```
diff --git a/docker/Dockerfile.cpu b/docker/Dockerfile.cpu
@@ -1,22 +1,23 @@
-FROM python:3.8-slim-buster
+FROM python:3.10-bookworm
 
 ARG DEBIAN_FRONTEND=noninteractive
 
-RUN apt-get update && apt-get install -y --no-install-recommends \
-    curl \
-    ca-certificates \
-    sudo \
-    git \
-    bzip2 \
-    libx11-6 \
- && apt-get clean \
- && rm -rf /var/lib/apt/lists/*
+# install lib required for pyaudio
+RUN apt update && apt install -y portaudio19-dev && apt-get clean && rm -rf /var/lib/apt/lists/*
 
+# update pip to support for whl.metadata -> less downloading
+RUN pip install --no-cache-dir -U "pip>=24"
+
+# create a working directory
+RUN mkdir /app
 WORKDIR /app
 
-COPY scripts/setup.sh requirements/server.txt /app/
+# install pytorch, but without the nvidia-libs that are only necessary for gpu
+RUN pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu
 
-RUN apt update && bash setup.sh && pip install -r server.txt
+# install the requirements for running the whisper-live server
+COPY requirements/server.txt /app/
+RUN pip install --no-cache-dir -r server.txt && rm server.txt
 
 COPY whisper_live /app/whisper_live
 COPY run_server.py /app

diff --git a/docker/Dockerfile.gpu b/docker/Dockerfile.gpu
@@ -1,33 +1,26 @@
-FROM nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu22.04
+FROM python:3.10-bookworm
+
 ARG DEBIAN_FRONTEND=noninteractive
 
-# Remove any third-party apt sources to avoid issues with expiring keys.
-RUN rm -f /etc/apt/sources.list.d/*.list
-
-# Install some basic utilities.
-RUN apt-get update && apt-get install -y --no-install-recommends \
-    curl \
-    ca-certificates \
-    sudo \
-    git \
-    bzip2 \
-    libx11-6 \
-    python3-dev \
-    python3-pip \
-    && python3 -m pip install --upgrade pip \
-    && rm -rf /var/lib/apt/lists/*
-
-# Create a working directory.
+# install lib required for pyaudio
+RUN apt update && apt install -y portaudio19-dev && apt-get clean && rm -rf /var/lib/apt/lists/*
+
+# update pip to support for whl.metadata -> less downloading
+RUN pip install --no-cache-dir -U "pip>=24"
+
+# create a working directory
 RUN mkdir /app
 WORKDIR /app
 
-COPY scripts/setup.sh requirements/server.txt /app
+# install the requirements for running the whisper-live server
+COPY requirements/server.txt /app/
+RUN pip install --no-cache-dir -r server.txt && rm server.txt
 
-RUN apt update && bash setup.sh && rm setup.sh
-RUN pip install -r server.txt && rm server.txt
+# make the paths of the nvidia libs installed as wheels visible. equivalent to:
+# export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
+ENV LD_LIBRARY_PATH="/usr/local/lib/python3.10/site-packages/nvidia/cublas/lib:/usr/local/lib/python3.10/site-packages/nvidia/cudnn/lib"
 
 COPY whisper_live /app/whisper_live
-
 COPY run_server.py /app
 
-CMD ["python3", "run_server.py"]
+CMD ["python", "run_server.py"]
diff --git a/docker/Dockerfile.tensorrt b/docker/Dockerfile.tensorrt
@@ -0,0 +1,30 @@
+FROM nvidia/cuda:12.5.1-runtime-ubuntu22.04 AS base
+
+ARG DEBIAN_FRONTEND=noninteractive
+
+RUN apt-get update && apt-get install -y \
+    python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs wget \
+    && rm -rf /var/lib/apt/lists/*
+
+FROM base AS devel
+RUN pip3 install --no-cache-dir -U tensorrt_llm==0.15.0.dev2024111200 --extra-index-url https://pypi.nvidia.com
+WORKDIR /app
+RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git && cd TensorRT-LLM && \
+    git checkout c629546ce429623c8a163633095230154a6f0574 && cd ../ && \
+    mv TensorRT-LLM/examples ./TensorRT-LLM-examples && \
+    rm -rf TensorRT-LLM
+
+
+FROM devel AS release
+WORKDIR /app
+COPY assets/ ./assets
+RUN wget -nc -P assets/ https://raw.githubusercontent.com/openai/whisper/main/whisper/assets/mel_filters.npz
+
+COPY scripts/setup.sh ./
+RUN apt update && bash setup.sh && rm setup.sh
+
+COPY requirements/server.txt .
+RUN pip install --no-cache-dir -r server.txt && rm server.txt
+COPY whisper_live ./whisper_live
+COPY scripts/build_whisper_tensorrt.sh .
+COPY run_server.py .
diff --git a/requirements/server.txt b/requirements/server.txt
@@ -1,12 +1,13 @@
-faster-whisper==1.0.1
-torch
+faster-whisper==1.1.0
 websockets
 onnxruntime==1.16.0
 numba
-openai-whisper
 kaldialign
 soundfile
 ffmpeg-python
 scipy
 jiwer
-evaluate
+evaluate
+numpy<2
+openai-whisper==20240930
+tokenizers==0.20.3
diff --git a/run_server.py b/run_server.py
@@ -1,5 +1,5 @@
 import argparse
-from whisper_live.server import TranscriptionServer
+import os
 
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
@@ -21,18 +21,30 @@
     parser.add_argument('--trt_multilingual', '-m',
                         action="store_true",
                         help='Boolean only for TensorRT model. True if multilingual.')
+    parser.add_argument('--omp_num_threads', '-omp',
+                        type=int,
+                        default=1,
+                        help="Number of threads to use for OpenMP")
+    parser.add_argument('--no_single_model', '-nsm',
+                        action='store_true',
+                        help='Set this if every connection should instantiate its own model. Only relevant for custom model, passed using -trt or -fw.')
     args = parser.parse_args()
 
     if args.backend == "tensorrt":
         if args.trt_model_path is None:
             raise ValueError("Please Provide a valid tensorrt model path")
 
+    if "OMP_NUM_THREADS" not in os.environ:
+        os.environ["OMP_NUM_THREADS"] = str(args.omp_num_threads)
+
+    from whisper_live.server import TranscriptionServer
     server = TranscriptionServer()
     server.run(
         "0.0.0.0",
         port=args.port,
         backend=args.backend,
         faster_whisper_custom_model_path=args.faster_whisper_custom_model_path,
         whisper_tensorrt_path=args.trt_model_path,
-        trt_multilingual=args.trt_multilingual
+        trt_multilingual=args.trt_multilingual,
+        single_model=not args.no_single_model,
     )