Updated results for faster-whisper + WhisperX

TODO: detailed results for WhisperX
opensource-spraakherkenning-nl · Jul 11, 2024 · 66b60c6 · 66b60c6
1 parent b00d22c
commit 66b60c6
Show file tree

Hide file tree

Showing 3 changed files with 16 additions and 13 deletions.
diff --git a/NISV/res_labelled.md b/NISV/res_labelled.md
@@ -33,7 +33,7 @@ Here is a matrix with **WER** results of the baseline implementation from OpenAI
 |---|---|---|---|---|
 |[OpenAI](https://github.com/openai/whisper)|11.1%|11.0%|12.9%|13.2%|
 |[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|17.1%|16.9%|16.6%|16.6%|
-|**[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)**|**10.3%**|**10.2%**|**11.8%**|**11.8%**|
+|**[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)**|**10.3%**|**10.3%**|**11.8%**|**11.8%**|
 |[WhisperX](https://github.com/m-bain/whisperX/)|12.3%|12.4%|13.0%|12.9%|
 
 <br>
@@ -44,8 +44,8 @@ And a matrix with the **time** spent in total by each implementation **to load a
 |---|---|---|---|---|
 |[OpenAI](https://github.com/openai/whisper)|36m:06s|32m:41s|42m:08s|30m:25s|
 |[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|21m:48s|19m:13s|23m:22s|22m:02s|
-|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|11m:40s|22m:27s|**11m:18s**|21m:56s|
-|**[WhisperX](https://github.com/m-bain/whisperX/)\***|**11m:17s**|**15m:54s**|11m:29s|**15m:05s**|
+|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|11m:39s|22m:29s|11m:04s|24m:04s|
+|**[WhisperX](https://github.com/m-bain/whisperX/)\***|**11m:34s**|**15m:15s**|**11m:01s**|**15m:04s**|
 
 \* For WhisperX, a separate alignment model based on wav2vec 2.0 has been applied in order to obtain word-level timestamps. Therefore, the time measured contains the time to load the model, time to transcribe, and time to align to generate timestamps. Speaker diarization has also been applied for WhisperX, which is measured separately and covered in [this section](./whisperx.md).
 
@@ -55,10 +55,10 @@ Finally, a matrix with the **maximum GPU memory consumption + maximum GPU power
 
 |Max. memory / Max. power|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`|
 |---|---|---|---|---|
-|[OpenAI](https://github.com/openai/whisper)|10621 MiB / 240 W|**10639 MiB** / 264 W|10927 MiB / 238 W|10941 MiB / 266 W|
+|[OpenAI](https://github.com/openai/whisper)|10621 MiB / 240 W|10639 MiB / 264 W|10927 MiB / 238 W|10941 MiB / 266 W|
 |[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)*|15073 MiB / **141 W**|12981 MiB / **215 W**|14566 MiB / **123 W**|19385 MiB / **235 W**|
-|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|**8576 MiB** / 188 W|11694 MiB / 235 W|**8567 MiB** / 195 W|**6942 MiB** / 237 W|
-|[WhisperX](https://github.com/m-bain/whisperX/)*|9419 MiB / 246 W|13548 MiB / 249 W|9417 MiB / 243 W|13539 MiB / 247 W|
+|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|**4287 MiB** / 230 W|**7776 MiB** / 263 W|**4292 MiB** / 230 W|**7768 MiB** / 262 W|
+|[WhisperX](https://github.com/m-bain/whisperX/)*|9947 MiB / 249 W|13940 MiB / 252 W|9944 MiB / 250 W|14094 MiB / 254 W|
 
 \* For these implementations, batching is supported. Setting a higher `batch_size` will lead to faster inference at the cost of extra memory used.
 

diff --git a/NISV/res_unlabelled.md b/NISV/res_unlabelled.md
@@ -13,7 +13,10 @@ For each Whisper implementation, 2 variables have been modified:
 - The compute type: `float16` vs. `float32` (check [here](./res_labelled.md) for more details about this parameter)
 - For Huggingface (HF) and WhisperX: `batch_size`
     - `4` for `HF float16`, `2` for `HF float32`
-    - `64/48` for `WhisperX float16 labelled/unlabelled`, `16` for `WhisperX float32`
+    - For **WhisperX**:
+        - `44` for `float16 large-v2`
+        - `48` for `float16 large-v3`
+        - `16` for `float32 large-v2/large-v3`
 
 <br>
 
@@ -23,8 +26,8 @@ Here's a matrix with the **time** spent in total by each implementation **to loa
 |---|---|---|---|---|
 |[OpenAI](https://github.com/openai/whisper)|1h:43m:47s|1h:20m:29s|1h:57m:06s|1h:28m:50s|
 |[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|43m:05s|1h:05m:17s|41m:39s|1h:01m:45s|
-|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|39m:59s|1h:18m:34s|40m:07s|1h:23m:07s|
-|**[WhisperX](https://github.com/m-bain/whisperX/)\***|**26m:57s**|**31m:57s**|**27m:00s**|**31m:43s**|
+|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|38m:52s|1h:17m:38s|39m:26s|1h:24m:21s|
+|**[WhisperX](https://github.com/m-bain/whisperX/)\***|**24m:52s**|**32m:01s**|**25m:42s**|**31m:24s**|
 
 \* For WhisperX, a separate alignment model based on wav2vec 2.0 has been applied in order to obtain word-level timestamps. Therefore, the time measured contains the time to load the model, time to transcribe, and time to align to generate timestamps. Speaker diarization has also been applied for WhisperX, which is measured separately and covered in a different section.
 
@@ -35,9 +38,9 @@ And also a matrix with the **maximum GPU memory consumption + maximum GPU power
 |Max. memory / Max. power|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`|
 |---|---|---|---|---|
 |[OpenAI](https://github.com/openai/whisper)|10943 MiB / 274 W|10955 MiB / 293 W|11094 MiB / 279 W|11164 MiB / 291 W|
-|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)*|16629 MiB / 269 W|18563 MiB / 287 W|12106 MiB / 259 W|15061 MiB / 288 W|
-|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|3789 MiB / 156 W|6813 MiB / 191 W|3750 MiB / 160 W|6953 MiB / 200 W|
-|[WhisperX](https://github.com/m-bain/whisperX/)*|22273 MiB / 268 W|21375 MiB / 281 W|22142 MiB / 267 W|21298 MiB / 279 W|
+|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)*|16629 MiB / 269 W|18563 MiB / 287 W|12106 MiB / **259 W**|15061 MiB / 288 W|
+|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|**4811 MiB** / 269 W|**8519 MiB** / 286 W|**4865 MiB** / 267 W|**9179 MiB** / 282 W|
+|[WhisperX](https://github.com/m-bain/whisperX/)*|21676 MiB / **268 W**|21657 MiB / **279 W**|22425 MiB / 267 W|21580 MiB / **279 W**|
 
 \* For these implementations, batching is supported. Setting a higher `batch_size` will lead to faster inference at the cost of extra memory used.
 

diff --git a/NISV/whisperx.md b/NISV/whisperx.md
@@ -4,7 +4,7 @@
 
 [**WhisperX**](https://github.com/m-bain/whisperX/) is an implementation of Whisper with support for batching, word/character level time alignment using wav2vec 2.0, and speaker diarization.
 
-There are 4 components/parts that we can define: **loading** Whisper, the **transcriber** (the part where Whisper runs to generate the text transcription without any timestamps), the **aligner** (generates the word-level timestamps using wav2vec 2.0), and the **diarizer** (identifies the speaker per segment and per word by assigning speaker IDs).
+There are 4 components/parts that we can define: **loading** Whisper + speaker diarization module, the **transcriber** (the part where Whisper runs to generate the text transcription without any timestamps), the **aligner** (generates the word-level timestamps using wav2vec 2.0), and the **diarizer** (identifies the speaker per segment and per word by assigning speaker IDs).
 
 Due to the wav2vec 2.0 based aligner which doesn't support aligning digits and currency symbols, numbers and currencies have been converted to their written form by setting `suppress_numerals=True`. In addition, the [original aligner](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-dutch) used in WhisperX, based on XLSR-53, has been replaced with an [aligner based on XLS-R](https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-dutch). This was done because the original aligner struggled with aligning some characters that are less common in Dutch, but part of the orthography of it (mainly accents on vowels). This might have led to more time spent on aligning compared to the XLSR-53 version, but an ablation study is planned for future work to confirm this hypothesis.