Skip to content

Commit

Permalink
Add Flemish N-Best results
Browse files Browse the repository at this point in the history
  • Loading branch information
greenw0lf committed Apr 4, 2024
1 parent f3ceeed commit f5dc793
Show file tree
Hide file tree
Showing 4 changed files with 42 additions and 40 deletions.
2 changes: 1 addition & 1 deletion RU/hardware.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ For Kaldi_NL, the memory it uses for diarization is 370 MB on average, and for N

**Whisper**: [Official github repository](https://github.com/openai/whisper)

For Whisper, no GPUs were used. The computer's hardware consists of 16GB RAM and 4 CPU cores, under a Windows-Docker environment. It has been observed that Whisper requires on average 8 GB of VRAM per recording, with an initial memory requirement of 9.9 GB.
For Whisper, no GPUs were used. The computer's hardware consists of 16GB RAM and 4 CPU cores, under a Windows-Docker environment. It has been observed that Whisper requires on average 8 GB of RAM per recording, with an initial memory requirement of 9.9 GB.


**Wav2vec2.0-large-xlsr-53-dutch**: [Official github repository](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-dutch)
Expand Down
70 changes: 36 additions & 34 deletions UT/N-Best/nbest_res.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,33 +2,35 @@

<h2>Kaldi_NL vs. Whisper - N-Best 2008 Dutch</h2>

The N-Best 2008 Dutch Evaluation corpus is a corpus designed to evaluate Dutch/Flemish Speech Recognition systems in 2008. UT has only evaluated the Dutch subset of the corpus. The Dutch part of the corpus consists of 2 subsets:
- `bn_nl`: Broadcast News programmes in the Netherlands,
- `cts_nl`: Conversational Telephone Speech in the Netherlands.
The N-Best 2008 Dutch Evaluation corpus is a corpus designed to evaluate Dutch/Flemish Speech Recognition systems in 2008. The corpus consists of 4 subsets:
- `bn_nl`: Broadcast News programmes in the Netherlands;
- `cts_nl`: Conversational Telephone Speech in the Netherlands;
- `bn_vl`: Broadcast News programmes in Belgium;
- `cts_vl`: Conversational Telephone Speech in Belgium.

For more details about the corpus, click [here](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=32b10cb0f4cb99ba934f5be5066638a5ad9b19f2).

<br>

Here is a matrix with **WER** results of the baseline model, Kaldi_NL, as well as different versions of zero-shot Whisper tested on this corpus:

|Model\Dataset|bn_nl|cts_nl|
|---|---|---|
|Kaldi_NL|12.6%|38.6%|
|Whisper v2|12.7%|25.9%|
|Whisper v3|13.7%|28.1%|
|Whisper v2 w/ VAD|11.6%|25.3%|
|Whisper v3 w/ VAD|14.1%|26.5%|
|faster-whisper v2|10.6%|24.1%|
|faster-whisper v3|12.5%|25.5%|
|**faster-whisper v2 w/ VAD**|**10.0%**|**23.9%**|
|faster-whisper v3 w/ VAD|12.3%|25.1%|
|XLS-R FT on Dutch|14.8%|33.5%|
|MMS - 102 languages|23.4%|49.0%|
|MMS - 1162 languages|18.5%|42.7%|
Here is a matrix with **WER** results of the baseline model, Kaldi_NL, as well as different end-to-end models tested on this corpus:

|Model\Dataset|bn_nl|cts_nl|bn_vl|cts_vl|
|---|---|---|---|---|
|Kaldi_NL|12.6%|38.6%|21.2%|59.4%|
|Whisper v2|12.7%|25.9%|-|-|
|Whisper v3|13.7%|28.1%|-|-|
|Whisper v2 w/ VAD|11.6%|25.3%|-|-|
|Whisper v3 w/ VAD|14.1%|26.5%|-|-|
|faster-whisper v2|10.6%|24.1%|**13.0%**|38.5%|
|faster-whisper v3|12.5%|25.5%|14.9%|38.4%|
|**faster-whisper v2 w/ VAD**|**10.0%**|**23.9%**|13.6%|37.9%|
|faster-whisper v3 w/ VAD|12.3%|25.1%|14.6%|**36.9%**|
|XLS-R FT on Dutch|14.8%|33.5%|17.0%|51.7%|
|MMS - 102 languages|23.4%|49.0%|25.0%|64.2%|
|MMS - 1162 languages|18.5%|42.7%|19.4%|57.7%|

<br>
And here are results for the same models on bn_nl with the foreign speech lines removed from the dataset:
And here are results for the same models on `bn_nl` with the foreign speech lines removed from the dataset:

|Model\Dataset|bn_nl|
|---|---|
Expand All @@ -44,20 +46,20 @@ Something to note about this setup of the dataset is that Whisper v3 was already

Here is also a matrix with the **time** spent in total by each model **to evaluate** the respective subset:

|Model\Dataset|bn_nl|cts_nl|
|---|---|---|
|Kaldi_NL|0h:08m:58s|0h:14m:47s|
|Whisper v2|1h:11m:59s|0h:53m:55s|
|Whisper v3|1h:09m:00s|0h:40m:20s|
|Whisper v2 w/ VAD|0h:52m:03s|0h:40m:09s|
|Whisper v3 w/ VAD|1h:02m:13s|0h:37m:50s|
|faster-whisper v2|0h:11m:31s|0h:09m:30s|
|faster-whisper v3|0h:11m:21s|0h:09m:41s|
|faster-whisper v2 w/ VAD|0h:12m:13s|0h:09m:36s|
|faster-whisper v3 w/ VAD|0h:12m:25s|0h:09m:13s|
|XLS-R FT on Dutch|0h:07m:36s|0h:07m:52s|
|*MMS - 102 languages*|**0h:04m:33s**|0h:04m:23s|
|*MMS - 1162 languages*|0h:05m:26s|**0h:04m:14s**|
|Model\Dataset|bn_nl|cts_nl|bn_vl|cts_vl|
|---|---|---|---|---|
|Kaldi_NL|0h:08m:58s|0h:14m:47s|0h:15m:57s|0h:20m:07s|
|Whisper v2|1h:11m:59s|0h:53m:55s|-|-|
|Whisper v3|1h:09m:00s|0h:40m:20s|-|-|
|Whisper v2 w/ VAD|0h:52m:03s|0h:40m:09s|-|-|
|Whisper v3 w/ VAD|1h:02m:13s|0h:37m:50s|-|-|
|faster-whisper v2|0h:11m:31s|0h:09m:30s|0h:10m:55s|0h:09m:39s|
|faster-whisper v3|0h:11m:21s|0h:09m:41s|0h:10m:36s|0h:09m:54s|
|faster-whisper v2 w/ VAD|0h:12m:13s|0h:09m:36s|0h:11m:01s|0h:09m:46s|
|faster-whisper v3 w/ VAD|0h:12m:25s|0h:09m:13s|0h:10m:45s|0h:09m:04s|
|XLS-R FT on Dutch|0h:07m:36s|0h:07m:52s|0h:08m:44s|0h:08m:21s|
|*MMS - 102 languages*|**0h:04m:33s**|0h:04m:23s|0h:04m:38s|**0h:03m:54s**|
|*MMS - 1162 languages*|0h:05m:26s|**0h:04m:14s**|**0h:04m:37s**|0h:03m:55s|

### Preprocessing, setup, and postprocessing
For more details, click [here](./nbest_setup.md).
6 changes: 3 additions & 3 deletions UT/N-Best/nbest_setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@
## N-Best setup
### Preprocessing

Not all speech was annotated in the reference files. After the very first results, I observed that there were a lot of insertions in the segments where speech was not annotated. Therefore, I decided to use a [script](https://github.com/greenw0lf/OH-SMArt/blob/master/reference2stm/segment_nbest.ipynb) that splits the audio files into smaller segments that can each be transcribed by Whisper/Kaldi_NL. Spoken utterances are put together into one audio/transcription file if they are not apart from each other for more than 5 seconds.
Not all speech was annotated in the reference files. After the very first results, I observed that there were a lot of insertions in the segments where speech was not annotated. Therefore, I decided to use a [script](https://github.com/greenw0lf/OH-SMArt/blob/master/reference2stm/segment_nbest.ipynb) that splits the audio files into smaller segments that can each be transcribed by the evaluated models. Spoken utterances are put together into one audio/transcription file if they are not apart from each other for more than 5 seconds.

The CTS subset of the dataset contained a string that was not properly encoded as a character in UTF-8. I replaced it with the correct character (\xE2 => â)
The `cts_nl` subset of the dataset contained a string that was not properly encoded as a character in UTF-8. I replaced it with the correct character (\xE2 => â)

### Postprocessing

Whisper v3 with or without VAD outputted sometimes segments such as "Beeld &". The issue is that only one word is expected per timestamp and the CTM format that I converted it uses whitespace as a delimiter. Thus, it would error when running the evaluation. This was fixed by adjusting the problematic lines manually, either by removing the second word or adjusting the timestamps such that the second word can fit in. The script used for finding the error lines can be found [here](https://github.com/greenw0lf/OH-SMArt/blob/master/whisper2ctm/validate_ctm.py).
Whisper v3 with or without VAD (`whisper-timestamped` implementation) outputted sometimes segments such as "Beeld &". The issue is that only one word is expected per timestamp and the CTM format that I converted it uses whitespace as a delimiter. Thus, it would error when running the evaluation. This was fixed by adjusting the problematic lines manually, either by removing the second word or adjusting the timestamps such that the second word can fit in. The script used for finding the error lines can be found [here](https://github.com/greenw0lf/OH-SMArt/blob/master/whisper2ctm/validate_ctm.py).

Aside from that, there are other steps applied, such as normalization of numbers and characters or variations of different words that are acceptable spellings in the context of evaluation. For more details, check the [ASR-NL-benchmark](https://github.com/opensource-spraakherkenning-nl/ASR_NL_benchmark) repository, where details about the hypothesis/reference files format can also be found.
4 changes: 2 additions & 2 deletions UT/hardware.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

For **Kaldi_NL**, evaluation was run on a local machine instead of a cluster. The local machine used is a Lenovo ThinkPad P15v Gen 3 with AMD Ryzen 7 PRO 6850H CPU and 32 GB of RAM. The reasons are that it is trickier to set up a Docker container on the cluster used for Whisper (see below) since I do not have admin rights and that Kaldi_NL was meant to run on modern local machines rather than requiring to have a powerful GPU/CPU.

For **Whisper** and **XLS-R**, a high-performance computing cluster was used. The cluster's hardware consists of 2 x Nvidia A10 with 24 GB VRAM each, using CUDA version 11.6, 256 GB RAM and 56 CPU cores. For more details, check the [wiki](https://jupyter.wiki.utwente.nl/) page of the cluster.
For **Whisper, XLS-R, and MMS**, a high-performance computing cluster was used. The cluster's hardware consists of 2 x Nvidia A10 with 24 GB VRAM each, using CUDA version 11.6, 256 GB RAM and 56 CPU cores. For more details, check the [wiki](https://jupyter.wiki.utwente.nl/) page of the cluster.

The implementation used to output word-level timestamps from Whisper is [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped). In addition, a more optimized implementation of Whisper has been used, namely [faster-whisper](https://github.com/SYSTRAN/faster-whisper). Both of them use the same parameters as the original implementation from [OpenAI](https://github.com/openai/whisper). The parameters are:
- `beam_size=5`
Expand All @@ -28,6 +28,6 @@ It has been observed that `whisper-timestamped` (simplified to "Whisper" in the

The better-optimized implementation, `faster-whisper`, however, uses on average 3.2 - 3.7 GB of VRAM per recording. Therefore, this implementation can be used on GPUs with smaller video memory capacity.

XLS-R uses on average 4.3 GB of VRAM per recording, sometimes going as low as 4.0 GB and as high as 8.3 GB. Initial memory requirement varies from 8 GB to 13.7 GB, though the higher value could have been influenced by other previous processes that were executed on the GPU.
`XLS-R` and `MMS` use on average 4.3 GB of VRAM per recording, sometimes going as low as 4.0 GB and as high as 8.3 GB. Initial memory requirement varies from 8 GB to 13.7 GB, though the higher value could have been influenced by other previous processes that were executed on the GPU.

For Kaldi_NL, the memory it uses for diarization is 355 MB on average, and for NNet3 decoding, 1.6 GB. Keep in mind that RAM is used in this case since it runs on the CPU.

0 comments on commit f5dc793

Please sign in to comment.