Refer to the inference tutorial for detailed guidance on how to run inference using SeamlessM4T models. In this tutorial, the evaluation protocol used for all tasks supported by SeamlessM4T is briefly described.
Sacrebleu library is used to compute the BLEU scores. To be consistent with Whisper, a character-level (char) tokenizer for Mandarin Chinese (cmn), Japanese (jpn), Thai (tha), Lao (lao), and Burmese (mya) is used. The default 13a tokenizer is used for other languages. Raw (unnormalized) references and predictions are used for computing the scores.
import sacrebleu
bleu_metric = sacrebleu.BLEU(tokenize=<TOKENIZER>)
bleu_score = bleu_metric.corpus_score(<PREDICTIONS>, [<REFERENCES>])
To measure the quality of the translated speech outputs, the audios are first transcribed using Whisper ASR model and BLEU score is computed on these ASR transcriptions comparing them with the ground truth text references.
Whisper large-v2 is used for non-English target languages and medium.en trained on English-only data is used for English due to its superior performance.
import whisper
model = whisper.load_model('medium.en')
model = whisper.load_model('large-v2')
To reproduce the whisper transcriptions and thereby the ASR-BLEU scores, greedy decoding is used with a preset temperature value of 0. Target language information is also passed to the whisper model.
prediction = model.transcribe(<AUDIO_PATH>, language=<LANGUAGE>, temperature=0, beam_size=1)["text"]
Whisper-normalizer is run on the ground truth and the model generated . ASR-BLEU scores are computed using sacrebleu following the same tokenization as described for S2TT.
from whisper_normalizer.basic import BasicTextNormalizer
normalizer = EnglishTextNormalizer() ## To be used for English
normalizer = BasicTextNormalizer() ## For non-English directions
Similar to S2TT, raw (unnormalized) references and predictions are used to compute the chrF++ scores for text-to-text translation.
import sacrebleu
chrf_metric = sacrebleu.CHRF(word_order=2)
chrf_score = chrf_metric.corpus_score(<REFERENCES>,<PREDICTIONS>)
Similar to Whisper, character-level error rate (CER) metric is used for Mandarin Chinese (cmn), Japanese (jpn), Thai (tha), Lao (lao), and Burmese (mya) languages. Word-level error rate (WER) metric is used for the remaining languages. Whisper-normalizer is applied on the ground truth and the model generated . JiWER library is used to compute these CER and WER scores.
import jiwer
wer = WER(<REFERENCES>,<PREDICTIONS>) ## WER
cer = CER(<REFERENCES>,<PREDICTIONS>) ## CER