Some speech zone with low-volume or high distortion are not detected #45

olevanss · 2023-02-28T11:26:29Z

olevanss
Feb 28, 2023

Hi there!

I tried to use whisper_timestamped for better time stamping. My test samples are the premade voice lines recorded in one .wav file separated by 2 seconds. For me it looked like it should be an easy task for a Whisper. After it failed I moved to whisper_timestamped.

I attach the results I got. The blue graph is the waveform of my audio, while red ranges are the timelines provided by whisper_timestamped.

Most frequent errors I get are:

Whisper_timestamped ignores phrases in timestamping
It puts wrong timestamps onto them.
or
Attached silent periods bordering the phrase to the phrase timestamps.

So, while overall performance is good, I would like to ask if I can do something to improve performance. I tried allowing refining whisper model up to 5 sec.

Answered by Jeronymous

Feb 28, 2023

Can you please describe more precisely what is the problem:

what kind of phrases are problematic? Can it be things like speech disfluencies?
Do your 3 plots corresponds to the 3 cases you describe? (the first and last ones do not exhibit obviously wrong behaviors).

If you are using the trainscribe() function in python you can :

try beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0) (for cases where some phrases are missing from the transcription)
try trust_whisper_timestamps = False (then no need to tune refine_whisper_precision, it will just recompute all timestamps)
try a larger model

View full answer

Jeronymous · 2023-02-28T11:46:41Z

Jeronymous
Feb 28, 2023
Maintainer

Can you please describe more precisely what is the problem:

what kind of phrases are problematic? Can it be things like speech disfluencies?
Do your 3 plots corresponds to the 3 cases you describe? (the first and last ones do not exhibit obviously wrong behaviors).

If you are using the trainscribe() function in python you can :

try beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0) (for cases where some phrases are missing from the transcription)
try trust_whisper_timestamps = False (then no need to tune refine_whisper_precision, it will just recompute all timestamps)
try a larger model

3 replies

olevanss Feb 28, 2023
Author

In the first example it ignores the last phrase which has not so much volume.
In the middle you can see some sounds which are not a speech.

So my problem with that is that I want to detect low volume phrases. Other thing I would like to do is still to timestamp sounds which are not speech. Would that be possible?

Second speech sample is a little distorted since I analyze phone call recordings. Distortion is not dramatic and it doesn't stop whisper from time stamping it more or less ok in the first half.

In the third sample the speech is not detected as well as pause is counted like a speech. The problem with this whole region is that it is pronounced a little like mumbling and contains basically "yes" said many times repeatedly.

Partially yes, they are aligned with errors. But you still can observe problem №3 in all of the samples as well as others.

I haven't tried what you written except for the larger model :)

I will try to use you advices and let you know how it worked.
Thank you for your help!
UPD: my IDE says there are no usages of trust_whisper_timestamps variable, I need to update my whisper_timestamped

Jeronymous Feb 28, 2023
Maintainer

Thank you @olevanss for the clarification.

In what you describe, there are several sources of problems, some of which directly come from the core Whisper neural network models (and might need re-training to be correctly addressed).
So solving some of these issues is currently out of the scope of whisper-timestamped (which is an extension of whisper).

And there are some particular mistakes made by Whisper models for which something can be done when trying to recover accurate timestamps.
For instance, we are currently working on another option to be more accurate around speech disfluencies, which are usually (but not always) removed by Whisper.
Disfluencies are things like fillers ("hmm",..), repetitions ("I di I did it", ...) that are generally removed by Whisper neural nets.

PS: Indeed trust_whisper_timestamps is a recent option (version 1.10+).

olevanss Mar 1, 2023
Author

Thank you!
New option on not trusting whisper_timestamps or setting beam_size helped me to improve and stabilize the output reasonably well

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some speech zone with low-volume or high distortion are not detected #45

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Some speech zone with low-volume or high distortion are not detected #45

olevanss Feb 28, 2023

Replies: 1 comment · 3 replies

Jeronymous Feb 28, 2023 Maintainer

olevanss Feb 28, 2023 Author

Jeronymous Feb 28, 2023 Maintainer

olevanss Mar 1, 2023 Author

olevanss
Feb 28, 2023

Replies: 1 comment 3 replies

Jeronymous
Feb 28, 2023
Maintainer

olevanss Feb 28, 2023
Author

Jeronymous Feb 28, 2023
Maintainer

olevanss Mar 1, 2023
Author