Dealing with parallel speech #233

altunenes · 2024-08-14T10:20:58Z

altunenes
Aug 14, 2024

I think it's a bit difficult problem to solve. But if there are any ideas, I would love to hear them.
Because right now in parallel speech, diarization is not working very well. And that's a limitation.

thewh1teagle · 2024-08-14T22:27:47Z

thewh1teagle
Aug 14, 2024
Maintainer

I think it's a bit difficult problem to solve. But if there are any ideas, I would love to hear them. Because right now in parallel speech, diarization is not working very well. And that's a limitation.

Can you show pseudo example? Like

Real: speaker1 and speaker2 said together hello
Vibe: transcribe nothing?

Maybe even upload it

pyannote-rs uses pyannote model and it has feature of gap speech meaning we can know when 2/3 speakers spoke in the same time

4 replies

altunenes Aug 16, 2024
Author

I think it's a bit difficult problem to solve. But if there are any ideas, I would love to hear them. Because right now in parallel speech, diarization is not working very well. And that's a limitation.

Can you show pseudo example? Like

Real: speaker1 and speaker2 said together hello Vibe: transcribe nothing?

Maybe even upload it

pyannote-rs uses pyannote model and it has feature of gap speech meaning we can know when 2/3 speakers spoke in the same time

I am very sorry, I cannot share the audio files because they are not mine. :(
it transcribes. but I realized that segmentation and transcription are weaker at this stage.

thewh1teagle Aug 18, 2024
Maintainer

We can detect parallel speech with pyannote-rs (havn't added it yet) and then we can apply noise suppression there and retranscribe
https://github.com/Rikorose/DeepFilterNet

altunenes Aug 18, 2024
Author

DeepFilterNet-Demo-new.mp4

this is impressive

thewh1teagle Aug 22, 2024
Maintainer

Very impressive!
There's potential for many uses and cool apps with that. Though I think that the repo code is messy the model is good.

altunenes · 2024-08-25T12:44:20Z

altunenes
Aug 25, 2024
Author

I shamefully found a video that we could use as a test, but unfortunately, I couldn't find any “disturbing” videos other than non-political ones ( again, sorry I couldn't find a better example of a public use case with: unbalanced tones of voice with speech interruptions and overlapping speech 😂 ) I thought this could be a very nice test case haha ). So I thought we could test parallel speech, and audio normalizations and push the limits.

It is a 7-minute video, I am not sharing it as something political in any way! But I thought we could use it for further testing.

If this starts to produce good results, I think we can be sure that other results around the world will now be close to “perfect”.

(I trimmed 00:00 to 6:30 )

https://www.youtube.com/watch?v=EHnqaDZ2KLk

I downloaded this video in mp4 format.

Vibe results (I used a small model)
max speaker 3
max sent length: 1
threads:4
temp 0.4
max cont: 0

result:

 Speaker 1: 
  We welcome the Republican nominee, President Trump, and the Democratic nominee, Vice President
  
  Speaker 2: 
  I beat Bernie Sanders. Not by much. I beat him a whole lot. I'm here standing facing you all. I tell you, if the hunters would have left two days earlier, you would have lost every primary. All he knows how to do is hurt him. Look, here's a deal. I got very lucky. I'm gonna get very lucky tonight as well. And tonight I'm gonna make sure because here's a deal.
  
  Speaker 2: 
  Here's the deal. The fact is that everything he's saying so far is simply a lie. I'm not here to call out his lies. Everybody knows he's a liar. But you agree... You're the liar. I want to make sure... Where'd you order last in your class the first in your...
  
  Speaker 2: 
  I want to make sure- Mr. President, can you let him finish, sir? Now, he doesn't know how to do that.
  
  Speaker 2: 
  He has you know you pick be surprised the wrong guy the wrong night at the wrong time listen you agreed with her name Sanders And the man I did let him there is no manifesto number one speak mr. Number two just lost the left number two I You just lost the left
  
  Speaker 2: 
  You agreed with Bernie Sanders on a plan. Folks, do you have any idea what this plan is doing? Mr. President, do you have what they do? Socialized medicine. Mr. President, let people know. You're a senator. I'm not going to answer the question because the question is... the question is radical left. Who is on your list, Joe? This is so unpresidential. Gentlemen, I think we've ended this... this is so unpresidential.
  
  Speaker 1: 
  a lot of people.
  
  Speaker 2: 
  and a lot more gonna die unless he gets a lot smarter a lot quicker.
  
  Speaker 3: 
  Did you use the word?
  
  Speaker 3: 
  So, you said you want to Delaware State, but you forgot the name of your country.
  
  Speaker 3: 
  You didn't go to Delaware State.
  
  Speaker 3: 
  the lowest or almost the lowest in your class. Don't ever use the word smart.
  
  Speaker 3: 
  with me. Don't ever use that word. Oh, give me a break. Because you know what? There's nothing smart about you, Joe. 47 years you've done nothing. Let's add this to the bad news. And if you would have had, let me just tell you something, Joe. No. If you would have
  
  Speaker 2: 
  the charge of what I was put through. I had it closed, the greatest economy in the history of our country. And by the way, now it's being built again. And it's going up fast. We're going to get to the economy in the next segment.
  
  Speaker 1: 
  Okay. Oh, Mr. President. I'm asking you a question. Will you tell us how much you paid in federal income taxes in? 2016 and 2017 millions of dollars you paid millions of dollars of dollars so not seven million dollars and you'll get to see it. He says he's smart because he can take advantage of the
  
  Speaker 2: 
  code. And he does take advantage of the tax code. That's why I'm going to eliminate the Trump tax cuts.
  
  Speaker 2: 
  in 47 months, I've done more than you've done in 47 years, Joe. Will you urge your supporters to stay calm during this extended period not to engage in any civil unrest, and will you pledge tonight that you will not declare victory until the election has been independently certified? President Trump, you go first. I'm urging my supporters to go into the polls and watch very carefully, because that's what has to happen.
  
  Speaker 1: 
  is independently certified.
  
  Speaker 2: 
  And here is the deal.
  
  Speaker 2: 
  the balance as you pointed out. Some of these ballots in some states can't even be open.
  
  Speaker 2: 
  until Election Day. And if there's thousands of ballots going to be in the next election,
  
  Speaker 2: 
  way, our military, they've been voting by ballots, at the end of the Civil War in effect. And that's what's going to happen. Why was it not -- why is it for them somehow not fraudulent? It's the same process. It's honest. No one has established at all that there is fraud related to mail-in ballots. Are you willing, tonight, to condemn white supremacists and militia groups and to say
  
  Speaker 2: 
  -Say it, do it, say it. - Do you want to call them? What do you want to call them? Give me a name, give me a name. - White Supremacist, White Supremacist. - What do you like me to condemn? - White Supremacist, White Supremacist. - That's a right problem. - Right problem, that's right. - Right boys, stand back and stand by, but I'll tell you what, I'll tell you what. Somebody's got to do something about Antifa and the Left, because this is not a right wing problem. This is a Left wing problem. - His own FBI director said, - This is a Left wing problem. - Right, but let White Supremacist, Antifa's an idea, not an organization.

0 replies

altunenes · 2024-08-25T19:39:09Z

altunenes
Aug 25, 2024
Author

Here is the simple code that performs a little bit better but speaker identification is the worst :-D
I'm certain now, that audio pre-processing is really crucial. I hope there is nice documentation to work the best way with the whisper models. Play with the resampling numbers and see the results yourself. 😊

Speaker 1: start = 0.00, end = 5.84, text = We welcome the Republican nominee, President Trump, and the Democratic nominee, Vice President
Speaker 1: start = 5.84, end = 6.84, text = Biden.
Speaker 2: start = 6.84, end = 7.84, text = [APPLAUSE]
Speaker 2: start = 7.84, end = 10.84, text = How are you doing, man?
Speaker 2: start = 10.84, end = 13.84, text = How are you?
Speaker 2: start = 13.84, end = 24.84, text = I'm well.
Speaker 3: start = 24.84, end = 26.44, text = I beat Bernie Sanders.
Speaker 3: start = 26.44, end = 27.44, text = Not by much.
Speaker 3: start = 27.44, end = 28.64, text = I beat him a whole hell of a lot.
Speaker 3: start = 28.64, end = 30.40, text = I'm here to stand and face a new old buddy.
Speaker 3: start = 30.40, end = 32.76, text = I thought you were going to talk about the hunters would have left two days earlier.
Speaker 3: start = 32.76, end = 35.76, text = You would have lost every primary on Super Tuesday.
Speaker 3: start = 35.76, end = 36.76, text = You got very lucky.
Speaker 3: start = 36.76, end = 37.76, text = Look, here's the deal.
Speaker 3: start = 37.76, end = 38.76, text = I got very lucky.
Speaker 3: start = 38.76, end = 40.56, text = I'm going to get very lucky tonight as well.
Speaker 3: start = 40.56, end = 43.88, text = And tonight I'm going to make sure because here's the deal.
Speaker 3: start = 43.88, end = 44.88, text = Here's the deal.
Speaker 3: start = 44.88, end = 47.92, text = The fact is that everything he's saying so far is simply a lie.
Speaker 3: start = 47.92, end = 49.64, text = I'm not here to call out his lies.
Speaker 3: start = 49.64, end = 50.92, text = Everybody knows he's a liar.
Speaker 3: start = 50.92, end = 53.44, text = But you agree, Joe, you're the liar.
Speaker 3: start = 53.44, end = 54.44, text = I want to make sure --
Speaker 3: start = 54.44, end = 55.44, text = You're the liar, yeah.
Speaker 3: start = 55.44, end = 56.44, text = You're the liar, yeah.
Speaker 3: start = 56.44, end = 57.44, text = I want to make sure --
Speaker 1: start = 57.44, end = 60.72, text = Mr. President, can you let him finish, sir?
Speaker 3: start = 60.72, end = 62.24, text = He doesn't know how to do that.
Speaker 3: start = 62.24, end = 63.24, text = He has --
Speaker 3: start = 63.24, end = 64.24, text = You'd be surprised.
Speaker 3: start = 64.24, end = 67.76, text = You picked the wrong guy the wrong night at the wrong time.
Speaker 3: start = 67.76, end = 68.84, text = Listen, you agreed with Bernie Sanders --
Speaker 3: start = 68.84, end = 69.84, text = Here's the deal.
Speaker 3: start = 69.84, end = 70.84, text = The whole idea --
Speaker 3: start = 70.84, end = 71.84, text = Let him --
Speaker 3: start = 71.84, end = 73.44, text = There is no manifesto, number one.
Speaker 3: start = 73.44, end = 74.52, text = Please let him speak, Mr. President.
Speaker 3: start = 74.52, end = 75.52, text = Number two --
Speaker 3: start = 75.52, end = 76.52, text = He just lost the left.
Speaker 3: start = 76.52, end = 77.52, text = Number two --
Speaker 3: start = 77.52, end = 79.76, text = You just lost the left.
Speaker 3: start = 79.76, end = 82.72, text = You agreed with Bernie Sanders on a plan.
Speaker 3: start = 82.72, end = 83.72, text = That's absolutely a creature.
Speaker 3: start = 83.72, end = 86.92, text = Folks, do you have any idea what this plan is doing?
Speaker 3: start = 86.92, end = 87.92, text = You're going to let --
Speaker 3: start = 87.92, end = 88.92, text = Mr. President, do you have any idea?
Speaker 3: start = 88.92, end = 89.92, text = Socialized medicine.
Speaker 3: start = 89.92, end = 90.92, text = Mr. President, let people know.
Speaker 3: start = 90.92, end = 91.92, text = Here's your answer to the question.
Speaker 3: start = 91.92, end = 93.92, text = I'm not going to answer the question because --
Speaker 3: start = 93.92, end = 94.92, text = Why wouldn't you answer that question?
Speaker 3: start = 94.92, end = 95.92, text = You're going to put a lot of --
Speaker 3: start = 95.92, end = 96.92, text = The question is --
Speaker 3: start = 96.92, end = 98.92, text = The Supreme Court just has radical left --
Speaker 3: start = 98.92, end = 99.92, text = Shut up, man.
Speaker 3: start = 99.92, end = 101.92, text = Who is on your list, Joe?
Speaker 3: start = 101.92, end = 102.92, text = Who is on your list?
Speaker 3: start = 102.92, end = 103.92, text = This is so --
Speaker 3: start = 103.92, end = 104.92, text = Gentlemen, I think we've ended this --
Speaker 3: start = 104.92, end = 105.92, text = This is so unprecedented.
Speaker 3: start = 105.92, end = 106.92, text = He's going to pack the court.
Speaker 3: start = 106.92, end = 107.92, text = We're not going to give a list.
Speaker 3: start = 107.92, end = 108.92, text = We have ended this segment.
Speaker 3: start = 108.92, end = 110.16, text = We're going to move on.
Speaker 3: start = 110.16, end = 112.00, text = A lot of people died.
Speaker 3: start = 112.00, end = 115.44, text = And a lot more are going to die unless he gets a lot smarter, a lot quicker.
Speaker 3: start = 115.44, end = 116.44, text = Mr. President?
Speaker 3: start = 116.44, end = 119.68, text = Did you use the word "smart"?
Speaker 3: start = 119.68, end = 123.36, text = So you said you went to Delaware State, but you forgot the name of your college.
Speaker 3: start = 123.36, end = 125.32, text = You didn't go to Delaware State.
Speaker 3: start = 125.32, end = 129.36, text = You graduated either the lowest or almost the lowest in your class.
Speaker 3: start = 129.36, end = 131.44, text = Don't ever use the word "smart" with me.
Speaker 3: start = 131.44, end = 132.44, text = Don't ever use that word.
Speaker 3: start = 132.44, end = 133.44, text = Oh, give me a break.
Speaker 3: start = 133.44, end = 134.44, text = Because you know what?
Speaker 3: start = 134.44, end = 135.88, text = There's nothing smart about you, Joe.
Speaker 3: start = 135.88, end = 137.20, text = Forty-seven years, you've done nothing.
Speaker 3: start = 137.20, end = 138.20, text = Let's have this debate.
Speaker 3: start = 138.20, end = 139.20, text = And if you would have had --
Speaker 3: start = 139.20, end = 140.72, text = Let me just tell you something, Joe.
Speaker 3: start = 140.72, end = 144.72, text = If you would have had the charge of what I was put through --
Speaker 3: start = 144.72, end = 148.08, text = I had it closed -- the greatest economy in the history of our country.
Speaker 3: start = 148.08, end = 149.72, text = And by the way, now it's being built again.
Speaker 3: start = 149.72, end = 152.36, text = You see, he's going to get to the economy in the next segment, sir.
Speaker 1: start = 152.36, end = 153.36, text = Okay.
Speaker 1: start = 153.36, end = 156.12, text = Oh, Mr. President, I'm asking you a question.
Speaker 1: start = 156.12, end = 163.04, text = Will you tell us how much you paid in federal income taxes in 2016 and 2017?
Speaker 1: start = 163.04, end = 164.04, text = Millions of dollars.
Speaker 1: start = 164.04, end = 165.04, text = You paid millions of dollars?
Speaker 1: start = 165.04, end = 166.04, text = Millions of dollars, yes.
Speaker 1: start = 166.04, end = 167.04, text = So not 700, you kept it.
Speaker 1: start = 167.04, end = 168.04, text = Millions of dollars.
Speaker 1: start = 168.04, end = 169.04, text = And you'll get to see it.
Speaker 1: start = 169.04, end = 172.20, text = He says he's smart because he can take advantage of the tax code.
Speaker 3: start = 172.20, end = 173.92, text = And he does take advantage of the tax code.
Speaker 3: start = 173.92, end = 176.88, text = That's why I'm going to eliminate the Trump tax cuts.
Speaker 3: start = 176.88, end = 179.68, text = But why didn't you do it over the last 25 years?
Speaker 3: start = 179.68, end = 183.52, text = Because you are president screwing things up.
Speaker 3: start = 183.52, end = 184.52, text = You were a senator.
Speaker 3: start = 184.52, end = 187.20, text = You're the worst president America has ever had.
Speaker 3: start = 187.20, end = 188.20, text = Come on.
Speaker 3: start = 188.20, end = 192.08, text = Let me just tell you, Joe, I've done more in 47 months.
Speaker 1: start = 192.08, end = 195.08, text = I've done more than you've done in 47 years, Joe.
Speaker 1: start = 195.08, end = 201.52, text = Will you urge your supporters to stay calm during this extended period, not to engage
Speaker 1: start = 201.52, end = 203.28, text = in any civil unrest?
Speaker 1: start = 203.28, end = 208.52, text = And will you pledge tonight that you will not declare victory until the election has
Speaker 1: start = 208.52, end = 211.12, text = been independently certified?
Speaker 3: start = 211.12, end = 212.12, text = President Trump, you go first.
Speaker 3: start = 212.12, end = 217.60, text = I'm urging my supporters to go into the polls and watch very carefully because that's what
Speaker 3: start = 217.60, end = 218.60, text = has to happen.
Speaker 3: start = 218.60, end = 220.44, text = I am urging them to do it.
Speaker 3: start = 220.44, end = 222.40, text = As you know today, there was a big problem.
Speaker 3: start = 222.40, end = 224.40, text = In Philadelphia, they went into watch.
Speaker 3: start = 224.40, end = 225.92, text = They were called poll watchers.
Speaker 3: start = 225.92, end = 227.92, text = A very safe, very nice thing.
Speaker 3: start = 227.92, end = 229.00, text = They were thrown out.
Speaker 3: start = 229.00, end = 230.48, text = They weren't allowed to watch.
Speaker 3: start = 230.48, end = 231.48, text = You know why?
Speaker 3: start = 231.48, end = 234.48, text = Because bad things happen in Philadelphia, bad things.
Speaker 3: start = 234.48, end = 237.12, text = And I am urging my people.
Speaker 3: start = 237.12, end = 238.72, text = I hope it's going to be a fair election.
Speaker 3: start = 238.72, end = 242.28, text = If it's a fair election, I am 100% on board.
Speaker 3: start = 242.28, end = 247.84, text = But if I see tens of thousands of ballots being manipulated, I can't go along with that.
Speaker 3: start = 247.84, end = 253.68, text = And I'll tell you what, from a common sense, it means you have a fraudulent election.
Speaker 3: start = 253.68, end = 255.60, text = You're sending out 80 million ballots.
Speaker 3: start = 255.60, end = 259.16, text = They're not equipped to, these people aren't equipped to handle it.
Speaker 3: start = 259.16, end = 260.16, text = Number one.
Speaker 3: start = 260.16, end = 261.16, text = They cheat.
Speaker 3: start = 261.16, end = 262.16, text = They cheat.
Speaker 3: start = 262.16, end = 267.92, text = Hey, they found ballots in a waste paper basket three days ago and they all had the
Speaker 3: start = 267.92, end = 269.04, text = name military ballots.
Speaker 3: start = 269.04, end = 270.04, text = They were military.
Speaker 3: start = 270.04, end = 271.84, text = They all had the name Trump on them.
Speaker 3: start = 271.84, end = 273.60, text = You think that's good?
Speaker 3: start = 273.60, end = 275.48, text = Vice President Biden, final question for you.
Speaker 3: start = 275.48, end = 281.08, text = Will you urge your supporters to stay calm while the vote is counted and will you pledge
Speaker 1: start = 281.08, end = 285.04, text = not to declare victory until the election is independently certified?
Speaker 3: start = 285.04, end = 286.04, text = Yes.
Speaker 3: start = 286.04, end = 287.04, text = Here's the deal.
Speaker 3: start = 287.04, end = 288.04, text = We'll count the ballots.
Speaker 3: start = 288.04, end = 292.92, text = As you pointed out, some of these ballots in some states can't even be opened until
Speaker 3: start = 292.92, end = 294.40, text = election day.
Speaker 3: start = 294.40, end = 296.92, text = And if there's thousands of ballots, it's going to take time to do it.
Speaker 3: start = 296.92, end = 302.64, text = And by the way, our military, they've been voting by ballots for the end of the Civil
Speaker 3: start = 302.64, end = 304.40, text = War in effect.
Speaker 3: start = 304.40, end = 306.40, text = And that's what's going to happen.
Speaker 3: start = 306.40, end = 311.00, text = Why was it not, why is it for them somehow not fraudulent?
Speaker 3: start = 311.00, end = 312.84, text = It's the same process.
Speaker 3: start = 312.84, end = 313.88, text = It's honest.
Speaker 3: start = 313.88, end = 320.48, text = No one has established at all that there is fraud related to mail in ballots.
Speaker 3: start = 320.48, end = 327.88, text = Are you willing tonight to condemn white supremacists and militia groups and to say that they need
Speaker 3: start = 327.88, end = 333.68, text = to stand down and not add to the violence in a number of these cities as we saw in Kenosha
Speaker 3: start = 333.68, end = 334.96, text = and as we've seen in Portland?
Speaker 3: start = 334.96, end = 335.96, text = Sure, I'm willing to do that.
Speaker 3: start = 335.96, end = 336.96, text = Are you prepared to specifically do it?
Speaker 3: start = 336.96, end = 342.16, text = But I would say, I would say almost everything I see is from the left wing, not from the
Speaker 3: start = 342.16, end = 343.16, text = right wing.
Speaker 3: start = 343.16, end = 344.16, text = What are you saying?
Speaker 3: start = 344.16, end = 345.64, text = I'm willing to do anything.
Speaker 3: start = 345.64, end = 346.64, text = I want to see peace.
Speaker 3: start = 346.64, end = 347.64, text = Then do it, sir.
Speaker 3: start = 347.64, end = 348.64, text = Say it.
Speaker 3: start = 348.64, end = 349.64, text = Do it.
Speaker 3: start = 349.64, end = 350.64, text = Say it.
Speaker 3: start = 350.64, end = 351.64, text = Do you want to call them?
Speaker 3: start = 351.64, end = 352.64, text = What do you want to call them?
Speaker 3: start = 352.64, end = 353.64, text = Give me a name.
Speaker 3: start = 353.64, end = 354.64, text = White supremacists and white supremacists.
Speaker 3: start = 354.64, end = 357.14, text = White supremacists and right problem.
Speaker 3: start = 357.14, end = 359.00, text = Stand back and stand by.
Speaker 3: start = 359.00, end = 363.20, text = But I'll tell you what, I'll tell you what, somebody's got to do something about Antifa
Speaker 3: start = 363.20, end = 366.04, text = and the left because this is not a right wing problem.
Speaker 3: start = 366.04, end = 371.64, text = His own BI director said the threat and the non-guide white supremacist.
Speaker 3: start = 371.64, end = 373.64, text = Antifa's an idea, not an organization.
Speaker 3: start = 373.64, end = 374.64, text = You got it.
Speaker 3: start = 374.64, end = 375.64, text = Not militia.
Speaker 3: start = 375.64, end = 379.44, text = That's what his FBI director said.
Processing time: 76 seconds

code:

use eyre::{Result, eyre, bail, Context};
use pyannote_rs::{EmbeddingExtractor, EmbeddingManager};
use whisper_rs::{FullParams, SamplingStrategy, WhisperContext, WhisperContextParameters};
use std::time::Instant;
use std::fs::File;
use std::io::Write;
use std::path::Path;
use std::panic::{catch_unwind, AssertUnwindSafe};
use eyre::OptionExt;
use rubato::{Resampler, SincFixedIn, SincInterpolationParameters, SincInterpolationType, WindowFunction};
use hound;
fn resample(input: &[f32], input_sample_rate: usize, output_sample_rate: usize) -> Result<Vec<f32>> {
    let params = SincInterpolationParameters {
        sinc_len: 256,
        f_cutoff: 0.95,
        interpolation: SincInterpolationType::Linear,
        oversampling_factor: 160,
        window: WindowFunction::BlackmanHarris2,
    };
    let mut resampler = SincFixedIn::<f32>::new(
        output_sample_rate as f64 / input_sample_rate as f64,
        2.0,
        params,
        input.len(),
        1,
    )?;
    let waves_in: Vec<Vec<f32>> = vec![input.to_vec()];
    let waves_out = resampler.process(&waves_in, None)?;

    Ok(waves_out[0].clone())
}
fn read_audio_file(path: &str) -> Result<(i32, Vec<f32>)> {
    let mut reader = hound::WavReader::open(path)?;
    let spec = reader.spec();
    let sample_rate = spec.sample_rate as i32;
    let channels = spec.channels as usize;
    if channels > 1 {
        println!("Non-mono audio detected. Converting to mono...");
    }
    let samples: Vec<f32> = match spec.sample_format {
        hound::SampleFormat::Int => {
            match spec.bits_per_sample {
                16 => reader.samples::<i16>().map(|s| s.unwrap() as f32 / i16::MAX as f32).collect(),
                24 => reader.samples::<i32>().map(|s| s.unwrap() as f32 / 8388608.0).collect(),
                32 => reader.samples::<i32>().map(|s| s.unwrap() as f32 / i32::MAX as f32).collect(),
                _ => bail!("Unsupported bits per sample: {}", spec.bits_per_sample),
            }
        },
        hound::SampleFormat::Float => reader.samples::<f32>().map(|s| s.unwrap()).collect(),
    };
    let mono_samples = if channels > 1 {
        samples.chunks(channels).map(|chunk| chunk.iter().sum::<f32>() / channels as f32).collect()
    } else {
        samples
    };
    Ok((sample_rate, mono_samples))
}
fn create_whisper_context(model_path: &str) -> Result<WhisperContext> {
    println!("Opening model...");
    let model_path = Path::new(model_path);
    if !model_path.exists() {
        bail!("Whisper model file doesn't exist");
    }
    let mut ctx_params = WhisperContextParameters::default();
    if std::env::var("CUDA_VERSION").is_ok() || std::env::var("ROCM_VERSION").is_ok() {
        ctx_params.use_gpu = true;
    }
    println!("Use GPU: {:?}", ctx_params.use_gpu);
    let model_path = model_path.to_str().ok_or_eyre("Can't convert model path to string")?;
    println!("Creating Whisper context with model path {}", model_path);
    let ctx_result = catch_unwind(AssertUnwindSafe(|| {
        WhisperContext::new_with_params(model_path, ctx_params)
    }));
    match ctx_result {
        Err(error) => {
            bail!("Create Whisper context crashed: {:?}", error)
        }
        Ok(ctx) => {
            println!("Created context successfully");
            Ok(ctx?)
        }
    }
}
fn setup_whisper_params() -> FullParams<'static, 'static> {
    let mut params = FullParams::new(SamplingStrategy::default());
    params.set_print_special(false);
    params.set_print_progress(true);
    params.set_print_realtime(false);
    params.set_print_timestamps(true);
    params.set_suppress_blank(true);
    params.set_token_timestamps(true);
    params.set_language(Some("en"));
    params.set_n_threads(4);
    params.set_translate(false);
    params.set_no_context(false);
    params.set_single_segment(false);
    params.set_split_on_word(true);
    params.set_max_tokens(0);
    params.set_temperature(0.0);
    params
}
fn transcribe(
    ctx: &WhisperContext,
    audio_path: &str,
    output_path: &str,
    diarize_options: DiarizeOptions,
) -> Result<()> {
    println!("Transcribe called for {}", audio_path);
    let (sample_rate, original_samples) = read_audio_file(audio_path)?;
    let whisper_samples = if sample_rate != 16000 {
        println!("Resampling audio from {} Hz to 16000 Hz", sample_rate);
        resample(&original_samples, sample_rate as usize, 16000)?
    } else {
        original_samples
    };
    let mut state = ctx.create_state().context("Failed to create state")?;
    let mut params = setup_whisper_params();
    let mut output_file = File::create(output_path)?;
    let start_time = Instant::now();
    params.set_single_segment(false); // Process the entire audio for better context
    // Convert f32 samples to i16 for pyannote_rs::segment
    let i16_samples: Vec<i16> = whisper_samples.iter()
        .map(|&x| (x * i16::MAX as f32) as i16)
        .collect();
    let diarize_segments = pyannote_rs::segment(&i16_samples, 16000, &diarize_options.segment_model_path)?;
    let mut embedding_manager = EmbeddingManager::new(diarize_options.max_speakers);
    let mut extractor = EmbeddingExtractor::new(&diarize_options.embedding_model_path)?;
    // Transcribe the entire audio
    state.full(params, &whisper_samples).context("failed to transcribe")?;
    let num_segments = state.full_n_segments().context("failed to get number of segments")?;
    let mut current_diarize_segment = 0;
    let mut current_speaker = String::new();
    for s in 0..num_segments {
        let text = state.full_get_segment_text(s).context("failed to get segment")?;
        let start = state.full_get_segment_t0(s)? as f64 / 100.0;
        let end = state.full_get_segment_t1(s)? as f64 / 100.0;
        // Find the corresponding diarize segment
        while current_diarize_segment < diarize_segments.len() && diarize_segments[current_diarize_segment].end <= start {
            current_diarize_segment += 1;
        }
        if current_diarize_segment < diarize_segments.len() {
            let diarize_segment = &diarize_segments[current_diarize_segment];
            let embedding_result: Vec<f32> = match extractor.compute(&diarize_segment.samples) {
                Ok(iter) => iter.collect(),
                Err(e) => {
                    eprintln!("Error computing embedding: {:?}", e);
                    Vec::new() 
                }
            };                
            let speaker = if embedding_manager.get_all_speakers().len() == diarize_options.max_speakers {
                embedding_manager
                    .get_best_speaker_match(embedding_result)
                    .map(|r| r.to_string())
                    .unwrap_or("Unknown".to_string())
            } else {
                embedding_manager
                    .search_speaker(embedding_result, diarize_options.threshold)
                    .map(|r| r.to_string())
                    .unwrap_or("Unknown".to_string())
            };
            if speaker != current_speaker {
                writeln!(output_file, "Speaker {}:", speaker)?;
                current_speaker = speaker.clone();
            }
            if !text.trim().is_empty() {
                let output_line = format!(
                    "start = {:.2}, end = {:.2}, text = {}",
                    start, end, text.trim()
                );
                writeln!(output_file, "{}", output_line)?;
                println!("Speaker {}: {}", speaker, output_line);
            }
        }
    }
    let processing_time = start_time.elapsed().as_secs();
    println!("Processing time: {} seconds", processing_time);
    println!("Output written to: {}", output_path);
    Ok(())
}
#[derive(Debug)]
struct DiarizeOptions {
    segment_model_path: String,
    embedding_model_path: String,
    threshold: f32,
    max_speakers: usize,
}
fn main() -> Result<()> {
    let audio_path = "political.wav";
    let output_path = "political.txt";
    let diarize_options = DiarizeOptions {
        segment_model_path: "segmentation-3.0.onnx".to_string(),
        embedding_model_path: "wespeaker_en_voxceleb_CAM++.onnx".to_string(),
        threshold: 0.5,
        max_speakers: 3,
    };
    let whisper_model_path = "ggml-small.bin";
    let ctx = create_whisper_context(whisper_model_path)?;
    transcribe(&ctx, audio_path, output_path, diarize_options)?;
    Ok(())
}

0 replies

thewh1teagle · 2024-08-29T19:36:12Z

thewh1teagle
Aug 29, 2024
Maintainer

@altunenes

In the latest version I added audio normalization. It seems that the transcription is much accurate

3 replies

altunenes Aug 29, 2024
Author

@altunenes

In the latest version I added audio normalization. It seems that the transcription is much accurate

yes, it's now better transcription indeed! Thank you for your contributions! its really really good!

Also I realized that if we process the entire audio, whisper gets better context and the results are better. I think we have now a nearly "perfect" transcription with a small model. :-)

But what do you think about speaker identification? What would be the best approach to increase its performance? When I normalized the sound, it didn't have much effect on the speaker ID results, and when I pushed a bit too hard, my results were even worse :-D I'm not sure... I wish we had better models :S

thewh1teagle Aug 29, 2024
Maintainer

But what do you think about speaker identification? What would be the best approach to increase its performance? When I normalized the sound, it didn't have much effect on the speaker ID results, and when I pushed a bit too hard, my results were even worse :-D I'm not sure... I wish we had better models :S

From your checks the segmentation looks correct? (The start and stop timestamps)

If it's accurate I'm pretty sure that there's much more accurate voice embedding models, we just need to find the right one. I'd try different models with python first, no need onnxruntime or such, just to get the idea if it's even possible to get accurate results. I have the project ort-diarize so we can use it and try different embeddings models

altunenes Aug 29, 2024
Author

segmentation is perfect with https://github.com/thewh1teagle/pyannote-rs.
I have never, ever seen anything wrong (I am not speaking about parallel conversations, I consider parallel conversations as a kind of “limitation” currently).
In normal conversations, I would say the segmentation is close to 100% correct.
I will look at your project thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with parallel speech #233

{{title}}

Replies: 4 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Dealing with parallel speech #233

altunenes Aug 14, 2024

Replies: 4 comments · 7 replies

thewh1teagle Aug 14, 2024 Maintainer

altunenes Aug 16, 2024 Author

thewh1teagle Aug 18, 2024 Maintainer

altunenes Aug 18, 2024 Author

thewh1teagle Aug 22, 2024 Maintainer

altunenes Aug 25, 2024 Author

altunenes Aug 25, 2024 Author

thewh1teagle Aug 29, 2024 Maintainer

altunenes Aug 29, 2024 Author

thewh1teagle Aug 29, 2024 Maintainer

altunenes Aug 29, 2024 Author

altunenes
Aug 14, 2024

Replies: 4 comments 7 replies

thewh1teagle
Aug 14, 2024
Maintainer

altunenes Aug 16, 2024
Author

thewh1teagle Aug 18, 2024
Maintainer

altunenes Aug 18, 2024
Author

thewh1teagle Aug 22, 2024
Maintainer

altunenes
Aug 25, 2024
Author

altunenes
Aug 25, 2024
Author

thewh1teagle
Aug 29, 2024
Maintainer

altunenes Aug 29, 2024
Author

thewh1teagle Aug 29, 2024
Maintainer

altunenes Aug 29, 2024
Author