Replies: 4 comments 7 replies
-
Can you show pseudo example? Like Real: speaker1 and speaker2 said together hello Maybe even upload it pyannote-rs uses pyannote model and it has feature of gap speech meaning we can know when 2/3 speakers spoke in the same time |
Beta Was this translation helpful? Give feedback.
-
I shamefully found a video that we could use as a test, but unfortunately, I couldn't find any “disturbing” videos other than non-political ones ( again, sorry I couldn't find a better example of a public use case with: unbalanced tones of voice with speech interruptions and overlapping speech 😂 ) I thought this could be a very nice test case haha ). So I thought we could test parallel speech, and audio normalizations and push the limits. It is a 7-minute video, I am not sharing it as something political in any way! But I thought we could use it for further testing. If this starts to produce good results, I think we can be sure that other results around the world will now be close to “perfect”. (I trimmed 00:00 to 6:30 ) https://www.youtube.com/watch?v=EHnqaDZ2KLk I downloaded this video in mp4 format. Vibe results (I used a small model) result:
|
Beta Was this translation helpful? Give feedback.
-
Here is the simple code that performs a little bit better but speaker identification is the worst :-D
code: use eyre::{Result, eyre, bail, Context};
use pyannote_rs::{EmbeddingExtractor, EmbeddingManager};
use whisper_rs::{FullParams, SamplingStrategy, WhisperContext, WhisperContextParameters};
use std::time::Instant;
use std::fs::File;
use std::io::Write;
use std::path::Path;
use std::panic::{catch_unwind, AssertUnwindSafe};
use eyre::OptionExt;
use rubato::{Resampler, SincFixedIn, SincInterpolationParameters, SincInterpolationType, WindowFunction};
use hound;
fn resample(input: &[f32], input_sample_rate: usize, output_sample_rate: usize) -> Result<Vec<f32>> {
let params = SincInterpolationParameters {
sinc_len: 256,
f_cutoff: 0.95,
interpolation: SincInterpolationType::Linear,
oversampling_factor: 160,
window: WindowFunction::BlackmanHarris2,
};
let mut resampler = SincFixedIn::<f32>::new(
output_sample_rate as f64 / input_sample_rate as f64,
2.0,
params,
input.len(),
1,
)?;
let waves_in: Vec<Vec<f32>> = vec![input.to_vec()];
let waves_out = resampler.process(&waves_in, None)?;
Ok(waves_out[0].clone())
}
fn read_audio_file(path: &str) -> Result<(i32, Vec<f32>)> {
let mut reader = hound::WavReader::open(path)?;
let spec = reader.spec();
let sample_rate = spec.sample_rate as i32;
let channels = spec.channels as usize;
if channels > 1 {
println!("Non-mono audio detected. Converting to mono...");
}
let samples: Vec<f32> = match spec.sample_format {
hound::SampleFormat::Int => {
match spec.bits_per_sample {
16 => reader.samples::<i16>().map(|s| s.unwrap() as f32 / i16::MAX as f32).collect(),
24 => reader.samples::<i32>().map(|s| s.unwrap() as f32 / 8388608.0).collect(),
32 => reader.samples::<i32>().map(|s| s.unwrap() as f32 / i32::MAX as f32).collect(),
_ => bail!("Unsupported bits per sample: {}", spec.bits_per_sample),
}
},
hound::SampleFormat::Float => reader.samples::<f32>().map(|s| s.unwrap()).collect(),
};
let mono_samples = if channels > 1 {
samples.chunks(channels).map(|chunk| chunk.iter().sum::<f32>() / channels as f32).collect()
} else {
samples
};
Ok((sample_rate, mono_samples))
}
fn create_whisper_context(model_path: &str) -> Result<WhisperContext> {
println!("Opening model...");
let model_path = Path::new(model_path);
if !model_path.exists() {
bail!("Whisper model file doesn't exist");
}
let mut ctx_params = WhisperContextParameters::default();
if std::env::var("CUDA_VERSION").is_ok() || std::env::var("ROCM_VERSION").is_ok() {
ctx_params.use_gpu = true;
}
println!("Use GPU: {:?}", ctx_params.use_gpu);
let model_path = model_path.to_str().ok_or_eyre("Can't convert model path to string")?;
println!("Creating Whisper context with model path {}", model_path);
let ctx_result = catch_unwind(AssertUnwindSafe(|| {
WhisperContext::new_with_params(model_path, ctx_params)
}));
match ctx_result {
Err(error) => {
bail!("Create Whisper context crashed: {:?}", error)
}
Ok(ctx) => {
println!("Created context successfully");
Ok(ctx?)
}
}
}
fn setup_whisper_params() -> FullParams<'static, 'static> {
let mut params = FullParams::new(SamplingStrategy::default());
params.set_print_special(false);
params.set_print_progress(true);
params.set_print_realtime(false);
params.set_print_timestamps(true);
params.set_suppress_blank(true);
params.set_token_timestamps(true);
params.set_language(Some("en"));
params.set_n_threads(4);
params.set_translate(false);
params.set_no_context(false);
params.set_single_segment(false);
params.set_split_on_word(true);
params.set_max_tokens(0);
params.set_temperature(0.0);
params
}
fn transcribe(
ctx: &WhisperContext,
audio_path: &str,
output_path: &str,
diarize_options: DiarizeOptions,
) -> Result<()> {
println!("Transcribe called for {}", audio_path);
let (sample_rate, original_samples) = read_audio_file(audio_path)?;
let whisper_samples = if sample_rate != 16000 {
println!("Resampling audio from {} Hz to 16000 Hz", sample_rate);
resample(&original_samples, sample_rate as usize, 16000)?
} else {
original_samples
};
let mut state = ctx.create_state().context("Failed to create state")?;
let mut params = setup_whisper_params();
let mut output_file = File::create(output_path)?;
let start_time = Instant::now();
params.set_single_segment(false); // Process the entire audio for better context
// Convert f32 samples to i16 for pyannote_rs::segment
let i16_samples: Vec<i16> = whisper_samples.iter()
.map(|&x| (x * i16::MAX as f32) as i16)
.collect();
let diarize_segments = pyannote_rs::segment(&i16_samples, 16000, &diarize_options.segment_model_path)?;
let mut embedding_manager = EmbeddingManager::new(diarize_options.max_speakers);
let mut extractor = EmbeddingExtractor::new(&diarize_options.embedding_model_path)?;
// Transcribe the entire audio
state.full(params, &whisper_samples).context("failed to transcribe")?;
let num_segments = state.full_n_segments().context("failed to get number of segments")?;
let mut current_diarize_segment = 0;
let mut current_speaker = String::new();
for s in 0..num_segments {
let text = state.full_get_segment_text(s).context("failed to get segment")?;
let start = state.full_get_segment_t0(s)? as f64 / 100.0;
let end = state.full_get_segment_t1(s)? as f64 / 100.0;
// Find the corresponding diarize segment
while current_diarize_segment < diarize_segments.len() && diarize_segments[current_diarize_segment].end <= start {
current_diarize_segment += 1;
}
if current_diarize_segment < diarize_segments.len() {
let diarize_segment = &diarize_segments[current_diarize_segment];
let embedding_result: Vec<f32> = match extractor.compute(&diarize_segment.samples) {
Ok(iter) => iter.collect(),
Err(e) => {
eprintln!("Error computing embedding: {:?}", e);
Vec::new()
}
};
let speaker = if embedding_manager.get_all_speakers().len() == diarize_options.max_speakers {
embedding_manager
.get_best_speaker_match(embedding_result)
.map(|r| r.to_string())
.unwrap_or("Unknown".to_string())
} else {
embedding_manager
.search_speaker(embedding_result, diarize_options.threshold)
.map(|r| r.to_string())
.unwrap_or("Unknown".to_string())
};
if speaker != current_speaker {
writeln!(output_file, "Speaker {}:", speaker)?;
current_speaker = speaker.clone();
}
if !text.trim().is_empty() {
let output_line = format!(
"start = {:.2}, end = {:.2}, text = {}",
start, end, text.trim()
);
writeln!(output_file, "{}", output_line)?;
println!("Speaker {}: {}", speaker, output_line);
}
}
}
let processing_time = start_time.elapsed().as_secs();
println!("Processing time: {} seconds", processing_time);
println!("Output written to: {}", output_path);
Ok(())
}
#[derive(Debug)]
struct DiarizeOptions {
segment_model_path: String,
embedding_model_path: String,
threshold: f32,
max_speakers: usize,
}
fn main() -> Result<()> {
let audio_path = "political.wav";
let output_path = "political.txt";
let diarize_options = DiarizeOptions {
segment_model_path: "segmentation-3.0.onnx".to_string(),
embedding_model_path: "wespeaker_en_voxceleb_CAM++.onnx".to_string(),
threshold: 0.5,
max_speakers: 3,
};
let whisper_model_path = "ggml-small.bin";
let ctx = create_whisper_context(whisper_model_path)?;
transcribe(&ctx, audio_path, output_path, diarize_options)?;
Ok(())
} |
Beta Was this translation helpful? Give feedback.
-
In the latest version I added audio normalization. It seems that the transcription is much accurate |
Beta Was this translation helpful? Give feedback.
-
I think it's a bit difficult problem to solve. But if there are any ideas, I would love to hear them.
Because right now in parallel speech, diarization is not working very well. And that's a limitation.
Beta Was this translation helpful? Give feedback.
All reactions