Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

采集音频时延长说话停顿时间 #5

Open
fuyud opened this issue Jan 21, 2025 · 5 comments
Open

采集音频时延长说话停顿时间 #5

fuyud opened this issue Jan 21, 2025 · 5 comments

Comments

@fuyud
Copy link

fuyud commented Jan 21, 2025

void startListening({
double positiveSpeechThreshold = 0.5,
double negativeSpeechThreshold = 0.35,
int preSpeechPadFrames = 1,
int redemptionFrames = 8,
int frameSamples = 1536,
int minSpeechFrames = 3,
bool submitUserSpeechOnPause = false,
});
这些参数分别有什么用呢,
我想设置采集音频时延长说话停顿时间,让它能采集更多音频后再停顿

@keyur2maru
Copy link
Owner

With the default Silero v4 model and frameSamples=1536, one frame equals 96ms (at 16kHz sample rate). For Silero VAD v5 model, frameSamples must be 512, making one frame 32ms.

Parameters explained:

English:

  • positiveSpeechThreshold (0.5): Confidence threshold to detect speech. Higher values mean more certainty needed to mark as speech.
  • negativeSpeechThreshold (0.35): Threshold to mark as non-speech. Lower values are more sensitive to potential speech.
  • preSpeechPadFrames (1): Number of frames to include before detected speech starts.
  • redemptionFrames (8): Key parameter for your need - number of silent frames needed before marking speech as ended. Default is 8 frames (768ms with default settings). Increase this to extend pause detection time.
  • frameSamples (1536): Samples per frame. Fixed at 1536 for legacy model (96ms) or 512 for v5 model (32ms).
  • minSpeechFrames (3): Minimum frames of speech needed before triggering speech detection.
  • submitUserSpeechOnPause (false): Whether to submit collected audio when paused.

To extend pause detection time, increase redemptionFrames. For example, setting it to 16 would require 1.536 seconds of silence before stopping.

Translation -
中文:

  • positiveSpeechThreshold (0.5): 检测语音的置信度阈值。值越高,需要越确定才会标记为语音。
  • negativeSpeechThreshold (0.35): 标记为非语音的阈值。值越低,对潜在语音越敏感。
  • preSpeechPadFrames (1): 在检测到语音开始前要包含的帧数。
  • redemptionFrames (8): 关键参数 - 标记语音结束前需要的静音帧数。默认8帧(默认设置下为768毫秒)。增加这个值可以延长暂停检测时间。
  • frameSamples (1536): 每帧的采样数。传统模型固定为1536(96毫秒),v5模型为512(32毫秒)。
  • minSpeechFrames (3): 触发语音检测所需的最小语音帧数。
  • submitUserSpeechOnPause (false): 暂停时是否提交收集的音频。

要延长暂停检测时间,增加redemptionFrames的值。例如,设置为16会要求1.536秒的静音才会停止。

@fuyud
Copy link
Author

fuyud commented Jan 25, 2025 via email

@fuyud
Copy link
Author

fuyud commented Jan 25, 2025 via email

@keyur2maru
Copy link
Owner

redemptionFrames 控制的就是VAD在检测到静音后,需要等待多长时间才判定语音真正结束。

以你的例子 "你好 (停顿) 我想去机场,怎么坐地铁":

默认设置下(redemptionFrames = 8):

  • 每帧时长是96毫秒(使用默认frameSamples=1536时)
  • 8帧 = 8 * 96毫秒 = 768毫秒(约0.77秒)
  • 所以如果停顿超过0.77秒,就会停止录音

要延长到2秒,需要增加redemptionFrames:

  • 2秒 = 2000毫秒
  • 2000毫秒 ÷ 96毫秒 = 大约21帧

所以你可以这样设置:

vadHandler.startListening(
  redemptionFrames: 21  // 这样会在检测到约2秒的静音后才停止
);

这样设置后,即使在说话中间有短暂停顿,也能继续录音直到整句话说完。

@fuyud
Copy link
Author

fuyud commented Jan 25, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants