Questions about flow matching decoders #905

sphmel · 2025-01-19T13:14:21Z

Hi, thanks for open-sourcing nice work.

In cosyvoice decoder, speaker embedding is used, while there're many works(voicebox, soundstorm, e2-tts, f5-tts, etc) that does not use speaker embedding on decoder side.

In cosyvoice2, speech tokenizer's ability has been improved quite a lot, If speech token has really small speaker informations relying only on prefix prompt would work well on zero-shot cloning task. I think your team already did some experiments about dropping speaker embedding. Is there any good reason to use speaker embedding in flow matching decoder? I hope such results be in cosyvoice2 tech report, or next version of cosyvoice model's tech report.

aluminumbox · 2025-01-20T02:20:04Z

check our report ,we did experiment on drop speaker embedding in llm and cer reduced

sphmel · 2025-01-20T02:46:37Z

@aluminumbox I mean, speaker embedding in flow matching decoder.

github-actions · 2025-02-20T02:00:39Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2025-03-06T02:03:13Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions bot added the stale label Feb 20, 2025

github-actions bot closed this as completed Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about flow matching decoders #905

Questions about flow matching decoders #905

sphmel commented Jan 19, 2025

aluminumbox commented Jan 20, 2025

sphmel commented Jan 20, 2025

github-actions bot commented Feb 20, 2025

github-actions bot commented Mar 6, 2025

Questions about flow matching decoders #905

Questions about flow matching decoders #905

Comments

sphmel commented Jan 19, 2025

aluminumbox commented Jan 20, 2025

sphmel commented Jan 20, 2025

github-actions bot commented Feb 20, 2025

github-actions bot commented Mar 6, 2025