-
Notifications
You must be signed in to change notification settings - Fork 58
Description
I’ve been experimenting extensively with the Step-Audio-EditX model and created a custom ComfyUI node that expands the model’s capabilities for more advanced, production-oriented audio generation.
This node supports:
Multiple speakers with separate reference audios
Long-form dialog generation
Inline speaker switching using [speakerX] tags
Inline pauses using [pause]300 (ms)
Inline emotion / style / speed control per line (e.g., [happy], [whisper], [slower])
Support for special paralinguistic tags such as [Laughter], [Breathing], [Dissatisfaction-hnn], etc.
These are preserved in the text and not stripped
Automatic concatenation of generated segments into a single final audio
Optional editing passes using the native emotion/style/speed edit functions
Progress visualization + cancellation support
Designed for multi-person conversation generation, audiobooks, and long-form content
This makes it possible to generate fully-scripted multi-voice scenes with inline emotional shifts and natural pauses — all in one pass.
I’m attaching a workflow screenshot in case anyone is interested in using it or contributing improvements.
Thanks for the awesome model — it’s incredibly powerful and flexible, and this node builds directly on the existing cloning/editing functionality without modifying the core engine.
ComfyUI Custom Node: https://github.com/vantagewithai/Vantage-Step-Audio-EditX
