Skip to content

Advanced ComfyUI node with multi person cloning (upto 6) and unlimited length audio generation. With all features. #33

@vantagewithai

Description

@vantagewithai

I’ve been experimenting extensively with the Step-Audio-EditX model and created a custom ComfyUI node that expands the model’s capabilities for more advanced, production-oriented audio generation.

This node supports:

Multiple speakers with separate reference audios

Long-form dialog generation

Inline speaker switching using [speakerX] tags

Inline pauses using [pause]300 (ms)

Inline emotion / style / speed control per line (e.g., [happy], [whisper], [slower])

Support for special paralinguistic tags such as [Laughter], [Breathing], [Dissatisfaction-hnn], etc.

These are preserved in the text and not stripped

Automatic concatenation of generated segments into a single final audio

Optional editing passes using the native emotion/style/speed edit functions

Progress visualization + cancellation support

Designed for multi-person conversation generation, audiobooks, and long-form content

This makes it possible to generate fully-scripted multi-voice scenes with inline emotional shifts and natural pauses — all in one pass.

I’m attaching a workflow screenshot in case anyone is interested in using it or contributing improvements.

Thanks for the awesome model — it’s incredibly powerful and flexible, and this node builds directly on the existing cloning/editing functionality without modifying the core engine.

ComfyUI Custom Node: https://github.com/vantagewithai/Vantage-Step-Audio-EditX

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions