Skip to content

Add LTX2 Condition Pipeline#13058

Open
dg845 wants to merge 12 commits intomainfrom
ltx2-add-condition-pipeline
Open

Add LTX2 Condition Pipeline#13058
dg845 wants to merge 12 commits intomainfrom
ltx2-add-condition-pipeline

Conversation

@dg845
Copy link
Collaborator

@dg845 dg845 commented Jan 30, 2026

What does this PR do?

This PR adds LTX2ConditionPipeline, a pipeline which supports visual conditioning at arbitrary frames for the LTX-2 model (paper, code, weights), following the original code. This is an analogue of LTXConditionPipeline for LTX-2, as both the original LTX models and LTX-2 support a similar conditioning scheme.

Fixes #12926

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul
@yiyixuxu

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@dg845
Copy link
Collaborator Author

dg845 commented Feb 4, 2026

Sample first-last-frame-to-video (FLF2V) script:

import torch

from diffusers import LTX2ConditionPipeline
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.pipelines.ltx2.pipeline_ltx2_condition import LTX2VideoCondition
from diffusers.utils import load_image


model_id = "Lightricks/LTX-2"
device = "cuda:0"
dtype = torch.bfloat16
seed = 42

width = 768
height = 512
frame_rate = 24.0

pipe = LTX2ConditionPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe.enable_model_cpu_offload(device=device)
pipe.vae.enable_tiling()

generator = torch.Generator(device).manual_seed(seed)

prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."

first_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png")
last_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png")

first_cond = LTX2VideoCondition(frames=first_image, index=0, strength=1.0)
last_cond = LTX2VideoCondition(frames=last_image, index=-1, strength=1.0)
conditions = [first_cond, last_cond]

video, audio = pipe(
    conditions=conditions,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=40,
    guidance_scale=4.0,
    generator=generator,
    output_type="np",
    return_dict=False,
)
video = (video * 255).round().astype("uint8")
video = torch.from_numpy(video)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_cond_flf2v.mp4",
)

@dg845
Copy link
Collaborator Author

dg845 commented Feb 4, 2026

Unfortunately the pipeline isn't quite working as of ed52c0d:

Official FLF2V sample:

ltx2_flf2v_official.mp4

Current diffusers FLF2V sample (using condition pipeline):

ltx2_cond_flf2v.mp4

Not sure why the video colors are messed up at the first and last frames (where the conditions are), will debug.

@dg845
Copy link
Collaborator Author

dg845 commented Feb 4, 2026

I think the color issue is now fixed:

ltx2_cond_flf2v_fixed.mp4

@dg845
Copy link
Collaborator Author

dg845 commented Feb 4, 2026

The condition pipeline also works with the distilled checkpoint:

FLF2V Distilled Script
import torch

from diffusers import LTX2ConditionPipeline, LTX2LatentUpsamplePipeline
from diffusers.pipelines.ltx2 import LTX2LatentUpsamplerModel
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.pipelines.ltx2.pipeline_ltx2_condition import LTX2VideoCondition
from diffusers.pipelines.ltx2.utils import DISTILLED_SIGMA_VALUES, STAGE_2_DISTILLED_SIGMA_VALUES
from diffusers.utils import load_image


model_id = "rootonchair/LTX-2-19b-distilled"
device = "cuda:0"
dtype = torch.bfloat16
seed = 42

width = 768
height = 512
frame_rate = 24.0

pipe = LTX2ConditionPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe.enable_model_cpu_offload(device=device)
pipe.vae.enable_tiling()

generator = torch.Generator(device).manual_seed(seed)

prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."

first_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png")
last_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png")

first_cond = LTX2VideoCondition(frames=first_image, index=0, strength=1.0)
last_cond = LTX2VideoCondition(frames=last_image, index=-1, strength=1.0)
conditions = [first_cond, last_cond]

video_latent, audio_latent = pipe(
    conditions=conditions,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=8,
    sigmas=DISTILLED_SIGMA_VALUES,
    guidance_scale=1.0,
    generator=generator,
    output_type="latent",
    return_dict=False,
)

latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained(
    model_id,
    subfolder="latent_upsampler",
    torch_dtype=dtype,
)
upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler)
upsample_pipe.enable_model_cpu_offload(device=device)
upscaled_video_latent = upsample_pipe(
    latents=video_latent,
    output_type="latent",
    return_dict=False,
)[0]

video, audio = pipe(
    conditions=conditions,
    latents=upscaled_video_latent,
    audio_latents=audio_latent,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width * 2,
    height=height * 2,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=3,
    # noise_scale=STAGE_2_DISTILLED_SIGMA_VALUES[0],
    sigmas=STAGE_2_DISTILLED_SIGMA_VALUES,
    generator=generator,
    guidance_scale=1.0,
    output_type="np",
    return_dict=False,
)
video = (video * 255).round().astype("uint8")
video = torch.from_numpy(video)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_cond_flf2v_distilled.mp4",
)

Official FLF2V distilled sample:

ltx2_flf2v_distilled_official.mp4

diffusers FLF2V distilled sample:

ltx2_cond_flf2v_distilled.mp4

@dg845 dg845 marked this pull request as ready for review February 4, 2026 06:26
@dg845 dg845 requested review from sayakpaul and yiyixuxu February 4, 2026 06:26
@sayakpaul
Copy link
Member

Thanks for the work, @dg845! Looking at the outputs, is there a discrepancy in the resolutions? The main bird subject seems a bit compressed to me in our implementation.

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work! It looks very clean and the implementation also seems faithful to the original one. I left some comments, LMK if they're clear.

Additionally, I would like to see some extensions being used in the pipeline. For example, multiple conditions (multiple images) with different indices. Would it be possible?

Comment on lines 529 to 533
conditions,
image,
video,
cond_index,
strength,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only accept condition as an input to simplify logic? If so, I think check_inputs() would then only accept condition (which could be a single item or a list of conditions). WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current logic follows LTXConditionPipeline in also accepting image and video arguments (and therefore also needing cond_index and strength arguments). However, I agree that only accepting conditions would probably be better because it it less ambiguous. (One reservation I have is that you would have to import LTX2VideoCondition every time, but maybe that's not a big deal.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(One reservation I have is that you would have to import LTX2VideoCondition every time, but maybe that's not a big deal.)

That's just one-time import no? Then one could create the conditions either as a list of LTX2VideoCondition or just LTX2VideoCondition (in case of single condition).

If so, I think that's fine?

num_frames = (num_frames - 1) // scale_factor * scale_factor + 1
return num_frames

def latent_idx_from_index(self, frame_idx: int, index_type: str = "latent") -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently just a single type. I guess we can just do it inside the caller instead of having a separate function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking supporting a "data" index_type, where the index is interpreted in data (pixel) space rather than latent space, as LTXConditionPipeline appears to support, but I don't quite understand the frame_index logic in LTXConditionPipeline yet. My current understanding is that the original LTX-2 code only supports latent indices (but I might be mistaken).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I don't quite understand the frame_index logic in LTXConditionPipeline yet. My current understanding is that the original LTX-2 code only supports latent indices (but I might be mistaken).

If so, let's keep it inline then?

negative_prompt_embeds: Optional[torch.Tensor] = None,
negative_prompt_attention_mask: Optional[torch.Tensor] = None,
decode_timestep: Union[float, List[float]] = 0.0,
decode_noise_scale: Optional[Union[float, List[float]]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit unrelated: I haven't seen the use of decode_noise_scale in LTX-2. If that's the case indeed, WDYT of cleaning it off from the LTX-2 pipelines?

Copy link
Collaborator Author

@dg845 dg845 Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that decode_noise_scale is used if the video VAE supports timestep conditioning:

if not self.vae.config.timestep_conditioning:
timestep = None
else:
noise = randn_tensor(latents.shape, generator=generator, device=device, dtype=latents.dtype)
if not isinstance(decode_timestep, list):
decode_timestep = [decode_timestep] * batch_size
if decode_noise_scale is None:
decode_noise_scale = decode_timestep
elif not isinstance(decode_noise_scale, list):
decode_noise_scale = [decode_noise_scale] * batch_size

The LTX-2 VAE currently uses timestep_conditioning=False. It's unclear to me whether the LTX-2 code intends to support it, as the video decoder model still accepts a timestep_conditioning argument:

https://github.com/Lightricks/LTX-2/blob/4f410820b198e05074a1e92de793e3b59e9ab5a0/packages/ltx-core/src/ltx_core/model/video_vae/video_vae.py#L432

but the VAE decoding code doesn't support a timestep argument that would be necessary if timestep_conditioning=True:

https://github.com/Lightricks/LTX-2/blob/4f410820b198e05074a1e92de793e3b59e9ab5a0/packages/ltx-core/src/ltx_core/model/video_vae/video_vae.py#L813

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am on the same side. Do you think it could make sense to remote this logic in a separate PR then? It will simplify the code a bit.

# Convert the noise_pred_video velocity model prediction into a sample (x0) prediction
denoised_sample = latents - noise_pred_video * sigma
# Apply the (packed) conditioning mask to the denoised (x0) sample, which will blend the conditions
# with the denoised sample according to the conditioning strength (a strength of 1.0 means we fully
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a strength of 1.0 means we fully

(nit): However, from the code, it's not clear how this strength is incorporated. Consider expanding on the comment a little bit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conditioning strengths are used here:

# Overwrite the portion of latents starting with start_token_idx with the condition
latents[:, start_token_idx:end_token_idx] = cond
conditioning_mask[:, start_token_idx:end_token_idx] = strength
clean_latents[:, start_token_idx:end_token_idx] = cond

Perhaps it would more clear to say that the conditioning_mask itself specifies the strength with which the conditions are applied?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

@dg845
Copy link
Collaborator Author

dg845 commented Feb 5, 2026

Additionally, I would like to see some extensions being used in the pipeline. For example, multiple conditions (multiple images) with different indices. Would it be possible?

The FLF2V script in #13058 (comment) gives an example using multiple conditions that's not possible in LTX2ImageToVideoPipeline: here we condition on a last frame as well as an initial frame.

@dg845
Copy link
Collaborator Author

dg845 commented Feb 5, 2026

Looking at the outputs, is there a discrepancy in the resolutions? The main bird subject seems a bit compressed to me in our implementation.

I believe the discrepancy is because the original LTX-2 center crops images:

https://github.com/Lightricks/LTX-2/blob/4f410820b198e05074a1e92de793e3b59e9ab5a0/packages/ltx-pipelines/src/ltx_pipelines/utils/media_io.py#L88

but VideoProcessor.preprocess_video doesn't expose a resize_mode like VaeImageProcessor.preprocess does, so we end up using the "default" resize_mode, which instead resizes the image to fit in a non-aspect-ratio-preserving way.

@sayakpaul
Copy link
Member

sayakpaul commented Feb 5, 2026

but VideoProcessor.preprocess_video doesn't expose a resize_mode like VaeImageProcessor.preprocess does, so we end up using the "default" resize_mode, which instead resizes the image to fit in a non-aspect-ratio-preserving way.

Should we try to implement this then? Perhaps club it inside #13084. I think that the results are better with a center crop. WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LTX-2 condition pipeline

3 participants