Add LTX2 Condition Pipeline by dg845 · Pull Request #13058 · huggingface/diffusers

dg845 · 2026-01-30T07:28:27Z

What does this PR do?

This PR adds LTX2ConditionPipeline, a pipeline which supports visual conditioning at arbitrary frames for the LTX-2 model (paper, code, weights), following the original code. This is an analogue of LTXConditionPipeline for LTX-2, as both the original LTX models and LTX-2 support a similar conditioning scheme.

Fixes #12926

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul
@yiyixuxu

HuggingFaceDocBuilderDev · 2026-01-30T07:36:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dg845 · 2026-02-04T02:35:57Z

Sample first-last-frame-to-video (FLF2V) script:

import torch

from diffusers import LTX2ConditionPipeline
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.pipelines.ltx2.pipeline_ltx2_condition import LTX2VideoCondition
from diffusers.utils import load_image


model_id = "Lightricks/LTX-2"
device = "cuda:0"
dtype = torch.bfloat16
seed = 42

width = 768
height = 512
frame_rate = 24.0

pipe = LTX2ConditionPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe.enable_model_cpu_offload(device=device)
pipe.vae.enable_tiling()

generator = torch.Generator(device).manual_seed(seed)

prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."

first_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png")
last_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png")

first_cond = LTX2VideoCondition(frames=first_image, index=0, strength=1.0)
last_cond = LTX2VideoCondition(frames=last_image, index=-1, strength=1.0)
conditions = [first_cond, last_cond]

video, audio = pipe(
    conditions=conditions,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=40,
    guidance_scale=4.0,
    generator=generator,
    output_type="np",
    return_dict=False,
)
video = (video * 255).round().astype("uint8")
video = torch.from_numpy(video)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_cond_flf2v.mp4",
)

dg845 · 2026-02-04T02:45:07Z

Unfortunately the pipeline isn't quite working as of ed52c0d:

Official FLF2V sample:

ltx2_flf2v_official.mp4

Current diffusers FLF2V sample (using condition pipeline):

ltx2_cond_flf2v.mp4

Not sure why the video colors are messed up at the first and last frames (where the conditions are), will debug.

…elocity space

dg845 · 2026-02-04T04:54:18Z

I think the color issue is now fixed:

ltx2_cond_flf2v_fixed.mp4

dg845 · 2026-02-04T05:46:17Z

The condition pipeline also works with the distilled checkpoint:

FLF2V Distilled Script

import torch

from diffusers import LTX2ConditionPipeline, LTX2LatentUpsamplePipeline
from diffusers.pipelines.ltx2 import LTX2LatentUpsamplerModel
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.pipelines.ltx2.pipeline_ltx2_condition import LTX2VideoCondition
from diffusers.pipelines.ltx2.utils import DISTILLED_SIGMA_VALUES, STAGE_2_DISTILLED_SIGMA_VALUES
from diffusers.utils import load_image


model_id = "rootonchair/LTX-2-19b-distilled"
device = "cuda:0"
dtype = torch.bfloat16
seed = 42

width = 768
height = 512
frame_rate = 24.0

pipe = LTX2ConditionPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe.enable_model_cpu_offload(device=device)
pipe.vae.enable_tiling()

generator = torch.Generator(device).manual_seed(seed)

prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
negative_prompt = "blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."

first_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png")
last_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png")

first_cond = LTX2VideoCondition(frames=first_image, index=0, strength=1.0)
last_cond = LTX2VideoCondition(frames=last_image, index=-1, strength=1.0)
conditions = [first_cond, last_cond]

video_latent, audio_latent = pipe(
    conditions=conditions,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=8,
    sigmas=DISTILLED_SIGMA_VALUES,
    guidance_scale=1.0,
    generator=generator,
    output_type="latent",
    return_dict=False,
)

latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained(
    model_id,
    subfolder="latent_upsampler",
    torch_dtype=dtype,
)
upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler)
upsample_pipe.enable_model_cpu_offload(device=device)
upscaled_video_latent = upsample_pipe(
    latents=video_latent,
    output_type="latent",
    return_dict=False,
)[0]

video, audio = pipe(
    conditions=conditions,
    latents=upscaled_video_latent,
    audio_latents=audio_latent,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=width * 2,
    height=height * 2,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=3,
    # noise_scale=STAGE_2_DISTILLED_SIGMA_VALUES[0],
    sigmas=STAGE_2_DISTILLED_SIGMA_VALUES,
    generator=generator,
    guidance_scale=1.0,
    output_type="np",
    return_dict=False,
)
video = (video * 255).round().astype("uint8")
video = torch.from_numpy(video)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_cond_flf2v_distilled.mp4",
)

Official FLF2V distilled sample:

ltx2_flf2v_distilled_official.mp4

diffusers FLF2V distilled sample:

ltx2_cond_flf2v_distilled.mp4

sayakpaul · 2026-02-04T10:21:10Z

Thanks for the work, @dg845! Looking at the outputs, is there a discrepancy in the resolutions? The main bird subject seems a bit compressed to me in our implementation.

sayakpaul

Thanks for this work! It looks very clean and the implementation also seems faithful to the original one. I left some comments, LMK if they're clear.

Additionally, I would like to see some extensions being used in the pipeline. For example, multiple conditions (multiple images) with different indices. Would it be possible?