feat: implement rae autoencoder.#13046
Conversation
|
@bytetriper if you could take a look? |
|
nice works @Ando233 checking |
|
off the bat,
lets sort out these things and then re-look |
|
Agree with @kashif . Also if possible we can bake all the params into config so we can enable .from_pretrained(), which is more elegant and aligns with diffusers usage. I can help convert our released ckpt to hgf format afterwards |
|
@Ando233 we're happy to provide assistance if needed. |
|
@Ando233 the one remaining thing is the use of the |
|
@bytetriper could you kindly try to run the conversion scripts and upload the diffusers style weights to your huggingface hub for the checkpoints you have? |
|
Thank you for efforts @kashif , let me try to implement the remaining |
|
@Ando233 I added that already, so next we can wait for @bytetriper for a review and see if the weight conversion works on his end |
|
Thanks for the implementation! I just checked and weight conversion works on my end. Converted models are under https://huggingface.co/collections/nyu-visionx/rae. @kashif @Ando233 Can you check whether the converted models work on your end? |
|
@bytetriper thanks! What would be the quickest way to validate if the implementation is correct? We can do a quick value assertion test between the original model and the converted model on the same inputs. Would you be able to do it? |
sayakpaul
left a comment
There was a problem hiding this comment.
Left a bunch of comments. The major thing is we need to be a bit more explicit in terms of how we're defining the configs, loading encoder state dicts, etc.
I think we could aim for the following entrypoint for instantiating the AutoencoderRAE class:
AutoencoderRAE(..., encoder_type="dinov2")Inside the implementation of AutoencoderRAE __init__(), specifically, we can have a simple if/else block to dispatch the encoder based on encoder_type.:
if encoder_type == "dinov2":
encoder = Dinov2Encoder()
elif encoder_type == "siglip2":
encoder = Siglip2Encoder()
...And then, when a user does AutoencoderRAE.from_pretrained(...), the state dict should have both the encoder and decoder state dict, following how it's done in the other Autoencoder implementations of diffusers.
I will also let @dg845 take a look and provide feedback.
| `AutoencoderRAE` is a representation autoencoder that combines a frozen vision encoder (DINOv2, SigLIP2, or MAE) with a ViT-MAE-style decoder. | ||
|
|
||
| Paper: [Diffusion Transformers with Representation Autoencoders](https://huggingface.co/papers/2510.11690). | ||
|
|
||
| The model follows the standard diffusers autoencoder API: | ||
| - `encode(...)` returns an `EncoderOutput` with a `latent` tensor. | ||
| - `decode(...)` returns a `DecoderOutput` with a `sample` tensor. |
There was a problem hiding this comment.
Cc: @stevhliu. Could you leave suggestions on the docs?
|
|
||
| model = AutoencoderRAE( | ||
| encoder_cls="dinov2", | ||
| encoder_name_or_path="facebook/dinov2-with-registers-base", |
| - `encode(...)` returns an `EncoderOutput` with a `latent` tensor. | ||
| - `decode(...)` returns a `DecoderOutput` with a `sample` tensor. | ||
|
|
||
| ## Usage |
|
|
||
| For latent normalization, use `latents_mean` and `latents_std` (matching other diffusers autoencoders). | ||
|
|
||
| See `examples/research_projects/autoencoder_rae/train_autoencoder_rae.py` for a stage-1 style training script |
There was a problem hiding this comment.
What does stage-2 have? Generation?
|
|
||
| `encoder_cls` supports `"dinov2"`, `"siglip2"`, and `"mae"`. | ||
|
|
||
| For latent normalization, use `latents_mean` and `latents_std` (matching other diffusers autoencoders). |
There was a problem hiding this comment.
We should provide an example for this.
| self.model.layernorm.weight = None | ||
| self.model.layernorm.bias = None |
There was a problem hiding this comment.
These params are not used in the forward pass anyway. So, maybe it's not needed?
There was a problem hiding this comment.
This seems to have been not addressed.
| from transformers import AutoImageProcessor | ||
|
|
||
| proc = AutoImageProcessor.from_pretrained(encoder_name_or_path) | ||
| encoder_mean = torch.tensor(proc.image_mean, dtype=torch.float32).view(1, 3, 1, 1) | ||
| encoder_std = torch.tensor(proc.image_std, dtype=torch.float32).view(1, 3, 1, 1) |
There was a problem hiding this comment.
This should be explicitly in the conversion script. This is an antipattern for the library.
We could do something like:
https://github.com/huggingface/diffusers/blob/a80b19218b4bd4faf2d6d8c428dcf1ae6f11e43d/src/diffusers/models/autoencoders/autoencoder_kl_ltx2.py#L1112C9-L1116C1
Then in the conversion script, make these a part of the converted state dict before loading that into the diffusers implementation. LMK if it's unclear.
| # Optional latent normalization (RAE-main uses mean/var) | ||
| latents_mean_tensor = _as_optional_tensor(latents_mean) | ||
| self.do_latent_normalization = latents_mean is not None or latents_std is not None | ||
| if latents_mean_tensor is not None: | ||
| self.register_buffer("_latents_mean", latents_mean_tensor, persistent=True) | ||
| else: | ||
| self._latents_mean = None | ||
| if latents_std_tensor is not None: | ||
| self.register_buffer("_latents_std", latents_std_tensor, persistent=True) | ||
| else: | ||
| self._latents_std = None |
There was a problem hiding this comment.
Seems like this can be removed?
| if encoder_hidden_size is None: | ||
| raise ValueError(f"Encoder '{encoder_cls}' must define `.hidden_size` attribute.") | ||
|
|
||
| decoder_config = SimpleNamespace( |
| - trainable_cls_token | ||
| """ | ||
|
|
||
| def __init__(self, config, num_patches: int): |
There was a problem hiding this comment.
We should split out the config and expand the __init__ args here. That's how it's done in diffusers.
I tested and the converted model produce identical output on my end up to some small numerical differences. Just want to make sure it also has the same behavior on other's end:) I generally agree that we should have encoder in the ckpt as well. Can help for conversion afterewards |
|
Cool then. I will give you a heads up when the PR is ready for another look. Thank you! |
|
@bytetriper i sent you some fixes to the weights if you can kindly merge |
|
@kashif Merged! |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
sayakpaul
left a comment
There was a problem hiding this comment.
Left some comments. Let me know if this makes sense. @bytetriper it would be great if you could also test the diffusers counterparts of RAE and let us know your thoughts.
| specific language governing permissions and limitations under the License. | ||
| --> | ||
|
|
||
| # AutoencoderRAE |
| "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08" | ||
| ).to("cuda").eval() | ||
|
|
||
| image = Image.open("cat.png").convert("RGB").resize((224, 224)) |
There was a problem hiding this comment.
Can we use an example snippet that just works? In that case, we should load the image directly from a public URL and then use it further. We can leverage load_image for this (from diffusers import load_image).
| # Latent normalization is handled automatically inside encode/decode | ||
| # when the checkpoint config includes latents_mean/latents_std. |
| self.model.layernorm.bias = None | ||
|
|
||
|
|
||
| def forward(self, images: torch.Tensor, requires_grad: bool = False) -> torch.Tensor: |
There was a problem hiding this comment.
Should the users need to pass requires_grad or can we just do it based on self.training?
| - trainable_cls_token | ||
| """ | ||
|
|
||
| def __init__(self, config, num_patches: int): |
| if encoder_hidden_size is None: | ||
| raise ValueError(f"Encoder '{encoder_cls}' must define `.hidden_size` attribute.") | ||
|
|
||
| decoder_config = SimpleNamespace( |
| _ENCODER_TYPES["tiny_test"] = TinyTestEncoder | ||
|
|
||
|
|
||
| class AutoencoderRAETests(unittest.TestCase): |
There was a problem hiding this comment.
Can we not use the existing tester mixin for this?
| self.assertTrue(torch.allclose(z_eval_1, z_eval_2, atol=1e-6, rtol=1e-5)) | ||
|
|
||
|
|
||
| @slow |
There was a problem hiding this comment.
We can skip this test suite for now. Let's first have enough growth for the model.
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
What does this PR do?
This PR adds a new representation autoencoder implementation, AutoencoderRAE, to diffusers.
Implements diffusers.models.autoencoders.autoencoder_rae.AutoencoderRAE with a frozen pretrained vision encoder (DINOv2 / SigLIP2 / ViT-MAE) and a ViT-MAE style decoder.
The decoder implementation is aligned with the RAE-main GeneralDecoder parameter structure, enabling loading of existing trained decoder checkpoints (e.g. model.pt) without key mismatches when encoder/decoder settings are consistent.
Adds unit/integration tests under diffusers/tests/models/autoencoders/test_models_autoencoder_rae.py.
Registers exports so users can import directly via from diffusers import AutoencoderRAE.
Fixes #13000
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Usage
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.