-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Add support for Ovis-Image #12740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add support for Ovis-Image #12740
Conversation
|
Ovis-Image has been released:
|
|
@bot /style |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
yiyixuxu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for the PR! I left a few feedbacks, and I think we can merge this very soon
Congrats on the release!! Sorry, we overlooked the PR (it was the thanksgiving holiday in US)
We will reach out to set up a collaboration channel for your future release.
| def enable_vae_slicing(self): | ||
| r""" | ||
| Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to | ||
| compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. | ||
| """ | ||
| depr_message = f"Calling `enable_vae_slicing()` on a `{self.__class__.__name__}` is deprecated and this method will be removed in a future version. Please use `pipe.vae.enable_slicing()`." | ||
| deprecate( | ||
| "enable_vae_slicing", | ||
| "0.40.0", | ||
| depr_message, | ||
| ) | ||
| self.vae.enable_slicing() | ||
|
|
||
| def disable_vae_slicing(self): | ||
| r""" | ||
| Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to | ||
| computing decoding in one step. | ||
| """ | ||
| depr_message = f"Calling `disable_vae_slicing()` on a `{self.__class__.__name__}` is deprecated and this method will be removed in a future version. Please use `pipe.vae.disable_slicing()`." | ||
| deprecate( | ||
| "disable_vae_slicing", | ||
| "0.40.0", | ||
| depr_message, | ||
| ) | ||
| self.vae.disable_slicing() | ||
|
|
||
| def enable_vae_tiling(self): | ||
| r""" | ||
| Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to | ||
| compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow | ||
| processing larger images. | ||
| """ | ||
| depr_message = f"Calling `enable_vae_tiling()` on a `{self.__class__.__name__}` is deprecated and this method will be removed in a future version. Please use `pipe.vae.enable_tiling()`." | ||
| deprecate( | ||
| "enable_vae_tiling", | ||
| "0.40.0", | ||
| depr_message, | ||
| ) | ||
| self.vae.enable_tiling() | ||
|
|
||
| def disable_vae_tiling(self): | ||
| r""" | ||
| Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to | ||
| computing decoding in one step. | ||
| """ | ||
| depr_message = f"Calling `disable_vae_tiling()` on a `{self.__class__.__name__}` is deprecated and this method will be removed in a future version. Please use `pipe.vae.disable_tiling()`." | ||
| deprecate( | ||
| "disable_vae_tiling", | ||
| "0.40.0", | ||
| depr_message, | ||
| ) | ||
| self.vae.disable_tiling() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def enable_vae_slicing(self): | |
| r""" | |
| Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to | |
| compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. | |
| """ | |
| depr_message = f"Calling `enable_vae_slicing()` on a `{self.__class__.__name__}` is deprecated and this method will be removed in a future version. Please use `pipe.vae.enable_slicing()`." | |
| deprecate( | |
| "enable_vae_slicing", | |
| "0.40.0", | |
| depr_message, | |
| ) | |
| self.vae.enable_slicing() | |
| def disable_vae_slicing(self): | |
| r""" | |
| Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to | |
| computing decoding in one step. | |
| """ | |
| depr_message = f"Calling `disable_vae_slicing()` on a `{self.__class__.__name__}` is deprecated and this method will be removed in a future version. Please use `pipe.vae.disable_slicing()`." | |
| deprecate( | |
| "disable_vae_slicing", | |
| "0.40.0", | |
| depr_message, | |
| ) | |
| self.vae.disable_slicing() | |
| def enable_vae_tiling(self): | |
| r""" | |
| Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to | |
| compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow | |
| processing larger images. | |
| """ | |
| depr_message = f"Calling `enable_vae_tiling()` on a `{self.__class__.__name__}` is deprecated and this method will be removed in a future version. Please use `pipe.vae.enable_tiling()`." | |
| deprecate( | |
| "enable_vae_tiling", | |
| "0.40.0", | |
| depr_message, | |
| ) | |
| self.vae.enable_tiling() | |
| def disable_vae_tiling(self): | |
| r""" | |
| Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to | |
| computing decoding in one step. | |
| """ | |
| depr_message = f"Calling `disable_vae_tiling()` on a `{self.__class__.__name__}` is deprecated and this method will be removed in a future version. Please use `pipe.vae.disable_tiling()`." | |
| deprecate( | |
| "disable_vae_tiling", | |
| "0.40.0", | |
| depr_message, | |
| ) | |
| self.vae.disable_tiling() |
| self, | ||
| prompt: Union[str, List[str]] = None, | ||
| negative_prompt: Union[str, List[str]] = None, | ||
| true_cfg_scale: float = 5.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| true_cfg_scale: float = 5.0, | |
| guidance_scale: float = 5.0, |
can we use guidance_scale if it is not a distilled checkpoint? since model is already out with this PR, we can add a deprecation message if you prefer
| if image_embeds is not None: | ||
| self._joint_attention_kwargs["ip_adapter_image_embeds"] = image_embeds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if image_embeds is not None: | |
| self._joint_attention_kwargs["ip_adapter_image_embeds"] = image_embeds |
let's remove the IP-adapter related logics if we don't support it yet
|
|
||
| device = self._execution_device | ||
|
|
||
| has_neg_prompt = negative_prompt is not None or ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so flux/qwen pipelines were written this way to support both distilled guidance and regular CFG - the user experience was pretty bad, and we regret the design choise so very much.
If Ovis only supports regular CFG let's not follow their path :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for standard CFG, one pattern you can do is
prompt_embeds, text_ids = self.encode_prompt(...)
if do_classifier_free_guidance:
negative_prompt_embeds, negative_text_ids = self.encode_prompt(..)| image_embeds = None | ||
| negative_image_embeds = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| image_embeds = None | |
| negative_image_embeds = None |
What does this PR do?
This PR introduces Ovis-Image into the diffusers library. Ovis-Image integrates a diffusion-based visual decoder with the Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o.