Feature/qwen3.5 export support by jpatrickiles-dev · Pull Request #1638 · huggingface/optimum-intel

jpatrickiles-dev · 2026-03-16T01:36:28Z

Title:

[OpenVINO] Add Qwen3.5 hybrid model export support with transformers 5.x compatibility

Architecture:

Qwen3.5 is a hybrid VLM architecture combining GatedDeltaNet (linear attention/SSM) layers (~75%) with full attention layers (~25%). All model sizes (0.8B–35B-A3B MoE) use this hybrid design with Qwen3_5DynamicCache storing both KV cache and recurrent/conv states.

Fixes:

Fixes transformers 5.x compatibility (removed transformers.onnx, is_offline_mode, HfFolder, AutoModelForVision2Seq)
Adds full Qwen3.5 GatedDeltaNet hybrid model export support (0.8B, 4B, 9B, 35B-A3B MoE)
Implements Qwen3_5TextModelPatcher and Qwen3_5MoeTextModelPatcher with static GatedDeltaNet forward patching and hybrid cache handling
Fixes GPU VariadicSplit layout error via 4D reshape in the patcher

Tickets:

N/A (no existing issue — this is new feature + compatibility fix)

Tested on:

Intel Arc Xe-LPG (Meteor Lake), Ubuntu 24.04, transformers 5.3.0, OpenVINO 2026.1. Validated Qwen3.5 0.8B/4B/9B export and inference on both CPU and GPU.

AI Assistance:

yes — Claude used for architecture analysis and patcher implementation. Human validation: exported and ran inference on all model sizes, verified coherent output on CPU and GPU.

transformers 5.x removed/moved several APIs that optimum-intel imports: - transformers.onnx module removed: inline ParameterFormat and compute_serialized_parameters_size (trivial enum + multiply) - transformers.utils.is_offline_mode moved to huggingface_hub.constants.is_offline_mode - AutoModelForVision2Seq renamed to AutoModelForImageTextToText - huggingface_hub.HfFolder removed: add shim delegating to get_token() All fixes use try/except fallbacks to maintain backward compatibility with transformers 4.x. Also adds missing Qwen3_5MoeModelPatcher import and registers qwen3_5, qwen3_5_text, and qwen3_5_moe model types for OpenVINO export. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Implement full OpenVINO export patcher for Qwen3.5 hybrid (GatedDeltaNet + Attention) models, enabling export of all Qwen3.5 variants: 0.8B, 4B, 9B dense and 35B-A3B MoE. Model patcher (model_patcher.py): - qwen3_5_gated_delta_net_forward: patched forward for GatedDeltaNet layers that eliminates dynamic control flow for OV tracing. Uses ov_causal_conv1d for unified prefill/decode conv handling, and patched_recurrent_gated_delta_rule + RecurrentAttentionCell for the recurrent attention computation. - Qwen3_5TextModelPatcher: handles cache decomposition (flat tensor list <-> per-layer conv_states/recurrent_states/key_cache/value_cache), GatedDeltaNet forward patching, and OV conversion extensions. - Qwen3_5MoeTextModelPatcher: extends above with MoE expert forward patching for the 35B-A3B model. Config registration (model_configs.py): - Register qwen3_5, qwen3_5_text, qwen3_5_moe, qwen3_5_moe_text as hybrid SSM model types inheriting from Qwen3NextOpenVINOConfig. Stateful transformation (stateful.py): - Normalize cache state dtypes to f32 before make_stateful to prevent bf16/f32 type mismatch in CPU plugin's Assign ops. - Convert bf16 logits output to f32 for openvino_genai compatibility. Other fixes: - Add Qwen3.5 model types to SSM_MODELS list for correct OVModelWithMambaForCausalLM class selection. - Handle AttributeError in _get_non_default_generation_parameters for Qwen3_5TextConfig which lacks this method. Tested: Qwen3.5-0.8B exports to INT4 and generates coherent text via openvino_genai.LLMPipeline on CPU. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The GPU plugin cannot handle 3D bf16 VariadicSplit/StridedSlice ops (no bfyx layout format available for rank-3 bf16 tensors). Fix: reshape the post-conv mixed_qkv tensor to 4D by inserting a dummy head dimension before slicing QKV components. The unsqueeze makes all slice operations 4D, which the GPU plugin handles correctly. The per-head reshape happens after slicing, preserving the correct contiguous [key_dim, key_dim, value_dim] layout from in_proj_qkv. Also adds _convert_bf16_to_f16 utility in stateful.py for future use with GPU devices that support f16 but not bf16 (Intel Arc Xe-LPG). Currently disabled as bf16→f16 conversion causes precision loss for Qwen3.5 models. The exported model runs correctly on CPU. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rkazants · 2026-03-16T05:23:31Z

Hi @jpatrickiles-dev,

We are already on this model enabling: #1634.
Please feel free to take good-first-issue if you want to contribute to optimum-intel

jpatrickiles-dev · 2026-03-16T05:36:18Z

No problem, my fork is working fine for me, so I'm happy.

sund00bie · 2026-03-16T06:45:42Z

No problem, my fork is working fine for me, so I'm happy.

Did you have to edit Qwen3.5-27B/config.json at all?

jpatrickiles-dev · 2026-03-16T13:51:34Z

No, but I didn't test any models beyond 9B due to my own needs. There were a number of problems, but I was able to correct everything and move on to my next goal of implementing the NPU speculator pipeline.

…

On Mon, Mar 16, 2026, 2:46 AM sund00bie ***@***.***> wrote: *sund00bie* left a comment (huggingface/optimum-intel#1638) <#1638 (comment)> No problem, my fork is working fine for me, so I'm happy. Did you have to edit Qwen3.5-27B/config.json at all? — Reply to this email directly, view it on GitHub <#1638 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/B3X7JWBR37AAEDLXX6ZKWUL4Q6PKZAVCNFSM6AAAAACWSZWMNSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DANRVGQ2DSNJVGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

SearchSavior · 2026-03-16T19:39:01Z

Hey @jpatrickiles-dev great work on this.

Feel free to join us at openarc discord. I am also doing a lot work with openvino in my project, which you might find useful. Since you are deep diving NPU, I have an end to end from scratch reimplementation of qwen tts and qwen asr built into OpenArc, releasing soon. However I do not have NPU device and cannot optimize. My from scratch code has been easier to study outside of the library abstractions and if you are motivated could be a fantastic contribution/learning adventure.

https://discord.gg/asu5jTP6b

jpatrickiles-dev and others added 3 commits March 15, 2026 01:14

rkazants closed this Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/qwen3.5 export support#1638

Feature/qwen3.5 export support#1638
jpatrickiles-dev wants to merge 3 commits intohuggingface:mainfrom
jpatrickiles-dev:feature/qwen3.5-export-support

jpatrickiles-dev commented Mar 16, 2026

Uh oh!

rkazants commented Mar 16, 2026

Uh oh!

jpatrickiles-dev commented Mar 16, 2026

Uh oh!

sund00bie commented Mar 16, 2026

Uh oh!

jpatrickiles-dev commented Mar 16, 2026 via email

Uh oh!

SearchSavior commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jpatrickiles-dev commented Mar 16, 2026

Title:

Architecture:

Fixes:

Tickets:

Tested on:

AI Assistance:

Uh oh!

rkazants commented Mar 16, 2026

Uh oh!

jpatrickiles-dev commented Mar 16, 2026

Uh oh!

sund00bie commented Mar 16, 2026

Uh oh!

jpatrickiles-dev commented Mar 16, 2026 via email

Uh oh!

SearchSavior commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants