Skip to content

Feature/qwen3.5 export support#1638

Closed
jpatrickiles-dev wants to merge 3 commits intohuggingface:mainfrom
jpatrickiles-dev:feature/qwen3.5-export-support
Closed

Feature/qwen3.5 export support#1638
jpatrickiles-dev wants to merge 3 commits intohuggingface:mainfrom
jpatrickiles-dev:feature/qwen3.5-export-support

Conversation

@jpatrickiles-dev
Copy link

Title:

[OpenVINO] Add Qwen3.5 hybrid model export support with transformers 5.x compatibility

Architecture:

Qwen3.5 is a hybrid VLM architecture combining GatedDeltaNet (linear attention/SSM) layers (~75%) with full attention layers (~25%). All model sizes (0.8B–35B-A3B MoE) use this hybrid design with Qwen3_5DynamicCache storing both KV cache and recurrent/conv states.

Fixes:

Fixes transformers 5.x compatibility (removed transformers.onnx, is_offline_mode, HfFolder, AutoModelForVision2Seq)
Adds full Qwen3.5 GatedDeltaNet hybrid model export support (0.8B, 4B, 9B, 35B-A3B MoE)
Implements Qwen3_5TextModelPatcher and Qwen3_5MoeTextModelPatcher with static GatedDeltaNet forward patching and hybrid cache handling
Fixes GPU VariadicSplit layout error via 4D reshape in the patcher

Tickets:

N/A (no existing issue — this is new feature + compatibility fix)

Tested on:

Intel Arc Xe-LPG (Meteor Lake), Ubuntu 24.04, transformers 5.3.0, OpenVINO 2026.1. Validated Qwen3.5 0.8B/4B/9B export and inference on both CPU and GPU.

AI Assistance:

yes — Claude used for architecture analysis and patcher implementation. Human validation: exported and ran inference on all model sizes, verified coherent output on CPU and GPU.

jpatrickiles-dev and others added 3 commits March 15, 2026 01:14
transformers 5.x removed/moved several APIs that optimum-intel imports:

- transformers.onnx module removed: inline ParameterFormat and
  compute_serialized_parameters_size (trivial enum + multiply)
- transformers.utils.is_offline_mode moved to
  huggingface_hub.constants.is_offline_mode
- AutoModelForVision2Seq renamed to AutoModelForImageTextToText
- huggingface_hub.HfFolder removed: add shim delegating to get_token()

All fixes use try/except fallbacks to maintain backward compatibility
with transformers 4.x.

Also adds missing Qwen3_5MoeModelPatcher import and registers qwen3_5,
qwen3_5_text, and qwen3_5_moe model types for OpenVINO export.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implement full OpenVINO export patcher for Qwen3.5 hybrid
(GatedDeltaNet + Attention) models, enabling export of all Qwen3.5
variants: 0.8B, 4B, 9B dense and 35B-A3B MoE.

Model patcher (model_patcher.py):
- qwen3_5_gated_delta_net_forward: patched forward for GatedDeltaNet
  layers that eliminates dynamic control flow for OV tracing. Uses
  ov_causal_conv1d for unified prefill/decode conv handling, and
  patched_recurrent_gated_delta_rule + RecurrentAttentionCell for the
  recurrent attention computation.
- Qwen3_5TextModelPatcher: handles cache decomposition (flat tensor
  list <-> per-layer conv_states/recurrent_states/key_cache/value_cache),
  GatedDeltaNet forward patching, and OV conversion extensions.
- Qwen3_5MoeTextModelPatcher: extends above with MoE expert forward
  patching for the 35B-A3B model.

Config registration (model_configs.py):
- Register qwen3_5, qwen3_5_text, qwen3_5_moe, qwen3_5_moe_text as
  hybrid SSM model types inheriting from Qwen3NextOpenVINOConfig.

Stateful transformation (stateful.py):
- Normalize cache state dtypes to f32 before make_stateful to prevent
  bf16/f32 type mismatch in CPU plugin's Assign ops.
- Convert bf16 logits output to f32 for openvino_genai compatibility.

Other fixes:
- Add Qwen3.5 model types to SSM_MODELS list for correct
  OVModelWithMambaForCausalLM class selection.
- Handle AttributeError in _get_non_default_generation_parameters for
  Qwen3_5TextConfig which lacks this method.

Tested: Qwen3.5-0.8B exports to INT4 and generates coherent text
via openvino_genai.LLMPipeline on CPU.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The GPU plugin cannot handle 3D bf16 VariadicSplit/StridedSlice ops
(no bfyx layout format available for rank-3 bf16 tensors).

Fix: reshape the post-conv mixed_qkv tensor to 4D by inserting a
dummy head dimension before slicing QKV components. The unsqueeze
makes all slice operations 4D, which the GPU plugin handles correctly.
The per-head reshape happens after slicing, preserving the correct
contiguous [key_dim, key_dim, value_dim] layout from in_proj_qkv.

Also adds _convert_bf16_to_f16 utility in stateful.py for future use
with GPU devices that support f16 but not bf16 (Intel Arc Xe-LPG).
Currently disabled as bf16→f16 conversion causes precision loss for
Qwen3.5 models. The exported model runs correctly on CPU.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rkazants
Copy link
Collaborator

Hi @jpatrickiles-dev,

We are already on this model enabling: #1634.
Please feel free to take good-first-issue if you want to contribute to optimum-intel

@rkazants rkazants closed this Mar 16, 2026
@jpatrickiles-dev
Copy link
Author

No problem, my fork is working fine for me, so I'm happy.

@sund00bie
Copy link

No problem, my fork is working fine for me, so I'm happy.

Did you have to edit Qwen3.5-27B/config.json at all?

@jpatrickiles-dev
Copy link
Author

jpatrickiles-dev commented Mar 16, 2026 via email

@SearchSavior
Copy link

Hey @jpatrickiles-dev great work on this.

Feel free to join us at openarc discord. I am also doing a lot work with openvino in my project, which you might find useful. Since you are deep diving NPU, I have an end to end from scratch reimplementation of qwen tts and qwen asr built into OpenArc, releasing soon. However I do not have NPU device and cannot optimize. My from scratch code has been easier to study outside of the library abstractions and if you are motivated could be a fantastic contribution/learning adventure.

https://discord.gg/asu5jTP6b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants