Skip to content

[Contribution] GLM4MoeForCausalLM Support#58

Open
lifelongeeek wants to merge 8 commits intoaws-neuron:mainfrom
lifelongeeek:feat/glm4-moe-support
Open

[Contribution] GLM4MoeForCausalLM Support#58
lifelongeeek wants to merge 8 commits intoaws-neuron:mainfrom
lifelongeeek:feat/glm4-moe-support

Conversation

@lifelongeeek
Copy link

@lifelongeeek lifelongeeek commented Mar 6, 2026

Description

Adds NeuronX Distributed Inference (NxDI) support for GLM-4.5 MoE (Glm4MoeForCausalLM) — a ~70B Mixture-of-Experts language model from ZhipuAI / Tsinghua University. Follows the contrib contribution guidelines and uses PR #34 as a structural reference.

Model Information

Model Name: GLM-4.5 MoE (Air variant)
HuggingFace: zai-org/GLM-4.5-Air
Model Architecture: Decoder-only MoE transformer with partial RoPE, sigmoid group-limited routing, shared experts, and dense-first layers.
Parameters: ~70B total, ~9B active per token (128 routed experts, top-8 per token)

Checklist

Please ensure your PR includes the following items. Refer to the contrib/CONTRIBUTING.md for detailed guidelines.

Required Components

  • Accuracy Test (contrib/models/glm4_moe/test/integration/test_model.py)

    • Integration test using check_accuracy_logits_v2 with a reduced 2-layer random-weight model
    • Validates CPU (HuggingFace) vs. Neuron logit consistency (divergence_difference_tol=0.001)
    • Compiles and runs on Neuron hardware (trn2.3xlarge, no checkpoint download required)
  • README.md (contrib/models/glm4_moe/README.md)

    • Usage example (compile + generate)
    • Compatibility matrix (trn2.3xlarge tested, trn2.48xlarge recommended)
    • Example checkpoints with HuggingFace links
    • Testing instructions for unit and integration tests
  • Source Code (contrib/models/glm4_moe/src/glm4_moe/)

    • modeling_glm4_moe.py: full NxDI implementation
    • Properly structured under contrib/models/glm4_moe/

Optional Components

  • Unit Tests (contrib/models/glm4_moe/test/unit/)

    • test_router.py: sigmoid group-limited top-k routing (10 tests, CPU-only)
    • test_attention.py: partial RoPE, QK norm, GQA (24 tests, CPU-only)
    • test_decoder.py: dense vs. MoE layer dispatch via first_k_dense_replace (15 tests, CPU-only)
  • vLLM Integration (contrib/models/glm4_moe/vllm/)

    • Offline inference script and OpenAI-compatible server launcher

Folder Structure

contrib/models/glm4_moe/
├── README.md
├── examples/
│   └── generation_glm4_moe_demo.py    # contrib-level demo
├── src/glm4_moe/
│   ├── __init__.py
│   └── modeling_glm4_moe.py
├── test/
│   ├── conftest.py
│   ├── integration/
│   │   ├── config_glm4_moe_2layers.json
│   │   ├── test_model.py
│   │   └── utils.py
│   └── unit/
│       ├── test_attention.py
│       ├── test_decoder.py
│       └── test_router.py
└── vllm/
    ├── README.md
    ├── run_offline_inference.py
    └── start-vllm-server.sh

examples/
└── generation_glm4_moe.py             # top-level example (mirrors generation_qwen3_moe_demo.py)

Architecture Notes

GLM-4.5 MoE has several differences from standard MoE models (e.g. Qwen3MoE) that required custom implementations:

Feature GLM-4.5 MoE Notes
RoPE Partial (partial_rotary_factor=0.5) Applied to first 50% of head_dim only
QKV Bias Yes (attention_bias=True)
Router activation Sigmoid (not softmax) DeepSeek-style
Routing Group-limited top-k (n_group, topk_group)
Correction bias e_score_correction_bias (frozen buffer)
Weight normalization norm_topk_prob + routed_scaling_factor
Shared experts n_shared_experts=1 (always active)
Dense-first layers first_k_dense_replace=1 First N layers use dense MLP

Testing

How did you test this change?

Tests included:

  • Unit tests for all model components (router, attention, decoder) — CPU only, no Neuron hardware required
  • Integration test validating logit accuracy using check_accuracy_logits_v2
# Unit tests (CPU only)
cd contrib/models/glm4_moe
pytest test/unit/ -v
# → 49/49 PASS

# Integration tests (requires Trn2 with ≥2 NeuronCores)
pytest test/integration/test_model.py -v -s
# → 4/4 PASS

Test Results (2026-03-06, trn2.3xlarge, NxDI 2.21+, transformers 4.56.2):

Test Suite Result
Unit: router top-k (10 tests) ✅ PASS
Unit: partial RoPE / attention (24 tests) ✅ PASS
Unit: decoder dispatch (15 tests) ✅ PASS
Integration: model compile + load (3 tests) ✅ PASS
Integration: check_accuracy_logits_v2 ✅ PASS (divergence_difference_tol=0.001)
Total ✅ 53/53 PASS

circle-jin and others added 7 commits February 19, 2026 09:06
Adds NXD inference support for GLM-4.5 MoE (Glm4MoeForCausalLM) models.
Based on the DeepSeek architecture with group-limited routing, sigmoid
activation, and optional partial RoPE.

Key components:
- NeuronGlm4MoeForCausalLM: top-level CausalLM model class
- NeuronGlm4MoeModel: transformer body with dense + MoE layer selection
- NeuronGlm4MoeAttention: multi-head GQA with partial RoPE support
- NeuronGlm4MoeDecoderLayer: decoder layer dispatching dense vs. MoE MLP
- Glm4MoeInferenceConfig: config loader with Glm4Moe-specific field mapping
- NeuronGlm4MoeRouter: sigmoid-based group-limited top-k routing
- initialize_glm4_moe_module: wires router + ExpertMLPsV2 + SharedExperts

Supports tp_degree/moe_tp_degree/moe_ep_degree sharding via NXD process groups.

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
- examples/generation_glm4_moe_demo.py: compile and run inference demo
  with configurable tp_degree, seq_len, and model/traced-model paths
- test_glm4_moe_accuracy.py: CPU (HuggingFace) vs Neuron token-matching
  accuracy test; passes with greedy decoding (top_k=1)
- create_glm4_tiny_random.py: utility to create a small random-weight
  GLM-4.5 MoE checkpoint for local testing without downloading the full model

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
- docs/glm4_moe_implementation.md: architecture overview, module breakdown,
  weight conversion details, sharding configuration guide
- docs/glm4_moe_testing.md: step-by-step testing guide with tiny random
  model, expected outputs, and troubleshooting notes

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
- Add tensor_capture_hook = kwargs.get('tensor_capture_hook', None) to
  prepare_inputs_for_generation to fix NameError when tensor_capture_hook
  was referenced before assignment
- Add import inspect
- Remove unconditional tensor_capture_hook from model_inputs dict
- Conditionally include tensor_capture_hook only when the model's forward()
  signature accepts it (multimodal models only), preventing TypeError for
  text-only models like GLM-4.5 MoE
…tegration

- Add contrib/models/glm4_moe/ following NxDI contrib structure
- Source model: src/glm4_moe/modeling_glm4_moe.py (Glm4MoeInferenceConfig,
  NeuronGlm4MoeForCausalLM, partial RoPE, sigmoid group routing, shared experts)
- Unit tests: test/unit/ — router top-k, partial RoPE, decoder dispatch (49 tests, all PASS)
- Integration tests: test/integration/ — compile + check_accuracy_logits_v2 with
  reduced 2-layer random-weight config on trn2.3xlarge (PASS)
- Examples: examples/generation_glm4_moe_demo.py with CLI args
- vLLM integration: vllm/run_offline_inference.py + start-vllm-server.sh
- README.md with architecture details, compatibility matrix, validation results

Tested on trn2.3xlarge (LNC=2, TP=2), NxDI 2.21+, transformers>=4.56.0
Following the pattern of examples/generation_qwen3_moe_demo.py.
Targets trn2.48xlarge (tp=32, moe_tp=4, moe_ep=8, bs=4, seq=4096).
Adds contrib src path via sys.path for the contrib-based model.
@lifelongeeek lifelongeeek marked this pull request as ready for review March 6, 2026 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants