[Contribution] GLM4MoeForCausalLM Support by lifelongeeek · Pull Request #58 · aws-neuron/neuronx-distributed-inference

lifelongeeek · 2026-03-06T10:06:36Z

Description

Adds NeuronX Distributed Inference (NxDI) support for GLM-4.5 MoE (Glm4MoeForCausalLM) — a ~70B Mixture-of-Experts language model from ZhipuAI / Tsinghua University. Follows the contrib contribution guidelines and uses PR #34 as a structural reference.

Model Information

Model Name: GLM-4.5 MoE (Air variant)
HuggingFace: zai-org/GLM-4.5-Air
Model Architecture: Decoder-only MoE transformer with partial RoPE, sigmoid group-limited routing, shared experts, and dense-first layers.
Parameters: ~70B total, ~9B active per token (128 routed experts, top-8 per token)

Checklist

Please ensure your PR includes the following items. Refer to the contrib/CONTRIBUTING.md for detailed guidelines.

Required Components

Accuracy Test (contrib/models/glm4_moe/test/integration/test_model.py)
- Integration test using check_accuracy_logits_v2 with a reduced 2-layer random-weight model
- Validates CPU (HuggingFace) vs. Neuron logit consistency (divergence_difference_tol=0.001)
- Compiles and runs on Neuron hardware (trn2.3xlarge, no checkpoint download required)
README.md (contrib/models/glm4_moe/README.md)
- Usage example (compile + generate)
- Compatibility matrix (trn2.3xlarge tested, trn2.48xlarge recommended)
- Example checkpoints with HuggingFace links
- Testing instructions for unit and integration tests
Source Code (contrib/models/glm4_moe/src/glm4_moe/)
- modeling_glm4_moe.py: full NxDI implementation
- Properly structured under contrib/models/glm4_moe/

Optional Components

Unit Tests (contrib/models/glm4_moe/test/unit/)
- test_router.py: sigmoid group-limited top-k routing (10 tests, CPU-only)
- test_attention.py: partial RoPE, QK norm, GQA (24 tests, CPU-only)
- test_decoder.py: dense vs. MoE layer dispatch via first_k_dense_replace (15 tests, CPU-only)
vLLM Integration (contrib/models/glm4_moe/vllm/)
- Offline inference script and OpenAI-compatible server launcher

Folder Structure

contrib/models/glm4_moe/
├── README.md
├── examples/
│   └── generation_glm4_moe_demo.py    # contrib-level demo
├── src/glm4_moe/
│   ├── __init__.py
│   └── modeling_glm4_moe.py
├── test/
│   ├── conftest.py
│   ├── integration/
│   │   ├── config_glm4_moe_2layers.json
│   │   ├── test_model.py
│   │   └── utils.py
│   └── unit/
│       ├── test_attention.py
│       ├── test_decoder.py
│       └── test_router.py
└── vllm/
    ├── README.md
    ├── run_offline_inference.py
    └── start-vllm-server.sh

examples/
└── generation_glm4_moe.py             # top-level example (mirrors generation_qwen3_moe_demo.py)

Architecture Notes

GLM-4.5 MoE has several differences from standard MoE models (e.g. Qwen3MoE) that required custom implementations:

Feature	GLM-4.5 MoE	Notes
RoPE	Partial (`partial_rotary_factor=0.5`)	Applied to first 50% of `head_dim` only
QKV Bias	Yes (`attention_bias=True`)
Router activation	Sigmoid (not softmax)	DeepSeek-style
Routing	Group-limited top-k (`n_group`, `topk_group`)
Correction bias	`e_score_correction_bias` (frozen buffer)
Weight normalization	`norm_topk_prob` + `routed_scaling_factor`
Shared experts	`n_shared_experts=1` (always active)
Dense-first layers	`first_k_dense_replace=1`	First N layers use dense MLP

Testing

How did you test this change?

Tests included:

Unit tests for all model components (router, attention, decoder) — CPU only, no Neuron hardware required
Integration test validating logit accuracy using check_accuracy_logits_v2

# Unit tests (CPU only)
cd contrib/models/glm4_moe
pytest test/unit/ -v
# → 49/49 PASS

# Integration tests (requires Trn2 with ≥2 NeuronCores)
pytest test/integration/test_model.py -v -s
# → 4/4 PASS

Test Results (2026-03-06, trn2.3xlarge, NxDI 2.21+, transformers 4.56.2):

Test Suite	Result
Unit: router top-k (10 tests)	✅ PASS
Unit: partial RoPE / attention (24 tests)	✅ PASS
Unit: decoder dispatch (15 tests)	✅ PASS
Integration: model compile + load (3 tests)	✅ PASS
Integration: `check_accuracy_logits_v2`	✅ PASS (`divergence_difference_tol=0.001`)
Total	✅ 53/53 PASS

Adds NXD inference support for GLM-4.5 MoE (Glm4MoeForCausalLM) models. Based on the DeepSeek architecture with group-limited routing, sigmoid activation, and optional partial RoPE. Key components: - NeuronGlm4MoeForCausalLM: top-level CausalLM model class - NeuronGlm4MoeModel: transformer body with dense + MoE layer selection - NeuronGlm4MoeAttention: multi-head GQA with partial RoPE support - NeuronGlm4MoeDecoderLayer: decoder layer dispatching dense vs. MoE MLP - Glm4MoeInferenceConfig: config loader with Glm4Moe-specific field mapping - NeuronGlm4MoeRouter: sigmoid-based group-limited top-k routing - initialize_glm4_moe_module: wires router + ExpertMLPsV2 + SharedExperts Supports tp_degree/moe_tp_degree/moe_ep_degree sharding via NXD process groups. Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

- examples/generation_glm4_moe_demo.py: compile and run inference demo with configurable tp_degree, seq_len, and model/traced-model paths - test_glm4_moe_accuracy.py: CPU (HuggingFace) vs Neuron token-matching accuracy test; passes with greedy decoding (top_k=1) - create_glm4_tiny_random.py: utility to create a small random-weight GLM-4.5 MoE checkpoint for local testing without downloading the full model Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

- docs/glm4_moe_implementation.md: architecture overview, module breakdown, weight conversion details, sharding configuration guide - docs/glm4_moe_testing.md: step-by-step testing guide with tiny random model, expected outputs, and troubleshooting notes Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

- Add tensor_capture_hook = kwargs.get('tensor_capture_hook', None) to prepare_inputs_for_generation to fix NameError when tensor_capture_hook was referenced before assignment - Add import inspect - Remove unconditional tensor_capture_hook from model_inputs dict - Conditionally include tensor_capture_hook only when the model's forward() signature accepts it (multimodal models only), preventing TypeError for text-only models like GLM-4.5 MoE

…tegration - Add contrib/models/glm4_moe/ following NxDI contrib structure - Source model: src/glm4_moe/modeling_glm4_moe.py (Glm4MoeInferenceConfig, NeuronGlm4MoeForCausalLM, partial RoPE, sigmoid group routing, shared experts) - Unit tests: test/unit/ — router top-k, partial RoPE, decoder dispatch (49 tests, all PASS) - Integration tests: test/integration/ — compile + check_accuracy_logits_v2 with reduced 2-layer random-weight config on trn2.3xlarge (PASS) - Examples: examples/generation_glm4_moe_demo.py with CLI args - vLLM integration: vllm/run_offline_inference.py + start-vllm-server.sh - README.md with architecture details, compatibility matrix, validation results Tested on trn2.3xlarge (LNC=2, TP=2), NxDI 2.21+, transformers>=4.56.0

Following the pattern of examples/generation_qwen3_moe_demo.py. Targets trn2.48xlarge (tp=32, moe_tp=4, moe_ep=8, bs=4, seq=4096). Adds contrib src path via sys.path for the contrib-based model.

… instructions

circle-jin and others added 7 commits February 19, 2026 09:06

refactor: remove glm4_moe from src/models and root, moved to contrib

c41872c

feat: add GLM-4.5 MoE generation example script

c588457

Following the pattern of examples/generation_qwen3_moe_demo.py. Targets trn2.48xlarge (tp=32, moe_tp=4, moe_ep=8, bs=4, seq=4096). Adds contrib src path via sys.path for the contrib-based model.

lifelongeeek marked this pull request as ready for review March 6, 2026 10:13

docs: add vllm/requirements.txt with vllm-neuronx version and install…

704694a

… instructions

ahoffman-aws requested review from aarondou and aws-satyajith March 7, 2026 01:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Contribution] GLM4MoeForCausalLM Support#58

[Contribution] GLM4MoeForCausalLM Support#58
lifelongeeek wants to merge 8 commits intoaws-neuron:mainfrom
lifelongeeek:feat/glm4-moe-support

lifelongeeek commented Mar 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lifelongeeek commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Architecture Notes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lifelongeeek commented Mar 6, 2026 •

edited

Loading