Add Trinity model family (AfmoeForCausalLM) contrib by jimburtoft · Pull Request #55 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-02-27T04:57:29Z

Description

Unified NxDI implementation for the Arcee AI Trinity model family (AfmoeForCausalLM). A single modeling_trinity.py supports all three model sizes -- Nano (~6B), Mini (26B), and Large (250B) -- with config-driven differences only.
Trinity is a Mixture-of-Experts architecture with several unique features: gated attention (sigmoid gate before o_proj), mixed sliding/full attention, QK normalization, conditional RoPE, expert bias in routing, and route_scale baked into weights.

Model Information

Model Name: Trinity (Nano, Mini, Large)
Model Architecture: Mixture-of-Experts decoder-only transformer (AfmoeForCausalLM)
Purpose: Text generation (causal language modeling)

Checklist

Required Components

Accuracy Test (test/integration/test_model.py)
- Integration test validates model accuracy via logit comparison and top-k token verification
- Test can compile and run the model on Neuron (validated on trn2 and inf2)
README.md with the following sections:
- Usage Example: Code examples for all three model sizes (Nano, Mini, Large)
- Compatibility Matrix: Table showing tested instance types (trn2.3xlarge, trn2.48xlarge, inf2.8xlarge, inf2.xlarge) with SDK 2.27
- Example Checkpoints: Links to arcee-ai/Trinity-Nano-Preview, arcee-ai/Trinity-Mini, arcee-ai/Trinity-Large-Preview
- Testing Instructions: Commands to run the test suite
Source Code (src/)
- modeling_trinity.py (1328 lines) following NxD Inference patterns
- Properly structured in the contrib folder hierarchy

Optional Components

Unit Tests (CPU or Neuron-based)
- Unit test directory included (test/unit/) but no unit tests yet

Folder Structure

/contrib/models/Trinity/
README.md

/src
init.py
modeling_trinity.py
/test
init.py
- /unit
  init.py
- /integration
  init.py
  test_model.py

Testing

How did you test this change?
Each model size was compiled and loaded on the appropriate Neuron instance. Forward passes were run on 3 standardized prompts and top-1 token predictions were verified for coherence. Multi-token generation (5 tokens) was tested via naive autoregressive loop. CPU reference comparison is in process, but all outputs are coherent and grammatically correct.

Test Results:

Model	Instance	TP	Compile	Load	Forward	Status
Nano	trn2.3xlarge	2	5.1 min	2.2 min	0.50s	PASS
Nano	inf2.8xlarge	1	reused	47.7s	0.73s	PASS
Nano	inf2.xlarge	1	--	OOM	--	FAIL (16GB system RAM)
Mini	trn2.3xlarge	4	4.9 min	4.1 min	0.37s	PASS
Large	trn2.48xlarge	64	8.6 min	15.6 min	1.15s	PASS
Large	trn2.48xlarge	32	10.1 min	--	--	FAIL (HBM OOM per NC)

Sample first-token predictions (all models):

"Hello, how are you?" -> I
"Explain quantum computing in simple terms." -> Quantum / What / Answer (varies by size)
"Write a Python function that calculates the Fibonacci sequence." -> The

Compatibility

Tested with:

Neuron SDK Version(s): 2.27 (NxDI 0.7.15063, neuronx-cc 2.22.12471)
Instance Type(s): trn2.3xlarge, trn2.48xlarge, inf2.8xlarge, inf2.xlarge
PyTorch Version: 2.9.0 (torch-neuronx 2.9.0.2.11)
Python Version: 3.12
Transformers Version: 4.56.2
DLAMI: Deep Learning AMI Neuron (Ubuntu 24.04) 20260126
Venv: /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/

Additional Information

Key porting challenges solved:

Gated attention -- Sigmoid gate applied to attention output before o_proj. Solved via inline override of attention forward methods (required for Neuron tracer compatibility).
route_scale -- NxDI MoE v2 does not support route_scale natively. Baked into expert down_proj weights during weight conversion.
Expert bias -- Created custom RouterTopKWithBias subclass since NxDI routing does not support learned bias.
Gate weight padding at high TP -- When num_attention_heads is not divisible by tp_degree (e.g., Large: 48 heads / TP=64), gate weights are padded with interleaved layout matching Q projection.
Trinity-Large requires TP=64 on trn2.48xlarge (all 64 logical NeuronCores). TP=32 causes HBM OOM because sharded weights (~23.5GB) fill the ~24GB HBM per physical NC with no room for scratchpad.

Known limitations:

MoE v2 NKI kernel accumulates in bf16, causing slightly higher numerical divergence vs CPU reference. Top-1 token accuracy is preserved.
NKI flash attention requires padding_side="right" on the tokenizer.
inf2.xlarge cannot run Nano (16GB system RAM insufficient for weight loading).
Related Issues
N/A -- This is a new model contribution.

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines (../contrib/CONTRIBUTING.md)
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

Unified NxDI implementation supporting all three Arcee AI Trinity sizes (Nano ~6B, Mini ~26B, Large ~250B) from a single modeling_trinity.py. Validated on SDK 2.27 (NxDI 0.7.15063, neuronx-cc 2.22.12471): - Nano: inf2.8xlarge (TP=1) and trn2.3xlarge (TP=2) - Mini: trn2.3xlarge (TP=4) - Large: trn2.48xlarge (TP=64)

Add layer_to_cache_size_mapping in setup_attr_for_model() to provide per-layer KV cache sizes for mixed attention models. Without this, KVCacheManager sizes all layers to sliding_window, causing a tensor shape mismatch in compute_for_token_gen when seq_len > sliding_window. Update README with validated max sequence lengths: - Nano TP=2: 40960, TP=4: 49152 (trn2.3xlarge) - Mini TP=4: 32768 (trn2.3xlarge) - Large TP=64: 30720 (trn2.48xlarge) All verified with actual token generation at max seq_len.

- Add TrinityKVCacheManager: per-layer KV cache management with uniform max_length buffers, per-layer scatter modulation (sliding vs global), and per-layer KV read slicing. Replaces layer_to_cache_size_mapping. - Enable has_mixed_attn=True for dual attention masks (global + local) - Restore sliding_window on attention layers for windowed_attention_forward - Add bucketing config examples for all three model sizes (Nano/Mini/Large) - Document bucketing restrictions (CTE buckets >= sliding_window, apply_seq_ids_mask required) - Add validated bucketing results (Nano, trn2.3xlarge, TP=2, 4 prompts PASS) - Update test_trinity.py with bucketing test support - Update validation date to 2026-03-02

- Trinity-Mini: trn2.3xlarge, TP=4, buckets=[2048,4096], ALL PASS - Trinity-Large: trn2.48xlarge, TP=64, buckets=[4096,8192], ALL PASS - All three model sizes now have tested bucketing examples

- Add 34 CPU-only unit tests: test_config.py (22 tests) and test_weight_conversion.py (12 tests) covering config parsing, layer type generation, sliding window clamping, fused MoE eligibility, weight name mappings, muP scaling, expert stacking, route_scale baking, shared expert mapping, and gate padding - Add Apache 2.0 copyright headers to all Python files - Replace all print() with logging module (modeling + test files) - Remove unused imports (KVCacheManager, Union) - Add NxDI version provenance comments to copied attention methods - Remove hardcoded paths from integration tests (env var + skip) - Add configurable performance pass/fail criteria to integration tests - Fix gate_proj interleaved padding for num_heads % tp_degree != 0 (was using incorrect 3D reshape; affects Large model TP=64 only) - Update README with Neuron vs CPU accuracy results and test docs

aarondou approved these changes Feb 27, 2026

View reviewed changes

aarondou approved these changes Mar 2, 2026

View reviewed changes

jimburtoft added 3 commits March 2, 2026 16:50

Add validated bucketing results for Mini and Large

6adc929

- Trinity-Mini: trn2.3xlarge, TP=4, buckets=[2048,4096], ALL PASS - Trinity-Large: trn2.48xlarge, TP=64, buckets=[4096,8192], ALL PASS - All three model sizes now have tested bucketing examples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Trinity model family (AfmoeForCausalLM) contrib#55

Add Trinity model family (AfmoeForCausalLM) contrib#55
jimburtoft wants to merge 5 commits intoaws-neuron:mainfrom
jimburtoft:contrib/trinity-model

jimburtoft commented Feb 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jimburtoft commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Test Results:

Sample first-token predictions (all models):

Compatibility

Additional Information

Known limitations:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jimburtoft commented Feb 27, 2026 •

edited

Loading