Add Trinity model family (AfmoeForCausalLM) contrib#55
Open
jimburtoft wants to merge 5 commits intoaws-neuron:mainfrom
Open
Add Trinity model family (AfmoeForCausalLM) contrib#55jimburtoft wants to merge 5 commits intoaws-neuron:mainfrom
jimburtoft wants to merge 5 commits intoaws-neuron:mainfrom
Conversation
Unified NxDI implementation supporting all three Arcee AI Trinity sizes (Nano ~6B, Mini ~26B, Large ~250B) from a single modeling_trinity.py. Validated on SDK 2.27 (NxDI 0.7.15063, neuronx-cc 2.22.12471): - Nano: inf2.8xlarge (TP=1) and trn2.3xlarge (TP=2) - Mini: trn2.3xlarge (TP=4) - Large: trn2.48xlarge (TP=64)
aarondou
approved these changes
Feb 27, 2026
Add layer_to_cache_size_mapping in setup_attr_for_model() to provide per-layer KV cache sizes for mixed attention models. Without this, KVCacheManager sizes all layers to sliding_window, causing a tensor shape mismatch in compute_for_token_gen when seq_len > sliding_window. Update README with validated max sequence lengths: - Nano TP=2: 40960, TP=4: 49152 (trn2.3xlarge) - Mini TP=4: 32768 (trn2.3xlarge) - Large TP=64: 30720 (trn2.48xlarge) All verified with actual token generation at max seq_len.
aarondou
approved these changes
Mar 2, 2026
- Add TrinityKVCacheManager: per-layer KV cache management with uniform max_length buffers, per-layer scatter modulation (sliding vs global), and per-layer KV read slicing. Replaces layer_to_cache_size_mapping. - Enable has_mixed_attn=True for dual attention masks (global + local) - Restore sliding_window on attention layers for windowed_attention_forward - Add bucketing config examples for all three model sizes (Nano/Mini/Large) - Document bucketing restrictions (CTE buckets >= sliding_window, apply_seq_ids_mask required) - Add validated bucketing results (Nano, trn2.3xlarge, TP=2, 4 prompts PASS) - Update test_trinity.py with bucketing test support - Update validation date to 2026-03-02
- Trinity-Mini: trn2.3xlarge, TP=4, buckets=[2048,4096], ALL PASS - Trinity-Large: trn2.48xlarge, TP=64, buckets=[4096,8192], ALL PASS - All three model sizes now have tested bucketing examples
- Add 34 CPU-only unit tests: test_config.py (22 tests) and test_weight_conversion.py (12 tests) covering config parsing, layer type generation, sliding window clamping, fused MoE eligibility, weight name mappings, muP scaling, expert stacking, route_scale baking, shared expert mapping, and gate padding - Add Apache 2.0 copyright headers to all Python files - Replace all print() with logging module (modeling + test files) - Remove unused imports (KVCacheManager, Union) - Add NxDI version provenance comments to copied attention methods - Remove hardcoded paths from integration tests (env var + skip) - Add configurable performance pass/fail criteria to integration tests - Fix gate_proj interleaved padding for num_heads % tp_degree != 0 (was using incorrect 3D reshape; affects Large model TP=64 only) - Update README with Neuron vs CPU accuracy results and test docs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Unified NxDI implementation for the Arcee AI Trinity model family (AfmoeForCausalLM). A single modeling_trinity.py supports all three model sizes -- Nano (~6B), Mini (26B), and Large (250B) -- with config-driven differences only.
Trinity is a Mixture-of-Experts architecture with several unique features: gated attention (sigmoid gate before o_proj), mixed sliding/full attention, QK normalization, conditional RoPE, expert bias in routing, and route_scale baked into weights.
Model Information
Model Name: Trinity (Nano, Mini, Large)
Model Architecture: Mixture-of-Experts decoder-only transformer (AfmoeForCausalLM)
Purpose: Text generation (causal language modeling)
Checklist
Required Components
Optional Components
Folder Structure
/contrib/models/Trinity/
README.md
init.py
modeling_trinity.py
init.py
init.py
init.py
test_model.py
Testing
How did you test this change?
Each model size was compiled and loaded on the appropriate Neuron instance. Forward passes were run on 3 standardized prompts and top-1 token predictions were verified for coherence. Multi-token generation (5 tokens) was tested via naive autoregressive loop. CPU reference comparison is in process, but all outputs are coherent and grammatically correct.
Test Results:
Sample first-token predictions (all models):
Compatibility
Tested with:
Additional Information
Key porting challenges solved:
Known limitations:
Related Issues
N/A -- This is a new model contribution.
By submitting this PR, I confirm that: