Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
166 commits
Select commit Hold shift + click to select a range
e56397d
[None][feat] Support tensor parallelism of trtllm moe backend for nem…
Wanli-Jiang Feb 26, 2026
41dd9e0
[None][test] Add tests for all database configs. (#11653)
fsaady Feb 26, 2026
3fd5faf
[https://nvbugs/5911143][fix] add async worker to MTP/Eagle3 sampler,…
dhansen-nvidia Feb 26, 2026
80aa8ca
[TRTLLM-10886][feat] Support PARD(Parallel Draft Model) in one-model …
ziyixiong-nv Feb 26, 2026
cde2592
[None][fix] Fix disagg cancellation (#11730)
Tabrizian Feb 26, 2026
a7b17f3
[None][fix] Use prefer_pinned() in pard.py (#11762)
mikeiovine Feb 27, 2026
3e12071
[None][fix] Make KVCacheManagerV2 release mem immediately on shutdown…
lowsfer Feb 27, 2026
097ecea
[TRTLLM-11115][feat] enable autotuner for visual gen + Compilation Co…
NVShreyas Feb 27, 2026
6f7138a
[None][chore] Minor fix in w4a8 mxfp4 mxfp8 test. (#11745)
Tracin Feb 27, 2026
50b48c1
[None][infra] Check in most recent lock file from nightly pipeline
tensorrt-cicd Feb 27, 2026
c65aee9
[None][infra] Move B200 test stage to AIHub (#11692)
yuanjingx87 Feb 27, 2026
57c2904
[None][infra] Waive failed cases for main on 02/27 (#11770)
EmmaQiaoCh Feb 27, 2026
2237d7d
[TRTLLM-11064][fix] Remove duplicated MoE Computation with Helix CP+D…
brb-nv Feb 27, 2026
37ab642
[TRTLLM-10386][fix] torch.compile: register add+norm fallback pass in…
luyiyun1021 Feb 27, 2026
55077fe
[None][feat] Support heterogeneous tokens_per_block (#11751)
lowsfer Feb 27, 2026
ab99ddf
[None][chore] Remove closed bugs (#11527)
xinhe-nv Feb 27, 2026
cb1a872
[None][test] local wheel installation support and add gb300 cases dem…
fredricz-20070104 Feb 27, 2026
7a06614
[None][feat] Refactor cache manager v2 to simplify new model support …
jiaganc Feb 27, 2026
c2d766b
[https://nvbugs/5879614][test] Waive test_guided_decoding_with_eagle3…
ziyixiong-nv Feb 27, 2026
985f81d
[https://nvbugs/5911788][test] Waive test_llm_partial_update_weights[…
liji-nv Feb 27, 2026
63c33c7
[None][feat] add globaltimer-based timing backend for autotuner profi…
dhansen-nvidia Feb 27, 2026
2220d48
[https://nvbugs/5926823][fix] Propagate logprobs from prefill to deco…
brb-nv Feb 27, 2026
d42911a
[TRTLLMINF-9][chore] Remove submodule pulls from TRT-LLM git checkout…
dpitman-nvda Feb 27, 2026
b5921b1
[https://nvbugs/5685010][fix] Delete test_eagle3_output_repetition_4g…
zheyuf Feb 27, 2026
ab7a20a
[None][fix] enable separate draft KV cache pool for aggregated + KVBM…
zyang-Modular Feb 27, 2026
3fe0908
[TRTLLM-11058][feat] Support Helix CP with GQA (#11570)
brb-nv Feb 27, 2026
e396442
[None][infra] Check in most recent lock file from nightly pipeline
tensorrt-cicd Feb 28, 2026
5ddeaf9
[None][perf] Vectorize quantize_fp8_blockwise with CUDA kernel (#11724)
karljang Feb 28, 2026
bb5cf9b
[https://nvbugs/5868616][fix] Fix warnings when building moe_kernels.…
yumin066 Feb 28, 2026
1d576c3
[None][chore] Add CI trigger and test failure retrieval instructions …
lucaslie Feb 28, 2026
b8bf27a
[None][fix] Fix typo: avaiable_blocks -> available_blocks in schedule…
kaiyux Feb 28, 2026
7bd01d2
[TRTLLM-11568][feat] Fix collective calls (#11632)
greg-kwasniewski1 Feb 28, 2026
841608f
[None][perf] Use F.rms_norm for per-head QK normalization in visual g…
karljang Mar 1, 2026
49b9e1b
[None][infra] Check in most recent lock file from nightly pipeline
tensorrt-cicd Mar 1, 2026
f6acde1
[TRTLLM-11185][test] Add back WAN VBench test in CI (#11804)
chang-l Mar 1, 2026
0df2ec6
[TRTLLM-9782][feat] Support to skip KV cache memory estimation (#11714)
HuiGao-NV Mar 1, 2026
1a349cd
[None][doc] Fix typos, grammar, and accuracy across documentation (#1…
kaiyux Mar 1, 2026
37343d4
[None][fix] cleanup mem in rollout process (#11658)
hchings Mar 1, 2026
a413f21
[None][feat] Add --served-model-name option to serve command (#11711)
slin1237 Mar 1, 2026
ea7a708
[None][chore] Update AGENTS.md (#11809)
lucaslie Mar 1, 2026
a20745a
[None][fix] AutoDeploy: Fix shape handling for singleton prefill (#11…
galagam Mar 1, 2026
17eaed5
[None][infra] Waive failed cases for main on 03/01 (#11811)
EmmaQiaoCh Mar 1, 2026
e8ad899
[None][feat] TRT-LLM Gen MoE finalize kernel optimization (#11501)
nekorobov Mar 1, 2026
17e03fa
[None][test] Add E2E test for cancelled disagg gen request with overl…
Tabrizian Mar 2, 2026
aa7632e
[None][chore] pass nsight options to ray_executor and trigger profili…
davidmlw Mar 2, 2026
4e9aa86
[TRTLLM-10962][feat] Refactor video encoding to use ffmpeg CLI or pur…
JunyiXu-nv Mar 2, 2026
50713d8
[None][infra] Check in most recent lock file from nightly pipeline
tensorrt-cicd Mar 2, 2026
85dc52a
[https://nvbugs/5823212][fix] Warmup maybe_compiled_cat in forward_co…
yuantailing Mar 2, 2026
3ab7770
[None][feat] Extract embeding as .savetensors and support float8 quan…
nvyocox Mar 2, 2026
361132b
[https://nvbugs/5885070][fix] fix deepeplowlatency with cutedsl moe b…
leslie-fang25 Mar 2, 2026
9013b58
[None][fix] Fix FP8 per-tensor torch.compile graph break in dynamic q…
karljang Mar 2, 2026
a28def9
[TRTLLM-9687][feat] Improve are_stop_words performance (#11196)
stnie Mar 2, 2026
f449845
[https://nvbugs/5883738][fix] fix bug for illegal memory access on Qw…
sunnyqgg Mar 2, 2026
4b4de81
[#10693][chore] AutoDeploy: Add L1 tests from coverage dashboard (#11…
marinayanov Mar 2, 2026
3b8b91f
[https://nvbugs/5764627][fix] Fix generation logits with streaming an…
stnie Mar 2, 2026
788b868
[https://nvbugs/5934461][fix] Propagate logits from prefill to decode…
brb-nv Mar 2, 2026
812d2ce
[#11726][feat] AutoDeploy: Fuse gemms of mixed children (#11793)
taylor-yb-lee Mar 2, 2026
971c4f0
[None][fix] Fix overly aggressive capacity scheduler (#11731)
jthomson04 Mar 2, 2026
a632f0f
[https://nvbugs/5689262][fix] use proper tokens when exclude_input_in…
lazykyama Mar 2, 2026
c6b1ed4
[https://nvbugs/5863912][fix] Fix with move launch_dependent_grids af…
benzh-2025 Mar 3, 2026
9c9d00d
[https://nvbugs/5938603][fix] Fix E/PD disagg chunked prefill bug (#1…
2ez4bz Mar 3, 2026
8553560
[None][test] add deepseek RCCA perf test case (#11736)
ruodil Mar 3, 2026
998e939
[None][fix] remove torch compile models arg (#11836)
NVShreyas Mar 3, 2026
ec2d8f4
[None][infra] Check in most recent lock file from nightly pipeline
tensorrt-cicd Mar 3, 2026
9afa7a4
[None][test] add b200 multi nodes tests db (#11783)
xinhe-nv Mar 3, 2026
ae6bc67
[None][fix] Fix SM120 issue for rms_norm with nvfp4_quant_fusion (#11…
Wanli-Jiang Mar 3, 2026
58e9f1b
[None][infra] Waive failed cases for main for post-merge 2564 (#11848)
ZhanruiSunCh Mar 3, 2026
fa76c44
[https://nvbugs/5936502][fix] remove dead codes (#11813)
bo-nv Mar 3, 2026
d5f6053
[None][chore] a GitHub Action to assign the PR to the author (#11673)
zhenhuaw-me Mar 3, 2026
e9c495e
[None][infra] Fix a typo in waives.txt (#11852)
EmmaQiaoCh Mar 3, 2026
6367dc2
[None][test] Fix wrong lora config (#11818)
yufeiwu-nv Mar 3, 2026
11fc7cb
[None][test] fix flaky issues (#11814)
xinhe-nv Mar 3, 2026
fd1738a
[None][fix] Fix OOM issue/dummy request allocation/chunked prefill/pp…
yizhang-nv Mar 3, 2026
5d13ebb
[None][test] update waive list (#11815)
xinhe-nv Mar 3, 2026
695d7a0
[TRTLLM-9939][perf] Short-sequence MHA optimization for DSA MLA prefi…
kaiyux Mar 3, 2026
3d348ab
[None][refactor] Revisit attention interface for AutoDeploy (#11796)
lucaslie Mar 3, 2026
2ba9140
[None][feat] Add a flag in trtllm serve to support overriding kv cach…
cjluo-nv Mar 3, 2026
3606afa
[TRTLLMINF-9][chore] Use checkoutFile in mergeWaiveList to avoid full…
dpitman-nvda Mar 3, 2026
e6374a8
[None][chore] Refresh inferenceX configs in recipes (#11595)
venkywonka Mar 3, 2026
e99d74f
[TRTLLM-11042][feat] Implement suffix automaton on device for spec an…
cascade812 Mar 3, 2026
6041a78
[https://nvbugs/5941681][fix] Handle dict type for speculative_config…
ziyixiong-nv Mar 4, 2026
7c74539
[None][feat] Add Kimi-K2.5 text model support (NVFP4) (#11777)
lancelly Mar 4, 2026
c2c9a42
[None][chore] Bump version to 1.3.0rc7 (#11864)
yuanjingx87 Mar 4, 2026
0a5a5e7
[https://nvbugs/5919026][fix] Fix AttributeError when DSA indexer acc…
ziyixiong-nv Mar 4, 2026
54d14df
[TRTLLM-11184][feat] Explicit video encode format support (#11830)
JunyiXu-nv Mar 4, 2026
763bce5
[None][test] Enable DeepGemm + DeepEPLowLatency MoE test combination …
Tabrizian Mar 4, 2026
1c2afe9
[#10009][fix] Fix json_schema response_format to support OpenAI API w…
JunyiXu-nv Mar 4, 2026
0518cc4
[None][infra] Check in most recent lock file from nightly pipeline
tensorrt-cicd Mar 4, 2026
4f2a230
[https://nvbugs/5927620][fix] Override mMaxAttentionWindow with the a…
ziyixiong-nv Mar 4, 2026
72091b3
[None][feat] Support mix quantization between shared experts and rout…
dmtri35 Mar 4, 2026
a106419
[#11666][fix] Fix inmemory model dir detection (#11753)
capyun007 Mar 4, 2026
77d48e0
[None][infra] Waive 3 failed cases for main in post-merge 2566 (#11881)
ZhanruiSunCh Mar 4, 2026
a9d247f
[None][doc] Add sparse attention tech blog (#11644)
heyuhhh Mar 4, 2026
2dbd154
[TRTLLM-9392][feat] Support MoE output to alltoall's workspace for al…
bobboli Mar 4, 2026
d3536f1
[TRTLLM-10852][feat] Enhance logprobs functionality to always return …
stnie Mar 4, 2026
e3d1e55
[None][fix] Fix typos, grammar, and formatting in comments and docstr…
kaiyux Mar 4, 2026
2bae787
[None][fix] Update check_is_moe into support mlp_layer_types after co…
eagle705 Mar 4, 2026
941a245
[https://nvbugs/5946303][fix] Fix incorrect GPU timing in time breakd…
luyiyun1021 Mar 4, 2026
b15062e
[None][chore] Update autotuner (#11859)
jiahanc Mar 4, 2026
e0f54b3
[None][chore] Handle failure in auto-assign author workflow (#11906)
zhenhuaw-me Mar 4, 2026
cb231c5
[https://nvbugs/5930934][fix] Fix OOM hang with NCCL_SYMMETRIC fallba…
peihu-nv Mar 4, 2026
e4332e0
[None][fix] Qwen3.5 fix positions ids input for text-only usage (#11877)
bmarimuthu-nv Mar 4, 2026
234eb83
[None][fix] Refactor nanoV3+superV3 accuracy tests to load example co…
galagam Mar 4, 2026
e3788f3
[None][chore] Deprecate eagle3 2-model (#11761)
mikeiovine Mar 4, 2026
5b81307
[#11819][fix] Disable preload for Llama4 scout (#11873)
taylor-yb-lee Mar 4, 2026
460889f
[None][chore] Fix format issue in tensorrt_llm/serve/openai_server.py…
chienchunhung Mar 4, 2026
011af4c
[None][feat] Separate radix search tree implementation (#10862)
thorjohnsen Mar 5, 2026
e01c38f
[None][feat] Add support for expert_number<=2048 and K<=32 (#11510)
ChristinaZ Mar 5, 2026
4e735db
[None][infra] Waive 1 failed cases for main in pre-merge 29212 (#11929)
ZhanruiSunCh Mar 5, 2026
559828c
[None][infra] Check in most recent lock file from nightly pipeline
tensorrt-cicd Mar 5, 2026
22007ca
[None][fix] remove leak check for kimi (#11825)
xinhe-nv Mar 5, 2026
12f2f39
[https://nvbugs/5907477][chore] unwaive test (#11896)
reasonsolo Mar 5, 2026
17921f8
[TRTLLM-10956][infra] Support build-only mode for GenPostMergeBuilds …
mzweilz Mar 5, 2026
9da717a
[#11755][feat] AutoDeploy onboarding agent + Kimi K2.5 AD modeling co…
bmarimuthu-nv Mar 5, 2026
5b0e8a9
[None][fix] Prevent RuntimeError from dict mutation during iteration …
Bias92 Mar 5, 2026
2f4ed7d
[TRTLLM-11101][feat] VisualGen benchmarking script (#11651)
zhenhuaw-me Mar 5, 2026
2ee7dba
[None][feat] Run extra general warmup to warm up memory pool (#10340)
liji-nv Mar 5, 2026
517ee94
[None][fix] Fix nemotron super MTP crash on SM90 (#11807)
sunnyqgg Mar 5, 2026
6062df4
[None][chore] Use cluster service discover in disagg CI tests (#11242)
ekou24 Mar 5, 2026
4786834
[None][feat] External Drafter One Model (#11758)
IzzyPutterman Mar 5, 2026
497b07d
[None][chore] Update model list (#11827)
tcherckez-nvidia Mar 5, 2026
5f1fb7c
[#11578][fix] Use string stop/bad words in gRPC proto instead of pre-…
CatherineSue Mar 6, 2026
e699f23
[None][feat] Add support for bidirectional sliding window attention m…
djns99 Mar 6, 2026
c6dbef2
[None][infra] Check in most recent lock file from nightly pipeline
tensorrt-cicd Mar 6, 2026
dd61fd5
[TRTLLM-11036][feat] Enable new moe test and clean the legacy moe tes…
xxi-nv Mar 6, 2026
93a62dc
[None][infra] Waive 4 failed cases for main in post-merge 2571 (#11968)
ZhanruiSunCh Mar 6, 2026
93ac4a0
[None][test] Fix deepseek-r1 OOM issue for H100 perf test (#11948)
yufeiwu-nv Mar 6, 2026
a018c48
[None][fix] Remove incorrect Python import style rule from AGENTS.md …
yuxianq Mar 6, 2026
a7c0af5
[https://nvbugs/5896577][fix] fix bug of mistral large3 with eagle (#…
byshiue Mar 6, 2026
f639e8b
[https://nvbugs/5819048][fix] unwaive test of qwen3-235b eagle3 (#11969)
byshiue Mar 6, 2026
c6c6dc1
[None][feat] Avoid duplicated computation with ADP + Helix CP in GQA …
brb-nv Mar 6, 2026
191e349
[https://nvbugs/5624818][fix] Add unittest for GPT-OSS non-paged_cont…
pengbowang-nv Mar 6, 2026
7f458ab
[#10245][feat] AutoDeploy: Support Finegrained FP8 quantization (#10897)
bmarimuthu-nv Mar 6, 2026
b94656c
[TRTLLM-11284][infra] Move large models test to post-merge (#11933)
EmmaQiaoCh Mar 6, 2026
dc740c2
[TRTLLM-11155][infra] Run multi-GPU tests even single-GPU tests are f…
yiqingy0 Mar 6, 2026
4dc7bc5
[None][fix] Refine tests/unittest/_torch/flashinfer/test_trtllm_flash…
yihwang-nv Mar 6, 2026
b5a4e34
[#11422][feat] AutoDeploy: Piecewise cudagraph support Prototype (#11…
nvchenghaoz Mar 6, 2026
5b0c956
[TRTLLM-11189][fix] VisualGen isolated TeaCache Wan fix (#11964)
o-stoner Mar 6, 2026
22c4706
[https://nvbugs/5846166][fix] Update Perf Triage Scripts to Fix gen_o…
chenfeiz0326 Mar 6, 2026
ac8bc6e
[TRTLLM-11057][feat] Add Helix CP support for DSV3.2 (#11507)
brb-nv Mar 6, 2026
427369e
[#2912][feat] Support Cohere Command A model (#11505)
torotoki Mar 6, 2026
498b25c
[TRTLLM-11259][perf] Parallel VAE harness and implementation for WAN …
NVShreyas Mar 6, 2026
5918348
[#11578][feat] support multimodal image input in gRPC server (#11800)
CatherineSue Mar 6, 2026
d1ba3b8
[TRTLLM-11093][feat] add 5D A2A for fused ulysses (#11787)
NVShreyas Mar 6, 2026
7dbda08
[TRTLLM-11189][fix] Fix TeaCache broken caching for FLUX.1 and FLUX.2…
karljang Mar 6, 2026
2087b24
[None][refactor] Request management in ScheduledRequests (#11784)
Funatiq Mar 7, 2026
10348f8
[None][perf] Add Triton FP8 blockwise quant kernel and autotuner buck…
chang-l Mar 7, 2026
2eb332c
[TRTLLM-11290][feat] Enable trtllm-serve E2E tests (#11985)
JunyiXu-nv Mar 7, 2026
cc16289
[None][feat] Optimize by fuse nvfp4_quant to layernorm_gated for mamb…
Wanli-Jiang Mar 7, 2026
dd8ffbd
[None][infra] Check in most recent lock file from nightly pipeline
tensorrt-cicd Mar 7, 2026
86e0282
[None][chore] Autodeploy: add models for sprint (#11999)
nvchenghaoz Mar 7, 2026
656091b
[None][infra] Update CI allow list 20260305 (#11965)
yuanjingx87 Mar 7, 2026
1dcb6ec
[https://nvbugs/5809169][unwaive] Unwaive TestGPTOSS test (#11416)
peaceh-nv Feb 26, 2026
039b06f
[https://nvbugs/5859881][fix] Unwaive test (#11716)
hyukn Feb 26, 2026
75e038e
[None][feat] add sanity tests for release1.2 version (#11738)
yingguo-trt Feb 26, 2026
05bb5c1
[https://nvbugs/5889841][fix] Add custom option class to allow subcom…
FrankD412 Feb 26, 2026
b548320
[https://nvbugs/5875522][docs] Add known issue for disaggregated serv…
Tabrizian Feb 27, 2026
2f725ea
[https://nvbugs/5775256] [fix] Reopen fp8_dsl_fused_moe ut. (#11779)
limin2021 Mar 2, 2026
07fbb5d
[https://nvbugs/5762822][chore] Unwaive longbenchV2 test (#11647)
heyuhhh Mar 2, 2026
2d9ed59
[https://nvbugs/5936273][fix] Fix bugs of Mistral Large3 (#11885)
byshiue Mar 4, 2026
9c6ce75
[https://nvbugs/5949098][doc] Fixing docs links (#11912)
pcastonguay Mar 4, 2026
f4593cf
[None][doc] Replace the TensorRT-LLM with TensorRT LLM (#11914)
nv-guomingz Mar 5, 2026
0579ac6
[None][chore] Fix/disagg perf failure detection (#11904)
yingguo-trt Mar 5, 2026
a0a9e33
Update tests/integration/test_lists/waives.txt
chzblych Mar 7, 2026
6b04973
[None][fix] Fix Collect Perf Sanity Result's import requests Error (#…
chenfeiz0326 Mar 7, 2026
ae00b20
Test CI/CD setup
ZhanruiSunCh Mar 9, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
41 changes: 41 additions & 0 deletions .claude/agents/ad-debug-agent.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
---
name: ad-debug-agent
description: Debug the AutoDeploy model onboarding process
tools: Read, Grep, Glob, Bash, Edit, Write
model: sonnet
---

Usually, we run a model with auto deploy using this command. If you are not given the model-id and config, ask the user first.

And ask if you want to rerun it to get fresh log and IR.
Keep log and IR dump directory $PWD.

Workflow:
1. Run the ad flow with the user given model-id and yaml using the below command.
How to run:
```bash
AD_DUMP_GRAPHS_DIR=<AD_DUMP_GRAPHS_DIR> python examples/auto_deploy/build_and_run_ad.py \
--model <MODEL_HF_ID> \
--args.yaml-extra examples/auto_deploy/model_registry/configs/<CONFIG_YAML_FILE> \
2>&1 | tee <LOG_FILE>
```
Where `AD_DUMP_GRAPHS_DIR=<AD_DUMP_GRAPHS_DIR>` is the directory where the graphs will be dumped (will be auto-created by the script), `<MODEL_HF_ID>` is the HF model-id of model we want to run (it can also be a local path to a model checkpoint), and `<CONFIG_YAML_FILE>` is the configuration file for the model.

If there's any error, we check the log file `<LOG_FILE>` and IR files in the `AD_DUMP_GRAPHS_DIR` directory to see what went wrong.

2. if you hit an error and notice something wrong, first inform the user what you observed. Then analyze the issue and think of possible rootcause. Don't jump to fixing anything yet.

3. Based on the discussion with the user, implement the fix and run again and iterate.


Remember to use you your own tools - Read, Grep, Glob, Bash, Edit, Write

Some common strategies to iterate faster and debug issues:
* use less hidden layers - can be done by updating the yaml file with model_kwargs. usually it'll be simple but it needs to match what model config expects - some models might have alternating layer patterns like - 1 full attention, 1 linear attention etc. Then update the yaml file with model_kwargs accordingly.
* enable / disable sharding - can be done by updating the yaml file with world_size = 1 or world_size >1 (say 2)

Common pit-falls:
* weights in HF safetensors are not matching what AD custom modeling code expects. So weight loading will fail. Usually there'll be load hooks registered in ad modeling code, but you can verify that. HF safetensors json will be helpful refer.
* custom model has different module hierarchies than what the checkpoint safetensors expect. In that case we update the ad custom modeling code to match the expected hierarchy.

Remember to use you your own tools - Read, Grep, Glob, Bash, Edit, Write
123 changes: 123 additions & 0 deletions .claude/agents/ad-onboard-reviewer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
---
name: onboard-reviewer
description: Independent reviewer for AutoDeploy model onboarding. Validates created model and test files against all onboarding requirements. Use after completing model onboarding work.
tools: Read, Grep, Glob
model: sonnet
---

You are an independent code reviewer for AutoDeploy model onboarding.

**Your role is adversarial.** You exist because the implementing agent misses details.
Do NOT trust any claims from the caller. You will be given a model name and file paths.
Read every file yourself, line by line, and verify each checklist item with concrete evidence.

## Inputs You Will Receive

- `model_name`: The model being onboarded
- `model_file`: Path to the created `modeling_*.py`
- `test_file`: Path to the created `test_*_modeling.py`
- `init_file`: Always `tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py`

## Validation Checklist

Read the actual source code for each check. Cite `file:line_number` for every PASS and FAIL.


### B. Self-Containment

| # | Check | How to verify |
|---|-------|---------------|
| B1 | No imports from other AD custom models (`from .modeling_*`) | Grep for `from .modeling_` — only `from .` imports of non-model utilities are OK (e.g., `mla_rope_utils`) |
| B2 | Config class is defined in the file OR imported from transformers (not from another AD model) | Check where the config class comes from |
| B3 | If config not in installed transformers, file has `AutoConfig.register()` | Grep for `AutoConfig.register` |

### BA Checkpoint compatibility
| BA1 | Make sure the custom modeling code nn.module hierarchy matches the model hierarchy that is expected in the checkpoint safetensor json. |
| BA2 | If our modeling code has expert-list style moe experts and the checkpoint has fused moe experts, add a load hook to load the safetensors correctly into our expert list weights.

### C. Ops & Compatibility

| # | Check | How to verify |
|---|-------|---------------|
| C1 | Only uses `torch_*` reference ops from `auto_deploy.custom_ops` or plain PyTorch | Grep for `torch.ops.` calls — only `torch.ops.auto_deploy.torch_*` allowed |
| C2 | No `triton_*`, `flashinfer_*`, `trtllm.*` ops (no exception for routers or router gemms all must be CPU compatible torch ops) | Grep for these prefixes |
| C3 | No KV cache logic (no `past_key_values`, no cache classes) | Grep for `past_key_value`, `cache`, `DynamicCache` |
| C4 | No training paths (no `self.training` checks, no `dropout`) | Grep for `self.training`, `dropout`, `Dropout` |
| C5 | No flash attention variants (`flash_attn`, `sdpa`, `_flash_attention`) | Grep for these strings |

### D. RoPE & MoE Conventions

| # | Check | How to verify |
|---|-------|---------------|
| D1 | RoPE buffers use `_ad_` prefix (`_ad_cos_cached`, `_ad_sin_cached`) | Grep for `register_buffer` calls with `_ad_` |
| D2 | RoPE `forward()` returns full table (not sliced by seq_len) | Read the RoPE forward method — should return full cached tensors |
| D3 | Position slicing happens downstream (in attention, by `position_ids`) | Check attention forward for `cos[position_ids]` or similar pattern |
| D4 | MoE experts use `nn.ModuleList` (not stacked tensor parameters) | Grep for `nn.ModuleList` in MoE class |
| D5 | Each expert has individual `gate_proj`, `up_proj`, `down_proj` weights | Check expert structure |

Note: D1-D3 only apply if the model uses RoPE. D4-D5 only apply if the model has MoE.
Mark as N/A with justification if the model doesn't have the relevant component.

### F. Test File — Structure

| # | Check | How to verify |
|---|-------|---------------|
| F1 | Uses small config (hidden_size ~64, num_hidden_layers 2-3, vocab_size ~1000) | Read the test config creation |
| F2 | No smoke tests — every test has meaningful assertions (`assert_close`, `assert_rmse_close`, shape checks, finiteness checks) | Check each test for substantive assertions |
| F3 | Do not rely on only `isnan`/`isinf` checks; include functional equivalence assertions | Check tests use `assert_close` or `assert_rmse_close` against reference outputs |
| F4 | Test imports must be self-contained (transformers imports or copied reference classes only); no hardcoded local/temp path imports | Inspect imports and helper loaders |

### G. Test File — Hierarchical Levels

| # | Check | How to verify |
|---|-------|---------------|
| G1 | **Block equivalence**: Tests individual blocks (MLP, Attention, MoE, Norm) comparing AD output vs HF output. Blocks with identical math (plain MLP, Norm) should use `torch.testing.assert_close` with tight tolerance. Blocks with fused custom ops (Attention with MLA/RoPE, MoE with fused routing) must use `assert_rmse_close` from `_model_test_utils` with appropriate `rmse_ratio_tol` (attention: 0.10, MoE: 0.02). | Look for per-block test functions loading same weights into both implementations; verify correct comparison function and tolerance |
| G2 | **Layer equivalence**: Tests a full decoder layer (if model has heterogeneous layers like dense vs MoE, tests each type). Must use `assert_rmse_close` with `rmse_ratio_tol=0.05`. | Look for layer-level test with `assert_rmse_close` |
| G3 | **Full model equivalence**: End-to-end logits comparison AD vs HF with same weights with minimum number layers. Must use `assert_rmse_close` with `rmse_ratio_tol=0.05`. Also, need to be able to run on CPU. | Look for full model test with logits `assert_rmse_close` |
| G4 | **Export test**: Uses `torch_export_to_gm` with `Dim.DYNAMIC` for both batch and sequence dimensions | Grep for `torch_export_to_gm` and `Dim.DYNAMIC` |
| G6 | Export test runs a second forward with different shape to verify dynamic dims work | Look for a second input with different B, S values |

### H. Test File — Weight Conversion

| # | Check | How to verify |
|---|-------|---------------|
| H1 | If MoE model: has state_dict converter from HF stacked format to per-expert format | Look for conversion function |
| H2 | Equivalence tests load identical weights into both HF and AD models before comparing | Check that `load_state_dict` is called with converted weights |

## Output Format

```text
REVIEW RESULT: PASS | FAIL

=== A. Structure & Hierarchy ===
A1 PASS modeling_foo.py:45 — FooPreTrainedModel(PreTrainedModel)
A2 PASS modeling_foo.py:30 — @dataclass FooCausalLMOutput(ModelOutput)
A3 FAIL modeling_foo.py:120 — forward(self, input_ids, attention_mask, ...) — missing position_ids
A4 PASS modeling_foo.py:135 — returns FooCausalLMOutput(logits=logits)

=== B. Self-Containment ===
B1 PASS No `from .modeling_` imports found
B2 PASS modeling_foo.py:15 — FooConfig defined in file
B3 PASS modeling_foo.py:80 — AutoConfig.register("foo", FooConfig, exist_ok=True)

=== C. Ops & Compatibility ===
...

=== Summary ===
PASSED: 22/26
FAILED: 4/26

Failed items requiring fixes:
1. A3 — Forward signature missing position_ids parameter (modeling_foo.py:120)
2. G2 — No layer equivalence test found
3. G4 — Export test missing Dim.DYNAMIC
4. H1 — No MoE weight converter despite model having MoE layers
```

## Rules

1. Be strict. If something is ambiguous or borderline, mark it FAIL and explain why.
2. A PASS result means EVERY SINGLE item passed. Even one FAIL means overall FAIL.
3. Always cite file:line_number. No exceptions.
4. Read the actual files. Never infer or assume based on the caller's description.
5. If a check is not applicable (e.g., D4 for a non-MoE model), mark it N/A with justification.
104 changes: 104 additions & 0 deletions .claude/skills/ad-model-onboard/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
---
name: ad-model-onboard
description: Translates a HuggingFace model into a prefill-only AutoDeploy custom model using reference custom ops, validates with hierarchical equivalence tests.
---

# AutoDeploy Model Onboarding

**Input:** HuggingFace model ID. **Output:** prefill-only custom model file + hierarchical tests + summary report.

## Phase 0 — Gather All Resources Upfront
Web/GitHub fetches require user approval and the user may leave. Do ALL network access now and save locally before proceeding.

**Step 1 — Check local transformers install first:**
```bash
python -c "import transformers; print(transformers.__file__)"
```
Look for `models/{model_type}/modeling_*.py` under that path. If found, use it directly — no network needed.

**Step 2 — If not found, download the HF repo (code only, skip weights):**
```bash
huggingface-cli download {org}/{model} --exclude "*.safetensors" "*.bin" "*.pt" "*.gguf"
```
This downloads config, code, and tokenizer files into the standard HF cache (`$HF_HOME` or `~/.cache/huggingface/`) while skipping large weight files. Files cached here are automatically found by `transformers.AutoConfig.from_pretrained` and similar APIs — no extra path wiring needed. Once downloaded you can work fully offline — read `config.json` and `modeling_*.py` from the cache snapshot directory printed by the command.

## Phase 1 — Analyze HF Model
Study the locally-available `config.json` and `modeling_*.py` (NOT from `tensorrt_llm/_torch/models/`). Identify attention type (MHA/GQA/MLA), MoE config, RoPE variant, normalization, activation, and any data-dependent ops that break `torch.export` (e.g. `torch.nonzero`, data-conditioned `if`).

## Phase 2 — Write Prefill-Only Model
Create `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_{name}.py`. Use `modeling_glm4_moe_lite.py` as a **structural template only** (class layout, dataclass outputs, forward signature). Strip: KV cache, training paths, dropout, flash attention variants. Keep: `PreTrainedModel` hierarchy, `ModelOutput` dataclass, minimal forward `(input_ids, position_ids, inputs_embeds=None, **kwargs)`.

**Critical**
Make sure the custom modeling code matches the model hierarchy that is expected in the checkpoint safetensor json.

**Critical rule: Do NOT import or reuse existing AD custom model code** (e.g. `from .modeling_deepseek import ...`). Every `modeling_{name}.py` must be self-contained. Use the HF source (`$CLONE_DIR/modeling_*.py`) as the source of truth for the model's logic and translate it fresh — even if a structurally similar AD model already exists. This prevents hidden coupling, makes each model auditable on its own, and ensures model-specific quirks are captured correctly.

## Phase 3 — Use Reference Custom Ops Only
Replace HF ops with `torch_*` prefixed AD reference ops. **Never** use `triton_*`/`flashinfer_*`/`trtllm_*` — backend selection happens later in AD transforms. Browse `tensorrt_llm/_torch/auto_deploy/custom_ops/` for all available reference ops and their exact signatures. For vanilla components (RMSNorm, MLP), plain PyTorch is also fine — AD fusion passes replace them.

## Phase 4 — Register
1. Bottom of model file: `AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM)`.
2. Add import + `__all__` entry in `models/custom/__init__.py`.
3. If config not in installed transformers, bundle config class and `AutoConfig.register(model_type, ConfigCls, exist_ok=True)`.

## Phase 5 — Model Input Contract
The custom model's forward signature must follow these rules:

1. **Always `input_ids`** — The top-level model always receives `input_ids`. A submodule graph may internally receive `inputs_embeds` (e.g., after the embedding layer), but the exported entry point takes token IDs.
2. **Always `position_ids`** — Vanilla sequential `position_ids` are always provided. If the model uses a non-standard RoPE variant or custom position encoding, the model must compute it internally on top of these vanilla `position_ids`.
3. **Multi-modal inputs** — If the model supports vision/audio/etc., those additional inputs are passed during prefill alongside `input_ids`.
4. **No attention mask, no cache inputs, no HF-runtime features** — Do not accept `attention_mask`, `past_key_values`, `use_cache`, or similar HF-runtime arguments. AD manages masking and caching via its own transforms and runtime.

## Phase 6 — Hierarchical Tests
Create `tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_{name}_modeling.py`. Use `test_glm4_moe_lite_modeling.py` as template. **No smoke tests.** Small config (hidden=64, layers=2-3, vocab=1000). Use `pytest.skip` if HF class unavailable.

**HF Reference Strategy:** Equivalence tests compare our custom implementation against the HF reference with identical weights and inputs.
- **If HF modules exist in the installed `transformers`**: import them directly (e.g., `from transformers.models.deepseek_v3.modeling_deepseek_v3 import DeepseekV3ForCausalLM`). Wrap imports in `_get_hf_*_class()` try/except helpers that return `None` on `ImportError`, and use `pytest.skip` when `None`.
- **If HF modules are NOT in the installed `transformers`**: copy the minimal module definitions from the HF `modeling_*.py` source into the test file as standalone reference classes. This keeps tests self-contained without requiring a specific `transformers` version.
- **Weight conversion helpers**: Write test-only helpers for any weight format differences between HF and custom (e.g., RoPE de-interleaving, stacked-to-per-expert MoE weights, gate weight key remapping). For full-model tests, prefer using `load_state_dict` pre-hooks already registered on the custom model.

**Numerical comparison:** For equivalence tests comparing custom ops against HF reference, use the shared `assert_rmse_close` utility from `_model_test_utils`:
```python
from _model_test_utils import assert_rmse_close
```
This computes `rmse(actual - expected) / rmse(expected)` — more robust than per-element `torch.testing.assert_close` since a few outlier elements won't fail the test. Use `torch.testing.assert_close` only for blocks with identical math (e.g., plain MLP with no custom ops).

Recommended `rmse_ratio_tol` values for bfloat16:
- **Identical math** (MLP, Norm): use `torch.testing.assert_close` with tight rtol/atol (1e-3)
- **MoE block** (fused routing): `0.02`
- **Decoder layer / MoE layer / full model**: `0.05`
- **Attention**: `0.10`

**Bottom-up levels (each must pass before next):**
1. **Block equivalence** — Test MLP, Attention, MoE, Norm individually: same weights + same input → `assert_rmse_close` (or `torch.testing.assert_close` for identical-math blocks).
2. **Layer equivalence** — Full decoder layer. If model has heterogeneous layers (dense vs MoE, attention vs SSM), test each type separately.
3. **Full model equivalence** — End-to-end logits comparison. Use a small config with <10 layers that covers the essence of the architecture (e.g., at least one of each layer type).
4. **Export test** — `torch_export_to_gm` with `Dim.DYNAMIC` for batch+seq, verify finite output, test a second shape.

## Phase 7 — Independent Review (MANDATORY)

Invoke the `ad-onboard-reviewer` subagent with ONLY the following information:
- Model name
- Path to the model file created
- Path to the test file created

**Do NOT include your own assessment of correctness. Do NOT summarize what you did.** Let the reviewer read the files and judge independently.

If the reviewer returns **FAIL** on any item:
1. Read the reviewer's specific failure reasons and file:line references
2. Fix each failed item
3. Invoke the reviewer again with the same minimal inputs
4. Repeat until you get a full **PASS**

Do NOT proceed to Phase 8 until the reviewer returns PASS.

## Phase 8 — Summary Report
Print (not file) after completion: (1) model overview + unique features, (2) tricky parts needing human review, (3) files created/modified, (4) test results table (name | validates | PASS/FAIL), (5) known limitations, (6) reviewer result (PASS + how many review iterations it took).

## Key Gotchas
- **Self-contained files only**: Never import from other AD custom models. Each `modeling_{name}.py` is a standalone translation from HF source.
- RoPE buffers: `_ad_` prefix, return full table (not sliced), slice by `position_ids` downstream.
- MoE weights: use `nn.ModuleList` per-expert for checkpoint compatibility. Write test-only state_dict converters for HF stacked format.
- `noaux_tc` routers (DeepSeek-V3 style): use vanilla PyTorch (sigmoid + bias + group topk + normalize + scale). AD transforms can replace with fused `trtllm` kernels at deployment time.
- Vision towers are typically **not** exported. Keep vision logic in eager PyTorch and export only the text path unless explicitly requested otherwise.
- Model code and tests must run on CPU. Use only torch reference ops in AutoDeploy (e.g., `torch_rmsnorm`, `torch_mla`, `torch_moe`) and avoid CUDA-only kernels in the modeling path.
18 changes: 18 additions & 0 deletions .github/workflows/auto-assign-author.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
name: Auto Assign PR to Author

on:
pull_request_target:
types: [opened]

jobs:
assign-author:
runs-on: ubuntu-latest
permissions:
pull-requests: write # Required to modify the PR
steps:
- name: Assign PR to Author
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_URL: ${{ github.event.pull_request.html_url }}
AUTHOR: ${{ github.actor }}
run: gh pr edit $PR_URL --add-assignee $AUTHOR || echo "Could not assign $AUTHOR (not a collaborator), skipping."
Loading
Loading