[1/N][feat] Make ATOM work with vLLM and SGLang by zejunchen-zejun · Pull Request #126 · ROCm/ATOM

zejunchen-zejun · 2026-01-12T08:50:35Z

This PR is used to make ATOM work with vLLM and SGLang, which keeps the OOB of popular frameworks and provides the optimizations from ATOM.

For vLLM, this PR uses the vLLM official out-of-tree mechanism and make ATOM provide platform, model and attention to vLLM. Here is the design diagram and performance snapshot. Compared to vLLM, vLLM+ATOM has 6-20% performance uplift.

Here is the RFC:

vLLM RFC: [RFC] Enable ATOM as vLLM out-of-tree Platform #201
vLLM PR: No PRs are needed. The design is transparent to vLLM.

For SGLang, this PR uses the official model impl backend mechanism. Here is the design diagram.

For attention, this PR constructs the BaseAttention and makes paged attention/radix attention inherits from this base class. The implementation details of ATOM server mode and plugin mode have been moved into the PagedAttentionImpl

atom/model_ops/paged_attention.py

atom/plugin/attention.py

atom/plugin/attention_mha.py

Copilot

Pull request overview

This pull request enables ATOM to work as a model implementation backend for vLLM and SGLang, allowing users to specify --model-impl atom when launching these frameworks. The implementation follows an official registry mechanism and combines framework-level features from vLLM/SGLang with model-level fusion kernels from ATOM/AITER.

Changes:

Adds plugin infrastructure to register ATOM models and attention backends with vLLM and SGLang
Implements attention metadata builders and handlers for plugin mode
Refactors model implementations (Qwen3, Qwen3MoE, etc.) to support both server and plugin modes
Adds documentation recipe with setup instructions and known limitations

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 43 comments.

Show a summary per file

File	Description
recipes/Model-Impl-Backend.md	Documentation and setup guide for using ATOM with vLLM and SGLang
atom/plugin/*.py	Core plugin infrastructure including registration, config generation, and attention handling
atom/models/*.py	Model implementations updated to support plugin mode with consistent APIs
atom/model_ops/*.py	Attention operations refactored with base classes and plugin-specific implementations
atom/model_loader/loader.py	Weight loading updated to support plugin mode
atom/config.py	Configuration extended with plugin-specific settings
atom/utils/*.py	Utilities updated for plugin mode support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/config.py

atom/model_engine/model_runner.py

atom/model_ops/base_attention.py

atom/model_ops/attention_mha.py

atom/model_ops/radix_attention.py

atom/plugin/register.py

recipes/Model-Impl-Backend.md

Copilot

Pull request overview

Copilot reviewed 29 out of 29 changed files in this pull request and generated 33 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/plugin/attention.py

atom/model_ops/__init__.py

atom/model_ops/base_attention.py

atom/plugin/prepare.py

atom/plugin/attention_mha.py

atom/models/qwen3.py

atom/model_ops/attentions/aiter_attention.py

atom/plugin/attention.py

Copilot

Pull request overview

Copilot reviewed 29 out of 29 changed files in this pull request and generated 20 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/models/qwen3_moe.py

atom/plugin/attention_mha.py

atom/plugin/config.py

atom/plugin/register.py

atom/plugin/attention_mha.py

atom/plugin/register.py

atom/models/qwen3_moe.py

atom/plugin/attention.py

atom/plugin/register.py

Copilot

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/plugin/attention_mha.py

recipes/vLLM-ATOM-OOT-Plugin-Backend.md

ChuanLi1101

Thanks for the hard work. It took me a while to review the PR. I’ve left some comments on a few more serious issues that may cause bugs, for your reference.

atom/plugin/attention.py

atom/plugin/prepare.py

atom/plugin/vllm/platform.py

atom/models/qwen3.py

atom/plugin/attention_mha.py

atom/plugin/vllm/model_wrapper.py

atom/plugin/attention.py

zejunchen-zejun · 2026-03-02T09:37:24Z

Thanks for the hard work. It took me a while to review the PR. I’ve left some comments on a few more serious issues that may cause bugs, for your reference.

Thank you for significant suggestions. I will resolve soon!

Copilot

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/plugin/attention.py

atom/model_loader/loader.py

recipes/vLLM-ATOM-OOT-Plugin-Backend.md

sunway513 · 2026-03-03T00:07:59Z

PR #126 Review: `[1/N][feat] Make ATOM work with vLLM and SGLang`

+3,243 / -179 | 32 files | 21 commits

Overall Assessment

This is an ambitious PR that enables ATOM to function as an out-of-tree plugin for vLLM and SGLang. The architectural approach is sound — leveraging vLLM's official OOT mechanism makes the design transparent to vLLM with zero upstream code changes. The reported 6-20% performance uplift is compelling.

I performed an initial review and then double-checked each finding against the latest commit on this branch. Several issues flagged earlier (by myself and others on older commits) have already been addressed. Below are the remaining items.

Note: An earlier version of this comment listed 7 critical issues. After verification against the latest code, most were either already fixed or did not hold up on closer inspection. This updated version reflects the corrected assessment.

Remaining Issues

1. extend_for_sliding_window type annotation mismatch — atom/plugin/attention_mha.py

The function signature declares k_scale: float and v_scale: float, but in practice self.k_scale / self.v_scale (which are torch.Tensor or None) are passed in, and the downstream cp_mha_gather_cache expects torch.Tensor. The code works at runtime, but the type annotation is misleading and will cause issues with type checkers and IDE tooling.

2. extend_workspace buffer bypasses framework memory management — atom/plugin/attention.py

The workspace buffer [2, 32*1024, num_kv_heads, head_dim] is allocated outside of vLLM's cache manager. For a model like Qwen3-235B (8 kv heads, 128 head_dim, bf16), this is ~128 MB of untracked GPU memory. At high gpu_mem_utilization, this creates OOM risk. Also flagged by @wuhuikx — recommend at minimum adding a prominent comment/warning, and ideally integrating with vLLM's memory accounting.

Design Suggestions (Non-blocking)

3. Heavy use of decorators increases debugging difficulty

Four decorators dynamically modify class inheritance chains and methods at import time (PagedAttentionImplDecoratorForPluginMode, AiterAttentionMetadataBuilderDecoratorForPluginMode, AiterBackendDecoratorForPluginMode, FusedMoEDecoratorForPluginMode). This makes stack traces harder to follow and debugging more difficult. Worth considering more explicit inheritance patterns in future iterations.

4. is_plugin_mode() / is_vllm() / is_sglang() scattered across core code

Core files (attention_mha.py, base_attention.py, embed_head.py, loader.py, moe.py) now contain plugin mode conditionals throughout. Consider using strategy pattern or abstract interfaces to better isolate plugin-specific behavior in future PRs.

5. SGLang support is incomplete

Multiple locations raise NotImplementedError for SGLang paths. Agree with @wuhuikx's suggestion to focus this PR on vLLM only and split SGLang into a follow-up PR.

6. Commit history cleanup

21 commits with several messages like "add" or "make lint happy". Recommend squashing before merge.

Previously Flagged Issues — Verified as Resolved or Not Applicable

For transparency, the following were flagged earlier but do not hold up against the latest code:

~~elif 0: dead code~~ → Fixed in latest commit (now elif use_triton_attn and self.rotary_emb is not None)
~~positions None crash~~ → positions is passed from model forward, not from Context.positions; dummy runs exit early
~~sliding_window None TypeError~~ → Proper None guard exists (if sliding_window is None or sliding_window == -1)
~~paged_attention_triton_plugin_mode arg name mismatch~~ → Fixed in latest commit, call sites now use k_cache= / v_cache=
~~ATOMPlatform fallback missing method~~ → When disabled, ATOMPlatform = None and register_platform() returns None; no class instantiation occurs
~~model_runner=None crash in aiter_attention.py~~ → In plugin mode, the decorator replaces __init__ entirely, so the original code path is not reached

CI/CD Coverage Requirement

See issue #255 for the full CI enhancement plan. Summary below:

CPU Unit Tests — Validated and Ready to Cherry-Pick

23 tests across 4 files have been developed, validated, and pass on CPU without GPU/vllm/aiter/triton dependencies (runs in ~1 second). The code is available on branch ci/plugin-mode-test-coverage in the ATOM repo for cherry-picking.

Test File	Tests	Coverage
`tests/test_plugin_prepare.py`	7	`is_vllm()`, `is_sglang()`, `is_plugin_mode()`, `_set_framework_backbone()`, invalid framework, case insensitivity
`tests/test_plugin_config.py`	6	`PluginConfig` defaults, vllm/sglang mode fields, field completeness
`tests/test_plugin_vllm_register.py`	6	`register_platform()` enable/disable, `register_model()` skip, model registry overrides, `set_attn_cls()` → PagedAttention/RadixAttention
`tests/test_plugin_vllm_platform.py`	4	`ATOMPlatform` None when disabled, inherits RocmPlatform when enabled, returns ATOM backend, fallback when attention disabled

Also included on the branch:

.github/workflows/atom-plugin-test.yaml — New workflow (CPU unit tests on every PR + GPU smoke test)
.github/scripts/atom_plugin_test.sh — vLLM/SGLang plugin launch + inference + accuracy script

GPU Tests — Concept Only (For Follow-Up Development)

The following GPU test levels are proposed but not yet implemented — they require actual GPU hardware and full vLLM + AITER stack:

Level	Description	GPU	Est. Time	Priority
L1: Plugin wiring	Decorator application, method injection, plugin discovery	1× MI355	~5 min	P0
L2: Kernel dispatch	Verify correct attention kernel selected per config (fusion/triton/asm paths, sliding window, FP8 vs BF16)	1× MI355	~15 min	P1
L3: E2E correctness	Plugin mode vs server mode output consistency, accuracy (gsm8k), multi-turn	8× MI355	~30 min	P1
L4: Perf regression	Throughput comparison plugin vs server mode (>= 95% baseline)	8× MI355	~60 min	P2 (nightly)

Positive Aspects

Leverages vLLM's official OOT mechanism — zero upstream code changes needed
Sound attention abstraction hierarchy (BaseAttention → PagedAttention / RadixAttention)
6-20% performance uplift backed by benchmark data
CI passing
Recipe documentation and RFC provided
Good responsiveness to review feedback — multiple issues from earlier reviews already addressed

Updated after double-checking all findings against the latest commit on this branch. CPU tests validated locally — 23/23 passing.

atom/plugin/attention.py

zejunchen-zejun · 2026-03-03T03:10:46Z

extend_for_sliding_window type annotation mismatch — atom/plugin/attention_mha.py

extend_workspace buffer bypasses framework memory management — atom/plugin/attention.py

Heavy use of decorators increases debugging difficulty

is_plugin_mode() / is_vllm() / is_sglang() scattered across core code

SGLang support is incomplete

Commit history cleanup

Thank you for comments. Let me fix and give the feedback:

Fixed type mismatch
Added warning to frontend users for the untracked gpu mem usage. This piece of memory is used to store the fetched kv cache for extend path(chunked prefill). For accounting those gpu mem usage into vLLM memory budget, it is not easy to do because the attention is entirely bypassed for profiler run in vLLM. The vLLM has the assumption that the attention will not consume any memory.
Yes, it makes sense. We consider the side effect of the decorator but we still plan to use it. The purpose is to isolate the plugin-mode behavior from the server-mode behavior for using the same component class, and the decorator AiterBackendDecoratorForPluginMode and FusedMoEDecoratorForPluginMode are quite simple, they just add some methods or modify the class name to follow the vLLM calling convention. The backbone code for FusedMoE and AiterBackend are shared for both server mode and plugin mode.
I fully agree. According to the design principle, it is a little bit unsafe to maintain the status across the core components. We plan to use abstract interface for checking current mode instead of using is_vllm, is_sglang in future PRs.
For SGLang, this PR only enables basic functionalities and fallback the attention to SGLang mainline. We have another PR(base on this PR) to integrate the radix attention to ATOM for SGLang+ATOM mode.
Done!

atom/plugin/attention_mha.py

framework Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Copilot

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/model_ops/base_attention.py

atom/plugin/attention_mha.py

atom/plugin/vllm/register.py

recipes/SGLang-ATOM-Model-Impl-Backend.md

zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch from e6e0128 to 45ec455 Compare January 13, 2026 14:58

zejunchen-zejun changed the title ~~[WP][feat] Make ATOM can be model impl backend for vLLM and SGLang~~ [WIP][feat] Make ATOM can be model impl backend for vLLM and SGLang Jan 13, 2026

zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch 3 times, most recently from cabd144 to c2657a9 Compare January 14, 2026 07:11

zejunchen-zejun changed the title ~~[WIP][feat] Make ATOM can be model impl backend for vLLM and SGLang~~ [WIP][feat] Make ATOM work as model impl backend for vLLM and SGLang Jan 15, 2026

zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch 4 times, most recently from ae1f5e9 to 02e39be Compare January 16, 2026 12:28

zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch 2 times, most recently from d0f4d79 to 2b10d8f Compare January 26, 2026 07:59

wuhuikx requested changes Jan 29, 2026

View reviewed changes

atom/model_ops/paged_attention.py Show resolved Hide resolved

wuhuikx requested changes Jan 29, 2026

View reviewed changes

atom/plugin/attention.py Show resolved Hide resolved

atom/plugin/attention_mha.py Show resolved Hide resolved

atom/plugin/attention_mha.py Show resolved Hide resolved

atom/plugin/attention_mha.py Outdated Show resolved Hide resolved

zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch 3 times, most recently from bdf7a06 to 09cc7ed Compare January 29, 2026 14:09

zejunchen-zejun changed the title ~~[WIP][feat] Make ATOM work as model impl backend for vLLM and SGLang~~ [feat] Make ATOM work as model impl backend for vLLM and SGLang Feb 2, 2026

zejunchen-zejun marked this pull request as ready for review February 2, 2026 04:01

Copilot AI review requested due to automatic review settings February 2, 2026 04:01

Copilot started reviewing on behalf of zejunchen-zejun February 2, 2026 04:02 View session

Copilot AI reviewed Feb 2, 2026

View reviewed changes

zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch from 1440b34 to dd0e196 Compare February 2, 2026 04:34

Copilot AI review requested due to automatic review settings February 2, 2026 04:37

zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch from dd0e196 to f6e3e47 Compare February 2, 2026 04:37

Copilot started reviewing on behalf of zejunchen-zejun February 2, 2026 04:37 View session

Copilot AI reviewed Feb 2, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings February 2, 2026 08:38

Copilot started reviewing on behalf of zejunchen-zejun February 2, 2026 08:39 View session

Copilot AI reviewed Feb 2, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 1, 2026 14:44

zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch from 2ea48e3 to 155f991 Compare March 1, 2026 14:44

Copilot started reviewing on behalf of zejunchen-zejun March 1, 2026 14:45 View session

Copilot AI reviewed Mar 1, 2026

View reviewed changes

atom/plugin/attention_mha.py Outdated Show resolved Hide resolved

recipes/vLLM-ATOM-OOT-Plugin-Backend.md Show resolved Hide resolved

ChuanLi1101 reviewed Mar 2, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 2, 2026 12:25

Copilot started reviewing on behalf of zejunchen-zejun March 2, 2026 12:26 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

atom/plugin/attention.py Outdated Show resolved Hide resolved

atom/model_loader/loader.py Outdated Show resolved Hide resolved

recipes/vLLM-ATOM-OOT-Plugin-Backend.md Show resolved Hide resolved

sunway513 mentioned this pull request Mar 3, 2026

CI/CD: Add incremental test coverage for Plugin Mode (vLLM/SGLang) #255

Open

ganyi1996ppo reviewed Mar 3, 2026

View reviewed changes

atom/plugin/attention.py Outdated Show resolved Hide resolved

valarLip reviewed Mar 3, 2026

View reviewed changes

atom/plugin/attention_mha.py Outdated Show resolved Hide resolved

Copilot AI review requested due to automatic review settings March 3, 2026 03:27

Copilot started reviewing on behalf of zejunchen-zejun March 3, 2026 03:28 View session

This comment was marked as duplicate.

Sign in to view

zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch 2 times, most recently from ebd438d to 7e9cece Compare March 3, 2026 03:57

wuhuikx previously approved these changes Mar 3, 2026

View reviewed changes

ganyi1996ppo previously approved these changes Mar 3, 2026

View reviewed changes

wuhuikx requested review from ChuanLi1101 and valarLip March 3, 2026 07:06

[feat][plugin] Make ATOM work as plugin for upper

7364488

framework Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Copilot AI review requested due to automatic review settings March 3, 2026 12:40

zejunchen-zejun dismissed stale reviews from ganyi1996ppo and wuhuikx via 7364488 March 3, 2026 12:41

zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch from 7e9cece to 7364488 Compare March 3, 2026 12:41

Copilot started reviewing on behalf of zejunchen-zejun March 3, 2026 12:42 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

Conversation

zejunchen-zejun commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

ChuanLi1101 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zejunchen-zejun commented Mar 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zejunchen-zejun commented Jan 12, 2026 •

edited

Loading

sunway513 commented Mar 3, 2026 •

edited

Loading

PR #126 Review: `[1/N][feat] Make ATOM work with vLLM and SGLang`

zejunchen-zejun commented Mar 3, 2026 •

edited

Loading