Skip to content

[1/N][feat] Make ATOM work with vLLM and SGLang#126

Open
zejunchen-zejun wants to merge 1 commit intoROCm:mainfrom
zejunchen-zejun:zejun/plugin_for_atom_1223
Open

[1/N][feat] Make ATOM work with vLLM and SGLang#126
zejunchen-zejun wants to merge 1 commit intoROCm:mainfrom
zejunchen-zejun:zejun/plugin_for_atom_1223

Conversation

@zejunchen-zejun
Copy link
Contributor

@zejunchen-zejun zejunchen-zejun commented Jan 12, 2026

This PR is used to make ATOM work with vLLM and SGLang, which keeps the OOB of popular frameworks and provides the optimizations from ATOM.

For vLLM, this PR uses the vLLM official out-of-tree mechanism and make ATOM provide platform, model and attention to vLLM. Here is the design diagram and performance snapshot. Compared to vLLM, vLLM+ATOM has 6-20% performance uplift.
image
image

Here is the RFC:

For SGLang, this PR uses the official model impl backend mechanism. Here is the design diagram.
image

For attention, this PR constructs the BaseAttention and makes paged attention/radix attention inherits from this base class. The implementation details of ATOM server mode and plugin mode have been moved into the PagedAttentionImpl
image

@zejunchen-zejun zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch from e6e0128 to 45ec455 Compare January 13, 2026 14:58
@zejunchen-zejun zejunchen-zejun changed the title [WP][feat] Make ATOM can be model impl backend for vLLM and SGLang [WIP][feat] Make ATOM can be model impl backend for vLLM and SGLang Jan 13, 2026
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch 3 times, most recently from cabd144 to c2657a9 Compare January 14, 2026 07:11
@zejunchen-zejun zejunchen-zejun changed the title [WIP][feat] Make ATOM can be model impl backend for vLLM and SGLang [WIP][feat] Make ATOM work as model impl backend for vLLM and SGLang Jan 15, 2026
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch 4 times, most recently from ae1f5e9 to 02e39be Compare January 16, 2026 12:28
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch 2 times, most recently from d0f4d79 to 2b10d8f Compare January 26, 2026 07:59
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch 3 times, most recently from bdf7a06 to 09cc7ed Compare January 29, 2026 14:09
@zejunchen-zejun zejunchen-zejun changed the title [WIP][feat] Make ATOM work as model impl backend for vLLM and SGLang [feat] Make ATOM work as model impl backend for vLLM and SGLang Feb 2, 2026
@zejunchen-zejun zejunchen-zejun marked this pull request as ready for review February 2, 2026 04:01
Copilot AI review requested due to automatic review settings February 2, 2026 04:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request enables ATOM to work as a model implementation backend for vLLM and SGLang, allowing users to specify --model-impl atom when launching these frameworks. The implementation follows an official registry mechanism and combines framework-level features from vLLM/SGLang with model-level fusion kernels from ATOM/AITER.

Changes:

  • Adds plugin infrastructure to register ATOM models and attention backends with vLLM and SGLang
  • Implements attention metadata builders and handlers for plugin mode
  • Refactors model implementations (Qwen3, Qwen3MoE, etc.) to support both server and plugin modes
  • Adds documentation recipe with setup instructions and known limitations

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 43 comments.

Show a summary per file
File Description
recipes/Model-Impl-Backend.md Documentation and setup guide for using ATOM with vLLM and SGLang
atom/plugin/*.py Core plugin infrastructure including registration, config generation, and attention handling
atom/models/*.py Model implementations updated to support plugin mode with consistent APIs
atom/model_ops/*.py Attention operations refactored with base classes and plugin-specific implementations
atom/model_loader/loader.py Weight loading updated to support plugin mode
atom/config.py Configuration extended with plugin-specific settings
atom/utils/*.py Utilities updated for plugin mode support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@zejunchen-zejun zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch from 1440b34 to dd0e196 Compare February 2, 2026 04:34
Copilot AI review requested due to automatic review settings February 2, 2026 04:37
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch from dd0e196 to f6e3e47 Compare February 2, 2026 04:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 29 changed files in this pull request and generated 33 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings February 2, 2026 08:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 29 changed files in this pull request and generated 20 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 1, 2026 14:44
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch from 2ea48e3 to 155f991 Compare March 1, 2026 14:44
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@ChuanLi1101 ChuanLi1101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hard work. It took me a while to review the PR. I’ve left some comments on a few more serious issues that may cause bugs, for your reference.

@zejunchen-zejun
Copy link
Contributor Author

Thanks for the hard work. It took me a while to review the PR. I’ve left some comments on a few more serious issues that may cause bugs, for your reference.

Thank you for significant suggestions. I will resolve soon!

Copilot AI review requested due to automatic review settings March 2, 2026 12:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sunway513
Copy link
Collaborator

sunway513 commented Mar 3, 2026

PR #126 Review: [1/N][feat] Make ATOM work with vLLM and SGLang

+3,243 / -179 | 32 files | 21 commits


Overall Assessment

This is an ambitious PR that enables ATOM to function as an out-of-tree plugin for vLLM and SGLang. The architectural approach is sound — leveraging vLLM's official OOT mechanism makes the design transparent to vLLM with zero upstream code changes. The reported 6-20% performance uplift is compelling.

I performed an initial review and then double-checked each finding against the latest commit on this branch. Several issues flagged earlier (by myself and others on older commits) have already been addressed. Below are the remaining items.

Note: An earlier version of this comment listed 7 critical issues. After verification against the latest code, most were either already fixed or did not hold up on closer inspection. This updated version reflects the corrected assessment.


Remaining Issues

1. extend_for_sliding_window type annotation mismatchatom/plugin/attention_mha.py

The function signature declares k_scale: float and v_scale: float, but in practice self.k_scale / self.v_scale (which are torch.Tensor or None) are passed in, and the downstream cp_mha_gather_cache expects torch.Tensor. The code works at runtime, but the type annotation is misleading and will cause issues with type checkers and IDE tooling.

2. extend_workspace buffer bypasses framework memory managementatom/plugin/attention.py

The workspace buffer [2, 32*1024, num_kv_heads, head_dim] is allocated outside of vLLM's cache manager. For a model like Qwen3-235B (8 kv heads, 128 head_dim, bf16), this is ~128 MB of untracked GPU memory. At high gpu_mem_utilization, this creates OOM risk. Also flagged by @wuhuikx — recommend at minimum adding a prominent comment/warning, and ideally integrating with vLLM's memory accounting.


Design Suggestions (Non-blocking)

3. Heavy use of decorators increases debugging difficulty

Four decorators dynamically modify class inheritance chains and methods at import time (PagedAttentionImplDecoratorForPluginMode, AiterAttentionMetadataBuilderDecoratorForPluginMode, AiterBackendDecoratorForPluginMode, FusedMoEDecoratorForPluginMode). This makes stack traces harder to follow and debugging more difficult. Worth considering more explicit inheritance patterns in future iterations.

4. is_plugin_mode() / is_vllm() / is_sglang() scattered across core code

Core files (attention_mha.py, base_attention.py, embed_head.py, loader.py, moe.py) now contain plugin mode conditionals throughout. Consider using strategy pattern or abstract interfaces to better isolate plugin-specific behavior in future PRs.

5. SGLang support is incomplete

Multiple locations raise NotImplementedError for SGLang paths. Agree with @wuhuikx's suggestion to focus this PR on vLLM only and split SGLang into a follow-up PR.

6. Commit history cleanup

21 commits with several messages like "add" or "make lint happy". Recommend squashing before merge.


Previously Flagged Issues — Verified as Resolved or Not Applicable

For transparency, the following were flagged earlier but do not hold up against the latest code:

  • elif 0: dead code → Fixed in latest commit (now elif use_triton_attn and self.rotary_emb is not None)
  • positions None crashpositions is passed from model forward, not from Context.positions; dummy runs exit early
  • sliding_window None TypeError → Proper None guard exists (if sliding_window is None or sliding_window == -1)
  • paged_attention_triton_plugin_mode arg name mismatch → Fixed in latest commit, call sites now use k_cache= / v_cache=
  • ATOMPlatform fallback missing method → When disabled, ATOMPlatform = None and register_platform() returns None; no class instantiation occurs
  • model_runner=None crash in aiter_attention.py → In plugin mode, the decorator replaces __init__ entirely, so the original code path is not reached

CI/CD Coverage Requirement

See issue #255 for the full CI enhancement plan. Summary below:

CPU Unit Tests — Validated and Ready to Cherry-Pick

23 tests across 4 files have been developed, validated, and pass on CPU without GPU/vllm/aiter/triton dependencies (runs in ~1 second). The code is available on branch ci/plugin-mode-test-coverage in the ATOM repo for cherry-picking.

Test File Tests Coverage
tests/test_plugin_prepare.py 7 is_vllm(), is_sglang(), is_plugin_mode(), _set_framework_backbone(), invalid framework, case insensitivity
tests/test_plugin_config.py 6 PluginConfig defaults, vllm/sglang mode fields, field completeness
tests/test_plugin_vllm_register.py 6 register_platform() enable/disable, register_model() skip, model registry overrides, set_attn_cls() → PagedAttention/RadixAttention
tests/test_plugin_vllm_platform.py 4 ATOMPlatform None when disabled, inherits RocmPlatform when enabled, returns ATOM backend, fallback when attention disabled

Also included on the branch:

  • .github/workflows/atom-plugin-test.yaml — New workflow (CPU unit tests on every PR + GPU smoke test)
  • .github/scripts/atom_plugin_test.sh — vLLM/SGLang plugin launch + inference + accuracy script

GPU Tests — Concept Only (For Follow-Up Development)

The following GPU test levels are proposed but not yet implemented — they require actual GPU hardware and full vLLM + AITER stack:

Level Description GPU Est. Time Priority
L1: Plugin wiring Decorator application, method injection, plugin discovery 1× MI355 ~5 min P0
L2: Kernel dispatch Verify correct attention kernel selected per config (fusion/triton/asm paths, sliding window, FP8 vs BF16) 1× MI355 ~15 min P1
L3: E2E correctness Plugin mode vs server mode output consistency, accuracy (gsm8k), multi-turn 8× MI355 ~30 min P1
L4: Perf regression Throughput comparison plugin vs server mode (>= 95% baseline) 8× MI355 ~60 min P2 (nightly)

Positive Aspects

  • Leverages vLLM's official OOT mechanism — zero upstream code changes needed
  • Sound attention abstraction hierarchy (BaseAttention → PagedAttention / RadixAttention)
  • 6-20% performance uplift backed by benchmark data
  • CI passing
  • Recipe documentation and RFC provided
  • Good responsiveness to review feedback — multiple issues from earlier reviews already addressed

Updated after double-checking all findings against the latest commit on this branch. CPU tests validated locally — 23/23 passing.

@zejunchen-zejun
Copy link
Contributor Author

zejunchen-zejun commented Mar 3, 2026

  1. extend_for_sliding_window type annotation mismatch — atom/plugin/attention_mha.py
  2. extend_workspace buffer bypasses framework memory management — atom/plugin/attention.py
  3. Heavy use of decorators increases debugging difficulty
  4. is_plugin_mode() / is_vllm() / is_sglang() scattered across core code
  5. SGLang support is incomplete
  6. Commit history cleanup

Thank you for comments. Let me fix and give the feedback:

  1. Fixed type mismatch
  2. Added warning to frontend users for the untracked gpu mem usage. This piece of memory is used to store the fetched kv cache for extend path(chunked prefill). For accounting those gpu mem usage into vLLM memory budget, it is not easy to do because the attention is entirely bypassed for profiler run in vLLM. The vLLM has the assumption that the attention will not consume any memory.
  3. Yes, it makes sense. We consider the side effect of the decorator but we still plan to use it. The purpose is to isolate the plugin-mode behavior from the server-mode behavior for using the same component class, and the decorator AiterBackendDecoratorForPluginMode and FusedMoEDecoratorForPluginMode are quite simple, they just add some methods or modify the class name to follow the vLLM calling convention. The backbone code for FusedMoE and AiterBackend are shared for both server mode and plugin mode.
  4. I fully agree. According to the design principle, it is a little bit unsafe to maintain the status across the core components. We plan to use abstract interface for checking current mode instead of using is_vllm, is_sglang in future PRs.
  5. For SGLang, this PR only enables basic functionalities and fallback the attention to SGLang mainline. We have another PR(base on this PR) to integrate the radix attention to ATOM for SGLang+ATOM mode.
  6. Done!

Copilot AI review requested due to automatic review settings March 3, 2026 03:27

This comment was marked as duplicate.

@zejunchen-zejun zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch 2 times, most recently from ebd438d to 7e9cece Compare March 3, 2026 03:57
wuhuikx
wuhuikx previously approved these changes Mar 3, 2026
ganyi1996ppo
ganyi1996ppo previously approved these changes Mar 3, 2026
@wuhuikx wuhuikx requested review from ChuanLi1101 and valarLip March 3, 2026 07:06
framework

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Copilot AI review requested due to automatic review settings March 3, 2026 12:40
@zejunchen-zejun zejunchen-zejun dismissed stale reviews from ganyi1996ppo and wuhuikx via 7364488 March 3, 2026 12:41
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch from 7e9cece to 7364488 Compare March 3, 2026 12:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants