[1/N][feat] Make ATOM work with vLLM and SGLang#126
[1/N][feat] Make ATOM work with vLLM and SGLang#126zejunchen-zejun wants to merge 1 commit intoROCm:mainfrom
Conversation
e6e0128 to
45ec455
Compare
cabd144 to
c2657a9
Compare
ae1f5e9 to
02e39be
Compare
d0f4d79 to
2b10d8f
Compare
bdf7a06 to
09cc7ed
Compare
There was a problem hiding this comment.
Pull request overview
This pull request enables ATOM to work as a model implementation backend for vLLM and SGLang, allowing users to specify --model-impl atom when launching these frameworks. The implementation follows an official registry mechanism and combines framework-level features from vLLM/SGLang with model-level fusion kernels from ATOM/AITER.
Changes:
- Adds plugin infrastructure to register ATOM models and attention backends with vLLM and SGLang
- Implements attention metadata builders and handlers for plugin mode
- Refactors model implementations (Qwen3, Qwen3MoE, etc.) to support both server and plugin modes
- Adds documentation recipe with setup instructions and known limitations
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 43 comments.
Show a summary per file
| File | Description |
|---|---|
| recipes/Model-Impl-Backend.md | Documentation and setup guide for using ATOM with vLLM and SGLang |
| atom/plugin/*.py | Core plugin infrastructure including registration, config generation, and attention handling |
| atom/models/*.py | Model implementations updated to support plugin mode with consistent APIs |
| atom/model_ops/*.py | Attention operations refactored with base classes and plugin-specific implementations |
| atom/model_loader/loader.py | Weight loading updated to support plugin mode |
| atom/config.py | Configuration extended with plugin-specific settings |
| atom/utils/*.py | Utilities updated for plugin mode support |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1440b34 to
dd0e196
Compare
dd0e196 to
f6e3e47
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 29 out of 29 changed files in this pull request and generated 33 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 29 out of 29 changed files in this pull request and generated 20 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
2ea48e3 to
155f991
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 32 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ChuanLi1101
left a comment
There was a problem hiding this comment.
Thanks for the hard work. It took me a while to review the PR. I’ve left some comments on a few more serious issues that may cause bugs, for your reference.
Thank you for significant suggestions. I will resolve soon! |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 32 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
PR #126 Review:
|
| Test File | Tests | Coverage |
|---|---|---|
tests/test_plugin_prepare.py |
7 | is_vllm(), is_sglang(), is_plugin_mode(), _set_framework_backbone(), invalid framework, case insensitivity |
tests/test_plugin_config.py |
6 | PluginConfig defaults, vllm/sglang mode fields, field completeness |
tests/test_plugin_vllm_register.py |
6 | register_platform() enable/disable, register_model() skip, model registry overrides, set_attn_cls() → PagedAttention/RadixAttention |
tests/test_plugin_vllm_platform.py |
4 | ATOMPlatform None when disabled, inherits RocmPlatform when enabled, returns ATOM backend, fallback when attention disabled |
Also included on the branch:
.github/workflows/atom-plugin-test.yaml— New workflow (CPU unit tests on every PR + GPU smoke test).github/scripts/atom_plugin_test.sh— vLLM/SGLang plugin launch + inference + accuracy script
GPU Tests — Concept Only (For Follow-Up Development)
The following GPU test levels are proposed but not yet implemented — they require actual GPU hardware and full vLLM + AITER stack:
| Level | Description | GPU | Est. Time | Priority |
|---|---|---|---|---|
| L1: Plugin wiring | Decorator application, method injection, plugin discovery | 1× MI355 | ~5 min | P0 |
| L2: Kernel dispatch | Verify correct attention kernel selected per config (fusion/triton/asm paths, sliding window, FP8 vs BF16) | 1× MI355 | ~15 min | P1 |
| L3: E2E correctness | Plugin mode vs server mode output consistency, accuracy (gsm8k), multi-turn | 8× MI355 | ~30 min | P1 |
| L4: Perf regression | Throughput comparison plugin vs server mode (>= 95% baseline) | 8× MI355 | ~60 min | P2 (nightly) |
Positive Aspects
- Leverages vLLM's official OOT mechanism — zero upstream code changes needed
- Sound attention abstraction hierarchy (BaseAttention → PagedAttention / RadixAttention)
- 6-20% performance uplift backed by benchmark data
- CI passing
- Recipe documentation and RFC provided
- Good responsiveness to review feedback — multiple issues from earlier reviews already addressed
Updated after double-checking all findings against the latest commit on this branch. CPU tests validated locally — 23/23 passing.
Thank you for comments. Let me fix and give the feedback:
|
ebd438d to
7e9cece
Compare
framework Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
7364488
7e9cece to
7364488
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 32 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This PR is used to make ATOM work with vLLM and SGLang, which keeps the OOB of popular frameworks and provides the optimizations from ATOM.
For vLLM, this PR uses the vLLM official out-of-tree mechanism and make ATOM provide platform, model and attention to vLLM. Here is the design diagram and performance snapshot. Compared to vLLM, vLLM+ATOM has 6-20% performance uplift.


Here is the RFC:
For SGLang, this PR uses the official model impl backend mechanism. Here is the design diagram.

For attention, this PR constructs the BaseAttention and makes paged attention/radix attention inherits from this base class. The implementation details of ATOM server mode and plugin mode have been moved into the PagedAttentionImpl
