Skip to content

PD Disaggregation Test Coverage Plan (PR #253) #22

@sunway513

Description

@sunway513

Motivation

PR #253 introduces Prefill/Decode disaggregation — a critical production feature that splits inference across separate GPU instances via MORI-IO RDMA. Current test coverage is minimal:

  • 1 test file (test_kv_aggregator.py, 96 lines) covering only KVOutputAggregator
  • 0 tests for the core transfer engine (1,624 lines), proxy (372 lines), scheduler integration (344 lines), and async worker plumbing (212 lines)

This issue tracks the plan to add layered test coverage following the same strategy as the plugin-mode CI (#255).

Approach: Layered Testing by Module

L1: CPU Unit Tests (P0 — gate for merge)

Pure logic tests with mocked GPU/RDMA/ZMQ dependencies. Run on ubuntu-latest in < 5 seconds.

tests/disaggregation/
  ├── test_kv_aggregator.py          # Enhance existing
  ├── test_connector_metadata.py     # New
  ├── test_kv_connector_scheduler.py # New
  ├── test_proxy.py                  # New
  ├── test_transfer_utils.py         # New
  └── test_scheduler_kv_integration.py # New

Mock strategy: Mock aiter, mori.io, torch.distributed, zmq at sys.modules level.

test_kv_aggregator.py (enhance)

  • All workers report same ID → emitted
  • Partial workers → not emitted until all report
  • Multiple rounds accumulate correctly
  • Empty inputs don't crash
  • Interleaved send/recv tracked independently
  • reset() clears pending
  • Counter entries deleted after emission (no leak)
  • world_size <= 0 raises ValueError

test_connector_metadata.py

  • add_new_req_to_recv builds correct ReqMeta from kv_transfer_params
  • add_new_req_to_save builds correct ReqMeta
  • Missing required params raises KeyError
  • Multiple reqs don't clobber each other
  • request_id_to_transfer_id mapping passthrough

test_kv_connector_scheduler.py

  • get_num_new_matched_tokens returns (prompt_len, True) for do_remote_prefill
  • Second call returns (0, False)kv_async_tagged idempotent
  • No kv_transfer_params → (0, False)
  • update_state_after_alloc consumer: queues req, sets transfer_id mapping
  • update_state_after_alloc producer: does NOT queue
  • do_remote_prefill flag cleared after processing
  • build_connector_meta drains pending queue into metadata
  • build_connector_meta on empty queue → no crash
  • request_finished producer output contains block_table, engine_id, host, port
  • request_finished consumer cleans up transfer_id mapping
  • transfer_id ↔ request_id always bidirectionally consistent

test_proxy.py

  • _append_whole_dict_unique deduplicates
  • Dedup ignores index field
  • Transfer mode mismatch raises ValueError
  • No instances → 503 response
  • Round-robin cycles through instances evenly
  • _extract_ip_port on valid URL
  • _extract_ip_port on invalid URL raises ValueError
  • Prefill request sets max_tokens=1 and stream=False

test_transfer_utils.py

  • convert_virtual_to_physical_pages default 16→1 expansion
  • Same size → no expansion
  • Custom block_size ratios
  • merge_contiguous_blocks — all contiguous → 1 merged
  • None contiguous → N transfers
  • Partial merge
  • Empty/single input
  • Unsorted input → auto-sorts
  • _compute_block_transfer_offsets MHA (5D) vs MLA (3D)
  • make_zmq_path IPv4, IPv6, no-port
  • RoleManager singleton + thread safety
  • set_role / get_role round-trip
  • get_port_offset formula: dp_rank * tp_size + tp_rank

test_scheduler_kv_integration.py

  • Seq enters WAITING_FOR_REMOTE_KV state
  • Finished recv moves seq to RUNNING
  • Finished send triggers block cleanup
  • None kv_connector_output → no crash
  • Seqs waiting for KV excluded from scheduled batch
  • connector_meta_output attached to ScheduledBatch

L2: CPU Integration Tests (P0)

Test Description
ZMQ handshake roundtrip Listener + client threads in-process, verify metadata exchange
Service discovery registration Simulate proxy ZMQ ROUTER, verify msgpack format and dedup
AsyncIOProcManager KV aggregation Mock multiple worker KV outputs, verify call_func_with_aggregation
_pop_done_transfers all-status check Bug: current code only checks status_list[-1]. Test with [FAIL, SUCCESS] → should NOT mark done
OpenAI server kv_params roundtrip Request with kv_transfer_params → response contains output
Proxy prefill→decode read-mode flow Simulate: prefill response → extract block metadata → decode request

L3: GPU Tests (P1 — design only)

Test Env Description
register_kv_caches RDMA metadata 1 GPU Real KV tensors → verify RDMA metadata non-null
MoRIIO wrapper tensor registration 1 GPU CUDA tensor → packed metadata valid
Single-node loopback transfer 2+ GPU Producer → consumer RDMA read, verify data match
E2E proxy+prefill+decode 8 GPU Full 3-process inference
Multi-request concurrent 8 GPU Concurrent P/D pipeline

Known Bugs to Cover

  1. _pop_done_transfers only checks status_list[-1] — should check ALL statuses
  2. start_load_kv busy-waitwhile need_handshake: continue burns CPU
  3. Proxy 600-hour timeoutaiohttp.ClientTimeout(total=6*6000*6000) should be configurable

Estimated Effort

Layer Files Test Cases Lines (est.)
L1 6 ~55 ~800
L2 1 ~6 ~300
L3 design only ~5 ~200
Total 7 ~66 ~1,300

CI Integration

Add to existing workflow or new atom-pd-test.yaml:

pd-unit-tests:
  name: PD Disaggregation Unit Tests (CPU)
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - uses: actions/setup-python@v4
      with:
        python-version: "3.12"
    - run: pip install pytest msgpack msgspec numpy aiohttp quart
    - run: pip install torch --index-url https://download.pytorch.org/whl/cpu
    - run: pytest tests/disaggregation/ -v --tb=short

Reference

  • Design doc: docs/plans/2026-03-04-pd-disaggregation-test-coverage-design.md
  • Related: PR #253, Issue #255

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions