-
Notifications
You must be signed in to change notification settings - Fork 664
[Feature] [PD Disaggregation] simplify configuration for pd-disaggregated deployment, and refactor post-init and usage for all ports #5415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…factor post-init and usage for all ports
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #5415 +/- ##
==========================================
Coverage ? 59.39%
==========================================
Files ? 329
Lines ? 41214
Branches ? 6277
==========================================
Hits ? 24478
Misses ? 14837
Partials ? 1899
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors port configuration for PD-disaggregated deployments by introducing automatic port allocation and consolidating port management logic. The changes aim to simplify configuration by removing the need for users to manually specify multiple ports per service.
Key Changes
- Introduced automatic port discovery via
find_free_ports()function with validation inpost_init_all_ports() - Centralized port extraction logic in
FDConfig.postprocess()to set per-DP ports (local_engine_worker_queue_port, etc.) - Added RDMA environment auto-detection via new
get_rdma_nics.shscript
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/utils.py | Added parse_ports(), find_free_ports(), and modified is_port_available() for port management |
| fastdeploy/engine/args_utils.py | Added post_init_all_ports() to automatically allocate ports; changed default cache_transfer_protocol to "ipc,rdma" |
| fastdeploy/config.py | Moved port extraction logic to postprocess() method; added local_engine_worker_queue_port attribute |
| fastdeploy/worker/worker_process.py | Moved port assignment from pre-config to post-config initialization |
| fastdeploy/engine/engine.py | Updated to use local_engine_worker_queue_port for DP queue services |
| fastdeploy/engine/common_engine.py | Simplified port handling by using local_engine_worker_queue_port directly |
| fastdeploy/engine/expert_service.py | Removed local port/device ID extraction logic (now handled in config) |
| fastdeploy/splitwise/splitwise_connector.py | Changed to use scalar pd_comm_port and local_device_ids |
| fastdeploy/cache_manager/prefix_cache_manager.py | Renamed pid_suffix to ipc_suffix; updated RDMA port usage |
| fastdeploy/cache_manager/cache_transfer_manager.py | Renamed engine_pid to ipc_suffix for clarity |
| fastdeploy/cache_manager/cache_messager.py | Renamed engine_pid to ipc_suffix for consistency |
| fastdeploy/cache_manager/transfer_factory/rdma_cache_transfer.py | Added automatic RDMA environment setup and NIC detection |
| fastdeploy/cache_manager/transfer_factory/get_rdma_nics.sh | New script for detecting RDMA-capable network interfaces |
| examples/splitwise/utils.sh | Improved health check display with UDP port checking |
| examples/splitwise/start_v1_tp1.sh | Simplified example by removing manual port specifications |
fastdeploy/cache_manager/transfer_factory/rdma_cache_transfer.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 25 out of 25 changed files in this pull request and generated 16 comments.
Comments suppressed due to low confidence (1)
fastdeploy/engine/args_utils.py:557
- This comment appears to contain commented-out code.
# for port in ports:
# assert is_port_available("0.0.0.0", port), f"Parameter `{name}`:{port} is already in use."
| # for port in ports: | ||
| # assert is_port_available("0.0.0.0", port), f"Parameter `{name}`:{port} is already in use." | ||
|
|
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commented-out assertion for port availability check has been removed. This check was important to ensure that the specified ports are not already in use before starting the service. Without this validation, the server might fail to start with cryptic error messages, making debugging more difficult.
Consider keeping this validation, especially since the PR aims to simplify deployment. Early detection of port conflicts improves user experience.
| engine_worker_queue_ports = [random.randint(8000, 65535) for _ in range(data_parallel_size)] | ||
| cache_queue_ports = [random.randint(8000, 65535) for _ in range(data_parallel_size)] | ||
| pd_comm_ports = [random.randint(8000, 65535) for _ in range(data_parallel_size)] | ||
| rdma_comm_ports = [random.randint(8000, 65535) for _ in range(data_parallel_size * tensor_parallel_size)] |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test uses random port numbers which could lead to flaky tests. If a randomly selected port happens to be in use or reserved, the test might fail unpredictably. Additionally, using random.randint(8000, 65535) can generate duplicate port numbers across the different lists.
Consider using a deterministic sequence of ports for testing, such as:
engine_worker_queue_ports = list(range(10000, 10000 + data_parallel_size))
cache_queue_ports = list(range(11000, 11000 + data_parallel_size))
pd_comm_ports = list(range(12000, 12000 + data_parallel_size))
rdma_comm_ports = list(range(13000, 13000 + data_parallel_size * tensor_parallel_size))This ensures deterministic, non-overlapping ports for unit tests.
| ) | ||
| except Exception as e: | ||
| logger.error(f"Failed to extract local devices or ports. Servers may not be able to start properly. {e}") | ||
|
|
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling in postprocess_devices_and_ports catches all exceptions and logs an error, but then continues execution. This could lead to the system running with incorrectly configured ports/devices, causing cryptic failures later.
Consider either:
- Re-raising the exception after logging to prevent the system from starting in a broken state
- Returning a boolean success indicator that callers can check
- Setting the config to a safe default state rather than leaving it partially configured
| raise |
Motivation
该 PR 旨在实现两个目标:
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.