-
Notifications
You must be signed in to change notification settings - Fork 685
[Feature] Add Golang-based Router for Request Scheduling and Load Balancing #5882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your contribution! |
|
mouxin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a new Golang-based router component to FastDeploy that provides request routing, load balancing, and health monitoring capabilities for inference services. The router supports both "splitwise" mode (separate prefill/decode instances) and "mixed" mode (unified instances).
Key Changes:
- Complete Golang router implementation with scheduling strategies (random, round-robin, power-of-two, process-tokens, request-num)
- Health check system for monitoring worker instances
- Prometheus metrics integration for observability
- Configuration management with YAML support
- Comprehensive test coverage across packages
Reviewed changes
Copilot reviewed 43 out of 44 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| cmd/main.go | Main entry point initializing router, manager, scheduler, and health monitoring |
| pkg/logger/logger.go | Logging infrastructure with level-based filtering and file/stdout output |
| pkg/metrics/metrics.go | Prometheus metrics for HTTP requests and inference tracking |
| internal/config/config.go | Configuration loading from YAML with validation and defaults |
| internal/manager/*.go | Instance registration, health checks, and worker management |
| internal/scheduler/handler/*.go | Worker selection strategies and load balancing logic |
| internal/gateway/completions.go | Request forwarding to inference workers with streaming support |
| internal/middleware/*.go | Logging, recovery, and metrics middleware |
| internal/router/router.go | HTTP route definitions and handler registration |
| run.sh, build.sh, Makefile | Build and deployment scripts |
| go.mod, go.sum | Go module dependencies |
| examples/* | Example configurations and startup scripts |
| @@ -0,0 +1,151 @@ | |||
| # fd-router | |||
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR description is incomplete. According to the coding guidelines, it should explain why these modifications are being made and what problem is being solved. Please add a detailed description covering:
- Why this Golang router is needed
- What problems it solves
- Key features and benefits
- How it differs from existing solutions
| PID=$(ps -ef | grep "fd-router" | grep -v grep | awk '{print $2}') | ||
| if [ -n "$PID" ]; then | ||
| echo "Killing existing fd-router process (PID: $PID)" | ||
| kill -9 $PID |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using 'kill -9' is a harsh signal that doesn't allow processes to clean up gracefully. Consider using 'kill -15' (SIGTERM) first, which allows the process to perform cleanup operations. Reserve 'kill -9' (SIGKILL) only as a last resort if SIGTERM doesn't work after a timeout.
| kill -9 $PID | |
| # Try graceful shutdown first | |
| kill -15 $PID | |
| TIMEOUT=10 | |
| while kill -0 $PID 2>/dev/null && [ $TIMEOUT -gt 0 ]; do | |
| echo "Waiting for fd-router (PID: $PID) to exit gracefully... ($TIMEOUT seconds remaining)" | |
| sleep 1 | |
| TIMEOUT=$((TIMEOUT - 1)) | |
| done | |
| # Force kill if still running after timeout | |
| if kill -0 $PID 2>/dev/null; then | |
| echo "fd-router (PID: $PID) did not exit gracefully; sending SIGKILL..." | |
| kill -9 $PID | |
| fi |
| PID=$(ps -ef | grep "fd-router" | grep -v grep | awk '{print $2}') | ||
| if [ -n "$PID" ]; then | ||
| echo "Killing existing fd-router process (PID: $PID)" | ||
| kill -9 $PID |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using 'kill -9' is a harsh signal that doesn't allow processes to clean up gracefully. Consider using 'kill -15' (SIGTERM) first, which allows the process to perform cleanup operations. Reserve 'kill -9' (SIGKILL) only as a last resort if SIGTERM doesn't work after a timeout.
| kill -9 $PID | |
| # First try to terminate gracefully | |
| kill -15 "$PID" | |
| sleep 5 | |
| # If still running after timeout, force kill as a last resort | |
| if ps -p "$PID" > /dev/null 2>&1; then | |
| echo "Process $PID did not terminate gracefully; force killing..." | |
| kill -9 "$PID" | |
| fi |
fastdeploy/golang_router/go.env
Outdated
| @@ -0,0 +1,3 @@ | |||
| GO111MODULE=on | |||
| GOPRIVATE=*.baidu.com | |||
| GOPROXY=http://goproxy.baidu-int.com | |||
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a hardcoded internal Baidu proxy (goproxy.baidu-int.com) makes the code non-portable for external contributors and open-source usage. This should either be removed or made configurable through environment variables so external users can build the project.
| GOPROXY=http://goproxy.baidu-int.com | |
| GOPROXY=${GOPROXY:-https://proxy.golang.org,direct} |
| r := rand.New(randomSource) | ||
| randomNum := r.Intn(len(workers)) | ||
| return workers[randomNum], nil |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The random number generator is not thread-safe. Multiple goroutines calling 'RandomSelectWorker' concurrently will access 'rand.New(randomSource)' which shares the same source. This can lead to race conditions. Either use 'math/rand/v2' with concurrent-safe generators, or protect access with a mutex, or use 'rand.Intn()' directly from the global generator which is thread-safe.
| r := rand.New(powerOfTwoSource) | ||
| length := len(workers) | ||
| randomNum1 := r.Intn(length) | ||
| randomNum2 := r.Intn(length) | ||
|
|
||
| for randomNum2 == randomNum1 { | ||
| randomNum2 = r.Intn(length) |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The random number generator is not thread-safe. Multiple goroutines calling 'PowerOfTwoSelectWorker' concurrently will access 'rand.New(powerOfTwoSource)' which shares the same source. This can lead to race conditions. Either use 'math/rand/v2' with concurrent-safe generators, or protect access with a mutex, or use 'rand.Intn()' directly from the global generator which is thread-safe.
| if err != nil { | ||
| log.Fatalln("Failed to open log file:", err) | ||
| } | ||
| defer file.Close() |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'defer file.Close()' statement is unreachable in the current implementation because it's placed after opening the file but inside 'once.Do'. Since loggers are assigned after this defer and the function doesn't return an error, this defer will execute after the 'once.Do' completes, which might close the file prematurely. The loggers will then write to a closed file, causing errors. Consider restructuring the code to ensure proper file lifecycle management.
| @@ -0,0 +1,151 @@ | |||
| # fd-router | |||
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR title "[Feature] add golang router" should follow the required format. According to the coding guidelines, PR titles should be more descriptive. Consider changing to something like: "[Feature] Add Golang-based Router for Request Scheduling and Load Balancing"
| if err != nil { | ||
| log.Fatalln("Failed to open log file:", err) | ||
| } | ||
| defer file.Close() |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File handle may be writable as a result of data flow from a call to OpenFile and closing it may result in data loss upon failure, which is not handled explicitly.
| # check fd-router binary | ||
| if [ ! -x "${FD_ROUTER_BIN}" ]; then | ||
| echo "⚠️ fd-router not found at ${FD_ROUTER_BIN}, downloading..." | ||
| mkdir -p "${FD_BIN_DIR}" | ||
| wget -q --no-proxy "${FD_ROUTER_URL}" -O "${FD_ROUTER_BIN}" || { | ||
| echo "❌ Failed to download fd-router" | ||
| exit 1 | ||
| } | ||
| chmod +x "${FD_ROUTER_BIN}" |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script automatically downloads a precompiled fd-router binary from https://paddle-qa.bj.bcebos.com/FastDeploy/fd-router into /usr/local/bin and executes it without any integrity verification (no checksum, signature, or pinned hash). If an attacker can tamper with the download path (via DNS poisoning, compromised storage, or MITM within the network), they can deliver a malicious binary that will be run with the user's privileges. Add a strong integrity check (e.g., pinned hash/signature validation or a versioned, authenticated distribution mechanism) before marking the binary executable and starting it.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #5882 +/- ##
==========================================
Coverage ? 66.59%
==========================================
Files ? 347
Lines ? 44467
Branches ? 6835
==========================================
Hits ? 29612
Misses ? 12670
Partials ? 2185
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.