Skip to content

Conversation

@mouxinqq
Copy link
Contributor

@mouxinqq mouxinqq commented Jan 5, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Jan 5, 2026

Thanks for your contribution!

@CLAassistant
Copy link

CLAassistant commented Jan 5, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ mouxin
❌ mouxinqq


mouxin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@paddle-bot paddle-bot bot added the contributor External developers label Jan 5, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new Golang-based router component to FastDeploy that provides request routing, load balancing, and health monitoring capabilities for inference services. The router supports both "splitwise" mode (separate prefill/decode instances) and "mixed" mode (unified instances).

Key Changes:

  • Complete Golang router implementation with scheduling strategies (random, round-robin, power-of-two, process-tokens, request-num)
  • Health check system for monitoring worker instances
  • Prometheus metrics integration for observability
  • Configuration management with YAML support
  • Comprehensive test coverage across packages

Reviewed changes

Copilot reviewed 43 out of 44 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
cmd/main.go Main entry point initializing router, manager, scheduler, and health monitoring
pkg/logger/logger.go Logging infrastructure with level-based filtering and file/stdout output
pkg/metrics/metrics.go Prometheus metrics for HTTP requests and inference tracking
internal/config/config.go Configuration loading from YAML with validation and defaults
internal/manager/*.go Instance registration, health checks, and worker management
internal/scheduler/handler/*.go Worker selection strategies and load balancing logic
internal/gateway/completions.go Request forwarding to inference workers with streaming support
internal/middleware/*.go Logging, recovery, and metrics middleware
internal/router/router.go HTTP route definitions and handler registration
run.sh, build.sh, Makefile Build and deployment scripts
go.mod, go.sum Go module dependencies
examples/* Example configurations and startup scripts

@@ -0,0 +1,151 @@
# fd-router
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description is incomplete. According to the coding guidelines, it should explain why these modifications are being made and what problem is being solved. Please add a detailed description covering:

  1. Why this Golang router is needed
  2. What problems it solves
  3. Key features and benefits
  4. How it differs from existing solutions

Copilot generated this review using guidance from repository custom instructions.
PID=$(ps -ef | grep "fd-router" | grep -v grep | awk '{print $2}')
if [ -n "$PID" ]; then
echo "Killing existing fd-router process (PID: $PID)"
kill -9 $PID
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 'kill -9' is a harsh signal that doesn't allow processes to clean up gracefully. Consider using 'kill -15' (SIGTERM) first, which allows the process to perform cleanup operations. Reserve 'kill -9' (SIGKILL) only as a last resort if SIGTERM doesn't work after a timeout.

Suggested change
kill -9 $PID
# Try graceful shutdown first
kill -15 $PID
TIMEOUT=10
while kill -0 $PID 2>/dev/null && [ $TIMEOUT -gt 0 ]; do
echo "Waiting for fd-router (PID: $PID) to exit gracefully... ($TIMEOUT seconds remaining)"
sleep 1
TIMEOUT=$((TIMEOUT - 1))
done
# Force kill if still running after timeout
if kill -0 $PID 2>/dev/null; then
echo "fd-router (PID: $PID) did not exit gracefully; sending SIGKILL..."
kill -9 $PID
fi

Copilot uses AI. Check for mistakes.
PID=$(ps -ef | grep "fd-router" | grep -v grep | awk '{print $2}')
if [ -n "$PID" ]; then
echo "Killing existing fd-router process (PID: $PID)"
kill -9 $PID
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 'kill -9' is a harsh signal that doesn't allow processes to clean up gracefully. Consider using 'kill -15' (SIGTERM) first, which allows the process to perform cleanup operations. Reserve 'kill -9' (SIGKILL) only as a last resort if SIGTERM doesn't work after a timeout.

Suggested change
kill -9 $PID
# First try to terminate gracefully
kill -15 "$PID"
sleep 5
# If still running after timeout, force kill as a last resort
if ps -p "$PID" > /dev/null 2>&1; then
echo "Process $PID did not terminate gracefully; force killing..."
kill -9 "$PID"
fi

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,3 @@
GO111MODULE=on
GOPRIVATE=*.baidu.com
GOPROXY=http://goproxy.baidu-int.com
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a hardcoded internal Baidu proxy (goproxy.baidu-int.com) makes the code non-portable for external contributors and open-source usage. This should either be removed or made configurable through environment variables so external users can build the project.

Suggested change
GOPROXY=http://goproxy.baidu-int.com
GOPROXY=${GOPROXY:-https://proxy.golang.org,direct}

Copilot uses AI. Check for mistakes.
Comment on lines 18 to 20
r := rand.New(randomSource)
randomNum := r.Intn(len(workers))
return workers[randomNum], nil
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The random number generator is not thread-safe. Multiple goroutines calling 'RandomSelectWorker' concurrently will access 'rand.New(randomSource)' which shares the same source. This can lead to race conditions. Either use 'math/rand/v2' with concurrent-safe generators, or protect access with a mutex, or use 'rand.Intn()' directly from the global generator which is thread-safe.

Copilot uses AI. Check for mistakes.
Comment on lines 23 to 29
r := rand.New(powerOfTwoSource)
length := len(workers)
randomNum1 := r.Intn(length)
randomNum2 := r.Intn(length)

for randomNum2 == randomNum1 {
randomNum2 = r.Intn(length)
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The random number generator is not thread-safe. Multiple goroutines calling 'PowerOfTwoSelectWorker' concurrently will access 'rand.New(powerOfTwoSource)' which shares the same source. This can lead to race conditions. Either use 'math/rand/v2' with concurrent-safe generators, or protect access with a mutex, or use 'rand.Intn()' directly from the global generator which is thread-safe.

Copilot uses AI. Check for mistakes.
if err != nil {
log.Fatalln("Failed to open log file:", err)
}
defer file.Close()
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'defer file.Close()' statement is unreachable in the current implementation because it's placed after opening the file but inside 'once.Do'. Since loggers are assigned after this defer and the function doesn't return an error, this defer will execute after the 'once.Do' completes, which might close the file prematurely. The loggers will then write to a closed file, causing errors. Consider restructuring the code to ensure proper file lifecycle management.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,151 @@
# fd-router
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title "[Feature] add golang router" should follow the required format. According to the coding guidelines, PR titles should be more descriptive. Consider changing to something like: "[Feature] Add Golang-based Router for Request Scheduling and Load Balancing"

Copilot generated this review using guidance from repository custom instructions.
if err != nil {
log.Fatalln("Failed to open log file:", err)
}
defer file.Close()
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File handle may be writable as a result of data flow from a call to OpenFile and closing it may result in data loss upon failure, which is not handled explicitly.

Copilot uses AI. Check for mistakes.
Comment on lines 35 to 43
# check fd-router binary
if [ ! -x "${FD_ROUTER_BIN}" ]; then
echo "⚠️ fd-router not found at ${FD_ROUTER_BIN}, downloading..."
mkdir -p "${FD_BIN_DIR}"
wget -q --no-proxy "${FD_ROUTER_URL}" -O "${FD_ROUTER_BIN}" || {
echo "❌ Failed to download fd-router"
exit 1
}
chmod +x "${FD_ROUTER_BIN}"
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script automatically downloads a precompiled fd-router binary from https://paddle-qa.bj.bcebos.com/FastDeploy/fd-router into /usr/local/bin and executes it without any integrity verification (no checksum, signature, or pinned hash). If an attacker can tamper with the download path (via DNS poisoning, compromised storage, or MITM within the network), they can deliver a malicious binary that will be run with the user's privileges. Add a strong integrity check (e.g., pinned hash/signature validation or a versioned, authenticated distribution mechanism) before marking the binary executable and starting it.

Copilot uses AI. Check for mistakes.
@mouxinqq mouxinqq changed the title [Feature] add golang router [Feature] Add Golang-based Router for Request Scheduling and Load Balancing Jan 6, 2026
Jiang-Jia-Jun
Jiang-Jia-Jun previously approved these changes Jan 6, 2026
Jiang-Jia-Jun
Jiang-Jia-Jun previously approved these changes Jan 7, 2026
@codecov-commenter
Copy link

codecov-commenter commented Jan 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@7ad5737). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #5882   +/-   ##
==========================================
  Coverage           ?   66.59%           
==========================================
  Files              ?      347           
  Lines              ?    44467           
  Branches           ?     6835           
==========================================
  Hits               ?    29612           
  Misses             ?    12670           
  Partials           ?     2185           
Flag Coverage Δ
GPU 66.59% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 0a92e96 into PaddlePaddle:develop Jan 7, 2026
15 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants