Skip to content

Conversation

@xiaolei373
Copy link
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Dec 7, 2025

Thanks for your contribution!

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 66.15679% with 177 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@c3a8a16). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/metrics/trace.py 71.39% 87 Missing and 32 partials ⚠️
fastdeploy/engine/common_engine.py 6.25% 15 Missing ⚠️
fastdeploy/entrypoints/openai/api_server.py 53.84% 12 Missing ⚠️
fastdeploy/entrypoints/openai/serving_chat.py 40.00% 10 Missing and 2 partials ⚠️
...astdeploy/entrypoints/openai/serving_completion.py 40.00% 10 Missing and 2 partials ⚠️
fastdeploy/output/token_processor.py 63.63% 4 Missing ⚠️
fastdeploy/engine/request.py 33.33% 2 Missing ⚠️
fastdeploy/entrypoints/cli/tokenizer.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #5417   +/-   ##
==========================================
  Coverage           ?   59.25%           
==========================================
  Files              ?      327           
  Lines              ?    40915           
  Branches           ?     6225           
==========================================
  Hits               ?    24243           
  Misses             ?    14788           
  Partials           ?     1884           
Flag Coverage Δ
GPU 59.25% <66.15%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces fine-grained distributed tracing for FastDeploy using OpenTelemetry to track request latency across different stages (preprocessing, scheduling, prefill, decode, postprocessing). This is Part 1 of the tracing implementation.

Key Changes:

  • Implemented comprehensive OpenTelemetry-based tracing infrastructure with span context propagation
  • Added tracing integration points across API server, scheduler, and token processor
  • Provided documentation and example configurations for Jaeger/OTel Collector setup

Reviewed changes

Copilot reviewed 31 out of 32 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
fastdeploy/metrics/trace.py New comprehensive tracing implementation with span management and context propagation
tests/metrics/test_trace.py Extensive test coverage for tracing functionality
fastdeploy/entrypoints/openai/api_server.py Integrated tracing initialization and span decorators
fastdeploy/entrypoints/openai/serving_chat.py Added request tracing start/finish and postprocessing spans
fastdeploy/entrypoints/openai/serving_completion.py Added request tracing start/finish and postprocessing spans
fastdeploy/entrypoints/engine_client.py Added preprocessing span tracking
fastdeploy/engine/common_engine.py Added scheduler span tracking and context propagation
fastdeploy/output/token_processor.py Added prefill/decode span tracking
fastdeploy/engine/request.py Added trace_carrier field to RequestOutput
fastdeploy/envs.py Added OTLP exporter configuration variables
docs/zh/observability/trace.md Chinese documentation for tracing features
docs/observability/trace.md English documentation for tracing features
examples/observability/ Docker Compose examples for Prometheus, Grafana, Jaeger, and OTel Collector

}
)

# 统一填充 reqs_context 的 Root Span 信息
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a Chinese comment in the code: # 初始化用于存储 Upstream Context的变量. All comments in the codebase should be in English for consistency and maintainability. Please translate this and other Chinese comments (lines 437, 480, 483, 501, 509, 513, 530) to English.

Suggested change
# 统一填充 reqs_context 的 Root Span 信息
# Consistently populate the Root Span information in reqs_context

Copilot uses AI. Check for mistakes.
Comment on lines 100 to 105
# with open(log_path, "w") as logfile:
with open(log_path, "w"):
process = subprocess.Popen(
cmd,
stdout=logfile,
stderr=subprocess.STDOUT,
# stdout=logfile,
# stderr=subprocess.STDOUT,
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stdout and stderr redirects have been commented out without explanation. This means the subprocess output is not being captured to the log file, which could make debugging failures difficult. Either restore the redirects or add a comment explaining why they were removed. If this change is intentional for debugging, it should not be in the final PR.

Copilot uses AI. Check for mistakes.
Comment on lines +716 to +708
# trace_carrier = tracing.trace_get_proc_propagate_context(rid=rid)

Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out code should be removed. The line # trace_carrier = tracing.trace_get_proc_propagate_context(rid=rid) suggests incomplete implementation or debugging code that should either be removed or uncommented if it's needed.

Suggested change
# trace_carrier = tracing.trace_get_proc_propagate_context(rid=rid)

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,785 @@
"""
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description is missing key information. According to the custom guidelines, the PR description should explain why these modifications are being made and what problem is being solved. The current description only lists checklist items. Please provide:

  1. The motivation for adding fine-grained tracing
  2. What problems this solves (e.g., debugging performance bottlenecks, request flow analysis)
  3. An overview of the implementation approach

Copilot generated this review using guidance from repository custom instructions.
self._processor.force_flush(timeout_millis)


def lable_span(request):
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function name has a spelling error: 'lable_span' should be 'label_span'.

Suggested change
def lable_span(request):
def label_span(request):

Copilot uses AI. Check for mistakes.
for task in tasks:
start_span_request("DEQUEUE", task, trace.SpanKind.CONSUMER)
# for task in tasks:
# start_span_request("DEQUEUE", task, trace.SpanKind.CONSUMER)
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment appears to contain commented-out code.

Suggested change
# start_span_request("DEQUEUE", task, trace.SpanKind.CONSUMER)

Copilot uses AI. Check for mistakes.
from opentelemetry import trace
from opentelemetry.propagate import extract

import fastdeploy.metrics.trace as tracing
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module 'fastdeploy.metrics.trace' is imported with both 'import' and 'import from'.

Copilot uses AI. Check for mistakes.
import threading
import time
import unittest
from unittest import mock
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module 'unittest' is imported with both 'import' and 'import from'.

Suggested change
from unittest import mock

Copilot uses AI. Check for mistakes.
Comment on lines +225 to +229
with mock.patch("fastdeploy.metrics.trace.logger"):
trace.process_tracing_init()
# Should log error but not crash
# Check if error was called (may not always be called depending on implementation)
pass
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary 'pass' statement.

Suggested change
with mock.patch("fastdeploy.metrics.trace.logger"):
trace.process_tracing_init()
# Should log error but not crash
# Check if error was called (may not always be called depending on implementation)
pass
with mock.patch("fastdeploy.metrics.trace.logger") as mock_logger:
trace.process_tracing_init()
# Should log error but not crash
# Check if error was called (may not always be called depending on implementation)
assert mock_logger.error.called

Copilot uses AI. Check for mistakes.

# Should log warnings but not crash
# Check if warning was called (may not always be called depending on implementation)
pass
Copy link

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary 'pass' statement.

Suggested change
pass

Copilot uses AI. Check for mistakes.
@xiaolei373 xiaolei373 force-pushed the support_tracing_feature_part1 branch 3 times, most recently from 16220c9 to 34ad14f Compare December 8, 2025 09:52
@xiaolei373 xiaolei373 force-pushed the support_tracing_feature_part1 branch from 34ad14f to bc1decd Compare December 8, 2025 10:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants