-
Notifications
You must be signed in to change notification settings - Fork 662
[Feature] Tracing: Fine-Grained Tracing for Request Latency Part1 #5417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
[Feature] Tracing: Fine-Grained Tracing for Request Latency Part1 #5417
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #5417 +/- ##
==========================================
Coverage ? 59.25%
==========================================
Files ? 327
Lines ? 40915
Branches ? 6225
==========================================
Hits ? 24243
Misses ? 14788
Partials ? 1884
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces fine-grained distributed tracing for FastDeploy using OpenTelemetry to track request latency across different stages (preprocessing, scheduling, prefill, decode, postprocessing). This is Part 1 of the tracing implementation.
Key Changes:
- Implemented comprehensive OpenTelemetry-based tracing infrastructure with span context propagation
- Added tracing integration points across API server, scheduler, and token processor
- Provided documentation and example configurations for Jaeger/OTel Collector setup
Reviewed changes
Copilot reviewed 31 out of 32 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
fastdeploy/metrics/trace.py |
New comprehensive tracing implementation with span management and context propagation |
tests/metrics/test_trace.py |
Extensive test coverage for tracing functionality |
fastdeploy/entrypoints/openai/api_server.py |
Integrated tracing initialization and span decorators |
fastdeploy/entrypoints/openai/serving_chat.py |
Added request tracing start/finish and postprocessing spans |
fastdeploy/entrypoints/openai/serving_completion.py |
Added request tracing start/finish and postprocessing spans |
fastdeploy/entrypoints/engine_client.py |
Added preprocessing span tracking |
fastdeploy/engine/common_engine.py |
Added scheduler span tracking and context propagation |
fastdeploy/output/token_processor.py |
Added prefill/decode span tracking |
fastdeploy/engine/request.py |
Added trace_carrier field to RequestOutput |
fastdeploy/envs.py |
Added OTLP exporter configuration variables |
docs/zh/observability/trace.md |
Chinese documentation for tracing features |
docs/observability/trace.md |
English documentation for tracing features |
examples/observability/ |
Docker Compose examples for Prometheus, Grafana, Jaeger, and OTel Collector |
| } | ||
| ) | ||
|
|
||
| # 统一填充 reqs_context 的 Root Span 信息 |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a Chinese comment in the code: # 初始化用于存储 Upstream Context的变量. All comments in the codebase should be in English for consistency and maintainability. Please translate this and other Chinese comments (lines 437, 480, 483, 501, 509, 513, 530) to English.
| # 统一填充 reqs_context 的 Root Span 信息 | |
| # Consistently populate the Root Span information in reqs_context |
| # with open(log_path, "w") as logfile: | ||
| with open(log_path, "w"): | ||
| process = subprocess.Popen( | ||
| cmd, | ||
| stdout=logfile, | ||
| stderr=subprocess.STDOUT, | ||
| # stdout=logfile, | ||
| # stderr=subprocess.STDOUT, |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stdout and stderr redirects have been commented out without explanation. This means the subprocess output is not being captured to the log file, which could make debugging failures difficult. Either restore the redirects or add a comment explaining why they were removed. If this change is intentional for debugging, it should not be in the final PR.
| # trace_carrier = tracing.trace_get_proc_propagate_context(rid=rid) | ||
|
|
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out code should be removed. The line # trace_carrier = tracing.trace_get_proc_propagate_context(rid=rid) suggests incomplete implementation or debugging code that should either be removed or uncommented if it's needed.
| # trace_carrier = tracing.trace_get_proc_propagate_context(rid=rid) |
| @@ -0,0 +1,785 @@ | |||
| """ | |||
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR description is missing key information. According to the custom guidelines, the PR description should explain why these modifications are being made and what problem is being solved. The current description only lists checklist items. Please provide:
- The motivation for adding fine-grained tracing
- What problems this solves (e.g., debugging performance bottlenecks, request flow analysis)
- An overview of the implementation approach
| self._processor.force_flush(timeout_millis) | ||
|
|
||
|
|
||
| def lable_span(request): |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function name has a spelling error: 'lable_span' should be 'label_span'.
| def lable_span(request): | |
| def label_span(request): |
| for task in tasks: | ||
| start_span_request("DEQUEUE", task, trace.SpanKind.CONSUMER) | ||
| # for task in tasks: | ||
| # start_span_request("DEQUEUE", task, trace.SpanKind.CONSUMER) |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment appears to contain commented-out code.
| # start_span_request("DEQUEUE", task, trace.SpanKind.CONSUMER) |
| from opentelemetry import trace | ||
| from opentelemetry.propagate import extract | ||
|
|
||
| import fastdeploy.metrics.trace as tracing |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Module 'fastdeploy.metrics.trace' is imported with both 'import' and 'import from'.
| import threading | ||
| import time | ||
| import unittest | ||
| from unittest import mock |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Module 'unittest' is imported with both 'import' and 'import from'.
| from unittest import mock |
| with mock.patch("fastdeploy.metrics.trace.logger"): | ||
| trace.process_tracing_init() | ||
| # Should log error but not crash | ||
| # Check if error was called (may not always be called depending on implementation) | ||
| pass |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary 'pass' statement.
| with mock.patch("fastdeploy.metrics.trace.logger"): | |
| trace.process_tracing_init() | |
| # Should log error but not crash | |
| # Check if error was called (may not always be called depending on implementation) | |
| pass | |
| with mock.patch("fastdeploy.metrics.trace.logger") as mock_logger: | |
| trace.process_tracing_init() | |
| # Should log error but not crash | |
| # Check if error was called (may not always be called depending on implementation) | |
| assert mock_logger.error.called |
|
|
||
| # Should log warnings but not crash | ||
| # Check if warning was called (may not always be called depending on implementation) | ||
| pass |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary 'pass' statement.
| pass |
16220c9 to
34ad14f
Compare
34ad14f to
bc1decd
Compare
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.