-
Notifications
You must be signed in to change notification settings - Fork 1
feat: GPU Telemetry Integration - Milestone 2 #316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/gpu-telemetry
Are you sure you want to change the base?
Conversation
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughIntroduces a GPU telemetry subsystem: new config group and user setting, telemetry enums/messages/models, telemetry manager and data collector, records aggregation and processing, controller coordination and shutdown logic, and telemetry export paths (console, CSV, JSON). Updates metrics base validation and energy unit, constants, and tests to reflect telemetry data structures and flows. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User as User
participant SystemController as SystemController
participant ServiceMgr as ServiceManager
participant TelemetryMgr as TelemetryManager
participant RecordsMgr as RecordsManager
participant Exporters as ExporterManager
User->>SystemController: start()
SystemController->>ServiceMgr: start()
SystemController->>TelemetryMgr: start()
TelemetryMgr->>SystemController: publish(TelemetryStatusMessage)
alt telemetry enabled and endpoints reachable
loop collection intervals
TelemetryMgr->>RecordsMgr: publish(TelemetryRecordsMessage)
RecordsMgr->>RecordsMgr: aggregate telemetry hierarchy
end
else telemetry disabled or none reachable
Note over SystemController: Will not wait for telemetry results
end
par profiling completes
RecordsMgr-->>SystemController: publish(ProcessRecordsResultMessage)
SystemController->>SystemController: mark profile_results_received
and telemetry processing completes
RecordsMgr-->>SystemController: publish(ProcessTelemetryResultMessage)
SystemController->>SystemController: store telemetry_results
end
SystemController->>SystemController: _check_and_trigger_shutdown()
alt ready to shutdown
SystemController->>Exporters: export(results, telemetry_results)
Exporters->>Exporters: console/csv/json emit telemetry sections
SystemController-->>User: done
end
sequenceDiagram
autonumber
participant ExporterManager
participant Csv as CsvExporter
participant Json as JsonExporter
participant Console as GPUTelemetryConsoleExporter
ExporterManager->>Csv: init(ExporterConfig{telemetry_results})
ExporterManager->>Json: init(ExporterConfig{telemetry_results})
ExporterManager->>Console: init(ExporterConfig{telemetry_results})
par CSV
Csv->>Csv: _generate_csv_content(records, telemetry_results)
Csv->>Csv: _write_telemetry_section()
and JSON
Json->>Json: build telemetry_export_data
and Console
Console->>Console: get_renderable(telemetry_results)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60–90 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements GPU telemetry integration as Milestone 2 of the telemetry project, adding comprehensive service architecture integration, metrics processing pipeline, and export capabilities to AIPerf's core service system.
Key changes include:
- Service architecture integration with TelemetryManager as an optional service in SystemController's lifecycle
- Complete metrics processing pipeline with statistical analysis and export integration
- Hierarchical telemetry data storage enabling real-time access and final aggregation
Reviewed Changes
Copilot reviewed 45 out of 45 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
aiperf/gpu_telemetry/ | New telemetry module with TelemetryManager service and TelemetryDataCollector |
aiperf/post_processors/telemetry_results_processor.py | Telemetry-specific results processor implementing TelemetryResultsProcessorProtocol |
aiperf/metrics/gpu_telemetry_types/ | GPU telemetry metric definitions extending BaseTelemetryMetric |
aiperf/common/models/telemetry_models.py | Comprehensive telemetry data models with hierarchical storage |
aiperf/exporters/ | Enhanced exporters with telemetry data integration for CSV, JSON, and console output |
aiperf/records/records_manager.py | Extended RecordsManager to handle telemetry records alongside inference metrics |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
aiperf/metrics/gpu_telemetry_types/energy_consumption_metric.py
Outdated
Show resolved
Hide resolved
Docstrings generation was requested by @ilana-n. * #316 (comment) The following files were modified: * `aiperf/common/config/user_config.py` * `aiperf/common/enums/metric_enums.py` * `aiperf/common/messages/telemetry_messages.py` * `aiperf/common/models/telemetry_models.py` * `aiperf/common/protocols.py` * `aiperf/controller/system_controller.py` * `aiperf/exporters/console_metrics_exporter.py` * `aiperf/exporters/csv_exporter.py` * `aiperf/exporters/exporter_manager.py` * `aiperf/exporters/gpu_telemetry_console_exporter.py` * `aiperf/exporters/json_exporter.py` * `aiperf/gpu_telemetry/telemetry_data_collector.py` * `aiperf/gpu_telemetry/telemetry_manager.py` * `aiperf/metrics/base_metric.py` * `aiperf/metrics/base_telemetry_metric.py` * `aiperf/metrics/gpu_telemetry_types/energy_consumption_metric.py` * `aiperf/metrics/gpu_telemetry_types/gpu_memory_used_metric.py` * `aiperf/metrics/gpu_telemetry_types/gpu_power_usage_metric.py` * `aiperf/metrics/gpu_telemetry_types/gpu_temperature_metric.py` * `aiperf/metrics/gpu_telemetry_types/gpu_utilization_metric.py` * `aiperf/metrics/gpu_telemetry_types/memory_clock_frequency_metric.py` * `aiperf/metrics/gpu_telemetry_types/memory_temperature_metric.py` * `aiperf/metrics/gpu_telemetry_types/sm_clock_frequency_metric.py` * `aiperf/post_processors/telemetry_results_processor.py` * `aiperf/records/records_manager.py` * `tests/gpu_telemetry/conftest.py` * `tests/gpu_telemetry/test_telemetry_data_collector.py`
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Thread safety: | ||
Uses self._shutdown_lock to prevent race conditions when ProcessRecordsResultMessage | ||
and ProcessTelemetryResultMessage arrive concurrently. The lock ensures atomic | ||
check-and-set of _shutdown_triggered, preventing double-triggering of stop(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scenario 1: Both messages arrive simultaneously (telemetry enabled & ready)
- Without lock: Both tasks pass the _shutdown_triggered check, both call stop() → double-trigger
- With lock: First task acquires lock, sets _shutdown_triggered = True, calls stop(). Second task waits, then hits early-return guard → single trigger
Scenario 2: Profile results arrive first, telemetry still pending
- Task acquires lock, evaluates telemetry_ready_for_shutdown = (not True or None) = False
- Logs "Waiting for telemetry results..." and returns without setting _shutdown_triggered
- When telemetry arrives later, second call acquires lock and correctly triggers shutdown
Scenario 3: Telemetry disabled
- Single call evaluates telemetry_ready_for_shutdown = (not False or ...) = True
- Triggers shutdown immediately (lock provides no overhead in sequential case)
2e4f01c
to
6bf7c00
Compare
- Add comprehensive GPU telemetry data collection and monitoring system - Implement TelemetryManager and TelemetryDataCollector for real-time GPU metrics - Create GPU telemetry console exporter for CLI display of telemetry data - Add telemetry results post-processor for data analysis and aggregation - Implement comprehensive unit test coverage for all telemetry components - Add integration tests for end-to-end telemetry functionality - Update system controller to integrate GPU telemetry with existing infrastructure - Extend existing exporters to support telemetry data formats - Add comprehensive error handling and async callback support - Include code cleanup, linting fixes, and documentation improvements
6bf7c00
to
2970f26
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still have more to go through. Submitting my comments so far. Looking really great and I am excited to test it out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have added some more comments and feedback. I appreciate the well documented code. thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 47 out of 47 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
) | ||
|
||
await self._check_if_all_records_received() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decorator changed from @on_pull_message
to @on_message
. This change in message handling pattern should be documented or explained, as it affects how telemetry records are processed in the system architecture.
# Note: We use @on_message (push-based delivery) for TELEMETRY_RECORDS instead of @on_pull_message. | |
# This is intentional: telemetry records are pushed to the RecordsManager as they are generated, | |
# ensuring timely processing and reducing latency. In contrast, metric records use a pull-based | |
# approach (@on_pull_message) to allow for controlled, demand-driven retrieval. This distinction | |
# is important for system performance and reliability. |
Copilot uses AI. Check for mistakes.
… to records manager
GPU Telemetry Milestone 2 - Service Integration
Summary
This PR builds upon the GPU telemetry work from the previous milestone and delivers integration with AIPerf's core service architecture.
Building on Previous Work: This milestone extends the initial telemetry data collection foundation by fully integrating the TelemetryManager as an AIPerf service, implementing the complete telemetry metrics processing pipeline, and exporting statistical analysis to the console (with --verbose on) and to the csv and json file exports (always)
Components of the project NOT included yet, which will be added in future PRs include:
Key Features Added in This Milestone
Service Architecture Integration
Metrics Processing Pipeline:
Architecture Improvements
Export & Results Integration
Running and Testing
Automatic Integration
GPU telemetry collection is now automatically enabled when DCGM is available:
# Standard AIPerf command - telemetry collection happens automatically and is displayed in the console when --verbose is on aiperf profile \ --model Qwen/Qwen3-0.6B \ --endpoint-type chat \ --endpoint /v1/chat/completions \ --streaming \ --concurrency 10 \ --request-count 100 \ --verbose
# AIPerf command with custom DCGM endpoint url aiperf profile \ --model Qwen/Qwen3-0.6B \ --endpoint-type chat \ --endpoint /v1/chat/completions \ --streaming \ --concurrency 10 \ --request-count 100 \ --server-metrics-url http://localhost:9400/metrics http://localhost:9401/metrics http://localhost:9402/metrics \ --verbose
Prerequisites
http://localhost:9401/metrics
(default) see tutorial hereExample Output
Verification
Service Registration
You should see TelemetryManager registration in the logs:
Testing & Development
Unit Testing
cd /home/inguyen/code/aiperf python -m pytest tests/gpu_telemetry/ -v python -m pytest tests/common/models/test_telemetry_models.py -v python -m pytest tests/post_processors/test_telemetry_results_processor.py -v
Summary by CodeRabbit
New Features
Improvements
Changes