[countersyncd]: Add retry between client and otel collector#4131
[countersyncd]: Add retry between client and otel collector#4131prsunny merged 12 commits intosonic-net:masterfrom
Conversation
Signed-off-by: Janet Cui <janet970527@gmail.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This pull request adds retry logic to the countersyncd OpenTelemetry exporter to handle transient connection failures between the client and the OTEL collector. The implementation attempts to reconnect for up to 5 minutes (30 attempts with 10-second intervals) before giving up.
Changes:
- Added a
consecutive_failurescounter to track sequential export failures - Implemented retry loop with exponential-style backoff (1 second initial, then 10 seconds)
- Enhanced logging to provide feedback during retry attempts and upon successful reconnection
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Janet Cui <janet970527@gmail.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Janet Cui <janet970527@gmail.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Janet Cui <janet970527@gmail.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…logics Signed-off-by: Janet Cui <janet970527@gmail.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Janet Cui <janet970527@gmail.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Cherry-pick PR to msft-202412: Azure/sonic-swss.msft#197 |
…t#4131) What I did Added a retry mechanism in countersyncd for communication between the OTEL client and the Open Telemetry collector. Why I did it To improve the robustness of countersyncd by handling transient connection failures, ensuring metrics are reliably sent even if the otel collector is temporarily unavailable.
…t#4131) What I did Added a retry mechanism in countersyncd for communication between the OTEL client and the Open Telemetry collector. Why I did it To improve the robustness of countersyncd by handling transient connection failures, ensuring metrics are reliably sent even if the otel collector is temporarily unavailable. Signed-off-by: ganglyu <glv@nvidia.com>
…t#4131) What I did Added a retry mechanism in countersyncd for communication between the OTEL client and the Open Telemetry collector. Why I did it To improve the robustness of countersyncd by handling transient connection failures, ensuring metrics are reliably sent even if the otel collector is temporarily unavailable. Signed-off-by: Yash Pandit <ypcisco@gmail.com>
|
@Janetxxx please manually create 202511 PR as there are cherry-pick conflicts |
@vmittal-msft it should be fixed now after other PRs has been cherry-picked to 202511 |
|
Cherry-pick PR to 202511: #4220 |
…t#4131) What I did Added a retry mechanism in countersyncd for communication between the OTEL client and the Open Telemetry collector. Why I did it To improve the robustness of countersyncd by handling transient connection failures, ensuring metrics are reliably sent even if the otel collector is temporarily unavailable. Signed-off-by: Baorong Liu <96146196+baorliu@users.noreply.github.com>
What I did
Added a retry mechanism in countersyncd for communication between the OTEL client and the Open Telemetry collector.
Why I did it
To improve the robustness of countersyncd by handling transient connection failures, ensuring metrics are reliably sent even if the otel collector is temporarily unavailable.
How I verified it
Details if related