Skip to content

Comments

[countersyncd]: Add retry between client and otel collector#4131

Merged
prsunny merged 12 commits intosonic-net:masterfrom
Janetxxx:dev/jc/otel-reconnect
Jan 20, 2026
Merged

[countersyncd]: Add retry between client and otel collector#4131
prsunny merged 12 commits intosonic-net:masterfrom
Janetxxx:dev/jc/otel-reconnect

Conversation

@Janetxxx
Copy link
Contributor

@Janetxxx Janetxxx commented Jan 13, 2026

What I did
Added a retry mechanism in countersyncd for communication between the OTEL client and the Open Telemetry collector.

Why I did it
To improve the robustness of countersyncd by handling transient connection failures, ensuring metrics are reliably sent even if the otel collector is temporarily unavailable.

How I verified it

Details if related

Signed-off-by: Janet Cui <janet970527@gmail.com>
@Janetxxx Janetxxx requested a review from prsunny as a code owner January 13, 2026 07:54
Copilot AI review requested due to automatic review settings January 13, 2026 07:54
@mssonicbld
Copy link
Collaborator

/azp run

@Janetxxx Janetxxx removed the request for review from prsunny January 13, 2026 07:54
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds retry logic to the countersyncd OpenTelemetry exporter to handle transient connection failures between the client and the OTEL collector. The implementation attempts to reconnect for up to 5 minutes (30 attempts with 10-second intervals) before giving up.

Changes:

  • Added a consecutive_failures counter to track sequential export failures
  • Implemented retry loop with exponential-style backoff (1 second initial, then 10 seconds)
  • Enhanced logging to provide feedback during retry attempts and upon successful reconnection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Janet Cui <janet970527@gmail.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Janet Cui <janet970527@gmail.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Janet Cui <janet970527@gmail.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Janet Cui <janet970527@gmail.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

…logics

Signed-off-by: Janet Cui <janet970527@gmail.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Janet Cui <janet970527@gmail.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Janetxxx Janetxxx requested a review from Pterosaur January 19, 2026 03:16
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prsunny prsunny merged commit 416a0eb into sonic-net:master Jan 20, 2026
16 checks passed
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to msft-202412: Azure/sonic-swss.msft#197

arpit-nexthop pushed a commit to nexthop-ai/sonic-swss that referenced this pull request Jan 21, 2026
…t#4131)

What I did
Added a retry mechanism in countersyncd for communication between the OTEL client and the Open Telemetry collector.

Why I did it
To improve the robustness of countersyncd by handling transient connection failures, ensuring metrics are reliably sent even if the otel collector is temporarily unavailable.
ganglyu pushed a commit to ganglyu/sonic-swss that referenced this pull request Jan 26, 2026
…t#4131)

What I did
Added a retry mechanism in countersyncd for communication between the OTEL client and the Open Telemetry collector.

Why I did it
To improve the robustness of countersyncd by handling transient connection failures, ensuring metrics are reliably sent even if the otel collector is temporarily unavailable.

Signed-off-by: ganglyu <glv@nvidia.com>
ypcisco pushed a commit to ypcisco/sonic-swss that referenced this pull request Jan 29, 2026
…t#4131)

What I did
Added a retry mechanism in countersyncd for communication between the OTEL client and the Open Telemetry collector.

Why I did it
To improve the robustness of countersyncd by handling transient connection failures, ensuring metrics are reliably sent even if the otel collector is temporarily unavailable.

Signed-off-by: Yash Pandit <ypcisco@gmail.com>
@vmittal-msft
Copy link
Contributor

@Janetxxx please manually create 202511 PR as there are cherry-pick conflicts

@DavidZagury
Copy link
Contributor

DavidZagury commented Feb 15, 2026

@Janetxxx please manually create 202511 PR as there are cherry-pick conflicts

@vmittal-msft it should be fixed now after other PRs has been cherry-picked to 202511

@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202511: #4220

baorliu pushed a commit to baorliu/sonic-swss that referenced this pull request Feb 23, 2026
…t#4131)

What I did
Added a retry mechanism in countersyncd for communication between the OTEL client and the Open Telemetry collector.

Why I did it
To improve the robustness of countersyncd by handling transient connection failures, ensuring metrics are reliably sent even if the otel collector is temporarily unavailable.

Signed-off-by: Baorong Liu <96146196+baorliu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants