Skip to content

Comments

[countersyncd]: Modify the exit behavior of the main function#4197

Merged
prsunny merged 2 commits intosonic-net:masterfrom
Pterosaur:fix_retry
Feb 5, 2026
Merged

[countersyncd]: Modify the exit behavior of the main function#4197
prsunny merged 2 commits intosonic-net:masterfrom
Pterosaur:fix_retry

Conversation

@Pterosaur
Copy link
Contributor

What I did
The main function exits as soon as any actor terminates.

Why I did it
Otel actor may terminate due to failed to connect the otel collector. In the previous behavior, the main function will not exit because it's waiting for all actor terminating.

How I verified it
Check it locally:

[2026-02-04 11:16:01.248] [crates/countersyncd/src/actor/otel.rs:340] [WARN] Export attempt 29 failed: status: Unavailable, message: "tcp connect error", details: [], metadata: MetadataMap { headers: {} }
[2026-02-04 11:16:11.253] [crates/countersyncd/src/actor/otel.rs:340] [WARN] Export attempt 30 failed: status: Unavailable, message: "tcp connect error", details: [], metadata: MetadataMap { headers: {} }
[2026-02-04 11:16:21.255] [crates/countersyncd/src/actor/otel.rs:405] [ERROR] Failed to export buffered metrics (consecutive failures 30): OtelActorExportError("Max export retries exceeded")
[2026-02-04 11:16:21.256] [crates/countersyncd/src/actor/otel.rs:431] [INFO] Shutting down OtelActor...
[2026-02-04 11:16:22.257] [crates/countersyncd/src/actor/otel.rs:439] [INFO] OtelActor shutdown complete. 2339 messages, 0 exports, 1 failures
[2026-02-04 11:16:22.257] [crates/countersyncd/src/main.rs:421] [INFO] OpenTelemetry actor terminated
[2026-02-04 11:16:22.257] [crates/countersyncd/src/main.rs:464] [ERROR] OpenTelemetry actor failed: OtelActorExportError("Max export retries exceeded")

Details if related

Signed-off-by: Ze Gan <ganze718@gmail.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Pterosaur Pterosaur marked this pull request as ready for review February 4, 2026 11:18
@Pterosaur Pterosaur requested a review from prsunny as a code owner February 4, 2026 11:18
@Pterosaur Pterosaur requested review from Janetxxx and Copilot February 4, 2026 11:18
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request modifies the countersyncd daemon's exit behavior to terminate the entire process as soon as any actor completes, rather than waiting for all actors to finish. This addresses a specific issue where the OpenTelemetry actor could fail (e.g., due to connection failures to the OTEL collector) but the daemon would continue running without exiting.

Changes:

  • Replaced the "wait-for-all" actor completion pattern with a tokio::select! block that exits on first actor termination
  • Added a new exit_on_join helper function to handle actor exit with appropriate exit codes
  • Modified OtelActor error propagation to properly return errors rather than using early ? returns
  • Added .cargo/config.toml with rustflags to treat warnings as errors

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
crates/countersyncd/src/main.rs Replaced synchronous wait-for-all actors with tokio::select! for first-exit semantics; added exit_on_join helper; updated exit code imports
crates/countersyncd/src/actor/otel.rs Modified error handling to capture and propagate errors through run_error variable instead of early returns
.cargo/config.toml Added build configuration to treat compiler warnings as errors

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Ze Gan <ganze718@gmail.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@Janetxxx Janetxxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Pterosaur
Copy link
Contributor Author

Hi @prsunny , could you please help to merge this PR?

displayName: "Compile sonic swss"
- script: |
cargo test
RUSTFLAGS=-Dwarnings cargo test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saiarcot895 , could you review this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pterosaur could you set this in Cargo.toml instead, so that it applies always?

https://doc.rust-lang.org/cargo/reference/manifest.html#the-lints-section for reference.

Copy link
Contributor Author

@Pterosaur Pterosaur Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it before, but here is copilot review suggestions:

Setting rustflags to treat warnings as errors (-Dwarnings) is a good practice for CI/CD pipelines to ensure code quality. However, this can be problematic for local development as it makes the build process more strict. Different Rust versions may introduce new warnings, causing builds to fail unexpectedly on local machines or with newer toolchain versions.

Consider moving this configuration to CI-specific settings instead of applying it globally via .cargo/config.toml. This could be done through environment variables (RUSTFLAGS=-Dwarnings) in your CI configuration, or by using a separate profile. This approach allows developers to work with warnings locally while still enforcing zero-warning policy in CI.

I feel it's reasonable. To the developer, the local warning restriction may be not convenient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point; it can also help with moving to newer versions of the Rust compiler if it is constrained to this CI pipeline.

@prsunny prsunny merged commit 702dae4 into sonic-net:master Feb 5, 2026
16 checks passed
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to msft-202412: Azure/sonic-swss.msft#206

@Pterosaur
Copy link
Contributor Author

Retrigger mssonicbld

@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202511: #4225

ksravani-hcl pushed a commit to ksravani-hcl/sonic-swss that referenced this pull request Feb 20, 2026
…net#4197)

What I did
The main function exits as soon as any actor terminates.

Why I did it
Otel actor may terminate due to failed to connect the otel collector. In the previous behavior, the main function will not exit because it's waiting for all actor terminating.
baorliu pushed a commit to baorliu/sonic-swss that referenced this pull request Feb 23, 2026
…net#4197)

What I did
The main function exits as soon as any actor terminates.

Why I did it
Otel actor may terminate due to failed to connect the otel collector. In the previous behavior, the main function will not exit because it's waiting for all actor terminating.

Signed-off-by: Baorong Liu <96146196+baorliu@users.noreply.github.com>
@mssonicbld
Copy link
Collaborator

@Pterosaur cherry pick PR didn't pass PR checker. Please check!!!
#4225

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants