Skip to content

feat(signer): implement graceful retry and timeout for remote signers#412

Merged
dev-jodee merged 9 commits intosolana-foundation:mainfrom
raushan728:feat/signer-retry-timeout
Apr 3, 2026
Merged

feat(signer): implement graceful retry and timeout for remote signers#412
dev-jodee merged 9 commits intosolana-foundation:mainfrom
raushan728:feat/signer-retry-timeout

Conversation

@raushan728
Copy link
Copy Markdown
Contributor

@raushan728 raushan728 commented Mar 28, 2026

Remote signers (KMS, Fireblocks) can hang or fail intermittently on HTTP calls.
This PR adds configurable retry and timeout, integrated with the health monitoring system.

Changes:

  • Each signing attempt wrapped with tokio::timeout (default 10s)
  • Exponential backoff (100ms × 2^exp) capped at ~12.8s
  • New config fields: sign_timeout_seconds (default 10), sign_max_retries (default 2)
  • Validator rejects timeout < 1s, warns if retries > 10
  • Signing failure reported once per request (not per retry) to preserve health counter accuracy
  • Fixed TOML serialization order for signing config fields
  • Added #[serial] to config tests to fix flakiness on shared global state

Open with Devin

@raushan728
Copy link
Copy Markdown
Contributor Author

raushan728 commented Mar 28, 2026

hi @dev-jodee this PR depends on #399 for pool telemetry. The diff will resolve automatically once the health PR is merged into main

devin-ai-integration[bot]

This comment was marked as resolved.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 28, 2026

Greptile Summary

This PR adds configurable retry and timeout logic to remote signer calls (sign_message) in VersionedTransactionResolved::sign_transaction, along with a new health-tracking mechanism in SignerPool that marks signers as unhealthy after repeated failures and probes for recovery after a 30-second cooldown. The two new KoraConfig fields (sign_timeout_seconds, sign_max_retries) are properly serialized, defaulted, and validated.

Key changes:

  • versioned_transaction.rs: Retry loop wraps each sign_message call in tokio::timeout; exponential backoff (100ms × 2^exp, capped at 7) between attempts; success/failure reported to SignerPool.
  • pool.rs: SignerWithMetadata gains two Mutex-protected health fields; healthy_signers() filters the pool before selection; get_next_signer and get_signer_by_pubkey both respect the health check with a 30-second recovery probe window.
  • config.rs / config_validator.rs: Two new fields with serde defaults and startup validation (error on timeout = 0, warn on retries > 10).

Issues found:

  • P1 — Each retry attempt within a single request calls record_signing_failure independently. With defaults (sign_max_retries = 2, MAX_CONSECUTIVE_FAILURES = 3), a single request that exhausts all retries generates exactly 3 failure reports, immediately marking the signer unhealthy. This couples two otherwise-independent constants in a non-obvious way and defeats the retry mechanism's goal of transparently handling transient failures.
  • P2health and last_failed_at are two separate Mutexes that are always acquired together in a fixed order. While deadlock-safe today, combining them into a single struct under one lock would be simpler and less fragile.
  • P2get_signers_info / SignerInfo doesn't expose health state, making it impossible to observe which signers are being bypassed via monitoring endpoints.
  • P2 — The global current_index round-robin counter is modulo'd against the filtered healthy slice; pool shrinkage can cause non-uniform distribution until the counter wraps.

Confidence Score: 4/5

Safe to merge after addressing the per-retry health reporting — the current defaults cause a single failed request to immediately blacklist a signer, undermining the retry mechanism's resilience goal.

The P1 finding is a real present-behavior defect: with the default sign_max_retries = 2 and the hardcoded MAX_CONSECUTIVE_FAILURES = 3, one request exhausting all retries generates exactly 3 consecutive record_signing_failure calls, hitting the threshold and marking the signer unhealthy immediately. This makes healthy signers susceptible to a 30-second blackout after a single transient failure — the opposite of what retries are meant to achieve. The P2 findings are quality/observability concerns that do not block correctness. Config, validator, and test-mock changes are clean.

crates/lib/src/transaction/versioned_transaction.rs (per-retry health reporting) and crates/lib/src/signer/pool.rs (dual-mutex design, observability gap).

Important Files Changed

Filename Overview
crates/lib/src/transaction/versioned_transaction.rs Adds retry/timeout loop around sign_message; each retry attempt reports independently to the health pool, meaning one fully-retried request can immediately mark a signer unhealthy (P1).
crates/lib/src/signer/pool.rs Adds health tracking (consecutive failures, 30s recovery probe) to SignerPool; lock ordering is consistent so no deadlock, but two separate Mutexes for health state are fragile; round-robin index skew and missing observability are minor concerns.
crates/lib/src/config.rs Adds sign_timeout_seconds (default 10) and sign_max_retries (default 2) to KoraConfig with serde defaults and Default impl; straightforward and correct.
crates/lib/src/validator/config_validator.rs Adds validation: errors on sign_timeout_seconds == 0, warns if sign_max_retries > 10; correct and consistent with existing validator patterns.
crates/lib/src/tests/config_mock.rs Adds hardcoded defaults for the two new fields in both mock builders; no issues.
crates/lib/src/tests/toml_mock.rs Extends ConfigBuilder with with_signing_config helper and TOML serialization for the two new fields; clean addition with no issues.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant VTx as VersionedTransactionResolved
    participant Pool as SignerPool
    participant Signer as Remote Signer

    Caller->>Pool: get_next_signer()
    Pool->>Pool: healthy_signers() — filter unhealthy
    Pool-->>Caller: Arc<Signer>

    Caller->>VTx: sign_transaction(config, signer, rpc_client)

    loop attempt 0..=max_retries
        VTx->>VTx: tokio::timeout(sign_timeout, ...)
        VTx->>Signer: sign_message(&message_bytes)

        alt Success
            Signer-->>VTx: Ok(signature)
            VTx->>Pool: record_signing_success(signer)
            VTx-->>Caller: (transaction, encoded)
        else Signing error
            Signer-->>VTx: Err(e)
            VTx->>Pool: record_signing_failure(signer) per-attempt
            VTx->>VTx: backoff (100ms x 2^exp)
        else Timeout
            VTx->>Pool: record_signing_failure(signer) per-attempt
            VTx->>VTx: backoff (100ms x 2^exp)
        end
    end

    opt All retries exhausted
        VTx-->>Caller: Err(SigningError)
    end

    Note over Pool: After MAX_CONSECUTIVE_FAILURES=3 signer marked unhealthy. Recovery probe allowed after 30s.
Loading

Comments Outside Diff (1)

  1. crates/lib/src/signer/pool.rs, line 297-307 (link)

    P2 get_signers_info Omits Health State — Limits Observability

    SignerInfo doesn't surface is_healthy or consecutive_failures. Operators consuming /getConfig or similar monitoring endpoints have no way to tell which signers are actively being skipped by the health filter. Given that this PR introduces the health mechanism as a production reliability feature, exposing it in the monitoring surface would make it far easier to debug degraded pools.

    Consider adding a healthy: bool (or consecutive_failures: u32) field to SignerInfo and populating it from the health lock in get_signers_info.

Reviews (1): Last reviewed commit: "merge(signer): resolve conflicts and fix..." | Re-trigger Greptile

devin-ai-integration[bot]

This comment was marked as resolved.

@raushan728 raushan728 force-pushed the feat/signer-retry-timeout branch from f67e065 to 7d3c420 Compare March 29, 2026 06:38
devin-ai-integration[bot]

This comment was marked as resolved.

@raushan728 raushan728 force-pushed the feat/signer-retry-timeout branch from e5b9cde to 8bab33f Compare April 3, 2026 09:51
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 3, 2026

✅ Fork external live tests passed.

fork-external-live-pass:dfc33405b22391d3a357e77801ef6ce5d809b04b
run: https://github.com/solana-foundation/kora/actions/runs/23952067537

@dev-jodee dev-jodee merged commit 5739b1f into solana-foundation:main Apr 3, 2026
12 of 13 checks passed
@raushan728 raushan728 deleted the feat/signer-retry-timeout branch April 4, 2026 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants