feat(signer): implement graceful retry and timeout for remote signers by raushan728 · Pull Request #412 · solana-foundation/kora

raushan728 · 2026-03-28T11:03:23Z

Remote signers (KMS, Fireblocks) can hang or fail intermittently on HTTP calls.
This PR adds configurable retry and timeout, integrated with the health monitoring system.

Changes:

Each signing attempt wrapped with tokio::timeout (default 10s)
Exponential backoff (100ms × 2^exp) capped at ~12.8s
New config fields: sign_timeout_seconds (default 10), sign_max_retries (default 2)
Validator rejects timeout < 1s, warns if retries > 10
Signing failure reported once per request (not per retry) to preserve health counter accuracy
Fixed TOML serialization order for signing config fields
Added #[serial] to config tests to fix flakiness on shared global state

raushan728 · 2026-03-28T11:05:25Z

hi @dev-jodee this PR depends on #399 for pool telemetry. The diff will resolve automatically once the health PR is merged into main

greptile-apps · 2026-03-28T11:09:20Z

Greptile Summary

This PR adds configurable retry and timeout logic to remote signer calls (sign_message) in VersionedTransactionResolved::sign_transaction, along with a new health-tracking mechanism in SignerPool that marks signers as unhealthy after repeated failures and probes for recovery after a 30-second cooldown. The two new KoraConfig fields (sign_timeout_seconds, sign_max_retries) are properly serialized, defaulted, and validated.

Key changes:

versioned_transaction.rs: Retry loop wraps each sign_message call in tokio::timeout; exponential backoff (100ms × 2^exp, capped at 7) between attempts; success/failure reported to SignerPool.
pool.rs: SignerWithMetadata gains two Mutex-protected health fields; healthy_signers() filters the pool before selection; get_next_signer and get_signer_by_pubkey both respect the health check with a 30-second recovery probe window.
config.rs / config_validator.rs: Two new fields with serde defaults and startup validation (error on timeout = 0, warn on retries > 10).

Issues found:

P1 — Each retry attempt within a single request calls record_signing_failure independently. With defaults (sign_max_retries = 2, MAX_CONSECUTIVE_FAILURES = 3), a single request that exhausts all retries generates exactly 3 failure reports, immediately marking the signer unhealthy. This couples two otherwise-independent constants in a non-obvious way and defeats the retry mechanism's goal of transparently handling transient failures.
P2 — health and last_failed_at are two separate Mutexes that are always acquired together in a fixed order. While deadlock-safe today, combining them into a single struct under one lock would be simpler and less fragile.
P2 — get_signers_info / SignerInfo doesn't expose health state, making it impossible to observe which signers are being bypassed via monitoring endpoints.
P2 — The global current_index round-robin counter is modulo'd against the filtered healthy slice; pool shrinkage can cause non-uniform distribution until the counter wraps.

Confidence Score: 4/5

Safe to merge after addressing the per-retry health reporting — the current defaults cause a single failed request to immediately blacklist a signer, undermining the retry mechanism's resilience goal.

The P1 finding is a real present-behavior defect: with the default sign_max_retries = 2 and the hardcoded MAX_CONSECUTIVE_FAILURES = 3, one request exhausting all retries generates exactly 3 consecutive record_signing_failure calls, hitting the threshold and marking the signer unhealthy immediately. This makes healthy signers susceptible to a 30-second blackout after a single transient failure — the opposite of what retries are meant to achieve. The P2 findings are quality/observability concerns that do not block correctness. Config, validator, and test-mock changes are clean.

crates/lib/src/transaction/versioned_transaction.rs (per-retry health reporting) and crates/lib/src/signer/pool.rs (dual-mutex design, observability gap).

Important Files Changed

Filename	Overview
crates/lib/src/transaction/versioned_transaction.rs	Adds retry/timeout loop around `sign_message`; each retry attempt reports independently to the health pool, meaning one fully-retried request can immediately mark a signer unhealthy (P1).
crates/lib/src/signer/pool.rs	Adds health tracking (consecutive failures, 30s recovery probe) to SignerPool; lock ordering is consistent so no deadlock, but two separate Mutexes for health state are fragile; round-robin index skew and missing observability are minor concerns.
crates/lib/src/config.rs	Adds `sign_timeout_seconds` (default 10) and `sign_max_retries` (default 2) to `KoraConfig` with serde defaults and `Default` impl; straightforward and correct.
crates/lib/src/validator/config_validator.rs	Adds validation: errors on `sign_timeout_seconds == 0`, warns if `sign_max_retries > 10`; correct and consistent with existing validator patterns.
crates/lib/src/tests/config_mock.rs	Adds hardcoded defaults for the two new fields in both mock builders; no issues.
crates/lib/src/tests/toml_mock.rs	Extends `ConfigBuilder` with `with_signing_config` helper and TOML serialization for the two new fields; clean addition with no issues.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant VTx as VersionedTransactionResolved
    participant Pool as SignerPool
    participant Signer as Remote Signer

    Caller->>Pool: get_next_signer()
    Pool->>Pool: healthy_signers() — filter unhealthy
    Pool-->>Caller: Arc<Signer>

    Caller->>VTx: sign_transaction(config, signer, rpc_client)

    loop attempt 0..=max_retries
        VTx->>VTx: tokio::timeout(sign_timeout, ...)
        VTx->>Signer: sign_message(&message_bytes)

        alt Success
            Signer-->>VTx: Ok(signature)
            VTx->>Pool: record_signing_success(signer)
            VTx-->>Caller: (transaction, encoded)
        else Signing error
            Signer-->>VTx: Err(e)
            VTx->>Pool: record_signing_failure(signer) per-attempt
            VTx->>VTx: backoff (100ms x 2^exp)
        else Timeout
            VTx->>Pool: record_signing_failure(signer) per-attempt
            VTx->>VTx: backoff (100ms x 2^exp)
        end
    end

    opt All retries exhausted
        VTx-->>Caller: Err(SigningError)
    end

    Note over Pool: After MAX_CONSECUTIVE_FAILURES=3 signer marked unhealthy. Recovery probe allowed after 30s.

Comments Outside Diff (1)

crates/lib/src/signer/pool.rs, line 297-307 (link)

get_signers_info Omits Health State — Limits Observability

SignerInfo doesn't surface is_healthy or consecutive_failures. Operators consuming /getConfig or similar monitoring endpoints have no way to tell which signers are actively being skipped by the health filter. Given that this PR introduces the health mechanism as a production reliability feature, exposing it in the monitoring surface would make it far easier to debug degraded pools.

Consider adding a healthy: bool (or consecutive_failures: u32) field to SignerInfo and populating it from the health lock in get_signers_info.

_{Reviews (1): Last reviewed commit: "merge(signer): resolve conflicts and fix..." | Re-trigger Greptile}

crates/lib/src/transaction/versioned_transaction.rs

crates/lib/src/signer/pool.rs

…dd regression test

…e safety

…try loop

…gning

github-actions · 2026-04-03T15:47:30Z

✅ Fork external live tests passed.

fork-external-live-pass:dfc33405b22391d3a357e77801ef6ce5d809b04b
run: https://github.com/solana-foundation/kora/actions/runs/23952067537

raushan728 requested review from amilz and dev-jodee as code owners March 28, 2026 11:03

This comment was marked as resolved.

Sign in to view

greptile-apps bot reviewed Mar 28, 2026

View reviewed changes

crates/lib/src/transaction/versioned_transaction.rs Outdated Show resolved Hide resolved

crates/lib/src/signer/pool.rs Show resolved Hide resolved

crates/lib/src/signer/pool.rs Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

raushan728 force-pushed the feat/signer-retry-timeout branch from f67e065 to 7d3c420 Compare March 29, 2026 06:38

This comment was marked as resolved.

Sign in to view

raushan728 added 5 commits April 3, 2026 09:39

feat(signer): add graceful retry and timeout logic

8bf1711

fix: move signing config fields before subsections in toml_mock and a…

d5aa145

…dd regression test

fix: report signing failure once after all retries exhausted

da3f365

fix: add #[serial] to test_validate_sign_timeout_zero for global stat…

27038b6

…e safety

fix(signer): resolve conflicts and integrate maintainer logging in re…

8bab33f

…try loop

raushan728 force-pushed the feat/signer-retry-timeout branch from e5b9cde to 8bab33f Compare April 3, 2026 09:51

dev-jodee added 4 commits April 3, 2026 10:29

fix(signer): refine signer pool retry and validation behavior

8b65bed

fix: align signer exports with current retry implementation

0346363

feat(signer): add bundle signing retry and shared backoff util

f40be37

refactor(signer): centralize retry flow for transaction and bundle si…

dfc3340

…gning

dev-jodee approved these changes Apr 3, 2026

View reviewed changes

dev-jodee merged commit 5739b1f into solana-foundation:main Apr 3, 2026
12 of 13 checks passed

raushan728 deleted the feat/signer-retry-timeout branch April 4, 2026 01:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(signer): implement graceful retry and timeout for remote signers#412

feat(signer): implement graceful retry and timeout for remote signers#412
dev-jodee merged 9 commits intosolana-foundation:mainfrom
raushan728:feat/signer-retry-timeout

raushan728 commented Mar 28, 2026 •

edited

Loading

Uh oh!

raushan728 commented Mar 28, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

greptile-apps bot commented Mar 28, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raushan728 commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raushan728 commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

greptile-apps bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

raushan728 commented Mar 28, 2026 •

edited

Loading

raushan728 commented Mar 28, 2026 •

edited

Loading

greptile-apps bot commented Mar 28, 2026 •

edited

Loading