feat(signer): implement graceful retry and timeout for remote signers#412
Conversation
|
hi @dev-jodee this PR depends on #399 for pool telemetry. The diff will resolve automatically once the health PR is merged into main |
Greptile SummaryThis PR adds configurable retry and timeout logic to remote signer calls ( Key changes:
Issues found:
Confidence Score: 4/5Safe to merge after addressing the per-retry health reporting — the current defaults cause a single failed request to immediately blacklist a signer, undermining the retry mechanism's resilience goal. The P1 finding is a real present-behavior defect: with the default
Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant VTx as VersionedTransactionResolved
participant Pool as SignerPool
participant Signer as Remote Signer
Caller->>Pool: get_next_signer()
Pool->>Pool: healthy_signers() — filter unhealthy
Pool-->>Caller: Arc<Signer>
Caller->>VTx: sign_transaction(config, signer, rpc_client)
loop attempt 0..=max_retries
VTx->>VTx: tokio::timeout(sign_timeout, ...)
VTx->>Signer: sign_message(&message_bytes)
alt Success
Signer-->>VTx: Ok(signature)
VTx->>Pool: record_signing_success(signer)
VTx-->>Caller: (transaction, encoded)
else Signing error
Signer-->>VTx: Err(e)
VTx->>Pool: record_signing_failure(signer) per-attempt
VTx->>VTx: backoff (100ms x 2^exp)
else Timeout
VTx->>Pool: record_signing_failure(signer) per-attempt
VTx->>VTx: backoff (100ms x 2^exp)
end
end
opt All retries exhausted
VTx-->>Caller: Err(SigningError)
end
Note over Pool: After MAX_CONSECUTIVE_FAILURES=3 signer marked unhealthy. Recovery probe allowed after 30s.
|
f67e065 to
7d3c420
Compare
…dd regression test
e5b9cde to
8bab33f
Compare
|
✅ Fork external live tests passed. fork-external-live-pass:dfc33405b22391d3a357e77801ef6ce5d809b04b |
Remote signers (KMS, Fireblocks) can hang or fail intermittently on HTTP calls.
This PR adds configurable retry and timeout, integrated with the health monitoring system.
Changes: