Skip to content

Add WDC session and key health#281

Merged
kwsantiago merged 13 commits intomainfrom
WDC-session-and-key-health
Feb 26, 2026
Merged

Add WDC session and key health#281
kwsantiago merged 13 commits intomainfrom
WDC-session-and-key-health

Conversation

@wksantiago
Copy link
Contributor

@wksantiago wksantiago commented Feb 26, 2026

Summary

  • Add key health monitoring with liveness checks for FROST shares
  • Add configurable session timeouts for descriptor coordination
  • Harden security: mutex poisoning, timeout bounds, timestamp handling
  • Simplify descriptor session and health check code paths

Summary by CodeRabbit

  • New Features

    • Frost network health-check command: run checks, report responsive/unresponsive peers, configurable timeout, and persist per-share health history with age/staleness display.
    • Wallet proposal timeout: optional per-proposal timeout that adjusts session and acknowledgement waits with validation and sensible defaults.
    • Mobile health APIs: health-check, persisted health-status listing, callback hooks, and a KeyHealthStatusInfo type for apps.
  • Bug Fixes

    • Safer node pubkey display and mutex handling to avoid panics.

@coderabbitai
Copy link

coderabbitai bot commented Feb 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a health-check feature and per-proposal timeout propagation: new CLI HealthCheck, KfpNode.health_check and HealthCheckComplete event, persistent KeyHealthStatus storage and Keep APIs, descriptor session timeout validation/propagation across protocol, node, CLI, and mobile layers.

Changes

Cohort / File(s) Summary
CLI & entrypoint
keep-cli/src/cli.rs, keep-cli/src/main.rs
Added FrostNetworkCommands::HealthCheck and passed optional timeout into wallet Propose flow; wired CLI dispatch to cmd_frost_network_health_check and updated wallet propose call to include timeout.
CLI health execution
keep-cli/src/commands/frost_network/mod.rs
Added cmd_frost_network_health_check implementing vault unlock, node init/announce/run, peer discovery, bounded node.health_check invocation, result aggregation, persistence, and formatted history; minor robustness fixes (pubkey display, mutex handling).
Wallet CLI proposal
keep-cli/src/commands/wallet.rs
Added optional timeout_secs parameter, validated against DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS, uses request_descriptor_with_timeout, and recalculates session/ACK timeouts from it.
Core health model & storage
keep-core/src/wallet.rs, keep-core/src/storage.rs, keep-core/src/backend.rs, keep-core/src/lib.rs
Introduced KeyHealthStatus type and thresholds; added HEALTH_STATUS table and encrypted CRUD (store/get/list/delete); exposed Keep APIs: store_health_status, get_health_status, list_health_statuses, list_stale_health_statuses.
Frost protocol & sessions
keep-frost-net/src/protocol.rs, keep-frost-net/src/descriptor_session.rs, keep-frost-net/src/node/descriptor.rs
Added DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS, DescriptorProposePayload.timeout_secs and builder, timeout validation, DescriptorSessionManager::with_timeoutResult, create_session_with_timeout, and request_descriptor_with_timeout propagation.
Node health check & events
keep-frost-net/src/node/mod.rs, keep-frost-net/src/lib.rs
Added HealthCheckResult, KfpNode::health_check(timeout), KfpNodeEvent::HealthCheckComplete, peer-ping snapshotting with tolerant ping behavior, and re-exported health types/constants.
Mobile SDK integration
keep-mobile/src/lib.rs, keep-mobile/src/persistence.rs, keep-mobile/src/types.rs, keep-mobile/src/error.rs
Added HealthCallbacks trait, health_check API, persistence of health entries, KeyHealthStatusInfo type, health_status_list, and InvalidInput error; added wallet_descriptor_propose_with_timeout.
Desktop UI small fix
keep-desktop/src/app.rs
Restricts wallet descriptor progress handling to update setup phase only when progress is Contributed.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as "keep-cli\ncmd_frost_network_health_check"
    participant Keep as "keep-core\nKeep (Vault & Storage)"
    participant Node as "keep-frost-net\nKfpNode"
    participant Peers as "Network\nPeers"

    User->>CLI: health_check(group, relay?, share?, timeout)
    CLI->>Keep: unlock_vault() & read share
    Keep-->>CLI: share_data
    CLI->>Node: init_node_and_announce(share)
    Node->>Peers: discover_peers()
    Peers-->>Node: peer_list
    CLI->>Node: node.health_check(timeout)
    Node->>Peers: ping(peer) (parallel, bounded by timeout)
    Peers-->>Node: pong / no response
    Node-->>CLI: HealthCheckResult{responsive, unresponsive}
    CLI->>Keep: store_health_status(KeyHealthStatus)
    Keep-->>CLI: persisted
    CLI-->>User: summary + history
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 I hopped the network, quick and spry,

pinged my peers beneath the sky,
timed the waits and marked the score,
saved each share’s remembered lore,
then bounded home and knocked once more.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.92% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Add WDC session and key health' accurately describes the main changes: configurable session timeouts for descriptor coordination and key health monitoring with liveness checks.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch WDC-session-and-key-health

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@wksantiago wksantiago self-assigned this Feb 26, 2026
@wksantiago wksantiago linked an issue Feb 26, 2026 that may be closed by this pull request
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
keep-frost-net/src/node/descriptor.rs (1)

32-72: ⚠️ Potential issue | 🟠 Major

Validate timeout_secs before creating the local descriptor session.

request_descriptor_with_timeout currently accepts any Option<u64> and uses it immediately for local session creation. Invalid values can cause local behavior to diverge from protocol constraints (e.g., zero-second sessions) before peer-side validation kicks in.

🔧 Proposed fix
 pub async fn request_descriptor_with_timeout(
     &self,
     policy: WalletPolicy,
     network: &str,
     own_xpub: &str,
     own_fingerprint: &str,
     timeout_secs: Option<u64>,
 ) -> Result<[u8; 32]> {
+    if let Some(t) = timeout_secs {
+        if t == 0 || t > DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS {
+            return Err(FrostNetError::Protocol(format!(
+                "timeout_secs must be between 1 and {DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS}"
+            )));
+        }
+    }
+
     let our_index = self.share.metadata.identifier;
@@
     let msg = KfpMessage::DescriptorPropose(payload);
+    msg.validate()
+        .map_err(|e| FrostNetError::Protocol(e.to_string()))?;
     let json = msg.to_json()?;

Also applies to: 87-101

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keep-frost-net/src/node/descriptor.rs` around lines 32 - 72, In
request_descriptor_with_timeout validate the timeout_secs parameter before
converting/using it to create a local session: check that if timeout_secs is
Some(t) then t > 0 (or meets any protocol min) and return an Err early for
invalid values; only then set session_timeout and call
sessions.create_session_with_timeout. Update the same validation where you later
create sessions (the block referenced around create_session_with_timeout) so
both uses of session_timeout are guarded and invalid zero or out-of-range
durations never produce a local session.
🧹 Nitpick comments (4)
keep-frost-net/src/protocol.rs (1)

275-283: Add explicit boundary tests for timeout_secs.

Please add coverage for None, 0, 1, DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS, and max + 1 to lock this protocol behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keep-frost-net/src/protocol.rs` around lines 275 - 283, Add unit tests that
exercise KfpMessage::DescriptorPropose timeout handling for the five boundary
cases: None, 0, 1, DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS, and
DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS + 1; locate the logic in protocol.rs around
KfpMessage::DescriptorPropose and the timeout_secs field and add tests that
construct DescriptorPropose messages (or call the validation path used there)
asserting success for None, 1, and DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS and
asserting an error for 0 and DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS + 1 so the
protocol behavior is fully covered.
keep-mobile/src/lib.rs (1)

743-748: Consider logging callback failures instead of discarding them silently.

Suppressing callback errors makes troubleshooting harder in app integrations.

Suggested observability improvement
             if let Some(cb) = self.health_callbacks.read().await.as_ref() {
-                let _ = cb.on_health_check_complete(
-                    result.responsive.clone(),
-                    result.unresponsive.clone(),
-                );
+                if let Err(e) = cb.on_health_check_complete(
+                    result.responsive.clone(),
+                    result.unresponsive.clone(),
+                ) {
+                    tracing::warn!("Health callback error: {e}");
+                }
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keep-mobile/src/lib.rs` around lines 743 - 748, Replace the silent discard of
the callback result (the line using "let _ = cb.on_health_check_complete(...)")
with explicit handling: capture the Result returned by
cb.on_health_check_complete in a variable, match it, and on Err(e) emit an error
log that includes a descriptive message, the error value and the relevant
context (e.g. result.responsive / result.unresponsive). Update the block that
reads self.health_callbacks and calls on_health_check_complete so failed
callbacks are logged (using your project's logging facility such as
tracing::error! or log::error!), rather than being ignored.
keep-mobile/src/persistence.rs (1)

399-422: load_health_statuses should prune stale/corrupt index entries.

Right now dead keys are silently skipped but retained, which can cause persistent read churn and index bloat.

Suggested cleanup pattern (matching descriptor behavior)
 pub(crate) fn load_health_statuses(storage: &Arc<dyn SecureStorage>) -> Vec<KeyHealthStatusInfo> {
@@
-    let index = load_health_index(storage);
-    let mut results = Vec::new();
-    for key in index {
-        if let Ok(data) = storage.load_share_by_key(key) {
-            if let Ok(stored) = serde_json::from_slice::<StoredHealthStatus>(&data) {
+    let index = load_health_index(storage);
+    let mut results = Vec::new();
+    let mut stale = Vec::new();
+    for key in &index {
+        match storage.load_share_by_key(key.clone()) {
+            Ok(data) => match serde_json::from_slice::<StoredHealthStatus>(&data) {
+                Ok(stored) => {
                 let stale_age = now.saturating_sub(stored.last_check_timestamp);
                 results.push(KeyHealthStatusInfo {
                     group_pubkey: stored.group_pubkey,
                     share_index: stored.share_index,
                     last_check_timestamp: stored.last_check_timestamp,
                     responsive: stored.responsive,
                     created_at: stored.created_at.unwrap_or(stored.last_check_timestamp),
                     is_stale: stale_age >= keep_core::wallet::KEY_HEALTH_STALE_THRESHOLD_SECS,
                     is_critical: stale_age >= keep_core::wallet::KEY_HEALTH_CRITICAL_THRESHOLD_SECS,
                 });
-            }
+                }
+                Err(_) => stale.push(key.clone()),
+            },
+            Err(_) => stale.push(key.clone()),
         }
     }
+    if !stale.is_empty() {
+        let cleaned: Vec<String> = index.into_iter().filter(|k| !stale.contains(k)).collect();
+        let _ = persist_health_index(storage, &cleaned);
+    }
     results
 }
keep-cli/src/commands/frost_network/mod.rs (1)

607-631: Health history output should be sorted for deterministic CLI UX.

Current order depends on storage iteration order, so output can be unstable across runs/backends.

Suggested deterministic ordering
-        let group_statuses: Vec<_> = all_statuses
+        let mut group_statuses: Vec<_> = all_statuses
             .iter()
             .filter(|s| s.group_pubkey == group_pubkey)
             .collect();
+        group_statuses.sort_by_key(|s| s.share_index);
         if !group_statuses.is_empty() {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keep-cli/src/commands/frost_network/mod.rs` around lines 607 - 631, The
health-history iteration is non-deterministic; make group_statuses mutable (let
mut group_statuses) and sort it deterministically before printing — e.g. sort
primarily by last_check_timestamp descending (newest first) and secondarily by
share_index ascending so ties are stable; use group_statuses.sort_by(|a,b|
b.last_check_timestamp.cmp(&a.last_check_timestamp).then(a.share_index.cmp(&b.share_index)))
after the collection and then iterate the sorted group_statuses for output.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@keep-core/src/storage.rs`:
- Around line 784-788: delete_health_status currently always returns Ok even
when the record was missing; update it to mirror other delete APIs by returning
an error for a missing record: call backend.delete(HEALTH_STATUS_TABLE,
key.as_bytes()), detect the backend's "not found" outcome (pattern-match or map
the backend error) and map it to KeepError::NotFound (or the module's
equivalent) instead of swallowing it, while still propagating other backend
errors; reference the delete_health_status function and health_status_key helper
when making the change.

In `@keep-frost-net/src/descriptor_session.rs`:
- Around line 406-415: Validate the incoming timeout in
create_session_with_timeout (and the related create_session overload) to reject
Duration::ZERO and excessively large durations: enforce a minimum >
Duration::ZERO and a configurable maximum (e.g., define a MAX_SESSION_TIMEOUT
constant such as Duration::from_secs(86_400) or wire in a configured cap) and
return an Err if the passed timeout is out of bounds; update callers to use the
same validation path or helper function (e.g., add a
validate_session_timeout(timeout: Duration) -> Result<Duration>) so
DescriptorSession construction always receives a sane, clamped timeout and avoid
immediate expiry or unbounded retention.

In `@keep-mobile/src/lib.rs`:
- Around line 731-741: Validate the timeout_secs argument at the start of
health_check before calling runtime.block_on: enforce a lower bound of 1 and an
upper bound (e.g., 300 seconds to match CLI behavior), and if out of range
return an appropriate KeepMobileError (e.g., KeepMobileError::InvalidInput with
a descriptive message). Update the health_check function to check timeout_secs,
error early on invalid values, and only construct
Duration::from_secs(timeout_secs) after the validation so downstream calls
(node.health_check(...)) never receive 0 or an unbounded value.
- Around line 766-777: The call to persistence::persist_health_status is
currently ignoring its Result (let _ = ...), which can hide failures; update the
code around persist_health_status (the call that builds KeyHealthStatusInfo with
group_pubkey, share_index, last_check_timestamp, responsive, created_at,
is_stale, is_critical) to properly handle the Result: at minimum match or use if
let Err(e) = persistence::persist_health_status(...) and log the error with
context (include group_pubkey and share_index) via the existing logger or
return/propagate the error upward where appropriate so failures are not silently
dropped.

---

Outside diff comments:
In `@keep-frost-net/src/node/descriptor.rs`:
- Around line 32-72: In request_descriptor_with_timeout validate the
timeout_secs parameter before converting/using it to create a local session:
check that if timeout_secs is Some(t) then t > 0 (or meets any protocol min) and
return an Err early for invalid values; only then set session_timeout and call
sessions.create_session_with_timeout. Update the same validation where you later
create sessions (the block referenced around create_session_with_timeout) so
both uses of session_timeout are guarded and invalid zero or out-of-range
durations never produce a local session.

---

Nitpick comments:
In `@keep-cli/src/commands/frost_network/mod.rs`:
- Around line 607-631: The health-history iteration is non-deterministic; make
group_statuses mutable (let mut group_statuses) and sort it deterministically
before printing — e.g. sort primarily by last_check_timestamp descending (newest
first) and secondarily by share_index ascending so ties are stable; use
group_statuses.sort_by(|a,b|
b.last_check_timestamp.cmp(&a.last_check_timestamp).then(a.share_index.cmp(&b.share_index)))
after the collection and then iterate the sorted group_statuses for output.

In `@keep-frost-net/src/protocol.rs`:
- Around line 275-283: Add unit tests that exercise
KfpMessage::DescriptorPropose timeout handling for the five boundary cases:
None, 0, 1, DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS, and
DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS + 1; locate the logic in protocol.rs around
KfpMessage::DescriptorPropose and the timeout_secs field and add tests that
construct DescriptorPropose messages (or call the validation path used there)
asserting success for None, 1, and DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS and
asserting an error for 0 and DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS + 1 so the
protocol behavior is fully covered.

In `@keep-mobile/src/lib.rs`:
- Around line 743-748: Replace the silent discard of the callback result (the
line using "let _ = cb.on_health_check_complete(...)") with explicit handling:
capture the Result returned by cb.on_health_check_complete in a variable, match
it, and on Err(e) emit an error log that includes a descriptive message, the
error value and the relevant context (e.g. result.responsive /
result.unresponsive). Update the block that reads self.health_callbacks and
calls on_health_check_complete so failed callbacks are logged (using your
project's logging facility such as tracing::error! or log::error!), rather than
being ignored.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a8a2ac8 and 808d431.

📒 Files selected for processing (16)
  • keep-cli/src/cli.rs
  • keep-cli/src/commands/frost_network/mod.rs
  • keep-cli/src/commands/wallet.rs
  • keep-cli/src/main.rs
  • keep-core/src/backend.rs
  • keep-core/src/lib.rs
  • keep-core/src/storage.rs
  • keep-core/src/wallet.rs
  • keep-frost-net/src/descriptor_session.rs
  • keep-frost-net/src/lib.rs
  • keep-frost-net/src/node/descriptor.rs
  • keep-frost-net/src/node/mod.rs
  • keep-frost-net/src/protocol.rs
  • keep-mobile/src/lib.rs
  • keep-mobile/src/persistence.rs
  • keep-mobile/src/types.rs

@wksantiago wksantiago linked an issue Feb 26, 2026 that may be closed by this pull request
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
keep-cli/src/commands/frost_network/mod.rs (1)

579-595: Consider graceful handling of partial failures during health status persistence.

If get_health_status or store_health_status fails for one share index, the entire function returns an error, potentially leaving the database in a partially updated state. Depending on the desired behavior, you might want to log individual failures and continue processing remaining shares.

♻️ Optional: Continue on individual storage failures
         for (&idx, responsive) in result
             .responsive
             .iter()
             .map(|i| (i, true))
             .chain(result.unresponsive.iter().map(|i| (i, false)))
         {
-            let existing = keep.get_health_status(&group_pubkey, idx)?;
+            let existing = match keep.get_health_status(&group_pubkey, idx) {
+                Ok(s) => s,
+                Err(e) => {
+                    tracing::warn!(share_index = idx, error = %e, "failed to get existing health status");
+                    None
+                }
+            };
             let created_at = existing.and_then(|s| s.created_at).unwrap_or(now);
             let status = keep_core::wallet::KeyHealthStatus {
                 group_pubkey,
                 share_index: idx,
                 last_check_timestamp: now,
                 responsive,
                 created_at: Some(created_at),
             };
-            keep.store_health_status(&status)?;
+            if let Err(e) = keep.store_health_status(&status) {
+                tracing::warn!(share_index = idx, error = %e, "failed to store health status");
+            }
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keep-cli/src/commands/frost_network/mod.rs` around lines 579 - 595, The loop
currently bails out on the first DB error from keep.get_health_status or
keep.store_health_status; change it to handle per-share failures so one bad
write/read doesn't abort the whole run: wrap the calls to
keep.get_health_status(&group_pubkey, idx) and keep.store_health_status(&status)
in per-iteration error handling (e.g., match or if let Err(e) => { log the error
including group_pubkey and share index idx and continue } ), and optionally
collect encountered errors into a Vec or Count to return/emit after the loop if
you need an aggregate failure signal; ensure you still compute created_at from
existing when get_health_status succeeds and build status from
group_pubkey/idx/now before attempting store.
keep-frost-net/src/descriptor_session.rs (1)

21-33: Avoid timeout-cap drift by using the protocol constant directly.

Line 21 hardcodes 86_400 even though descriptor timeout limits already live in protocol constants. Reusing the shared constant keeps bounds consistent across modules.

♻️ Suggested alignment
 use crate::protocol::{
     KeySlot, WalletPolicy, DESCRIPTOR_ACK_PHASE_TIMEOUT_SECS, DESCRIPTOR_CONTRIBUTION_TIMEOUT_SECS,
-    DESCRIPTOR_FINALIZE_TIMEOUT_SECS, DESCRIPTOR_SESSION_TIMEOUT_SECS, MAX_FINGERPRINT_LENGTH,
+    DESCRIPTOR_FINALIZE_TIMEOUT_SECS, DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS,
+    DESCRIPTOR_SESSION_TIMEOUT_SECS, MAX_FINGERPRINT_LENGTH,
     MAX_XPUB_LENGTH,
 };
@@
-const MAX_SESSION_TIMEOUT: Duration = Duration::from_secs(86_400);
+const MAX_SESSION_TIMEOUT: Duration =
+    Duration::from_secs(DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keep-frost-net/src/descriptor_session.rs` around lines 21 - 33, The file
defines a local MAX_SESSION_TIMEOUT using the hardcoded 86_400s; update
validate_session_timeout to use the shared protocol timeout constant instead of
the magic number: either construct MAX_SESSION_TIMEOUT from the protocol's
session timeout constant or compare timeout.as_secs() directly against the
protocol constant (import the protocol constant into this module). Change
references inside validate_session_timeout (function name:
validate_session_timeout, symbol: MAX_SESSION_TIMEOUT) so bounds come from the
protocol constant to keep limits consistent across modules.
keep-mobile/src/persistence.rs (1)

398-421: Prune stale health-index entries during reads.

At Line 406-Line 419, missing/corrupt records are skipped but never removed from HEALTH_STATUS_INDEX_KEY. That causes repeated failed loads and index drift over time. Consider cleaning stale keys the same way load_descriptors does.

♻️ Suggested cleanup pattern
 pub(crate) fn load_health_statuses(storage: &Arc<dyn SecureStorage>) -> Vec<KeyHealthStatusInfo> {
@@
-    let index = load_health_index(storage);
-    let mut results = Vec::new();
-    for key in index {
-        if let Ok(data) = storage.load_share_by_key(key) {
-            if let Ok(stored) = serde_json::from_slice::<StoredHealthStatus>(&data) {
+    let index = load_health_index(storage);
+    let mut results = Vec::new();
+    let mut stale = Vec::new();
+    for key in &index {
+        match storage.load_share_by_key(key.clone()) {
+            Ok(data) => match serde_json::from_slice::<StoredHealthStatus>(&data) {
                 let stale_age = now.saturating_sub(stored.last_check_timestamp);
                 results.push(KeyHealthStatusInfo {
@@
-                });
-            }
+                });
+            }
+            Err(_) => stale.push(key.clone()),
+        }
+    }
+
+    if !stale.is_empty() {
+        let cleaned: Vec<String> = index.into_iter().filter(|k| !stale.contains(k)).collect();
+        let _ = persist_health_index(storage, &cleaned);
+    }
-        }
-    }
     results
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keep-mobile/src/persistence.rs` around lines 398 - 421, In
load_health_statuses: the code currently skips missing/corrupt entries from
load_health_index but leaves them in the stored HEALTH_STATUS_INDEX_KEY, causing
repeated failures; modify load_health_statuses to build a cleaned index (e.g.,
iterate over the keys from load_health_index, add only keys that successfully
load and parse StoredHealthStatus to results) and remove keys that fail
storage.load_share_by_key or serde_json::from_slice; after iteration persist the
cleaned index back (use the existing index save helper analogous to
load_descriptors’ save logic — e.g., a save_health_index or the same persistence
API used elsewhere) so the next read won’t repeatedly encounter bad entries;
keep existing KeyHealthStatusInfo construction and stale/critical checks as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@keep-frost-net/src/node/descriptor.rs`:
- Around line 198-202: The current code silently drops invalid
payload.timeout_secs by filtering into propose_timeout, so replace the
filter/map pattern with explicit validation: check payload.timeout_secs and if
it is Some(t) but t == 0 or t > DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS return an
error (fail fast) with a clear message, otherwise map the valid value to
std::time::Duration::from_secs and assign to propose_timeout (or leave None when
timeout is legitimately absent); reference the existing propose_timeout binding,
payload.timeout_secs, and DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS when implementing
this change.
- Around line 32-40: The function request_descriptor_with_timeout no longer
rejects invalid network values early, allowing bad proposals to be broadcast;
add an early validation of the network parameter inside
request_descriptor_with_timeout (before any session creation or broadcast calls)
that checks the provided network string against the allowed networks and
immediately returns an Err when invalid so invalid proposals cannot proceed;
reference the network parameter and the request_descriptor_with_timeout function
to locate where to insert the guard.

In `@keep-frost-net/src/node/mod.rs`:
- Around line 1036-1050: health_check currently reads peers into all_peers and
then calls ping_peers which takes its own snapshot, causing inconsistent
responsive/unresponsive classification if peers change; fix by taking a single
peers snapshot in health_check (including share_index and any data ping_peers
needs such as PublicKey and last-seen timestamp), add a new helper like
ping_peers_from_snapshot(&self, peers_snapshot: &[(u16, PublicKey, Instant)],
timeout: Duration) -> Result<Vec<u16>> (or change ping_peers to accept a
snapshot), call that from health_check using the same snapshot, and then compute
unresponsive by comparing against that snapshot's share_index values so both
responsive and unresponsive derive from the identical peer set.

In `@keep-mobile/src/lib.rs`:
- Around line 748-753: The code currently swallows errors from the health
callback call in the block that reads self.health_callbacks and invokes
cb.on_health_check_complete(result.responsive.clone(),
result.unresponsive.clone()) using let _ = ..., so change that to capture the
Result and at minimum log failures instead of discarding them; locate the call
to on_health_check_complete and replace the ignored-result pattern with code
that awaits the call result, matches or uses if let Err(e) to emit a clear log
via your logger (or tracing) including the error and context (e.g., which
callback and the responsive/unresponsive values) so callback integration errors
are surfaced.

---

Nitpick comments:
In `@keep-cli/src/commands/frost_network/mod.rs`:
- Around line 579-595: The loop currently bails out on the first DB error from
keep.get_health_status or keep.store_health_status; change it to handle
per-share failures so one bad write/read doesn't abort the whole run: wrap the
calls to keep.get_health_status(&group_pubkey, idx) and
keep.store_health_status(&status) in per-iteration error handling (e.g., match
or if let Err(e) => { log the error including group_pubkey and share index idx
and continue } ), and optionally collect encountered errors into a Vec or Count
to return/emit after the loop if you need an aggregate failure signal; ensure
you still compute created_at from existing when get_health_status succeeds and
build status from group_pubkey/idx/now before attempting store.

In `@keep-frost-net/src/descriptor_session.rs`:
- Around line 21-33: The file defines a local MAX_SESSION_TIMEOUT using the
hardcoded 86_400s; update validate_session_timeout to use the shared protocol
timeout constant instead of the magic number: either construct
MAX_SESSION_TIMEOUT from the protocol's session timeout constant or compare
timeout.as_secs() directly against the protocol constant (import the protocol
constant into this module). Change references inside validate_session_timeout
(function name: validate_session_timeout, symbol: MAX_SESSION_TIMEOUT) so bounds
come from the protocol constant to keep limits consistent across modules.

In `@keep-mobile/src/persistence.rs`:
- Around line 398-421: In load_health_statuses: the code currently skips
missing/corrupt entries from load_health_index but leaves them in the stored
HEALTH_STATUS_INDEX_KEY, causing repeated failures; modify load_health_statuses
to build a cleaned index (e.g., iterate over the keys from load_health_index,
add only keys that successfully load and parse StoredHealthStatus to results)
and remove keys that fail storage.load_share_by_key or serde_json::from_slice;
after iteration persist the cleaned index back (use the existing index save
helper analogous to load_descriptors’ save logic — e.g., a save_health_index or
the same persistence API used elsewhere) so the next read won’t repeatedly
encounter bad entries; keep existing KeyHealthStatusInfo construction and
stale/critical checks as-is.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 808d431 and 22e8e8e.

📒 Files selected for processing (8)
  • keep-cli/src/commands/frost_network/mod.rs
  • keep-core/src/storage.rs
  • keep-frost-net/src/descriptor_session.rs
  • keep-frost-net/src/node/descriptor.rs
  • keep-frost-net/src/node/mod.rs
  • keep-mobile/src/error.rs
  • keep-mobile/src/lib.rs
  • keep-mobile/src/persistence.rs

@wksantiago
Copy link
Contributor Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Feb 26, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@wksantiago
Copy link
Contributor Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Feb 26, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
keep-frost-net/src/node/mod.rs (2)

1085-1088: Log ping build/send failures during health checks.

These failures are currently swallowed, which makes transport issues look like pure peer liveness issues and complicates diagnosis.

🪵 Proposed logging for observability
-        for (_, pubkey, _) in peers_snapshot {
-            if let Ok(event) = KfpEventBuilder::ping(&self.keys, pubkey) {
-                let _ = self.client.send_event(&event).await;
-            }
-        }
+        for (_, pubkey, _) in peers_snapshot {
+            match KfpEventBuilder::ping(&self.keys, pubkey) {
+                Ok(event) => {
+                    if let Err(e) = self.client.send_event(&event).await {
+                        debug!(peer = %pubkey, error = %e, "Failed to send health-check ping");
+                    }
+                }
+                Err(e) => {
+                    debug!(peer = %pubkey, error = %e, "Failed to build health-check ping");
+                }
+            }
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keep-frost-net/src/node/mod.rs` around lines 1085 - 1088, The ping build/send
failures inside the health-check loop are being ignored; when iterating over
peers_snapshot and calling KfpEventBuilder::ping(&self.keys, pubkey) and then
self.client.send_event(&event).await, capture and log any Err returned by
KfpEventBuilder::ping and the Result from send_event instead of discarding
them—use the module's existing logger (or tracing macros) to emit descriptive
error messages that include the peer pubkey and the error details so transport
failures are distinguishable from peer liveness issues.

1036-1036: Apply timeout bounds at the node API boundary as well.

KfpNode::health_check is public, but timeout bounds are only enforced in keep-mobile/src/lib.rs (Lines 730-735). Adding the same guard here keeps behavior consistent for all callers.

⏱️ Proposed guard in `health_check`
 pub async fn health_check(&self, timeout: Duration) -> Result<HealthCheckResult> {
+    if timeout < Duration::from_secs(1) || timeout > Duration::from_secs(300) {
+        return Err(FrostNetError::Protocol(format!(
+            "health_check timeout must be between 1 and 300 seconds, got {timeout:?}"
+        )));
+    }
+
     let peers_snapshot: Vec<(u16, PublicKey, std::time::Instant)> = self
         .peers
         .read()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keep-frost-net/src/node/mod.rs` at line 1036, KfpNode::health_check currently
accepts any Duration but timeout bounds are only applied in keep-mobile;
replicate the same guard at the API boundary by clamping the incoming timeout to
the allowed range before proceeding. In the body of pub async fn
health_check(&self, timeout: Duration) -> Result<HealthCheckResult>, apply the
same min/max clamp logic used in keep-mobile (or call the shared helper if one
exists) to constrain timeout to the configured MIN/MAX values, then use the
clamped value for the rest of the function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@keep-frost-net/src/node/mod.rs`:
- Around line 398-399: Validate session_timeout before performing any
relay/network side effects by calling DescriptorSessionManager::with_timeout(t)
early and handling its error immediately; if session_timeout is Some(t) attempt
DescriptorSessionManager::with_timeout(t)? before any relay add/connect work and
store the returned manager (or, when None, use DescriptorSessionManager::new()),
then reuse that validated manager for subsequent network setup so invalid
timeouts fail fast and avoid side effects and startup latency.

In `@keep-mobile/src/lib.rs`:
- Around line 738-746: The read guard on self.node (node_guard) is held across
an awaited call to node.health_check and any subsequent callback paths, causing
write contention; fix by minimizing the lock scope: immediately clone or extract
the minimal data needed (e.g., call node.as_ref().ok_or(...)? then clone
node.group_pubkey() into a local group_pubkey String or bytes) and drop
node_guard before calling node.health_check(...). Use a short-lived block or
drop(node_guard) so the awaited health_check and any callbacks run without
holding self.node's read lock; apply the same pattern for the other similar
section that holds node_guard across async work.
- Around line 807-812: Add explicit timeout bounds validation to
wallet_descriptor_propose_with_timeout: check if timeout_secs is Some and reject
values <= 0 or > DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS with a clear
KeepMobileError (mirroring health_check's 1–300 pattern), returning an Err
before forwarding the proposal; also apply the same validation to the other
proposer method around the 837–843 block so both
wallet_descriptor_propose_with_timeout and its sibling perform identical checks
instead of relying on downstream peer filtering.

---

Nitpick comments:
In `@keep-frost-net/src/node/mod.rs`:
- Around line 1085-1088: The ping build/send failures inside the health-check
loop are being ignored; when iterating over peers_snapshot and calling
KfpEventBuilder::ping(&self.keys, pubkey) and then
self.client.send_event(&event).await, capture and log any Err returned by
KfpEventBuilder::ping and the Result from send_event instead of discarding
them—use the module's existing logger (or tracing macros) to emit descriptive
error messages that include the peer pubkey and the error details so transport
failures are distinguishable from peer liveness issues.
- Line 1036: KfpNode::health_check currently accepts any Duration but timeout
bounds are only applied in keep-mobile; replicate the same guard at the API
boundary by clamping the incoming timeout to the allowed range before
proceeding. In the body of pub async fn health_check(&self, timeout: Duration)
-> Result<HealthCheckResult>, apply the same min/max clamp logic used in
keep-mobile (or call the shared helper if one exists) to constrain timeout to
the configured MIN/MAX values, then use the clamped value for the rest of the
function.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 22e8e8e and 6acd489.

📒 Files selected for processing (3)
  • keep-frost-net/src/node/descriptor.rs
  • keep-frost-net/src/node/mod.rs
  • keep-mobile/src/lib.rs

wksantiago and others added 11 commits February 26, 2026 13:17
- Panic on poisoned mutex instead of silently recovering (H3)
- Bound health check timeout to 3600s max (M1)
- Use hex fallback instead of empty string for bech32 failure (L1)
- Floor ACK timeout at DESCRIPTOR_ACK_TIMEOUT_SECS (ACK asymmetry)
- Fix clippy lint: use function reference in map
- Use Option<u64> for created_at to avoid false staleness on legacy records
- Update lib.rs exports after conflict resolution with main
@kwsantiago kwsantiago force-pushed the WDC-session-and-key-health branch from 6acd489 to e6f9cff Compare February 26, 2026 18:32
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
keep-mobile/src/lib.rs (1)

738-746: ⚠️ Potential issue | 🟠 Major

Drop self.node read guards before awaited network work.

Both blocks keep RwLockReadGuard alive across .await, which can delay writers for the full network/timeout window.

🔧 Suggested lock-scope fix
-            let (group_pubkey, result) = {
-                let node_guard = self.node.read().await;
-                let node = node_guard.as_ref().ok_or(KeepMobileError::NotInitialized)?;
-                let gp = hex::encode(node.group_pubkey());
-                let r = node
-                    .health_check(Duration::from_secs(timeout_secs))
-                    .await
-                    .map_err(|e| KeepMobileError::NetworkError { msg: e.to_string() })?;
-                (gp, r)
-            };
+            let node = {
+                let node_guard = self.node.read().await;
+                node_guard
+                    .as_ref()
+                    .cloned()
+                    .ok_or(KeepMobileError::NotInitialized)?
+            };
+            let group_pubkey = hex::encode(node.group_pubkey());
+            let result = node
+                .health_check(Duration::from_secs(timeout_secs))
+                .await
+                .map_err(|e| KeepMobileError::NetworkError { msg: e.to_string() })?;
-            let session_id = {
-                let node_guard = self.node.read().await;
-                let node = node_guard.as_ref().ok_or(KeepMobileError::NotInitialized)?;
+            let node = {
+                let node_guard = self.node.read().await;
+                node_guard
+                    .as_ref()
+                    .cloned()
+                    .ok_or(KeepMobileError::NotInitialized)?
+            };
+            let session_id = {
                 // existing logic...
                 node.request_descriptor_with_timeout(
                     policy,
                     &network,
                     &xpub,
                     &fingerprint,
                     timeout_secs,
                 )
                 .await
                 .map_err(|e| KeepMobileError::NetworkError { msg: e.to_string() })?
             };
#!/bin/bash
rg -n -C6 'let node_guard = self\.node\.read\(\)\.await;|health_check\(Duration::from_secs|request_descriptor_with_timeout\(' keep-mobile/src/lib.rs

Also applies to: 830-855

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keep-mobile/src/lib.rs` around lines 738 - 746, The read guard from
self.node.read().await (node_guard) is held across an await
(node.health_check(...).await), blocking writers; fix by limiting the guard's
scope: acquire self.node.read().await, check that node is Some (returning
KeepMobileError::NotInitialized if not), clone or extract only the needed sync
data (e.g., call node.group_pubkey() and hex::encode it or clone any request
params), then drop the guard (end the block) before calling async functions like
node.health_check(Duration::from_secs(timeout_secs)). Do the same pattern for
other occurrences such as request_descriptor_with_timeout to ensure no
RwLockReadGuard is alive across .await.
keep-frost-net/src/node/descriptor.rs (1)

205-209: ⚠️ Potential issue | 🟠 Major

Fail fast on invalid proposal timeout instead of silently downgrading.

Filtering invalid payload.timeout_secs to None converts bad input into default behavior; this handler should reject out-of-range values explicitly.

🔒 Suggested strict timeout handling
-        let propose_timeout = payload
-            .timeout_secs
-            .filter(|&t| t > 0 && t <= DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS)
-            .map(std::time::Duration::from_secs);
+        let propose_timeout = match payload.timeout_secs {
+            Some(0) => {
+                return Err(FrostNetError::Session(
+                    "Descriptor timeout must be greater than zero".into(),
+                ));
+            }
+            Some(t) if t > DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS => {
+                return Err(FrostNetError::Session(format!(
+                    "Descriptor timeout {t}s exceeds maximum {}s",
+                    DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS
+                )));
+            }
+            Some(t) => Some(std::time::Duration::from_secs(t)),
+            None => None,
+        };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keep-frost-net/src/node/descriptor.rs` around lines 205 - 209, The code
silently converts out-of-range payload.timeout_secs to None by filtering;
instead validate payload.timeout_secs explicitly and return an error when it's
Some but <= 0 or > DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS. In the block that
currently computes propose_timeout, check payload.timeout_secs: if None, leave
propose_timeout None; if Some(t) and t is within
(0..=DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS) set propose_timeout =
Some(Duration::from_secs(t)); otherwise return an Err (or propagate a validation
error) from the surrounding handler so invalid timeouts fail fast; reference the
propose_timeout variable, payload.timeout_secs, and
DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS to locate and update the logic.
🧹 Nitpick comments (1)
keep-mobile/src/persistence.rs (1)

398-421: Prune stale health-index entries during load.

Missing/corrupt records are ignored but kept in the index, so dead keys can accumulate and be retried forever.

♻️ Suggested cleanup pattern
 pub(crate) fn load_health_statuses(storage: &Arc<dyn SecureStorage>) -> Vec<KeyHealthStatusInfo> {
@@
-    let index = load_health_index(storage);
+    let index = load_health_index(storage);
     let mut results = Vec::new();
-    for key in index {
-        if let Ok(data) = storage.load_share_by_key(key) {
+    let mut stale = Vec::new();
+    for key in &index {
+        if let Ok(data) = storage.load_share_by_key(key.clone()) {
             if let Ok(stored) = serde_json::from_slice::<StoredHealthStatus>(&data) {
                 let stale_age = now.saturating_sub(stored.last_check_timestamp);
                 results.push(KeyHealthStatusInfo {
@@
                 });
+            } else {
+                stale.push(key.clone());
             }
+        } else {
+            stale.push(key.clone());
         }
     }
+    if !stale.is_empty() {
+        let cleaned: Vec<String> = index.into_iter().filter(|k| !stale.contains(k)).collect();
+        let _ = persist_health_index(storage, &cleaned);
+    }
     results
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keep-mobile/src/persistence.rs` around lines 398 - 421, load_health_statuses
currently ignores missing/corrupt entries but leaves their keys in the health
index (loaded via load_health_index), causing dead keys to accumulate; modify
load_health_statuses so that when storage.load_share_by_key(key) returns Err or
serde_json::from_slice::<StoredHealthStatus>(&data) fails it removes that key
from the in-memory index and after iterating persists the cleaned index back to
storage (call the existing index save function, e.g. save_health_index or
equivalent) so only successfully loaded keys remain in the stored index; keep
all other logic (timestamp/staleness calculation and constructing
KeyHealthStatusInfo) unchanged and ensure errors do not abort the loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@keep-desktop/src/app.rs`:
- Around line 1358-1361: The issue is that matches!(progress,
DescriptorProgress::Contributed) moves progress (DescriptorProgress is Clone but
not Copy) before it is reused; update the condition to borrow progress instead
(e.g., use matches!(&progress, DescriptorProgress::Contributed) or match on
&progress) so the subsequent use of progress when setting s.phase =
SetupPhase::Coordinating(progress) compiles; modify the if condition around
DescriptorProgress and ensure you still pass the owned progress into
SetupPhase::Coordinating when assigning via self.screen -> Screen::Wallet /
WalletScreen.

In `@keep-frost-net/src/node/mod.rs`:
- Around line 1068-1070: The health_check function currently silently clamps the
caller-provided timeout; instead validate the timeout argument and return an
explicit error when it's out of the allowed range. In the health_check method,
remove the timeout.clamp(...) call and add a bounds check against the allowed
minimum and maximum (e.g., Duration::from_secs(1) and Duration::from_secs(300));
if the timeout is outside that range return an Err (use the same
Result/anyhow/Error type used by health_check) with a clear message describing
the invalid timeout value. Keep the rest of health_check logic unchanged (use
the validated timeout variable thereafter).

---

Duplicate comments:
In `@keep-frost-net/src/node/descriptor.rs`:
- Around line 205-209: The code silently converts out-of-range
payload.timeout_secs to None by filtering; instead validate payload.timeout_secs
explicitly and return an error when it's Some but <= 0 or >
DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS. In the block that currently computes
propose_timeout, check payload.timeout_secs: if None, leave propose_timeout
None; if Some(t) and t is within (0..=DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS) set
propose_timeout = Some(Duration::from_secs(t)); otherwise return an Err (or
propagate a validation error) from the surrounding handler so invalid timeouts
fail fast; reference the propose_timeout variable, payload.timeout_secs, and
DESCRIPTOR_SESSION_MAX_TIMEOUT_SECS to locate and update the logic.

In `@keep-mobile/src/lib.rs`:
- Around line 738-746: The read guard from self.node.read().await (node_guard)
is held across an await (node.health_check(...).await), blocking writers; fix by
limiting the guard's scope: acquire self.node.read().await, check that node is
Some (returning KeepMobileError::NotInitialized if not), clone or extract only
the needed sync data (e.g., call node.group_pubkey() and hex::encode it or clone
any request params), then drop the guard (end the block) before calling async
functions like node.health_check(Duration::from_secs(timeout_secs)). Do the same
pattern for other occurrences such as request_descriptor_with_timeout to ensure
no RwLockReadGuard is alive across .await.

---

Nitpick comments:
In `@keep-mobile/src/persistence.rs`:
- Around line 398-421: load_health_statuses currently ignores missing/corrupt
entries but leaves their keys in the health index (loaded via
load_health_index), causing dead keys to accumulate; modify load_health_statuses
so that when storage.load_share_by_key(key) returns Err or
serde_json::from_slice::<StoredHealthStatus>(&data) fails it removes that key
from the in-memory index and after iterating persists the cleaned index back to
storage (call the existing index save function, e.g. save_health_index or
equivalent) so only successfully loaded keys remain in the stored index; keep
all other logic (timestamp/staleness calculation and constructing
KeyHealthStatusInfo) unchanged and ensure errors do not abort the loop.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6acd489 and e6f9cff.

📒 Files selected for processing (18)
  • keep-cli/src/cli.rs
  • keep-cli/src/commands/frost_network/mod.rs
  • keep-cli/src/commands/wallet.rs
  • keep-cli/src/main.rs
  • keep-core/src/backend.rs
  • keep-core/src/lib.rs
  • keep-core/src/storage.rs
  • keep-core/src/wallet.rs
  • keep-desktop/src/app.rs
  • keep-frost-net/src/descriptor_session.rs
  • keep-frost-net/src/lib.rs
  • keep-frost-net/src/node/descriptor.rs
  • keep-frost-net/src/node/mod.rs
  • keep-frost-net/src/protocol.rs
  • keep-mobile/src/error.rs
  • keep-mobile/src/lib.rs
  • keep-mobile/src/persistence.rs
  • keep-mobile/src/types.rs
🚧 Files skipped from review as they are similar to previous changes (4)
  • keep-core/src/backend.rs
  • keep-cli/src/main.rs
  • keep-core/src/wallet.rs
  • keep-mobile/src/types.rs

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
keep-frost-net/src/node/descriptor.rs (1)

219-244: ⚠️ Potential issue | 🟠 Major

Don’t collapse all session-creation errors into “duplicate proposal.”

Err(_) currently treats every failure as a duplicate session. That can mask real protocol/state errors and make malformed proposals invisible operationally. Only the true “already exists” path should be ignored.

🔧 Suggested fix
-        let session_created = {
-            let mut sessions = self.descriptor_sessions.write();
-            match sessions.create_session_with_timeout(
-                payload.session_id,
-                self.group_pubkey,
-                payload.policy.clone(),
-                payload.network.clone(),
-                expected_contributors,
-                HashSet::new(),
-                propose_timeout,
-            ) {
-                Ok(session) => {
-                    session.set_initiator(sender);
-
-                    if let Err(e) = session.add_contribution(
-                        sender_share_index,
-                        payload.initiator_xpub.clone(),
-                        payload.initiator_fingerprint.clone(),
-                    ) {
-                        debug!("Failed to store initiator contribution: {e}");
-                    }
-                    true
-                }
-                Err(_) => {
-                    debug!("Descriptor session already exists, ignoring duplicate proposal");
-                    false
-                }
-            }
-        };
+        let session_created = {
+            let mut sessions = self.descriptor_sessions.write();
+            if sessions.get_session(&payload.session_id).is_some() {
+                debug!("Descriptor session already exists, ignoring duplicate proposal");
+                false
+            } else {
+                let session = sessions.create_session_with_timeout(
+                    payload.session_id,
+                    self.group_pubkey,
+                    payload.policy.clone(),
+                    payload.network.clone(),
+                    expected_contributors,
+                    HashSet::new(),
+                    propose_timeout,
+                )?;
+                session.set_initiator(sender);
+
+                if let Err(e) = session.add_contribution(
+                    sender_share_index,
+                    payload.initiator_xpub.clone(),
+                    payload.initiator_fingerprint.clone(),
+                ) {
+                    debug!("Failed to store initiator contribution: {e}");
+                }
+                true
+            }
+        };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keep-frost-net/src/node/descriptor.rs` around lines 219 - 244, The Err arm of
the sessions.create_session_with_timeout call currently treats all errors as
duplicates; change the match to capture the error (e.g., Err(e)) and explicitly
handle the “already exists”/duplicate session variant by comparing e to the
specific error (or using matches!(e, SessionError::AlreadyExists | ...)) and
return false only for that case, while logging and returning false/propagating
(per surrounding convention) for any other error; update the debug message in
the non-duplicate branch to include the error details so malformed proposals and
real protocol/state errors are visible (refer to
sessions.create_session_with_timeout, the Err(_) branch and the surrounding
match block).
🧹 Nitpick comments (1)
keep-frost-net/src/node/mod.rs (1)

1084-1088: Consider using a set for responsive membership checks.

responsive.contains(idx) in a loop is quadratic. Converting responsive to HashSet<u16> first keeps this linear and clearer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@keep-frost-net/src/node/mod.rs` around lines 1084 - 1088, The current
collection of unresponsive peers is doing a membership check against responsive
in a loop causing O(n*m) behavior; convert responsive into a HashSet<u16> first
(e.g., build responsive_set from responsive via iter().cloned().collect()) and
then use responsive_set.contains(idx) inside the peers_snapshot mapping to
produce unresponsive, and add use std::collections::HashSet if not already
imported; update references to use responsive_set.contains in the block that
produces unresponsive.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@keep-mobile/src/persistence.rs`:
- Around line 414-423: The current use of saturating_sub when computing
stale_age hides future timestamps (last_check_timestamp > now) by yielding 0 and
marking keys as fresh; change the computation to treat future timestamps as aged
by replacing saturating_sub with checked_sub and handling None as a very large
age (e.g., now.checked_sub(stored.last_check_timestamp).unwrap_or(u64::MAX)) so
that is_stale/is_critical (which compare against KEY_HEALTH_STALE_THRESHOLD_SECS
and KEY_HEALTH_CRITICAL_THRESHOLD_SECS) correctly flag skewed future timestamps
as stale/critical; update the stale_age binding accordingly (referencing
stale_age, last_check_timestamp, now, and the KEY_HEALTH_* constants).
- Around line 390-394: The read-modify-write in load_health_index(...) /
persist_health_index(...) is non-atomic and can lose entries under concurrent
writes; change this to an atomic update: use the storage layer's
transaction/compare-and-swap or a retry loop that reloads the index and
re-applies the push until persist_health_index succeeds (or use a mutex around
HEALTH_STATUS_INDEX_KEY updates if storage has no transaction API).
Specifically, locate the block using load_health_index, the local variable
index, the key variable, and the call to persist_health_index and replace it
with an atomic update strategy (transactional write, CAS, or retry-on-conflict)
so two concurrent updates cannot overwrite each other.

---

Outside diff comments:
In `@keep-frost-net/src/node/descriptor.rs`:
- Around line 219-244: The Err arm of the sessions.create_session_with_timeout
call currently treats all errors as duplicates; change the match to capture the
error (e.g., Err(e)) and explicitly handle the “already exists”/duplicate
session variant by comparing e to the specific error (or using matches!(e,
SessionError::AlreadyExists | ...)) and return false only for that case, while
logging and returning false/propagating (per surrounding convention) for any
other error; update the debug message in the non-duplicate branch to include the
error details so malformed proposals and real protocol/state errors are visible
(refer to sessions.create_session_with_timeout, the Err(_) branch and the
surrounding match block).

---

Nitpick comments:
In `@keep-frost-net/src/node/mod.rs`:
- Around line 1084-1088: The current collection of unresponsive peers is doing a
membership check against responsive in a loop causing O(n*m) behavior; convert
responsive into a HashSet<u16> first (e.g., build responsive_set from responsive
via iter().cloned().collect()) and then use responsive_set.contains(idx) inside
the peers_snapshot mapping to produce unresponsive, and add use
std::collections::HashSet if not already imported; update references to use
responsive_set.contains in the block that produces unresponsive.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6f9cff and 124a47f.

📒 Files selected for processing (3)
  • keep-frost-net/src/node/descriptor.rs
  • keep-frost-net/src/node/mod.rs
  • keep-mobile/src/persistence.rs

@kwsantiago kwsantiago merged commit c8f2989 into main Feb 26, 2026
9 checks passed
@kwsantiago kwsantiago deleted the WDC-session-and-key-health branch February 26, 2026 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add periodic key health check pings Support configurable WDC session timeout up to 24h

2 participants