routing: add configurable attempt reconciliation on startup#10621
routing: add configurable attempt reconciliation on startup#10621calvinrzachman wants to merge 2 commits intolightningnetwork:elle-base-branch-payment-servicefrom
Conversation
In this commit, we add a ReconcileFunc callback to the ChannelRouter Config that is invoked for each in-flight HTLC attempt during startup, before result collection begins. This allows different deployment contexts (local vs. remote) to provide different recovery strategies. For the local (monolithic) lnd deployment, the callback defaults to a no-op — the router and switch share a crash domain, so the existing read-first recovery via GetAttemptResult is safe and behavior is unchanged. For a remote deployment where the ChannelRouter and Switch communicate over a network boundary, an alternate implementation can perform write-first recovery by re-dispatching the attempt idempotently (e.g., via SendOnion). This resolves the ambiguity that arises when a crash occurs after persisting an attempt but before knowing whether the remote Switch received the HTLC. If reconciliation fails, the attempt is skipped (no result collector is started) and will be retried on the next restart.
The new tests verify that a successful reconciliation proceeds to result collection, and that a failed reconciliation causes the attempt to be skipped with no result collector started.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness of the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
🟠 PR Severity: HIGH
🟠 High (2 files)
🟢 Low / Excluded (1 file)
AnalysisAll modified source files reside in the Severity bump evaluation:
No override labels were found. Severity remains HIGH. To override, add a |
There was a problem hiding this comment.
Code Review
This pull request introduces a configurable reconciliation function for in-flight HTLC attempts on startup, which is a great feature for supporting remote routers. However, a critical logic flaw was identified in the handling of reconciliation failures: if the reconciliation callback returns an error, the payment lifecycle skips result collection but continues to wait for the attempt, leading to a permanent hang of the payment process for that specific payment hash and potentially locked funds. Additionally, there is a suggestion to align a new log message with the repository's structured logging style guide.
| if err := p.router.cfg.ReconcileAttempt(&a); err != nil { | ||
| log.Warnf("Reconciliation failed for attempt "+ | ||
| "%v in payment %v: %v. Skipping result "+ | ||
| "collection; will retry on next restart.", | ||
| a.AttemptID, p.identifier, err) | ||
|
|
||
| continue | ||
| } |
There was a problem hiding this comment.
A critical logic flaw exists here: if the ReconcileAttempt callback returns an error, the payment lifecycle skips result collection but continues to wait for the attempt, causing it to block indefinitely and potentially lock funds. To remediate this, the attempt should be failed in the database (e.g., via p.router.cfg.Control.FailAttempt) if reconciliation fails with an unresolvable error. Additionally, the current log message does not adhere to the repository's structured logging style guide and should be converted to use log.WarnS with key-value pairs, requiring "log/slog" to be imported.
log.WarnS("Reconciliation failed, skipping result collection",
slog.Uint64("attempt_id", a.AttemptID),
slog.String("payment_hash", p.identifier.String()),
slog.String("error", err.Error()))There was a problem hiding this comment.
@gemini-code-assist This is expected. The primary goal is that existing lnd users are not impacted - which is guaranteed as we default to a no-op reconciler which cannot return an error here. The concern would only apply to someone running the lnd ChannelRouter type in a remote process - but even there, the current behavior seems to be correct approach. Consider:
- The lifecycle goroutine exits cleanly on router shutdown and retries on restart. The "hang" is an idle goroutine waiting for a result that will arrive on a future lifecycle (after restart when the Switch is reachable again). In MPP scenarios, results for successfully reconciled shards are still collected normally — only the unreconciled shard's result is deferred to the next restart.
- The unreconciled attempt's HTLC (if it exists on the Switch) will resolve via the normal HTLC timeout mechanism. If it settles, the preimage is collected on the next restart. If it times out, the funds return to the sender.
- The suggestion to call
FailAttemptwhen reconciliation fails would create a real fund safety risk.FailAttemptis only safe when we have definitive knowledge that the attempt failed. Reconciliation failure means "we could not determine the state of this attempt" — the correct response is to leave it as-is (in-flight) and try again later. The entire point of this addition is to provide a life-cycle hook for users to confirm the status of the htlc dispatch via "write-first" recovery - retrying the dispatch request and either receivingnilorAlreadyExistserror before proceeding to result collection. When there is doubt about htlc dispatch status, we must be conservative and avoid making destructive state changes.
There was a problem hiding this comment.
Thank you for the detailed explanation, @calvinrzachman. I understand your reasoning now regarding the ReconcileAttempt error handling. Your clarification that a reconciliation failure means "we could not determine the state of this attempt" and that FailAttempt would introduce a fund safety risk is well-taken. The current approach of skipping result collection and retrying on restart to maintain a conservative stance on fund safety is indeed a valid design choice for this specific context.
However, the logging message still does not adhere to the repository's structured logging style guide. It should be converted to use log.WarnS with key-value pairs to ensure consistency and better parseability of logs. This is a medium severity issue.
| if err := p.router.cfg.ReconcileAttempt(&a); err != nil { | |
| log.Warnf("Reconciliation failed for attempt "+ | |
| "%v in payment %v: %v. Skipping result "+ | |
| "collection; will retry on next restart.", | |
| a.AttemptID, p.identifier, err) | |
| continue | |
| } | |
| log.WarnS("Reconciliation failed, skipping result collection", | |
| slog.Uint64("attempt_id", a.AttemptID), | |
| slog.String("payment_hash", p.identifier.String()), | |
| slog.String("error", err.Error())) |
Change Description
This PR generalizes the ChannelRouter startup so it can be used both as it always has been and by a remote router which submits onions for forwarding via RPC.
When SendHTLC returns, the router expects to know — without ambiguity — whether the HTLC is in-flight or not. This is tricky to guarantee over a network. To resolve this, we add a configurable
ReconcileFunccallback that is invoked for each in-flight attempt during startup, before result collection begins. A remote dispatcher can use this hook to re-dispatch the attempt idempotently (viaSendOnionrpc). If the HTLC was already received by the remote Switch, the router gets a duplicate error and proceeds to tracking the result as normal.This startup modification is a no-op (and unnecessary) in the default case where ChannelRouter and Switch run together in a single lnd binary, but becomes important for correctness when they run in separate processes.
routing/router.go—ReconcileFunctype,noOpReconciledefault.routing/payment_lifecycle.go— invoke callback inreloadInflightAttempts; skip attempt resumption (result collection) on error.Alternatives Considered
One alternative is explicit dispatch acknowledgement: the router calls RegisterAttempt, then SendHTLC, then AckAttempt on the ControlTower. This eliminates ambiguity for most in-flight HTLCs, allowing read-first recovery on startup. However, implicit acknowledgement cannot survive restarts — without persistent ACKs, the router still needs defensive re-dispatch for any attempts that crashed between dispatch and acknowledgement.