Skip to content

routing: add configurable attempt reconciliation on startup#10621

Open
calvinrzachman wants to merge 2 commits intolightningnetwork:elle-base-branch-payment-servicefrom
calvinrzachman:generalized-router-startup
Open

routing: add configurable attempt reconciliation on startup#10621
calvinrzachman wants to merge 2 commits intolightningnetwork:elle-base-branch-payment-servicefrom
calvinrzachman:generalized-router-startup

Conversation

@calvinrzachman
Copy link
Collaborator

Change Description

This PR generalizes the ChannelRouter startup so it can be used both as it always has been and by a remote router which submits onions for forwarding via RPC.

When SendHTLC returns, the router expects to know — without ambiguity — whether the HTLC is in-flight or not. This is tricky to guarantee over a network. To resolve this, we add a configurable ReconcileFunc callback that is invoked for each in-flight attempt during startup, before result collection begins. A remote dispatcher can use this hook to re-dispatch the attempt idempotently (via SendOnion rpc). If the HTLC was already received by the remote Switch, the router gets a duplicate error and proceeds to tracking the result as normal.

This startup modification is a no-op (and unnecessary) in the default case where ChannelRouter and Switch run together in a single lnd binary, but becomes important for correctness when they run in separate processes.

  • routing/router.goReconcileFunc type, noOpReconcile default.
  • routing/payment_lifecycle.go — invoke callback in reloadInflightAttempts; skip attempt resumption (result collection) on error.

Alternatives Considered

One alternative is explicit dispatch acknowledgement: the router calls RegisterAttempt, then SendHTLC, then AckAttempt on the ControlTower. This eliminates ambiguity for most in-flight HTLCs, allowing read-first recovery on startup. However, implicit acknowledgement cannot survive restarts — without persistent ACKs, the router still needs defensive re-dispatch for any attempts that crashed between dispatch and acknowledgement.

In this commit, we add a ReconcileFunc callback to the ChannelRouter
Config that is invoked for each in-flight HTLC attempt during startup,
before result collection begins. This allows different deployment
contexts (local vs. remote) to provide different recovery strategies.

For the local (monolithic) lnd deployment, the callback defaults to a
no-op — the router and switch share a crash domain, so the existing
read-first recovery via GetAttemptResult is safe and behavior is
unchanged.

For a remote deployment where the ChannelRouter and Switch communicate
over a network boundary, an alternate implementation can perform
write-first recovery by re-dispatching the attempt idempotently (e.g.,
via SendOnion). This resolves the ambiguity that arises when a crash
occurs after persisting an attempt but before knowing whether the
remote Switch received the HTLC.

If reconciliation fails, the attempt is skipped (no result collector
is started) and will be retried on the next restart.
The new tests verify that a successful reconciliation proceeds to result
collection, and that a failed reconciliation causes the attempt to be
skipped with no result collector started.
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of the ChannelRouter's startup process by introducing a flexible reconciliation mechanism for in-flight HTLC attempts. This change is crucial for scenarios where the router and HTLC switch operate in separate processes, such as with remote dispatchers, ensuring that the router can reliably determine the state of pending payments after a restart. By allowing configurable reconciliation, the system can now confidently resume tracking HTLCs, even if a crash occurred between dispatch and acknowledgment, thereby improving overall payment reliability and fault tolerance.

Highlights

  • Configurable HTLC Attempt Reconciliation: Introduced a configurable ReconcileFunc callback that is invoked for each in-flight HTLC attempt during startup, before result collection begins. This allows for custom logic to confirm the status of pending HTLCs.
  • Support for Remote Routers: Generalized the ChannelRouter startup to support remote routers that submit onions for forwarding via RPC, ensuring unambiguous HTLC state tracking even when the router and switch run in separate processes.
  • Idempotent Re-dispatch: The reconciliation hook enables idempotent re-dispatch of attempts, allowing a remote dispatcher to resubmit an HTLC. If the HTLC was already received, a duplicate error is returned, and tracking proceeds normally.
  • Default No-Op Behavior: The new reconciliation mechanism defaults to a noOpReconcile function, preserving the existing behavior for lnd setups where the ChannelRouter and Switch run in the same binary, making the change a no-op in the default case.
  • Improved Crash Recovery: Enhances the router's ability to recover gracefully after a crash by providing a mechanism to confirm the in-flight status of HTLCs, preventing ambiguity and ensuring correct state tracking.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • routing/payment_lifecycle.go
    • Updated the reloadInflightAttempts function to incorporate the new ReconcileAttempt callback.
    • Added logic to skip result collection for an HTLC attempt if its reconciliation fails, logging a warning and allowing retry on next restart.
  • routing/payment_lifecycle_test.go
    • Modified newTestPaymentLifecycle to initialize ReconcileAttempt with noOpReconcile.
    • Added TestReloadInflightAttemptsWithReconciliation to verify that the ReconcileFunc is called and result collection proceeds on success.
    • Added TestReloadInflightAttemptsReconciliationError to confirm that result collection is skipped when ReconcileFunc returns an error.
  • routing/router.go
    • Defined a new type ReconcileFunc for the callback function used to reconcile in-flight HTLC attempts.
    • Implemented noOpReconcile as the default ReconcileFunc, which simply returns nil.
    • Added ReconcileAttempt field to the Config struct of the ChannelRouter.
    • Modified the New function for ChannelRouter to default ReconcileAttempt to noOpReconcile if not explicitly set in the configuration.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@lightninglabs-deploy lightninglabs-deploy added the severity-high Requires knowledgeable engineer review label Mar 3, 2026
@lightninglabs-deploy
Copy link
Collaborator

🟠 PR Severity: HIGH

Automated classification | 2 files | 50 lines changed

🟠 High (2 files)
  • routing/payment_lifecycle.go - payment lifecycle logic in routing package
  • routing/router.go - core router implementation in routing package
🟢 Low / Excluded (1 file)
  • routing/payment_lifecycle_test.go - test file (excluded from severity counting)

Analysis

All modified source files reside in the routing/* package, which handles payment pathfinding algorithms and is classified as HIGH severity. The changes touch payment_lifecycle.go and router.go, both core components of the payment routing state machine.

Severity bump evaluation:

  • Non-test, non-generated files changed: 2 (threshold: >20) — no bump
  • Lines changed (non-test): ~50 (threshold: >500) — no bump
  • Multiple distinct critical packages: no — no bump

No override labels were found. Severity remains HIGH.


To override, add a severity-override-{critical,high,medium,low} label.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a configurable reconciliation function for in-flight HTLC attempts on startup, which is a great feature for supporting remote routers. However, a critical logic flaw was identified in the handling of reconciliation failures: if the reconciliation callback returns an error, the payment lifecycle skips result collection but continues to wait for the attempt, leading to a permanent hang of the payment process for that specific payment hash and potentially locked funds. Additionally, there is a suggestion to align a new log message with the repository's structured logging style guide.

Comment on lines +1139 to +1146
if err := p.router.cfg.ReconcileAttempt(&a); err != nil {
log.Warnf("Reconciliation failed for attempt "+
"%v in payment %v: %v. Skipping result "+
"collection; will retry on next restart.",
a.AttemptID, p.identifier, err)

continue
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

A critical logic flaw exists here: if the ReconcileAttempt callback returns an error, the payment lifecycle skips result collection but continues to wait for the attempt, causing it to block indefinitely and potentially lock funds. To remediate this, the attempt should be failed in the database (e.g., via p.router.cfg.Control.FailAttempt) if reconciliation fails with an unresolvable error. Additionally, the current log message does not adhere to the repository's structured logging style guide and should be converted to use log.WarnS with key-value pairs, requiring "log/slog" to be imported.

log.WarnS("Reconciliation failed, skipping result collection",
	slog.Uint64("attempt_id", a.AttemptID),
	slog.String("payment_hash", p.identifier.String()),
	slog.String("error", err.Error()))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gemini-code-assist This is expected. The primary goal is that existing lnd users are not impacted - which is guaranteed as we default to a no-op reconciler which cannot return an error here. The concern would only apply to someone running the lnd ChannelRouter type in a remote process - but even there, the current behavior seems to be correct approach. Consider:

  • The lifecycle goroutine exits cleanly on router shutdown and retries on restart. The "hang" is an idle goroutine waiting for a result that will arrive on a future lifecycle (after restart when the Switch is reachable again). In MPP scenarios, results for successfully reconciled shards are still collected normally — only the unreconciled shard's result is deferred to the next restart.
  • The unreconciled attempt's HTLC (if it exists on the Switch) will resolve via the normal HTLC timeout mechanism. If it settles, the preimage is collected on the next restart. If it times out, the funds return to the sender.
  • The suggestion to call FailAttempt when reconciliation fails would create a real fund safety risk. FailAttempt is only safe when we have definitive knowledge that the attempt failed. Reconciliation failure means "we could not determine the state of this attempt" — the correct response is to leave it as-is (in-flight) and try again later. The entire point of this addition is to provide a life-cycle hook for users to confirm the status of the htlc dispatch via "write-first" recovery - retrying the dispatch request and either receiving nil or AlreadyExists error before proceeding to result collection. When there is doubt about htlc dispatch status, we must be conservative and avoid making destructive state changes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed explanation, @calvinrzachman. I understand your reasoning now regarding the ReconcileAttempt error handling. Your clarification that a reconciliation failure means "we could not determine the state of this attempt" and that FailAttempt would introduce a fund safety risk is well-taken. The current approach of skipping result collection and retrying on restart to maintain a conservative stance on fund safety is indeed a valid design choice for this specific context.

However, the logging message still does not adhere to the repository's structured logging style guide. It should be converted to use log.WarnS with key-value pairs to ensure consistency and better parseability of logs. This is a medium severity issue.

Suggested change
if err := p.router.cfg.ReconcileAttempt(&a); err != nil {
log.Warnf("Reconciliation failed for attempt "+
"%v in payment %v: %v. Skipping result "+
"collection; will retry on next restart.",
a.AttemptID, p.identifier, err)
continue
}
log.WarnS("Reconciliation failed, skipping result collection",
slog.Uint64("attempt_id", a.AttemptID),
slog.String("payment_hash", p.identifier.String()),
slog.String("error", err.Error()))

@saubyk saubyk added this to the v0.21.0 milestone Mar 3, 2026
@saubyk saubyk added this to v0.21 Mar 3, 2026
@saubyk saubyk moved this to In progress in v0.21 Mar 3, 2026
@calvinrzachman calvinrzachman marked this pull request as ready for review March 3, 2026 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

payments-v2 severity-high Requires knowledgeable engineer review

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

3 participants