Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: clarify engine to replica ping behavior #1170

Merged
merged 1 commit into from
Aug 1, 2024

Conversation

ejweber
Copy link
Contributor

@ejweber ejweber commented Aug 1, 2024

Which issue(s) this PR fixes:

Contributes to longhorn/longhorn#8711.

What this PR does / why we need it:

While working through longhorn/longhorn#8711 (comment), I "discovered" that failure to receive a ping response does NOT cause the engine to mark a replica as ERR. I initially set out to address that "issue".

After quite a bit of internal discussion and POCs that would either remove the engine to replica ping entirely or include it in the rolling engine-to-replica timeout with I/O operations, we have decided to leave it the way it is. See the code comments in this PR for an accurate description of its current behavior.

The main reasons for this decision are:

  • The ping can sometimes detect a replica in a weird state (e.g. closed when we think it should be open). It does this in one existing engine integration test. Removing it would potentially reduce our ability to handle an occurrence.
  • The ping is otherwise largely only useful when there is no I/O in flight. If there is I/O in flight, engine-replica-timeout governs whether we should mark a replica as ERR.
    • There isn't much benefit to quickly detecting a replica in a bad state if there is no active I/O.
    • If the issue is a temporary network disruption that can resolve itself during a period of no active I/O, it would be quite disruptive (and pointless) to rebuild the replica.

Thanks to @PhanLe1010 for reasoning this out and helping us reach the conclusion.

Longhorn 8711

Signed-off-by: Eric Weber <eric.weber@suse.com>
Copy link
Contributor

@PhanLe1010 PhanLe1010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Thank you for the investigation

@PhanLe1010 PhanLe1010 merged commit 49c2c72 into longhorn:master Aug 1, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants