Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug the engine might choose a replica with a smaller head size to be the source of truth for auto-salvage #1114

Merged
merged 2 commits into from
May 31, 2024

Conversation

PhanLe1010
Copy link
Contributor

More details are in the issue description of longhorn/longhorn#8659

ejweber
ejweber previously approved these changes May 29, 2024
Copy link
Contributor

@ejweber ejweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good catch.

@PhanLe1010
Copy link
Contributor Author

PhanLe1010 commented May 29, 2024

Update: I am also Modify the salvageRevisionCounterDisabledReplicas logic in this PR:

The old logic is that:

  1. Filter replica candidates to keep only replicas which was modified
    within the last 5 seconds from the last modified replica
  2. Then filter to keep only replicas with head size equals to the biggest
    one
  3. Then pick a random replica from the set

The new logic:

  1. Filter replica candiates to keep only replicas which was modified
    within the last 5 seconds from the last modified replica
  2. Then filter to keep only replicas with head size equal to the biggest
    one
  3. Then pick the last modified replica from the set

WDYT @ejweber @shuo-wu @derekbit

ejweber
ejweber previously approved these changes May 29, 2024
Copy link
Contributor

@ejweber ejweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only downside I can think of here is that repeated attempts to salvage will always choose the same replica. If that replica is broken somehow, we can never make a different choice. I think this is not a compelling enough reason to change the logic.

pkg/controller/control.go Outdated Show resolved Hide resolved
james-munson
james-munson previously approved these changes May 29, 2024
Copy link
Contributor

@james-munson james-munson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Eric's notes. Looks good.

@PhanLe1010
Copy link
Contributor Author

Looks like the github action is taking a long time waiting for an available runner:
image

Do we need to change the runner from oracle-aarch64-4cpu-16gb to longhorn-infra-arm64-runners as we did in the longhorn-instance-manager https://github.com/longhorn/longhorn-instance-manager/pull/500/files ?

Btw, may I have a question about the location (providers) of oracle-aarch64-4cpu-16gb and longhorn-infra-arm64-runners? @derekbit @FrankYang0529 ?

Copy link
Contributor

@james-munson james-munson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change from map to slice looks good, too.

the source of truth for auto-salvage

longhorn-8659

Signed-off-by: Phan Le <phan.le@suse.com>
The old logic is that:
1. Filter replica candidates to keep only replicas which was modified
   within the last 5 seconds from the last modified replica
2. Then filter to keep only replicas with head size equals to the biggest
   one
3. Then pick a random replica from the set

The new logic:
1. Filter replica candiates to keep only replicas which was modified
   within the last 5 seconds from the last modified replica
2. Then filter to keep only replicas with head size equal to the biggest
   one
3. Then pick the last modified replica from the set

longhorn-8659
longhorn-8563

Signed-off-by: Phan Le <phan.le@suse.com>
@PhanLe1010
Copy link
Contributor Author

ping @shuo-wu Can I get an approve for merging this PR. Thank you

Copy link
Contributor

@ejweber ejweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Member

@derekbit derekbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekbit
Copy link
Member

The only downside I can think of here is that repeated attempts to salvage will always choose the same replica. If that replica is broken somehow, we can never make a different choice. I think this is not a compelling enough reason to change the logic.

Then, we need records the broken replica and prevent picking up it in the retries.

@PhanLe1010
Copy link
Contributor Author

If a replica is broken beyond repairable state it wouldn't be able to start running state in the first place? In addition, if the best replica here is still broken, I would prefer the user manual intervention here instead. Wdyt?

@innobead innobead merged commit 15c1e89 into longhorn:master May 31, 2024
10 checks passed
@innobead
Copy link
Member

@mergify backport v1.6.x v1.5.x

Copy link

mergify bot commented May 31, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants