Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share-manager HA - lease renewal #222

Merged
merged 2 commits into from
Jul 23, 2024

Conversation

james-munson
Copy link
Contributor

@james-munson james-munson commented May 20, 2024

Which issue(s) this PR fixes:

This is part of issue longhorn/longhorn#6205, Share Manager HA. This code will look for a lease associated to the share manager's volume and run a goroutine to renew it regularly if it exists.

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

I am marking it as draft because the related code in longhorn-manager is still in development, although it should have no effect if there is no Lease object.

@james-munson james-munson marked this pull request as draft May 20, 2024 21:36
@derekbit
Copy link
Member

derekbit commented Jul 8, 2024

Is it ready for review?

Comment on lines 127 to 163
if err := m.getLease(); err != nil {
m.logger.WithError(err).Warn("Failed to get lease - no lease renewal will be done")
} else {
m.logger.Info("Found lease")
if err = m.takeLease(); err != nil {
m.logger.WithError(err).Warn("Failed to take lease - no lease renewal will be done.")
} else {
go m.runLeaseRenew()
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a new EVN to the share-manager pod like enableFastFailureDetection so that when it is enabled, share-manager pod is required to be able to find the lease object and take ownership of the lease object? In this case, if share-manager pod is not able to find the lease object/take ownership, we need to crash the pod to signal that the enableFastFailureDetection requirement is not able to fulfilled

When enableFastFailureDetection is disabled, share-manager pod ignore the lease mechanism. Just save the ETCD traffic similar to previous Longhorn version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we can do this. I hadn't pictured that a failure to find the lease would be a serious enough deal to crash the share-manager app. Normally that would mean that a lease-capable share-manager image is running opposite an older longhorn-manager image that doesn't create the lease, which is easily fixed. It could also mean that lease creation failed, for whatever reason, which would be tougher.

Note that the longhorn-manager code makes the lease regardless of "feature gate" setting enable-share-manager-fast-failover. That only controls whether it checks for staleness. So the combination of setting with environment variable is

  • setting TRUE, evn TRUE: lease is updated, staleness is checked.
  • setting FALSE, evn TRUE: lease is updated, but never checked.
  • setting TRUE, evn FALSE, lease is not updated, so when it is checked, there will be no holder identity, so not stale.
  • setting FALSE, evn FALSE, lease is neither updated nor checked. This would be the default shipping config?

We would just have to be clear in the documentation that both items have to be enabled for fast failover to work. Is that what you are picturing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. See below.

Copy link

mergify bot commented Jul 15, 2024

This pull request is now in conflict. Could you fix it @james-munson? 🙏

@james-munson james-munson marked this pull request as ready for review July 16, 2024 19:32
Copy link
Contributor

@ejweber ejweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for leaving a bunch of single comments instead of one review with multiple comments. That was not my intention.

It seems we are executing a bunch of code repeatedly in a way we don't intend. (Though it is possible I am reading something wrong.)

I was reading it wrong: #222 (comment).

Copy link
Member

@derekbit derekbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something need improvemnt

  1. If the fast failover feature is disabled by the corresponding global setting, share-manager pod has to be aware of the setting and respect to the it. That means
  • lease should not be created
  • the check and renewal of lease is not required
  1. messages: redundant messages and flooding messages
  2. lease holder: the holder can be either pod name or node name. Need to clarify it.
  3. A ton of runLeaseRenew goroutines are created

pkg/server/share_manager.go Outdated Show resolved Hide resolved
pkg/server/share_manager.go Show resolved Hide resolved
pkg/server/share_manager.go Outdated Show resolved Hide resolved
pkg/server/share_manager.go Outdated Show resolved Hide resolved
pkg/server/share_manager.go Outdated Show resolved Hide resolved
@derekbit derekbit requested a review from ChanYiLin July 18, 2024 01:47
Copy link

mergify bot commented Jul 19, 2024

This pull request is now in conflict. Could you fix it @james-munson? 🙏

@ejweber ejweber self-requested a review July 19, 2024 15:35
@james-munson james-munson force-pushed the 6205-renew-lease branch 2 times, most recently from fe264b6 to 02ce7a6 Compare July 19, 2024 23:52
Copy link

mergify bot commented Jul 20, 2024

This pull request is now in conflict. Could you fix it @james-munson? 🙏

pkg/server/nfs/nfs_server.go Outdated Show resolved Hide resolved
pkg/server/nfs/nfs_server.go Outdated Show resolved Hide resolved
pkg/server/share_manager.go Outdated Show resolved Hide resolved
pkg/server/share_manager.go Outdated Show resolved Hide resolved
pkg/server/share_manager.go Outdated Show resolved Hide resolved
pkg/server/share_manager.go Show resolved Hide resolved
pkg/server/share_manager.go Outdated Show resolved Hide resolved
Copy link
Contributor

@ejweber ejweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM barring a nit I posted and some others from @derekbit that I agree with.

pkg/server/share_manager.go Show resolved Hide resolved
pkg/server/share_manager.go Show resolved Hide resolved
pkg/server/share_manager.go Show resolved Hide resolved
Signed-off-by: James Munson <james.munson@suse.com>
Signed-off-by: James Munson <james.munson@suse.com>
Copy link
Member

@derekbit derekbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekbit
Copy link
Member

Thanks @james-munson and @PhanLe1010 for your effort. All issues are addressed, so we can merge the PR.

@derekbit derekbit merged commit 7ee089b into longhorn:master Jul 23, 2024
7 checks passed
@derekbit
Copy link
Member

@mergify backport v1.7.x

Copy link

mergify bot commented Jul 23, 2024

backport v1.7.x

✅ Backports have been created

@james-munson james-munson deleted the 6205-renew-lease branch July 26, 2024 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants