Share-manager HA - lease renewal #222

james-munson · 2024-05-20T21:35:57Z

Which issue(s) this PR fixes:

This is part of issue longhorn/longhorn#6205, Share Manager HA. This code will look for a lease associated to the share manager's volume and run a goroutine to renew it regularly if it exists.

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

I am marking it as draft because the related code in longhorn-manager is still in development, although it should have no effect if there is no Lease object.

derekbit · 2024-07-08T05:19:56Z

Is it ready for review?

PhanLe1010 · 2024-07-15T23:29:46Z

pkg/server/share_manager.go

+			if err := m.getLease(); err != nil {
+				m.logger.WithError(err).Warn("Failed to get lease - no lease renewal will be done")
+			} else {
+				m.logger.Info("Found lease")
+				if err = m.takeLease(); err != nil {
+					m.logger.WithError(err).Warn("Failed to take lease - no lease renewal will be done.")
+				} else {
+					go m.runLeaseRenew()
+				}
+			}
+


Should we add a new EVN to the share-manager pod like enableFastFailureDetection so that when it is enabled, share-manager pod is required to be able to find the lease object and take ownership of the lease object? In this case, if share-manager pod is not able to find the lease object/take ownership, we need to crash the pod to signal that the enableFastFailureDetection requirement is not able to fulfilled

When enableFastFailureDetection is disabled, share-manager pod ignore the lease mechanism. Just save the ETCD traffic similar to previous Longhorn version.

I suppose we can do this. I hadn't pictured that a failure to find the lease would be a serious enough deal to crash the share-manager app. Normally that would mean that a lease-capable share-manager image is running opposite an older longhorn-manager image that doesn't create the lease, which is easily fixed. It could also mean that lease creation failed, for whatever reason, which would be tougher.

Note that the longhorn-manager code makes the lease regardless of "feature gate" setting enable-share-manager-fast-failover. That only controls whether it checks for staleness. So the combination of setting with environment variable is

setting TRUE, evn TRUE: lease is updated, staleness is checked.

setting FALSE, evn TRUE: lease is updated, but never checked.

setting TRUE, evn FALSE, lease is not updated, so when it is checked, there will be no holder identity, so not stale.

setting FALSE, evn FALSE, lease is neither updated nor checked. This would be the default shipping config?

We would just have to be clear in the documentation that both items have to be enabled for fast failover to work. Is that what you are picturing?

Done. See below.

mergify · 2024-07-15T23:30:57Z

This pull request is now in conflict. Could you fix it @james-munson? 🙏

pkg/server/share_manager.go

ejweber

Sorry for leaving a bunch of single comments instead of one review with multiple comments. That was not my intention.

~~It seems we are executing a bunch of code repeatedly in a way we don't intend. (Though it is possible I am reading something wrong.)~~

I was reading it wrong: #222 (comment).

derekbit

Something need improvemnt

If the fast failover feature is disabled by the corresponding global setting, share-manager pod has to be aware of the setting and respect to the it. That means

lease should not be created
the check and renewal of lease is not required

messages: redundant messages and flooding messages
lease holder: the holder can be either pod name or node name. Need to clarify it.
A ton of runLeaseRenew goroutines are created

pkg/server/share_manager.go

mergify · 2024-07-19T00:30:58Z

This pull request is now in conflict. Could you fix it @james-munson? 🙏

mergify · 2024-07-20T02:55:16Z

This pull request is now in conflict. Could you fix it @james-munson? 🙏

pkg/server/nfs/nfs_server.go

pkg/server/share_manager.go

ejweber

LGTM barring a nit I posted and some others from @derekbit that I agree with.

pkg/server/share_manager.go

Signed-off-by: James Munson <james.munson@suse.com>

derekbit

LGTM

derekbit · 2024-07-23T00:45:11Z

Thanks @james-munson and @PhanLe1010 for your effort. All issues are addressed, so we can merge the PR.

derekbit · 2024-07-23T04:23:21Z

@mergify backport v1.7.x

mergify · 2024-07-23T04:23:29Z

backport v1.7.x

✅ Backports have been created

#265 Share-manager HA - lease renewal (backport #222) has been created for branch v1.7.x

james-munson marked this pull request as draft May 20, 2024 21:36

james-munson requested review from PhanLe1010, ejweber and shuo-wu May 20, 2024 21:37

PhanLe1010 reviewed Jul 15, 2024

View reviewed changes

pkg/server/share_manager.go Outdated Show resolved Hide resolved

james-munson force-pushed the 6205-renew-lease branch from 3d2a940 to a7b1755 Compare July 16, 2024 19:32

james-munson marked this pull request as ready for review July 16, 2024 19:32