-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Share-manager HA - lease renewal #222
Conversation
Is it ready for review? |
pkg/server/share_manager.go
Outdated
if err := m.getLease(); err != nil { | ||
m.logger.WithError(err).Warn("Failed to get lease - no lease renewal will be done") | ||
} else { | ||
m.logger.Info("Found lease") | ||
if err = m.takeLease(); err != nil { | ||
m.logger.WithError(err).Warn("Failed to take lease - no lease renewal will be done.") | ||
} else { | ||
go m.runLeaseRenew() | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a new EVN to the share-manager pod like enableFastFailureDetection
so that when it is enabled, share-manager pod is required to be able to find the lease object and take ownership of the lease object? In this case, if share-manager pod is not able to find the lease object/take ownership, we need to crash the pod to signal that the enableFastFailureDetection
requirement is not able to fulfilled
When enableFastFailureDetection
is disabled, share-manager pod ignore the lease mechanism. Just save the ETCD traffic similar to previous Longhorn version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we can do this. I hadn't pictured that a failure to find the lease would be a serious enough deal to crash the share-manager app. Normally that would mean that a lease-capable share-manager image is running opposite an older longhorn-manager image that doesn't create the lease, which is easily fixed. It could also mean that lease creation failed, for whatever reason, which would be tougher.
Note that the longhorn-manager code makes the lease regardless of "feature gate" setting enable-share-manager-fast-failover
. That only controls whether it checks for staleness. So the combination of setting with environment variable is
- setting TRUE, evn TRUE: lease is updated, staleness is checked.
- setting FALSE, evn TRUE: lease is updated, but never checked.
- setting TRUE, evn FALSE, lease is not updated, so when it is checked, there will be no holder identity, so not stale.
- setting FALSE, evn FALSE, lease is neither updated nor checked. This would be the default shipping config?
We would just have to be clear in the documentation that both items have to be enabled for fast failover to work. Is that what you are picturing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. See below.
This pull request is now in conflict. Could you fix it @james-munson? 🙏 |
3d2a940
to
a7b1755
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for leaving a bunch of single comments instead of one review with multiple comments. That was not my intention.
It seems we are executing a bunch of code repeatedly in a way we don't intend. (Though it is possible I am reading something wrong.)
I was reading it wrong: #222 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something need improvemnt
- If the fast failover feature is disabled by the corresponding global setting, share-manager pod has to be aware of the setting and respect to the it. That means
- lease should not be created
- the check and renewal of lease is not required
- messages: redundant messages and flooding messages
- lease holder: the holder can be either pod name or node name. Need to clarify it.
- A ton of
runLeaseRenew
goroutines are created
This pull request is now in conflict. Could you fix it @james-munson? 🙏 |
fe264b6
to
02ce7a6
Compare
This pull request is now in conflict. Could you fix it @james-munson? 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM barring a nit I posted and some others from @derekbit that I agree with.
Signed-off-by: James Munson <james.munson@suse.com>
02ce7a6
to
91569b5
Compare
Signed-off-by: James Munson <james.munson@suse.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @james-munson and @PhanLe1010 for your effort. All issues are addressed, so we can merge the PR. |
@mergify backport v1.7.x |
✅ Backports have been created
|
Which issue(s) this PR fixes:
This is part of issue longhorn/longhorn#6205, Share Manager HA. This code will look for a lease associated to the share manager's volume and run a goroutine to renew it regularly if it exists.
What this PR does / why we need it:
Special notes for your reviewer:
Additional documentation or context
I am marking it as draft because the related code in longhorn-manager is still in development, although it should have no effect if there is no Lease object.