Share manager HA #2811

james-munson · 2024-05-20T21:49:40Z

Which issue(s) this PR fixes:

This PR is part of longhorn/longhorn#6205 - Share manager HA. It creates the Lease, checks it, and takes action to delete the share-manager pod if it is stale and replace it with one on a different node.

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

I am marking this PR as a draft because it is not complete. It needs to make the detachment and re-attachment to the new pod happen before the old pod's node goes notReady.

PhanLe1010 · 2024-07-08T21:49:44Z

Should we start reviewing or should we wait for the PR to be mark as ready for review? @james-munson

mergify · 2024-07-10T05:41:02Z

This pull request is now in conflict. Could you fix it @james-munson? 🙏

controller/share_manager_controller.go

PhanLe1010 · 2024-07-12T02:52:56Z

Update:

Currently, there are 2 remaining challenges:

Get the volume detachment faster. Right now, it is taking 30-60s to detect and detach the volume from the down node. This challenge is difficult because we need to figureout very code flow to detect and skip waiting for engine/replica deletion if a RWX volume is delinquent.
We need to make sure that one broken RWX doesn't make Longhorn to evict all other RWX on the same node

PhanLe1010 · 2024-07-12T06:35:44Z

We are making progress on challenge 1. Will update soon

PhanLe1010 · 2024-07-12T19:32:11Z

Update:

The POC for challenge 1 at the comment is working now

Get the volume detachment faster. Right now, it is taking 30-60s to detect and detach the volume from the down node. This challenge is difficult because we need to figureout very code flow to detect and skip waiting for engine/replica deletion if a RWX volume is delinquent.

Screencast.from.07-12-2024.12.26.30.PM.webm

The remaining challenge is:

We need to make sure that one broken RWX volume doesn't make Longhorn to evict all other RWX volumes on the same node and fine tune the implementation

controller/volume_controller.go

PhanLe1010 · 2024-07-15T22:47:41Z

FYI: I discussed with @james-munson to draw the state machine of the lease CR. This might help to speed up your review process

james-munson · 2024-07-16T16:51:33Z

Rebased on current master and resolved conflicts.

derekbit · 2024-07-17T07:41:49Z

I would suggest consolidating the commits.

derekbit

I will continue reviewing the share manager contoller soon.

types/setting.go

datastore/longhorn.go

datastore/kubernetes.go

controller/instance_handler.go

app/daemon.go

controller/node_controller.go

controller/replica_controller.go

mergify · 2024-07-17T10:35:45Z

This pull request is now in conflict. Could you fix it @james-munson? 🙏

ejweber · 2024-07-22T20:11:47Z

This one might be needed as the volume controller need to wakeup quickly and switch the ownership of volume CR to same node as the node of new RWX pod

So the flow is (I think):

All share manager controllers detect the lease is stale and enqueue the share manager.
Interim share manager controller takes over.
Interim share manager controller marks the lease delinquent.

Interim share manager controller updates sharemanager.status.state = error.

longhorn-manager/controller/share_manager_controller.go

Lines 863 to 866 in a8e56ab

    
           if sm.Status.State != longhorn.ShareManagerStateStopped { 
        
           	log.Info("Updating share manager to error state") 
        
           	sm.Status.State = longhorn.ShareManagerStateError 
        
           }

Interim share manager controller cleans up share manager pod.

longhorn-manager/controller/share_manager_controller.go

Lines 810 to 811 in a8e56ab

sm.Status.State == longhorn.ShareManagerStateError {

err = c.cleanupShareManagerPod(sm)

So it looks like the pod and share manager updates will both trigger a volume reconcile simultaneously. (Everything Kubernetes actually does to the pod will be slower than this flow, since it normally doesn't try to do anything for a long time.)

james-munson · 2024-07-22T20:19:06Z

Either would work as an informer, so long as we just drive the ownership off of the pod's node. We don't want everybody changing to the interim owner of the SM only to change again once the pod is scheduled.

PhanLe1010 · 2024-07-22T20:26:12Z

Regarding to #2811 (comment)

The next step in that flow would be a new SM pod is recreated by share-manager controller. Volume controller need to quickly detect the node of the new pod so it might need to watch the pod?

Thought at this moment share-manager CR might have switch to starting state which can also trigger volume controller if we was using sharemanager CR informer instead.

Let's me do a test

ejweber · 2024-07-22T20:28:50Z

Either would work as an informer, so long as we just drive the ownership off of the pod's node. We don't want everybody changing to the interim owner of the SM only to change again once the pod is scheduled.

This makes sense to me. My main concern is that this PR explicitly switches the volume controller to monitoring pods for requeues instead of share managers. But the share manager controller is already monitoring share manager pods and updating share manager state. (So far) I don't see the reason why the volume controller has to as well now, instead of just being triggered by share manager state as always.

longhorn-6205 Signed-off-by: Phan Le <phan.le@suse.com>

PhanLe1010 · 2024-07-22T20:43:22Z

Testing show that reverting share manager pod informer back to share manager CR informer does not slow down the flow as @ejweber expected. ce11b41

I reverted them. Any additional concern @ejweber @james-munson

PhanLe1010 · 2024-07-22T20:44:27Z

At this moment, I think the last item to get LGTM from @ejweber is the setting name change longhorn/longhorn#8804 (comment)

@james-munson is helping with that

Signed-off-by: James Munson <james.munson@suse.com>

ejweber

LGTM barring any necessary squashing/rebasing and the resolution to some of @derekbit's comments.

Thanks for the hard work on this!

PhanLe1010 · 2024-07-22T23:51:05Z

Does the feature need to handle upgrade? Will a running RWX volume remain functional after upgrading the longhorn-system without any detachment?

ref: #2811 (comment)

@derekbit Yes, a running RWX volume will remain functional after upgrading the longhorn-system without any detachment

PhanLe1010 · 2024-07-23T01:07:33Z

Lease check monitor: lack of recover mechanism

Ref: #2811 (comment)

Can you give more details on how disk monitor recover? From a quick reading, I don't quite to get it yet @derekbit

derekbit

In general, LGTM.
Some minor issues and one TODO suggestion.

@james-munson Can we check if we need the swap or keep it as the pervious order?

	if err = c.syncShareManagerPod(sm); err != nil {
		return err
	}

	if err = c.syncShareManagerVolume(sm); err != nil {
		return err
	}

app/daemon.go

types/setting.go

datastore/longhorn.go

controller/node_controller.go

controller/replica_controller.go

controller/share_manager_controller.go

PhanLe1010 · 2024-07-23T02:41:46Z

Can we check if we need the swap or keep it as the pervious order?

if err = c.syncShareManagerPod(sm); err != nil {
return err
}

if err = c.syncShareManagerVolume(sm); err != nil {
return err
}

I just tried to swap the order and test it. The result is somehow volume detach/attach is much slower maybe due to have to make multiple sync loops. I think we can keep the current order. I will continue to investigate after the PR is merged. Added this to the TODO at longhorn/longhorn#6205 (comment)

cc @james-munson @derekbit

PhanLe1010 · 2024-07-23T02:49:23Z

TODO: I'm thinking we can make lease_lifetime, lease_check_period and so on as global settings in the future. This change would allow some use cases to tolerate longer delinquent intervals.

Maybe a new github ticket for this? @derekbit @james-munson

longhorn-6205 Signed-off-by: Phan Le <phan.le@suse.com>

derekbit · 2024-07-23T03:01:37Z

make lease_lifetime, lease_check_period and so on as global settings in the future. This change would allow some use cases to tolerate longer delinquent intervals.

Let's track it in longhorn/longhorn#9062

james-munson · 2024-07-23T03:05:49Z

My testing matches @PhanLe1010. With the sync order swapped, we time out trying to do mount operations in the volume sync, before ever dealing with the staleness of the pod, which I saw get up to more than 20 seconds overdue. With that, any time gain is lost, so I think the order needs to stay as it is. FWIW, in the e2e tests that I ran, it did not appear to affect normal RWO behavior.

derekbit · 2024-07-23T04:23:55Z

@mergify backport v1.7.x

mergify · 2024-07-23T04:24:07Z

backport v1.7.x

✅ Backports have been created

#2994 Share manager HA (backport #2811) has been created for branch v1.7.x

james-munson marked this pull request as draft May 20, 2024 21:49

james-munson requested review from PhanLe1010, ejweber and shuo-wu May 20, 2024 21:50

james-munson force-pushed the share-manager-lease branch 2 times, most recently from caabf8e to b118921 Compare May 30, 2024 20:41

james-munson force-pushed the share-manager-lease branch from 8f059b7 to 0f16201 Compare June 10, 2024 16:24

james-munson force-pushed the share-manager-lease branch from 0f16201 to ac962bf Compare June 21, 2024 01:23

derekbit assigned james-munson Jun 27, 2024

james-munson force-pushed the share-manager-lease branch from 701d9f5 to 3de89d4 Compare July 9, 2024 18:25

PhanLe1010 reviewed Jul 11, 2024

View reviewed changes

controller/share_manager_controller.go Outdated Show resolved Hide resolved

PhanLe1010 reviewed Jul 11, 2024

View reviewed changes

controller/share_manager_controller.go Outdated Show resolved Hide resolved

PhanLe1010 reviewed Jul 11, 2024

View reviewed changes

controller/share_manager_controller.go Outdated Show resolved Hide resolved

controller/share_manager_controller.go Show resolved Hide resolved

PhanLe1010 reviewed Jul 12, 2024

View reviewed changes

controller/share_manager_controller.go Outdated Show resolved Hide resolved

PhanLe1010 reviewed Jul 12, 2024

View reviewed changes

controller/volume_controller.go Outdated Show resolved Hide resolved

james-munson force-pushed the share-manager-lease branch from 48f8545 to 17aae5a Compare July 16, 2024 16:50

james-munson marked this pull request as ready for review July 16, 2024 16:55

This was referenced Jul 16, 2024

feat(webhook selector): add longhorn-manager labels when webhook ready. #2920

Closed

feat(RWX HA) - add an enabling setting, and make service selectors dynamically settable. longhorn/longhorn#8804

Merged

PhanLe1010 force-pushed the share-manager-lease branch from 1d6b033 to f16cb60 Compare July 16, 2024 23:52

derekbit reviewed Jul 17, 2024

View reviewed changes

Revert share manager pod informer back to share manager CR informer

ce11b41

longhorn-6205 Signed-off-by: Phan Le <phan.le@suse.com>

Rename RWX volume fast failover setting.

d1422f2

Signed-off-by: James Munson <james.munson@suse.com>

ejweber previously approved these changes Jul 22, 2024

View reviewed changes

PhanLe1010 mentioned this pull request Jul 23, 2024

[FEATURE] Share manager HA - Experimental longhorn/longhorn#6205

Closed

derekbit reviewed Jul 23, 2024

View reviewed changes

Fix Derek's review comments

cd8220c

longhorn-6205 Signed-off-by: Phan Le <phan.le@suse.com>

PhanLe1010 dismissed ejweber’s stale review via cd8220c July 23, 2024 02:53

derekbit approved these changes Jul 23, 2024

View reviewed changes

PhanLe1010 merged commit 1b5cafd into longhorn:master Jul 23, 2024
6 checks passed

This was referenced Jul 23, 2024

Share manager HA (backport #2811) #2994

Merged

feat(RWX HA) - add an enabling setting, and make service selectors dynamically settable. (backport #8804) longhorn/longhorn#9063

Merged

james-munson mentioned this pull request Jul 23, 2024

[IMPROVEMENT] Add a label and selector for webhooks to longhorn-manager pod manifest. longhorn/longhorn#8803

Closed

ejweber mentioned this pull request Jul 25, 2024

[BUG] Test case test_rwx_delete_share_manager_pod fails after changes for RWX HA. longhorn/longhorn#9081

Closed

PhanLe1010 mentioned this pull request Jul 25, 2024

Maintain the previous state machine of share manager CR #3008

Merged

mergify bot mentioned this pull request Jul 25, 2024

Maintain the previous state machine of share manager CR (backport #3008) #3012

Merged

james-munson deleted the share-manager-lease branch July 26, 2024 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Share manager HA #2811

Share manager HA #2811

james-munson commented May 20, 2024

PhanLe1010 commented Jul 8, 2024

mergify bot commented Jul 10, 2024

PhanLe1010 commented Jul 12, 2024

PhanLe1010 commented Jul 12, 2024 •

edited

Loading

PhanLe1010 commented Jul 12, 2024 •

edited

Loading

PhanLe1010 commented Jul 15, 2024 •

edited

Loading

james-munson commented Jul 16, 2024 •

edited

Loading

derekbit commented Jul 17, 2024

derekbit left a comment

mergify bot commented Jul 17, 2024

ejweber commented Jul 22, 2024 •

edited

Loading

james-munson commented Jul 22, 2024

PhanLe1010 commented Jul 22, 2024

ejweber commented Jul 22, 2024

PhanLe1010 commented Jul 22, 2024

PhanLe1010 commented Jul 22, 2024

ejweber left a comment

PhanLe1010 commented Jul 22, 2024

PhanLe1010 commented Jul 23, 2024

derekbit left a comment

PhanLe1010 commented Jul 23, 2024 •

edited

Loading

PhanLe1010 commented Jul 23, 2024

derekbit commented Jul 23, 2024

james-munson commented Jul 23, 2024

derekbit commented Jul 23, 2024

mergify bot commented Jul 23, 2024 •

edited

Loading

Share manager HA #2811

Share manager HA #2811

Conversation

james-munson commented May 20, 2024

Which issue(s) this PR fixes:

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

PhanLe1010 commented Jul 8, 2024

mergify bot commented Jul 10, 2024

PhanLe1010 commented Jul 12, 2024

PhanLe1010 commented Jul 12, 2024 • edited Loading

PhanLe1010 commented Jul 12, 2024 • edited Loading

Update:

PhanLe1010 commented Jul 15, 2024 • edited Loading

james-munson commented Jul 16, 2024 • edited Loading

derekbit commented Jul 17, 2024

derekbit left a comment

Choose a reason for hiding this comment

mergify bot commented Jul 17, 2024

ejweber commented Jul 22, 2024 • edited Loading

james-munson commented Jul 22, 2024

PhanLe1010 commented Jul 22, 2024

ejweber commented Jul 22, 2024

PhanLe1010 commented Jul 22, 2024

PhanLe1010 commented Jul 22, 2024

ejweber left a comment

Choose a reason for hiding this comment

PhanLe1010 commented Jul 22, 2024

PhanLe1010 commented Jul 23, 2024

derekbit left a comment

Choose a reason for hiding this comment

PhanLe1010 commented Jul 23, 2024 • edited Loading

PhanLe1010 commented Jul 23, 2024

derekbit commented Jul 23, 2024

james-munson commented Jul 23, 2024

derekbit commented Jul 23, 2024

mergify bot commented Jul 23, 2024 • edited Loading

✅ Backports have been created

PhanLe1010 commented Jul 12, 2024 •

edited

Loading

PhanLe1010 commented Jul 12, 2024 •

edited

Loading

PhanLe1010 commented Jul 15, 2024 •

edited

Loading

james-munson commented Jul 16, 2024 •

edited

Loading

ejweber commented Jul 22, 2024 •

edited

Loading

PhanLe1010 commented Jul 23, 2024 •

edited

Loading

mergify bot commented Jul 23, 2024 •

edited

Loading