Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share manager HA #2811

Merged
merged 21 commits into from
Jul 23, 2024
Merged

Share manager HA #2811

merged 21 commits into from
Jul 23, 2024

Conversation

james-munson
Copy link
Contributor

Which issue(s) this PR fixes:

This PR is part of longhorn/longhorn#6205 - Share manager HA. It creates the Lease, checks it, and takes action to delete the share-manager pod if it is stale and replace it with one on a different node.

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

I am marking this PR as a draft because it is not complete. It needs to make the detachment and re-attachment to the new pod happen before the old pod's node goes notReady.

@PhanLe1010
Copy link
Contributor

Should we start reviewing or should we wait for the PR to be mark as ready for review? @james-munson

Copy link

mergify bot commented Jul 10, 2024

This pull request is now in conflict. Could you fix it @james-munson? 🙏

@PhanLe1010
Copy link
Contributor

Update:

Currently, there are 2 remaining challenges:

  1. Get the volume detachment faster. Right now, it is taking 30-60s to detect and detach the volume from the down node. This challenge is difficult because we need to figureout very code flow to detect and skip waiting for engine/replica deletion if a RWX volume is delinquent.
  2. We need to make sure that one broken RWX doesn't make Longhorn to evict all other RWX on the same node

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Jul 12, 2024

We are making progress on challenge 1. Will update soon

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Jul 12, 2024

Update:

The POC for challenge 1 at the comment is working now

Get the volume detachment faster. Right now, it is taking 30-60s to detect and detach the volume from the down node. This challenge is difficult because we need to figureout very code flow to detect and skip waiting for engine/replica deletion if a RWX volume is delinquent.

Screencast.from.07-12-2024.12.26.30.PM.webm

The remaining challenge is:

  1. We need to make sure that one broken RWX volume doesn't make Longhorn to evict all other RWX volumes on the same node and fine tune the implementation

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Jul 15, 2024

FYI: I discussed with @james-munson to draw the state machine of the lease CR. This might help to speed up your review process

lease-cr-statemachine

@james-munson
Copy link
Contributor Author

james-munson commented Jul 16, 2024

Rebased on current master and resolved conflicts.

@derekbit
Copy link
Member

I would suggest consolidating the commits.

Copy link
Member

@derekbit derekbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will continue reviewing the share manager contoller soon.

types/setting.go Show resolved Hide resolved
types/setting.go Outdated Show resolved Hide resolved
datastore/longhorn.go Outdated Show resolved Hide resolved
datastore/kubernetes.go Show resolved Hide resolved
datastore/kubernetes.go Outdated Show resolved Hide resolved
controller/instance_handler.go Show resolved Hide resolved
app/daemon.go Show resolved Hide resolved
controller/node_controller.go Outdated Show resolved Hide resolved
controller/replica_controller.go Outdated Show resolved Hide resolved
controller/replica_controller.go Outdated Show resolved Hide resolved
Copy link

mergify bot commented Jul 17, 2024

This pull request is now in conflict. Could you fix it @james-munson? 🙏

@ejweber
Copy link
Collaborator

ejweber commented Jul 22, 2024

This one might be needed as the volume controller need to wakeup quickly and switch the ownership of volume CR to same node as the node of new RWX pod

So the flow is (I think):

  • All share manager controllers detect the lease is stale and enqueue the share manager.
  • Interim share manager controller takes over.
  • Interim share manager controller marks the lease delinquent.
  • Interim share manager controller updates sharemanager.status.state = error.
    if sm.Status.State != longhorn.ShareManagerStateStopped {
    log.Info("Updating share manager to error state")
    sm.Status.State = longhorn.ShareManagerStateError
    }
  • Interim share manager controller cleans up share manager pod.
    sm.Status.State == longhorn.ShareManagerStateError {
    err = c.cleanupShareManagerPod(sm)

So it looks like the pod and share manager updates will both trigger a volume reconcile simultaneously. (Everything Kubernetes actually does to the pod will be slower than this flow, since it normally doesn't try to do anything for a long time.)

@james-munson
Copy link
Contributor Author

Either would work as an informer, so long as we just drive the ownership off of the pod's node. We don't want everybody changing to the interim owner of the SM only to change again once the pod is scheduled.

@PhanLe1010
Copy link
Contributor

Regarding to #2811 (comment)

The next step in that flow would be a new SM pod is recreated by share-manager controller. Volume controller need to quickly detect the node of the new pod so it might need to watch the pod?

Thought at this moment share-manager CR might have switch to starting state which can also trigger volume controller if we was using sharemanager CR informer instead.

Let's me do a test

@ejweber
Copy link
Collaborator

ejweber commented Jul 22, 2024

Either would work as an informer, so long as we just drive the ownership off of the pod's node. We don't want everybody changing to the interim owner of the SM only to change again once the pod is scheduled.

This makes sense to me. My main concern is that this PR explicitly switches the volume controller to monitoring pods for requeues instead of share managers. But the share manager controller is already monitoring share manager pods and updating share manager state. (So far) I don't see the reason why the volume controller has to as well now, instead of just being triggered by share manager state as always.

longhorn-6205

Signed-off-by: Phan Le <phan.le@suse.com>
@PhanLe1010
Copy link
Contributor

Testing show that reverting share manager pod informer back to share manager CR informer does not slow down the flow as @ejweber expected. ce11b41

I reverted them. Any additional concern @ejweber @james-munson

@PhanLe1010
Copy link
Contributor

At this moment, I think the last item to get LGTM from @ejweber is the setting name change longhorn/longhorn#8804 (comment)

@james-munson is helping with that

Signed-off-by: James Munson <james.munson@suse.com>
ejweber
ejweber previously approved these changes Jul 22, 2024
Copy link
Collaborator

@ejweber ejweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM barring any necessary squashing/rebasing and the resolution to some of @derekbit's comments.

Thanks for the hard work on this!

@PhanLe1010
Copy link
Contributor

Does the feature need to handle upgrade? Will a running RWX volume remain functional after upgrading the longhorn-system without any detachment?

ref: #2811 (comment)

@derekbit Yes, a running RWX volume will remain functional after upgrading the longhorn-system without any detachment

@PhanLe1010
Copy link
Contributor

Lease check monitor: lack of recover mechanism

Ref: #2811 (comment)

Can you give more details on how disk monitor recover? From a quick reading, I don't quite to get it yet @derekbit

Copy link
Member

@derekbit derekbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, LGTM.
Some minor issues and one TODO suggestion.

@james-munson Can we check if we need the swap or keep it as the pervious order?

	if err = c.syncShareManagerPod(sm); err != nil {
		return err
	}

	if err = c.syncShareManagerVolume(sm); err != nil {
		return err
	}

app/daemon.go Outdated Show resolved Hide resolved
types/setting.go Show resolved Hide resolved
datastore/longhorn.go Show resolved Hide resolved
controller/node_controller.go Outdated Show resolved Hide resolved
controller/replica_controller.go Outdated Show resolved Hide resolved
controller/share_manager_controller.go Show resolved Hide resolved
controller/share_manager_controller.go Show resolved Hide resolved
@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Jul 23, 2024

Can we check if we need the swap or keep it as the pervious order?

if err = c.syncShareManagerPod(sm); err != nil {
return err
}

if err = c.syncShareManagerVolume(sm); err != nil {
return err
}

I just tried to swap the order and test it. The result is somehow volume detach/attach is much slower maybe due to have to make multiple sync loops. I think we can keep the current order. I will continue to investigate after the PR is merged. Added this to the TODO at longhorn/longhorn#6205 (comment)

cc @james-munson @derekbit

@PhanLe1010
Copy link
Contributor

TODO: I'm thinking we can make lease_lifetime, lease_check_period and so on as global settings in the future. This change would allow some use cases to tolerate longer delinquent intervals.

Maybe a new github ticket for this? @derekbit @james-munson

longhorn-6205

Signed-off-by: Phan Le <phan.le@suse.com>
@derekbit
Copy link
Member

make lease_lifetime, lease_check_period and so on as global settings in the future. This change would allow some use cases to tolerate longer delinquent intervals.

Let's track it in longhorn/longhorn#9062

@james-munson
Copy link
Contributor Author

My testing matches @PhanLe1010. With the sync order swapped, we time out trying to do mount operations in the volume sync, before ever dealing with the staleness of the pod, which I saw get up to more than 20 seconds overdue. With that, any time gain is lost, so I think the order needs to stay as it is. FWIW, in the e2e tests that I ran, it did not appear to affect normal RWO behavior.

@PhanLe1010 PhanLe1010 merged commit 1b5cafd into longhorn:master Jul 23, 2024
6 checks passed
@derekbit
Copy link
Member

@mergify backport v1.7.x

Copy link

mergify bot commented Jul 23, 2024

backport v1.7.x

✅ Backports have been created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants