-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(webhook selector): add longhorn-manager labels when webhook ready. #2920
Conversation
As I mentioned in longhorn/longhorn#8803 (comment), the PR only adds the labels to webhook but doesn't handle the deletion. Could you help clear up my confusion? |
I hope so. This PR (and longhorn/longhorn#8804 in longhorn) just add the new selector labels. Should make no difference to normal operation. After it is committed, then I can pull the changes into my RWX HA branch, because the logic for deletion and restoration depends on conditions and checks that only exist in that branch. |
1484c2c
to
5b60abd
Compare
I see. Then, let's merge the PR after RWX HA branch is ready for review. Then, we can see how webhook selector works in the implementation. WDYT? |
app/daemon.go
Outdated
clients, err := client.NewClients(kubeconfigPath, true, ctx.Done()) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
if err := webhook.StartWebhook(ctx, types.WebhookTypeConversion, clientsWithoutDatastore); err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change might be problematic. We are reverting the fix #2812. AKA, we are trying to start dastore before the conversion webhook which might create a deadlock since the datastore has to wait for the conversion webhook to be running first
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
I think we can preserve this approach pretty easily though by changing how it works. We already make a number of details about the pod available to its running process using the downward API. metadata.name
is not one of the fields we use, but we can.
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
https://kubernetes.io/docs/concepts/workloads/pods/downward-api/
It would look something like this:
pod, _ := clientsWithoutDatastore.Clients.K8s.CoreV1().Pods(os.Getenv(types.EnvPodNamespace)).Get(context.Background(), os.Getenv("POD_NAME"), v1.GetOptions{})
pod.Labels["key"] = "value"
clientsWithoutDatastore.Clients.K8s.CoreV1().Pods(os.Getenv(types.EnvPodNamespace)).Update(context.Background(), pod, v1.UpdateOptions{})
This strategy is not quite as nice as using the datastore function you are planning to use for other interactions with the webhook labels, but it is (I think) a solid way for a longhorn-manager pod to add the necessary label TO ITSELF without initializing the datastore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll buy that. I did wonder a bit, but it was happy in my testing. I can see where it might set up a race with a possible deadlock, though. Thanks, @ejweber for the suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it is worth, here are the logs from a run with the previous method. As you can see, the localhost test passes even before the call to datastore to modify the pod
2024-06-28T22:34:06.978508405Z I0628 22:34:06.978322 1 shared_informer.go:320] Caches are synced for longhorn datastore
2024-06-28T22:34:06.978630751Z time="2024-06-28T22:34:06Z" level=info msg="Starting longhorn conversion webhook server" func=webhook.StartWebhook file="webhook.go:24"
2024-06-28T22:34:06.978639002Z time="2024-06-28T22:34:06Z" level=info msg="Waiting for conversion webhook to become ready" func=webhook.StartWebhook file="webhook.go:43"
2024-06-28T22:34:06.981596794Z time="2024-06-28T22:34:06Z" level=warning msg="Failed to check endpoint https://localhost:9501/v1/healthz" func=webhook.isServiceAvailable file="webhook.go:78" error="Get \"https://localhost:9501/v1/healthz\": dial tcp [::1]:9501: connect: connection refused"
2024-06-28T22:34:06.988202655Z time="2024-06-28T22:34:06Z" level=info msg="Active TLS secret longhorn-system/longhorn-webhook-tls (ver=51640559) (count 2): map[listener.cattle.io/cn-longhorn-admission-webhook.longhor-59584d:longhorn-admission-webhook.longhorn-system.svc listener.cattle.io/cn-longhorn-conversion-webhook.longho-6a0089:longhorn-conversion-webhook.longhorn-system.svc listener.cattle.io/fingerprint:SHA1=A22EFDA2A84D7A4EFD53D3A7926D1ACA54A52804]" func="memory.(*memory).Update" file="memory.go:42"
2024-06-28T22:34:06.988244638Z time="2024-06-28T22:34:06Z" level=info msg="Listening on :9501" func=server.ListenAndServe.func2 file="server.go:77"
2024-06-28T22:34:07.315294352Z time="2024-06-28T22:34:07Z" level=info msg="Starting /v1, Kind=Secret controller" func="controller.(*controller).run" file="controller.go:148"
2024-06-28T22:34:07.315566430Z time="2024-06-28T22:34:07Z" level=info msg="Starting apiextensions.k8s.io/v1, Kind=CustomResourceDefinition controller" func="controller.(*controller).run" file="controller.go:148"
2024-06-28T22:34:07.315814195Z time="2024-06-28T22:34:07Z" level=info msg="Building conversion rules..." func="server.(*WebhookServer).runConversionWebhookListenAndServe.func1" file="server.go:193"
2024-06-28T22:34:07.316297058Z time="2024-06-28T22:34:07Z" level=info msg="Starting apiregistration.k8s.io/v1, Kind=APIService controller" func="controller.(*controller).run" file="controller.go:148"
2024-06-28T22:34:07.321634322Z time="2024-06-28T22:34:07Z" level=info msg="Updating TLS secret for longhorn-system/longhorn-webhook-tls (count: 2): map[listener.cattle.io/cn-longhorn-admission-webhook.longhor-59584d:longhorn-admission-webhook.longhorn-system.svc listener.cattle.io/cn-longhorn-conversion-webhook.longho-6a0089:longhorn-conversion-webhook.longhorn-system.svc listener.cattle.io/fingerprint:SHA1=A22EFDA2A84D7A4EFD53D3A7926D1ACA54A52804]" func="kubernetes.(*storage).saveInK8s" file="controller.go:225"
2024-06-28T22:34:08.982742691Z time="2024-06-28T22:34:08Z" level=info msg="Started longhorn conversion webhook server on localhost" func=webhook.StartWebhook file="webhook.go:47"
2024-06-28T22:34:09.024365655Z time="2024-06-28T22:34:09Z" level=info msg="conversion webhook service is now accessible" func=webhook.CheckWebhookServiceAvailability file="webhook.go:63"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated as suggested to avoid datastore use for conversion webhook setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it is worth, here are the logs from a run with the previous method. As you can see, the localhost test passes even before the call to datastore to modify the pod
IIUC, it is only an issue in very specific cases (which are hopefully no longer relevant). In the case @PhanLe1010 fixed:
- Datastore initialization failed due to a BackupTarget using an outdated version (
v1alpha1
) and no conversion webhook. - Since datastore initialization failed, we couldn't get a conversion webhook up.
Realistically, any cluster you use that isn't super old or intentionally broken has no chicken/egg issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @ejweber that the cluster needs to be old to trigger the issue (start using Longhorn since v1.2.6).
Though some NITs:
- The outdated version can be
v1beta1
, not necessarilyv1alpha1
- Not only backup target CR, it can also happen with engine image CR, ...
- Arguably nontrivial chance of happening since it only require user to install Longhorn <= 1.2.x and upgrade to Longhorn >= 1.4.x
5b60abd
to
cf89658
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me
Signed-off-by: James Munson <james.munson@suse.com>
cf89658
to
99fe4c6
Compare
This pull request is now in conflict. Could you fix it @james-munson? 🙏 |
The change in this PR has been pulled into RWX HA PR #2811. I will close this one, since it is unlikely to be adopted on its own. |
Which issue(s) this PR fixes:
Issue longhorn/longhorn#8803 and longhorn/longhorn#8612
What this PR does / why we need it:
Add the longhorn-manager pod labels to be used as webhook service selectors dynamically after the webhooks have been started.
Special notes for your reviewer:
Note that this PR must be accompanied by the manifest changes in PR longhorn/longhorn#8804. They depend on each other.
Additional documentation or context