Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(webhook selector): add longhorn-manager labels when webhook ready. #2920

Closed
wants to merge 1 commit into from

Conversation

james-munson
Copy link
Contributor

@james-munson james-munson commented Jun 28, 2024

Which issue(s) this PR fixes:

Issue longhorn/longhorn#8803 and longhorn/longhorn#8612

What this PR does / why we need it:

Add the longhorn-manager pod labels to be used as webhook service selectors dynamically after the webhooks have been started.

Special notes for your reviewer:

Note that this PR must be accompanied by the manifest changes in PR longhorn/longhorn#8804. They depend on each other.

Additional documentation or context

@derekbit
Copy link
Member

derekbit commented Jun 30, 2024

As I mentioned in longhorn/longhorn#8803 (comment), the PR only adds the labels to webhook but doesn't handle the deletion. Could you help clear up my confusion?

@james-munson
Copy link
Contributor Author

james-munson commented Jul 1, 2024

I hope so. This PR (and longhorn/longhorn#8804 in longhorn) just add the new selector labels. Should make no difference to normal operation. After it is committed, then I can pull the changes into my RWX HA branch, because the logic for deletion and restoration depends on conditions and checks that only exist in that branch.
I did it this way, rather than just introducing it all in that branch, because it does also offer a fix for the situation described by @PhanLe1010 in longhorn/longhorn#8612.

@derekbit
Copy link
Member

derekbit commented Jul 2, 2024

I hope so. This PR (and longhorn/longhorn#8804 in longhorn) just add the new selector labels. Should make no difference to normal operation. After it is committed, then I can pull the changes into my RWX HA branch, because the logic for deletion and restoration depends on conditions and checks that only exist in that branch. I did it this way, rather than just introducing it all in that branch, because it does also offer a fix for the situation described by @PhanLe1010 in longhorn/longhorn#8612.

I see. Then, let's merge the PR after RWX HA branch is ready for review. Then, we can see how webhook selector works in the implementation. WDYT?

app/daemon.go Outdated
Comment on lines 159 to 181
clients, err := client.NewClients(kubeconfigPath, true, ctx.Done())
if err != nil {
return err
}

if err := webhook.StartWebhook(ctx, types.WebhookTypeConversion, clientsWithoutDatastore); err != nil {
return err
}
Copy link
Contributor

@PhanLe1010 PhanLe1010 Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change might be problematic. We are reverting the fix #2812. AKA, we are trying to start dastore before the conversion webhook which might create a deadlock since the datastore has to wait for the conversion webhook to be running first

WDYT?

Copy link
Collaborator

@ejweber ejweber Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

I think we can preserve this approach pretty easily though by changing how it works. We already make a number of details about the pod available to its running process using the downward API. metadata.name is not one of the fields we use, but we can.

env:
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: POD_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName

https://kubernetes.io/docs/concepts/workloads/pods/downward-api/

It would look something like this:

pod, _ := clientsWithoutDatastore.Clients.K8s.CoreV1().Pods(os.Getenv(types.EnvPodNamespace)).Get(context.Background(), os.Getenv("POD_NAME"), v1.GetOptions{})
pod.Labels["key"] = "value"
clientsWithoutDatastore.Clients.K8s.CoreV1().Pods(os.Getenv(types.EnvPodNamespace)).Update(context.Background(), pod, v1.UpdateOptions{})

This strategy is not quite as nice as using the datastore function you are planning to use for other interactions with the webhook labels, but it is (I think) a solid way for a longhorn-manager pod to add the necessary label TO ITSELF without initializing the datastore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll buy that. I did wonder a bit, but it was happy in my testing. I can see where it might set up a race with a possible deadlock, though. Thanks, @ejweber for the suggestion.

Copy link
Contributor Author

@james-munson james-munson Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it is worth, here are the logs from a run with the previous method. As you can see, the localhost test passes even before the call to datastore to modify the pod

2024-06-28T22:34:06.978508405Z I0628 22:34:06.978322       1 shared_informer.go:320] Caches are synced for longhorn datastore
2024-06-28T22:34:06.978630751Z time="2024-06-28T22:34:06Z" level=info msg="Starting longhorn conversion webhook server" func=webhook.StartWebhook file="webhook.go:24"
2024-06-28T22:34:06.978639002Z time="2024-06-28T22:34:06Z" level=info msg="Waiting for conversion webhook to become ready" func=webhook.StartWebhook file="webhook.go:43"
2024-06-28T22:34:06.981596794Z time="2024-06-28T22:34:06Z" level=warning msg="Failed to check endpoint https://localhost:9501/v1/healthz" func=webhook.isServiceAvailable file="webhook.go:78" error="Get \"https://localhost:9501/v1/healthz\": dial tcp [::1]:9501: connect: connection refused"
2024-06-28T22:34:06.988202655Z time="2024-06-28T22:34:06Z" level=info msg="Active TLS secret longhorn-system/longhorn-webhook-tls (ver=51640559) (count 2): map[listener.cattle.io/cn-longhorn-admission-webhook.longhor-59584d:longhorn-admission-webhook.longhorn-system.svc listener.cattle.io/cn-longhorn-conversion-webhook.longho-6a0089:longhorn-conversion-webhook.longhorn-system.svc listener.cattle.io/fingerprint:SHA1=A22EFDA2A84D7A4EFD53D3A7926D1ACA54A52804]" func="memory.(*memory).Update" file="memory.go:42"
2024-06-28T22:34:06.988244638Z time="2024-06-28T22:34:06Z" level=info msg="Listening on :9501" func=server.ListenAndServe.func2 file="server.go:77"
2024-06-28T22:34:07.315294352Z time="2024-06-28T22:34:07Z" level=info msg="Starting /v1, Kind=Secret controller" func="controller.(*controller).run" file="controller.go:148"
2024-06-28T22:34:07.315566430Z time="2024-06-28T22:34:07Z" level=info msg="Starting apiextensions.k8s.io/v1, Kind=CustomResourceDefinition controller" func="controller.(*controller).run" file="controller.go:148"
2024-06-28T22:34:07.315814195Z time="2024-06-28T22:34:07Z" level=info msg="Building conversion rules..." func="server.(*WebhookServer).runConversionWebhookListenAndServe.func1" file="server.go:193"
2024-06-28T22:34:07.316297058Z time="2024-06-28T22:34:07Z" level=info msg="Starting apiregistration.k8s.io/v1, Kind=APIService controller" func="controller.(*controller).run" file="controller.go:148"
2024-06-28T22:34:07.321634322Z time="2024-06-28T22:34:07Z" level=info msg="Updating TLS secret for longhorn-system/longhorn-webhook-tls (count: 2): map[listener.cattle.io/cn-longhorn-admission-webhook.longhor-59584d:longhorn-admission-webhook.longhorn-system.svc listener.cattle.io/cn-longhorn-conversion-webhook.longho-6a0089:longhorn-conversion-webhook.longhorn-system.svc listener.cattle.io/fingerprint:SHA1=A22EFDA2A84D7A4EFD53D3A7926D1ACA54A52804]" func="kubernetes.(*storage).saveInK8s" file="controller.go:225"
2024-06-28T22:34:08.982742691Z time="2024-06-28T22:34:08Z" level=info msg="Started longhorn conversion webhook server on localhost" func=webhook.StartWebhook file="webhook.go:47"
2024-06-28T22:34:09.024365655Z time="2024-06-28T22:34:09Z" level=info msg="conversion webhook service is now accessible" func=webhook.CheckWebhookServiceAvailability file="webhook.go:63"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as suggested to avoid datastore use for conversion webhook setup.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it is worth, here are the logs from a run with the previous method. As you can see, the localhost test passes even before the call to datastore to modify the pod

IIUC, it is only an issue in very specific cases (which are hopefully no longer relevant). In the case @PhanLe1010 fixed:

  • Datastore initialization failed due to a BackupTarget using an outdated version (v1alpha1) and no conversion webhook.
  • Since datastore initialization failed, we couldn't get a conversion webhook up.

Realistically, any cluster you use that isn't super old or intentionally broken has no chicken/egg issue.

Copy link
Contributor

@PhanLe1010 PhanLe1010 Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @ejweber that the cluster needs to be old to trigger the issue (start using Longhorn since v1.2.6).

Though some NITs:

  • The outdated version can be v1beta1, not necessarily v1alpha1
  • Not only backup target CR, it can also happen with engine image CR, ...
  • Arguably nontrivial chance of happening since it only require user to install Longhorn <= 1.2.x and upgrade to Longhorn >= 1.4.x

Copy link
Contributor

@PhanLe1010 PhanLe1010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me

Signed-off-by: James Munson <james.munson@suse.com>
Copy link

mergify bot commented Jul 11, 2024

This pull request is now in conflict. Could you fix it @james-munson? 🙏

@james-munson
Copy link
Contributor Author

The change in this PR has been pulled into RWX HA PR #2811. I will close this one, since it is unlikely to be adopted on its own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants