-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kafka-broker-receiver crashes on startup starting with 1.14.0 in EKS #3901
Comments
cc @creydr |
Hello @wSedlacek, |
Sadly there isn't much else to go off in the logs. With the default logging it simply tries to start then terminates. My only real lead is that it works with 1.13.9 but stops working with 1.14.0, so it has to be something that changed between those two versions. |
Do you have any termination logs / logs why the container was terminated (or something in the pods .status.containerStatuses? OIDC support was added in 1.14. |
Ah, yes. This might be useful.
|
FWIW I was forced to revert 1.14 to 1.13.x, I couldn't spot anything interesting in the logs to take a guess at why it wasn't starting up. I believe my containerStatuses was similar to above. |
I'm trying with removing that dependency as we don't need it but I will need somehow to test it out on EKS #3997, we can also release it as it's a clean up that shouldn't affect functionality (tests are passing) |
I don't believe this is fixed, without jboss
Liveness probe failed: Get "http://10.207.25.223:8080/healthz": dial tcp 10.207.25.223:8080: connect: connection refused
|
@treyhyde which patch release of 1.14 are you running? The most recent one I can see is 1.14.7, which does not have the jboss fix in it |
This was 1.15.0, I'm assuming the patch is in that build as I no longer see the jboss messages. 1.13.x is the last version that has actually successfully started. |
Okay, reopening this for now then, thanks for trying 1.15 out @treyhyde ! We'll keep trying to debug and fix this |
@treyhyde since I don't have access to an EKS cluster, if I were to share a receiver image with extra logging would you be open to testing that out on EKS and sharing the logs with us? |
@Cali0707 absolutely |
@treyhyde the image with extra startup logs is This was built off of https://github.com/Cali0707/eventing-kafka-broker/tree/extra-receiver-startup-logs Thanks for helping to debug this! |
And then it stays there until the livenessProbes terminate it. |
well, I take that back, I extended the liveness probes, and your versions seems to have a bit of a different behavior unless I messed up the liveness probe hacks before... If I wait long enough, it fails... (as expected, sort of)
and then goes on to actually start up and go ready. I didn't see this happen on 1.15.0, but again, I could have messed up the probes and not properly extended the timing enough. |
Thanks @treyhyde ! I added one more log that we were missing (specifically, the exception from failing to load the OIDC configuration). Would you mind grabbing the logs one more time? The updated image should be on the same tag as before. In the meantime, I'll open a PR so that we don't even try to load the OIDC config if it is not enabled. But, this is not a root cause fix for EKS so I would still appreciate it if you could share the new logs and we can figure out what's causing it to fail there! |
I'm honestly not sure where that endpoint is, it's not the associated OIDC endpoint for sure. It's not any kube svc. It's not in the pod CIDR or the VPC cidr. |
BTW, thanks @Cali0707 for the attention here. I am interested in OIDC rollout but I agree, the best first fix is to only load that config conditionally. IMO, that unblocks us to upgrade off of 1.13.x. We'd be very excited to get a 1.14.x and/or 1.15.x point release that includes your new PR. |
@treyhyde can you try to curl My guess is that that endpoint is set in the |
Just a note here, we normally only support upgrades like 1.13.x -> 1.14.y -> 1.15.z, but that is especially the case for these releases as there are various data plane migrations that need to occur correctly :) |
OK, there you go, it's the JWKS url |
aws/containers-roadmap#2234 seems relevant |
@treyhyde would you mind checking that in your cluster curling |
curl -v https://oidc.eks.us-west-2.amazonaws.com/id/**REDACTED***/.well-known/openid-configuration gives me
curl -v https://oidc.eks.us-west-2.amazonaws.com/id/***REDACTEDCLUSTERID***/keys does indeed appear to be a jwks endpoint
|
BTW, if I (generously) extend the liveness probes, I can confirm that 1.14.8 also eventually goes "ready". It just needs to get pst that 60 second timeout on the jwks fetch. |
Thanks for the help debugging @treyhyde ! I've opened knative/eventing#8121 to track fixing the root cause of this, and hopefully we can merge #4021 soon which should give us 1.14.9 and 1.15.1 next Tuesday |
@Cali0707 glad I could help, thanks for the quick action |
Following up here, I think all that's left is making the OIDC discovery url configurable in this repo as well. WDYT @creydr ? |
yes. So doing knative/eventing#8121 for eventing-kafka-broker |
Describe the bug
The container in
kafka-broker-receiver
reports error code 143 on startup.The only logs are
Enabling verbose logging does show a few more DNS requests but the important post those logs with verbose seems to be
I suspect the OIDC features are causing the issue as I don't see this behavior locally in k3d.
Expected behavior
When OIDC is disabled (default) OIDC should not be used (which I think is causing the crash) but more importantly it should not crash on startup
To Reproduce
In EKS install with
Knative release version
Knative Eventing: 1.14.0
Additional context
Downgrading to 1.13.9 does not produce the issue.
Upgrading to 1.14.1 or 1.14.2 still has the issue.
The text was updated successfully, but these errors were encountered: