-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KIC fails to start. All pods down: nginx [emerg] 1#0: bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)" #13730
Comments
FYI: the "deadlock" is removed by restarting the pods manually kubectl -n kong-dbless delete pods --selector=app.kubernetes.io/instance=kong-green then kong KIC pods (controller and gateways) restart normally. |
It seems your issue has been resolved. Feel free to reopen if you have any further concern. |
thanks @StarlightIbuki for taking this issue. However, your answer doesn't help much. Could you please point me to the resolution ? How has this issue been solved ? And what's the fix ? Thank you in advance |
Sorry I thought you had found the solution. @randmonkey Could you also take a look into this? |
hi @randmonkey, we are getting the issue above multiple times per days and it's getting very frustrating. Do you have any insights to share ? On my side, I've also been searching for solutions. And a closer look at the behaviour wise, the liveness probe is failing, which only restarts the container. Restarting the container doesn't help. Kong is able to start only when the pod is deleted (manually), which leads me towards cleaning up the PID |
the issue seems to be the same as Kong/kubernetes-ingress-controller#5324 I have the following hypothesis on what is happening
|
The
5-8 would be the possible reason of the issue. For 1-4, KIC failing to push the config will not make the liveness probe fail and then restart the gateway pod. |
👋 I think I have some insight on this. In 3.8, we relocated Kong's internal sockets into a subdirectory in the prefix tree (#13409). There is some code that runs as part of This logic is unfortunately duplicated in our docker entrypoint script because it circumvents The docker entrypoint code was not updated to point to the new socket directory that Kong is using as of 3.8 (an oversight). I've opened a PR to remedy this, which I think should resolve the issue. For those using the *In fact, enabling this kind of ops pattern in 3.8 was part of the underlying intent of #13409: establishing more segregation between persistent and transient data so that lifecycle management doesn't require non-trivial amounts of scripting (like what is found in the aforementioned docker entrypoint). |
hello @flrgh , ingress:
controller:
# controller config
gateway:
enabled: true
deployment:
kong:
enabled: true
initContainers:
- command:
- rm
- '-vrf'
- ${KONG_PREFIX}/sockets
env:
- name: KONG_PREFIX
value: /kong_prefix/
image: kong:3.8.0
imagePullPolicy: IfNotPresent
name: clear-stale-pid-custom
volumeMounts:
- mountPath: /kong_prefix/
name: kong-green-gateway-prefix-dir when our preemptible node got "restarted" just a few minutes ago. Kong was not able to restart properly and crash again with the errors
|
@joran-fonjallaz that's odd. My k8s knowledge is minimal, so bear with me a little. If
|
hello @flrgh, So your feeling that the issue might be linked to 3.8 does seem correct |
Is there an existing issue for this?
Kong version (
$ kong version
)3.8.0
Current Behavior
hello,
We run kong KIC on GKE clusters: every night the preemptible nodes are reclaimed in our staging envs. And most of the time, it takes down all kong gateway pods (2 replicas) for hours.
versions
1.30.4-gke.1348000
0.14.1
,3.3.1
3.8.0
Additional info
It seems that the liveness probe is responding ok, while the readiness probe remains unhealthy, leading to the gateway pods to just remain around, not able to process traffic.
Error logs
The controller fails to talk to the gateways with
Kong finds itself in some sort of "deadlock" until the pods are deleted manually. Any insights ?
Below is the
values.yaml
file configuring kongExpected Behavior
kong gateway pods, either
bind() to unix:/kong_prefix/sockets/we failed (98: Address already in use)
Steps To Reproduce
I could reproduce the error by killing the nodes (
kubectl delete nodes
) on which the kong pods were running. After killing the nodes, KIC fails to restart as it enters the deadlock situation described above. See screenshot:Anything else?
dump of a failing gateway pod:
kubectl describe
:and logs
The text was updated successfully, but these errors were encountered: