Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: amd build failing due to DeadlineExceeded #203

Open
1 task done
uptownhr opened this issue Jan 15, 2025 · 1 comment
Open
1 task done

[Bug]: amd build failing due to DeadlineExceeded #203

uptownhr opened this issue Jan 15, 2025 · 1 comment
Assignees
Labels
bug Something isn't working triage Needs to be triaged

Comments

@uptownhr
Copy link
Collaborator

Prior Search

  • I have already searched this project's issues to determine if a bug report has already been made.

What happened?

I am utilizing the wf_dockerfile_build to build an image however the amd64 step is constantly failing due to the error

error: listing workers for Build: failed to list workers: DeadlineExceeded: logical service 10.0.102.169:1234: route default.endpoint: backend default.unknown: service in fail-fast

Steps to Reproduce

Unknown

Relevant log output

full logs from workflow step

{"argo":true,"level":"info","msg":"waiting for dependency \"scale-buildkit\"","time":"2025-01-15T14:58:42.586Z"}
{"argo":true,"level":"info","msg":"waiting for dependency \"clone\"","time":"2025-01-15T14:58:43.586Z"}
{"argo":true,"level":"info","msg":"capturing logs","time":"2025-01-15T14:58:44.586Z"}
error: listing workers for Build: failed to list workers: DeadlineExceeded: logical service 10.0.102.169:1234: route default.endpoint: backend default.unknown: service in fail-fast
{"argo":true,"error":null,"level":"info","msg":"sub-process exited","time":"2025-01-15T14:58:49.593Z"}
{"argo":true,"level":"info","msg":"not saving outputs - not main container","time":"2025-01-15T14:58:49.593Z"}
Error: exit status 1
@uptownhr uptownhr added bug Something isn't working triage Needs to be triaged labels Jan 15, 2025
@fullykubed
Copy link
Member

To me, this indicates that there is an issue with Cilium launching on new amd64 nodes.

My guess is that b/c we use the VPA to set resource requests / limits for the cilium node agent AND all nodes in the cluster are usually arm64, that the VPA sets inappropriate resources (too low / oom) for cilium when launching the first amd64 node in the cluster. This will eventually resolve itself as the VPA adjusts the resource recommendations upwards.

I will have to dig into this more to propose a solution. I am not yet sure how to get the VPA to make different recommendations based on the CPU architecture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Needs to be triaged
Projects
None yet
Development

No branches or pull requests

2 participants