-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Expected Behavior
Tekton TaskRuns should not fail due to init container exit code 255 when the init containers (prepare, place-scripts) complete their work successfully.
Actual Behavior
On CRI-O-based clusters (including OpenShift), Tekton's init containers intermittently fail with exit code 255 and no error logs. The container's work completed successfully — the log message "Entrypoint initialization" is the success return from subcommands.OK{message: "Entrypoint initialization"} in cmd/entrypoint/subcommands/subcommands.go:66, confirming the binary was copied and step directories were created.
The TaskRun is marked as Failed with:
init container failed, "prepare" exited with code 255
CRI-O logs on the affected node show:
level=error msg="Failed to update container state for <container-id>: stdout: , stderr: "
Why Tekton is uniquely affected
Tekton's init containers are among the fastest-exiting containers in any Kubernetes cluster:
prepare: runs/ko-app/entrypoint init→cp(src, dst)+stepInit(steps)(binary copy + symlink creation) → exits in <1msplace-scripts: runs a shell to write script files to a volume → exits in ~1-10ms
This hits a container runtime race condition where the exit code cannot be captured for very fast-exiting processes. The root cause is tracked upstream:
- CRI-O reports exit code 255 for fast-exiting init containers in production (conmon race condition) cri-o/cri-o#9840
- e2e_node tests upstream spuriously fail with containers exiting with exit code 255 cri-o/cri-o#8980 (closed without fix)
Note: This may also affect containerd — not yet verified.
Steps to Reproduce the Problem
- Deploy Tekton Pipelines on a CRI-O-based cluster (OpenShift 4.x)
- Run many TaskRuns concurrently
- Observe intermittent init container failures with exit code 255 and no error logs
- Verify via CRI-O logs on the affected node:
journalctl -u crio --no-pager | grep "Failed to update container state"
The issue is intermittent, not reproducible on demand, and not related to node resource pressure.
Possible Tekton-side mitigations
While the root cause is in the container runtime, Tekton can mitigate this:
Option A: Add a brief delay before init container exit
In cmd/entrypoint/subcommands/init.go, add time.Sleep(10 * time.Millisecond) after the filesystem operations complete. This widens the window for the container runtime to set up process monitoring. Cost: 10ms per TaskRun (negligible for CI/CD pipelines).
Option B: Reconciler retry on init container exit 255
In pkg/reconciler/taskrun/, detect when a pod fails with init container exit code 255 and automatically recreate the pod. This handles the failure transparently without adding latency to any TaskRun.
Option C: Both
Option A reduces the frequency. Option B handles the remaining edge cases.
Additional Info
- Kubernetes version: OpenShift 4.20.15
- Tekton Pipeline version: affects all versions (the init container code path has existed since the entrypoint refactor)
- Current user workaround: set
retrieson PipelineTask definitions to automatically retry failed tasks - Code locations:
- Init container creation:
pkg/pod/pod.go:619(prepare),pkg/pod/script.go:98(place-scripts) - Init subcommand:
cmd/entrypoint/subcommands/init.go:24(entrypointInit) - Success log:
cmd/entrypoint/subcommands/subcommands.go:66(OK{message: "Entrypoint initialization"})
- Init container creation: