Skip to content

Init containers (prepare, place-scripts) vulnerable to CRI-O exit code 255 race condition #9654

@waveywaves

Description

@waveywaves

Expected Behavior

Tekton TaskRuns should not fail due to init container exit code 255 when the init containers (prepare, place-scripts) complete their work successfully.

Actual Behavior

On CRI-O-based clusters (including OpenShift), Tekton's init containers intermittently fail with exit code 255 and no error logs. The container's work completed successfully — the log message "Entrypoint initialization" is the success return from subcommands.OK{message: "Entrypoint initialization"} in cmd/entrypoint/subcommands/subcommands.go:66, confirming the binary was copied and step directories were created.

The TaskRun is marked as Failed with:

init container failed, "prepare" exited with code 255

CRI-O logs on the affected node show:

level=error msg="Failed to update container state for <container-id>: stdout: , stderr: "

Why Tekton is uniquely affected

Tekton's init containers are among the fastest-exiting containers in any Kubernetes cluster:

  • prepare: runs /ko-app/entrypoint initcp(src, dst) + stepInit(steps) (binary copy + symlink creation) → exits in <1ms
  • place-scripts: runs a shell to write script files to a volume → exits in ~1-10ms

This hits a container runtime race condition where the exit code cannot be captured for very fast-exiting processes. The root cause is tracked upstream:

Note: This may also affect containerd — not yet verified.

Steps to Reproduce the Problem

  1. Deploy Tekton Pipelines on a CRI-O-based cluster (OpenShift 4.x)
  2. Run many TaskRuns concurrently
  3. Observe intermittent init container failures with exit code 255 and no error logs
  4. Verify via CRI-O logs on the affected node:
    journalctl -u crio --no-pager | grep "Failed to update container state"

The issue is intermittent, not reproducible on demand, and not related to node resource pressure.

Possible Tekton-side mitigations

While the root cause is in the container runtime, Tekton can mitigate this:

Option A: Add a brief delay before init container exit

In cmd/entrypoint/subcommands/init.go, add time.Sleep(10 * time.Millisecond) after the filesystem operations complete. This widens the window for the container runtime to set up process monitoring. Cost: 10ms per TaskRun (negligible for CI/CD pipelines).

Option B: Reconciler retry on init container exit 255

In pkg/reconciler/taskrun/, detect when a pod fails with init container exit code 255 and automatically recreate the pod. This handles the failure transparently without adding latency to any TaskRun.

Option C: Both

Option A reduces the frequency. Option B handles the remaining edge cases.

Additional Info

  • Kubernetes version: OpenShift 4.20.15
  • Tekton Pipeline version: affects all versions (the init container code path has existed since the entrypoint refactor)
  • Current user workaround: set retries on PipelineTask definitions to automatically retry failed tasks
  • Code locations:
    • Init container creation: pkg/pod/pod.go:619 (prepare), pkg/pod/script.go:98 (place-scripts)
    • Init subcommand: cmd/entrypoint/subcommands/init.go:24 (entrypointInit)
    • Success log: cmd/entrypoint/subcommands/subcommands.go:66 (OK{message: "Entrypoint initialization"})

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions