Init containers (prepare, place-scripts) vulnerable to CRI-O exit code 255 race condition

# Expected Behavior

Tekton TaskRuns should not fail due to init container exit code 255 when the init containers (`prepare`, `place-scripts`) complete their work successfully.

# Actual Behavior

On CRI-O-based clusters (including OpenShift), Tekton's init containers intermittently fail with exit code 255 and no error logs. The container's work completed successfully — the log message `"Entrypoint initialization"` is the success return from `subcommands.OK{message: "Entrypoint initialization"}` in `cmd/entrypoint/subcommands/subcommands.go:66`, confirming the binary was copied and step directories were created.

The TaskRun is marked as Failed with:
```
init container failed, "prepare" exited with code 255
```

CRI-O logs on the affected node show:
```
level=error msg="Failed to update container state for <container-id>: stdout: , stderr: "
```

### Why Tekton is uniquely affected

Tekton's init containers are among the fastest-exiting containers in any Kubernetes cluster:

- **`prepare`**: runs `/ko-app/entrypoint init` → `cp(src, dst)` + `stepInit(steps)` (binary copy + symlink creation) → exits in <1ms
- **`place-scripts`**: runs a shell to write script files to a volume → exits in ~1-10ms

This hits a container runtime race condition where the exit code cannot be captured for very fast-exiting processes. The root cause is tracked upstream:
- https://github.com/cri-o/cri-o/issues/9840
- https://github.com/cri-o/cri-o/issues/8980 (closed without fix)

**Note:** This may also affect containerd — not yet verified.

# Steps to Reproduce the Problem

1. Deploy Tekton Pipelines on a CRI-O-based cluster (OpenShift 4.x)
2. Run many TaskRuns concurrently
3. Observe intermittent init container failures with exit code 255 and no error logs
4. Verify via CRI-O logs on the affected node:
   ```bash
   journalctl -u crio --no-pager | grep "Failed to update container state"
   ```

The issue is intermittent, not reproducible on demand, and not related to node resource pressure.

# Possible Tekton-side mitigations

While the root cause is in the container runtime, Tekton can mitigate this:

### Option A: Add a brief delay before init container exit

In `cmd/entrypoint/subcommands/init.go`, add `time.Sleep(10 * time.Millisecond)` after the filesystem operations complete. This widens the window for the container runtime to set up process monitoring. Cost: 10ms per TaskRun (negligible for CI/CD pipelines).

### Option B: Reconciler retry on init container exit 255

In `pkg/reconciler/taskrun/`, detect when a pod fails with init container exit code 255 and automatically recreate the pod. This handles the failure transparently without adding latency to any TaskRun.

### Option C: Both

Option A reduces the frequency. Option B handles the remaining edge cases.

# Additional Info

- Kubernetes version: OpenShift 4.20.15
- Tekton Pipeline version: affects all versions (the init container code path has existed since the entrypoint refactor)
- Current user workaround: set `retries` on PipelineTask definitions to automatically retry failed tasks
- Code locations:
  - Init container creation: `pkg/pod/pod.go:619` (`prepare`), `pkg/pod/script.go:98` (`place-scripts`)
  - Init subcommand: `cmd/entrypoint/subcommands/init.go:24` (`entrypointInit`)
  - Success log: `cmd/entrypoint/subcommands/subcommands.go:66` (`OK{message: "Entrypoint initialization"}`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Init containers (prepare, place-scripts) vulnerable to CRI-O exit code 255 race condition #9654

Expected Behavior

Actual Behavior

Why Tekton is uniquely affected

Steps to Reproduce the Problem

Possible Tekton-side mitigations

Option A: Add a brief delay before init container exit

Option B: Reconciler retry on init container exit 255

Option C: Both

Additional Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Init containers (prepare, place-scripts) vulnerable to CRI-O exit code 255 race condition #9654

Description

Expected Behavior

Actual Behavior

Why Tekton is uniquely affected

Steps to Reproduce the Problem

Possible Tekton-side mitigations

Option A: Add a brief delay before init container exit

Option B: Reconciler retry on init container exit 255

Option C: Both

Additional Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions