Rewrite service readiness check and remove orchestration service #1583

hugosantos · 2025-10-12T21:16:49Z

Summary

Replace port-forwarding based service readiness checks with a simpler, more reliable one-shot pod approach. This eliminates the orchestration service RPC entirely and makes readiness checks work consistently for both in-cluster and remote Kubernetes clusters.

Key Changes

1. Service Readiness Check Rewrite

Before:

Used RawDialServer() with port-forwarding
Required in-cluster access via *.svc.cluster.local DNS
Complex async error handling
Bash-specific /dev/tcp device

After:

Deploys one-shot pod with shell script inside cluster
Uses nc (netcat) for TCP connectivity tests
Simple pass/fail via pod exit code
POSIX-compliant shell script for busybox compatibility
Works for both remote and in-cluster deployments

2. Orchestration Service Removal

Deleted:

orchestration/service/ package (~400 lines)
orchestration/proto/ package (~800 lines)
CallAreServicesReady() RPC function

The AreServicesReady RPC was the only remaining orchestration service endpoint. Since we now handle readiness checks directly without RPC, the entire service package is no longer needed.

3. Orchestrator Deployment Changes

Set UseOrchestrator = false (was true)
Made orchestrator preparation conditional on flag
Orchestrator still available via --use_orchestrator flag for runtime config controllers

4. Security Improvements

Uses Chainguard busybox image (cgr.dev/chainguard/busybox:latest)
Minimal, secure base image with no CVEs
Runs as uid 65532 (nonroot)
Read-only root filesystem enforced

Implementation Details

The new readiness checker:

Lists K8s services matching the deployable
Generates a POSIX-compliant shell script that tests each TCP port
Deploys a one-shot pod running the script inside the cluster
Script uses nc -z to test TCP connectivity with retry logic (60 attempts × 500ms = 30s)
Returns success if pod exits with code 0, failure otherwise
Pod is automatically cleaned up via defer

Testing

✅ All Go tests pass (go test ./...)
✅ Full codebase builds (go build ./...)
✅ nsdev prepare local completes without deploying orchestrator
✅ nsdev test runs successfully (service readiness check executes)

Benefits

Simpler: Direct K8s API calls, no RPC layer
More reliable: Actual TCP dial inside cluster vs async port-forward errors
Works remotely: Uses K8s API to deploy/watch pod
More secure: Chainguard minimal image, CVE-free
Cleaner codebase: -585 lines of code (mostly dead orchestration service)

Breaking Changes

None. The orchestration service RPC was already unused - this just removes the dead code.

🤖 Generated with Claude Code

Replace port-forwarding based service readiness checks with a simpler, more reliable one-shot pod approach. This eliminates the orchestration service RPC entirely and makes readiness checks work consistently for both in-cluster and remote Kubernetes clusters. Key changes: - Rewrite AreServicesReady() to deploy one-shot checker pod with shell script - Use Chainguard busybox image (cgr.dev/chainguard/busybox:latest) - Use nc (netcat) for TCP connectivity tests instead of bash /dev/tcp - POSIX-compliant shell script for busybox compatibility - Delete orchestration/service and orchestration/proto packages (~580 lines) - Set UseOrchestrator = false (orchestrator no longer needed for readiness) - Make orchestrator deployment conditional on UseOrchestrator flag Benefits: - Actually tests TCP connect() from inside the cluster - Works for remote clusters via K8s API - Simple pass/fail via pod exit code - Built-in retry: 60 attempts × 500ms = 30s per port - More secure: Chainguard minimal, CVE-free image - Simpler: No RPC layer, no async port-forward errors 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Increase retry timeout from 30s to 2 minutes (240 attempts) - Replace 'timeout 0.1 nc' with 'nc -z -w 1' (nc native timeout) - The timeout command may not be available in busybox - Longer timeout needed for CI environments where services take longer to start 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

hugosantos and others added 2 commits October 12, 2025 23:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rewrite service readiness check and remove orchestration service #1583

Rewrite service readiness check and remove orchestration service #1583

Uh oh!

hugosantos commented Oct 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Rewrite service readiness check and remove orchestration service #1583

Are you sure you want to change the base?

Rewrite service readiness check and remove orchestration service #1583

Uh oh!

Conversation

hugosantos commented Oct 12, 2025

Summary

Key Changes

1. Service Readiness Check Rewrite

2. Orchestration Service Removal

3. Orchestrator Deployment Changes

4. Security Improvements

Implementation Details

Testing

Benefits

Breaking Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant