-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Summary
Add support for deploying egg infrastructure on Google Cloud Run using Terraform, as an additional deployment target alongside the existing local Docker-based setup. Both deployment methods must remain fully functional and share as much configuration and code as possible. Additionally, support a hybrid mode where developers run agent sandboxes locally while connecting to remote Cloud Run gateway and orchestrator instances.
Motivation
Cloud Run provides autoscaling, managed TLS, built-in observability, and pay-per-use pricing — making it well-suited for egg's architecture where workloads are bursty (agent sessions are triggered on-demand). Moving to Cloud Run eliminates the need to manage VMs or container orchestration while maintaining the containerized deployment model we already use.
Terraform gives us reproducible, version-controlled infrastructure with plan/apply workflow that fits our review-before-deploy model.
Design Principle: Local-First, Cloud-Compatible
The local Docker Compose deployment is the primary development and operational mode. Cloud Run is an additional deployment target — not a replacement. All changes must preserve full local functionality.
Deployment Modes
The system must support three deployment topologies:
1. Fully Local (existing, unchanged)
All components run locally via Docker Compose: orchestrator, gateway, agent sandboxes, PostgreSQL, Redis. This is the current setup and the default for development.
┌─────────────────────────────────────────────┐
│ Developer Machine (Docker Compose) │
│ ┌─────────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Orchestrator │ │ Gateway │ │ Agent │ │
│ └──────┬──────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ ┌──────┴──────┐ ┌────┴────┐ │
│ │ PostgreSQL │ │ Redis │ │
│ └─────────────┘ └─────────┘ │
└─────────────────────────────────────────────┘
2. Fully Cloud Run
All components run on Cloud Run (services + jobs), backed by Cloud SQL and Memorystore. No local components needed.
┌─────────────────────────────────────────────┐
│ Google Cloud Run │
│ ┌─────────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Orchestrator │ │ Gateway │ │ Agent │ │
│ │ (service) │ │(service)│ │ (job) │ │
│ └──────┬──────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ ┌──────┴──────┐ ┌────┴────┐ │
│ │ Cloud SQL │ │Memorystore│ │
│ └─────────────┘ └──────────┘ │
└─────────────────────────────────────────────┘
3. Hybrid: Local Sandbox + Remote Control Plane
Developers run agent sandboxes locally (for fast iteration, debugging, local file access) but connect to the shared remote gateway and orchestrator on Cloud Run. This is the key workflow for teams where infrastructure is centrally managed but agents run on developer machines.
┌──────────────────────┐ ┌──────────────────────────┐
│ Developer Machine │ │ Google Cloud Run │
│ │ │ │
│ ┌─────────────────┐ │ │ ┌─────────────┐ │
│ │ Agent Sandbox │ │ HTTPS │ │ Orchestrator │ │
│ │ (Docker / local) │─┼──────┼─▶│ (service) │ │
│ │ │ │ │ └──────┬──────┘ │
│ │ GATEWAY_URL= │ │ │ │ │
│ │ https://gw.run │ │ │ ┌──────┴──────┐ │
│ │ EGG_ORCH_URL= │ │ HTTPS │ │ Gateway │ │
│ │ https://orch.run│─┼──────┼─▶│ (service) │ │
│ └─────────────────┘ │ │ └──────┬──────┘ │
│ │ │ │ │
│ │ │ ┌──────┴──────┐ ┌──────┐│
│ │ │ │ Cloud SQL │ │Redis ││
│ │ │ └─────────────┘ └──────┘│
└──────────────────────┘ └──────────────────────────┘
Hybrid mode requirements:
- Local sandbox connects to remote gateway and orchestrator via HTTPS (just set
GATEWAY_URLandEGG_ORCHESTRATOR_URLto the Cloud Run service URLs) - Authentication: local sandbox must present a valid identity token to call Cloud Run services. Options: GCP service account key,
gcloudauth, or a shared API key verified by the gateway - The gateway must accept connections from outside the VPC (local dev machines) — needs an auth layer, not just VPC-internal trust
- Local sandbox does NOT need local PostgreSQL or Redis — the remote orchestrator owns all state
- The
eggCLI should support a--remoteflag or config file to switch between local and remote control plane - Agent registration: the remote orchestrator needs to track locally-spawned agents. The local sandbox should call the orchestrator's registration API on startup, same as Cloud Run jobs do
Shared vs. Environment-Specific
| Layer | Shared | Local-Only | Cloud Run-Only | Hybrid-Specific |
|---|---|---|---|---|
| Container images | Same Dockerfiles, same images | — | Pushed to Artifact Registry | Local sandbox image, remote services |
| Application code | All business logic, APIs, policy enforcement | — | — | — |
| Configuration | Env var names and semantics | docker-compose.yml, .env |
Terraform tfvars, Secret Manager |
.env with remote URLs + auth token |
| Service discovery | Same env vars (GATEWAY_URL, EGG_ORCHESTRATOR_URL) |
Docker DNS names | Cloud Run service URLs | Cloud Run URLs from local env |
| Secrets | Same env var interface | .env file or Docker secrets |
Secret Manager as env vars | Local .env with GCP auth token |
| Database | Same schema, same connection string | Local PostgreSQL container | Cloud SQL | Remote (owned by orchestrator) |
| Redis | Same usage patterns | Local Redis container | Memorystore | Remote (owned by orchestrator) |
| Container spawning | Same orchestrator interface | Docker API | Cloud Run Jobs API | N/A (dev launches sandbox manually) |
| Network policy | Same proxy concept | Filtering proxy container | VPC Service Controls | Local proxy → remote gateway |
| Auth (agent→gateway) | Same API contract | Docker network (implicit trust) | Cloud Run IAM | Identity token or API key over HTTPS |
Abstraction Strategy
The key places where local and Cloud Run diverge should be behind runtime-selected backends, not compile-time or build-time switches:
- Container spawner — The orchestrator currently spawns agents via Docker API. Add a
CloudRunJobSpawnerthat implements the same interface. Select viaEGG_SPAWNER_BACKEND=docker|cloudrunenv var. - Secret loading — Currently reads from env vars /
.env. On Cloud Run, secrets come from Secret Manager but are still mounted as env vars — so no application changes needed. - Service discovery — Already driven by env vars (
GATEWAY_URL,EGG_ORCHESTRATOR_URL). On Cloud Run, these point to Cloud Run service URLs instead of Docker DNS names. In hybrid mode, local sandbox sets these to the same Cloud Run URLs. No application changes needed. - Auth middleware — Gateway currently trusts callers implicitly (Docker network boundary). Add an optional auth layer that validates identity tokens or API keys when
EGG_AUTH_MODE=tokenis set. Disabled by default for local mode, enabled for Cloud Run and hybrid. - Health checks — Existing health endpoints should work as-is for Cloud Run probes and hybrid connectivity checks.
What NOT to Change
- Do not remove or modify Docker Compose files
- Do not add Cloud Run-specific logic to core business code (gateway policy enforcement, SDLC pipeline, checkpoint storage, etc.)
- Do not change env var names or semantics — Cloud Run config should use the same env vars
- Do not break the filtering proxy for local mode when adding Cloud Run network controls
- Do not require GCP credentials for fully-local mode
Proposed Architecture
Orchestrator → Cloud Run Service
- Long-running Cloud Run service handling pipeline lifecycle, phase management, and HITL decisions
- Needs persistent connections (WebSocket or SSE) for real-time pipeline status
- Backed by Cloud SQL (PostgreSQL) and Memorystore (Redis) for state management
- Min instances: 1 (to avoid cold start latency for pipeline operations)
- Must accept agent registration from both Cloud Run jobs and remote local sandboxes
- Same container image as local, different env vars
Gateway → Cloud Run Service
- Long-running Cloud Run service handling git/gh proxying, policy enforcement, and credential management
- Needs access to GitHub App credentials via Secret Manager
- Enforces branch ownership, merge blocking, and phase-validated commits
- Should be in the same VPC as the orchestrator for low-latency communication
- Must authenticate incoming requests (Cloud Run IAM for service-to-service, token/API key for hybrid local agents)
- Same container image as local, different env vars
Agent Containers → Cloud Run Jobs OR Local Sandbox
- Cloud Run: Headless Claude Code instances (
claude --print) launched as Cloud Run jobs - Hybrid: Developer runs sandbox locally via
egg --remote, connecting to Cloud Run gateway/orchestrator - Both use the same agent image and the same env var contract
- Jobs spawned by orchestrator use Cloud Run Jobs API; local sandboxes self-register with the orchestrator
Terraform Structure
terraform/
├── environments/
│ ├── dev/
│ │ ├── main.tf # Dev environment composition
│ │ ├── terraform.tfvars # Dev-specific values
│ │ └── backend.tf # GCS state backend (dev bucket)
│ ├── staging/
│ │ └── ...
│ └── prod/
│ └── ...
├── modules/
│ ├── networking/
│ │ ├── main.tf # VPC, subnets, VPC connector, firewall rules
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── cloud-run-service/
│ │ ├── main.tf # Reusable Cloud Run service (used by orchestrator + gateway)
│ │ ├── variables.tf # image, env vars, min/max instances, VPC connector, etc.
│ │ └── outputs.tf # service URL, service account
│ ├── cloud-run-job/
│ │ ├── main.tf # Cloud Run Job template for agent containers
│ │ ├── variables.tf # image, timeout, env vars, resource limits
│ │ └── outputs.tf
│ ├── database/
│ │ ├── main.tf # Cloud SQL (PostgreSQL) instance + databases
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── cache/
│ │ ├── main.tf # Memorystore (Redis) instance
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── secrets/
│ │ ├── main.tf # Secret Manager secrets + IAM bindings
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── artifact-registry/
│ │ ├── main.tf # Artifact Registry repo for container images
│ │ ├── variables.tf
│ │ └── outputs.tf
│ └── iam/
│ ├── main.tf # Service accounts, roles, workload identity
│ ├── variables.tf
│ └── outputs.tf
├── versions.tf # Provider version constraints
└── README.md
Key Terraform Resources
| Resource | Purpose |
|---|---|
google_cloud_run_v2_service |
Orchestrator and gateway services |
google_cloud_run_v2_job |
Agent container job template |
google_sql_database_instance |
PostgreSQL for pipeline state |
google_redis_instance |
Redis for caching/pub-sub |
google_secret_manager_secret |
GitHub App key, API keys, Anthropic key |
google_artifact_registry_repository |
Container image registry |
google_vpc_access_connector |
VPC connector for Cloud Run → private resources |
google_compute_network / subnetwork |
VPC and subnet configuration |
google_service_account |
Per-service identity (least privilege) |
google_cloud_run_v2_service_iam_member |
Service-to-service auth |
Terraform Workflow
- Plan:
terraform planon PR (GitHub Actions) — review infra changes alongside code - Apply:
terraform applyafter merge to main — automated via GitHub Actions - State: Remote state in GCS bucket with state locking via Cloud Storage
- Secrets: Sensitive values via
terraform.tfvars(gitignored) or injected from Secret Manager in CI
Key Design Considerations
- Backwards compatibility: Every application change must be tested in both local Docker Compose and Cloud Run. CI should run the existing local test suite unchanged.
- Hybrid auth: The gateway and orchestrator need an auth mechanism that works for both Cloud Run IAM (service-to-service) and external callers (hybrid local sandboxes). A dual-mode approach: accept Cloud Run IAM tokens OR a bearer token/API key. The
eggCLI can obtain a GCP identity token viagcloud auth print-identity-tokenor use a configured API key. - Hybrid networking: Cloud Run services need to allow ingress from the internet (for hybrid mode). Use Cloud Run's built-in ingress controls (
--ingress=all) with auth rather than network-level restrictions for the gateway and orchestrator endpoints that hybrid agents call. - Hybrid DX: A developer should be able to go from fully local to hybrid with a simple config change — e.g.,
egg --remote devreads a profile that setsGATEWAY_URL,EGG_ORCHESTRATOR_URL, and auth credentials. No other changes to the sandbox. - Networking: Orchestrator, gateway, and agent containers need to communicate. Use VPC connector or direct VPC egress for Cloud Run-to-Cloud Run. Hybrid agents reach services over public HTTPS.
- Secrets: GitHub App private key, API keys → Secret Manager, mounted as env vars. Same env var names as local — the application sees no difference.
- Timeouts: Cloud Run services have a max request timeout of 60 min. Cloud Run jobs have a max timeout of 24h. Agent sessions may need the longer job timeout.
- Cold starts: Orchestrator and gateway should use min-instances=1. Agent containers can tolerate cold starts since they're launched asynchronously.
- State: Pipeline state currently lives in-memory and local filesystem. Need to externalize to Cloud SQL + GCS for Cloud Run. Local mode continues using local PostgreSQL and filesystem.
- Network policy: Local mode keeps the filtering proxy. Cloud Run uses VPC Service Controls or firewall rules. Hybrid mode uses the local filtering proxy configured to allow outbound to remote gateway/orchestrator.
- Container images: Same Dockerfiles for both. GitHub Actions builds images on push, tags with commit SHA, pushes to Artifact Registry. Local just builds with
docker compose build. - Service-to-service auth: Cloud Run IAM invoker roles between orchestrator → gateway and orchestrator → agent jobs. Hybrid agents use token-based auth.
Tasks
Abstraction Layer (Do First — No Cloud Run Dependency)
- Introduce container spawner interface in orchestrator with
DockerSpawner(current behavior) and pluggable backend selection viaEGG_SPAWNER_BACKENDenv var - Add optional auth middleware to gateway and orchestrator (disabled by default, enabled via
EGG_AUTH_MODE) - Add agent self-registration endpoint to orchestrator (for hybrid local sandboxes and Cloud Run jobs that need to register on startup)
- Verify all service discovery is driven by env vars (no hardcoded hostnames)
- Verify health check endpoints work as HTTP probes (no changes expected)
- Ensure database connection is configured via
DATABASE_URLenv var (works for both local PG and Cloud SQL)
CLI Support for Hybrid Mode
- Add
egg --remote <profile>flag that loads remote gateway/orchestrator URLs and auth config from a profile file (e.g.,~/.egg/profiles/dev.env) - Implement profile config format:
GATEWAY_URL,EGG_ORCHESTRATOR_URL, auth token orgcloudintegration - On startup in hybrid mode: self-register with remote orchestrator, verify connectivity to remote gateway
- Update filtering proxy config to allow outbound to remote gateway/orchestrator URLs in hybrid mode
- Document hybrid setup:
gcloud auth, profile creation, first-run verification
Terraform Foundation
- Set up Terraform directory structure with modules as described above
- Configure GCS backend for remote state
- Create
networkingmodule (VPC, subnets, VPC connector, firewall rules) - Create
iammodule (service accounts, per-developer identity for hybrid access) - Create
secretsmodule (Secret Manager resources + IAM bindings) - Create
artifact-registrymodule
Cloud Run Services
- Create
cloud-run-servicemodule (reusable for orchestrator and gateway) - Define orchestrator service (env vars, Cloud SQL connection, min instances, auth enabled)
- Define gateway service (Secret Manager mounts, VPC config, auth enabled)
- Configure service-to-service IAM (orchestrator ↔ gateway)
- Configure ingress to allow hybrid agents (internet ingress + auth)
Cloud Run Jobs (Agents)
- Create
cloud-run-jobmodule for agent containers - Implement
CloudRunJobSpawnerbehind the spawner interface - Configure job execution environment (timeout, memory, CPU, env vars)
Data Layer
- Create
databasemodule (Cloud SQL PostgreSQL) - Create
cachemodule (Memorystore Redis) - Externalize any filesystem-based state to GCS (Cloud Run only; local mode unchanged)
CI/CD
- GitHub Actions workflow: build + push images to Artifact Registry
- GitHub Actions workflow:
terraform planon PR,terraform applyon merge - Environment promotion pipeline (dev → staging → prod)
- Ensure existing local test suite continues to pass in CI
Validation
- Integration testing in a staging Cloud Run environment
- Hybrid mode testing: local sandbox → remote Cloud Run gateway/orchestrator
- Side-by-side validation: same operations tested on local, Cloud Run, and hybrid
- Documentation for Cloud Run + Terraform deployment (separate from local docs)
- Documentation for hybrid setup for developers
Open Questions
- Should we use Cloud Run Functions (2nd gen, which is built on Cloud Run) for agents instead of Cloud Run Jobs? Functions give simpler invocation but less control over execution environment.
- What's the right approach for agent-to-gateway communication in fully-Cloud-Run mode? Sidecar pattern doesn't work in Cloud Run — options include service-to-service auth or a shared VPC.
- Do we need Cloud Run multi-container support (currently in preview) to co-locate gateway logic with agents?
- Should Terraform state be in a dedicated GCP project or the same project as the workloads?
- Do we want Workload Identity Federation for GitHub Actions → GCP auth, or a service account key?
- For hybrid auth, should we use GCP identity tokens (requires each dev to have GCP IAM access), a shared API key (simpler but less granular), or both?
- Should the orchestrator enforce any limits on hybrid agents (max concurrent per developer, allowed repos, etc.)?