Skip to content

Conversation

@hskiba
Copy link
Member

@hskiba hskiba commented Jan 11, 2026

Summary

Add warm pool of pre-stopped EC2 instances to reduce GitHub runner startup time from ~90s to ~20-30s.

Features

  • Pre-create and stop EC2 instances that can be quickly started when jobs are queued
  • Configurable pool sizes per instance type via JSON: {"c8a.4xlarge":1,"c8a.2xlarge":2}
  • Scheduled maintenance (every 5 minutes) to maintain pool levels
  • Instances are ephemeral - terminate after job completion, not reused
  • Graceful fallback to fresh instance launch if no warm instances available

Changes

  • main.go: Add warm pool logic, scheduled maintenance handler, multipart MIME user-data for cloud-init
  • template.yaml: Add WarmPoolConfig parameter, scheduled event, IAM permissions
  • user-data.sh: Handle both pre-extracted runner and fallback to cache extraction

How it works

  1. Maintenance Lambda runs every 5 minutes to ensure pool is populated
  2. When a job is queued, Lambda checks for available stopped instance in pool
  3. If found: updates tags, changes shutdown behavior to TERMINATE, injects job-specific user-data, starts instance
  4. If not found: launches fresh instance (original behavior)
  5. After job completes, instance terminates (ephemeral)

Test plan

  • Deploy with WARM_POOL_CONFIG={"c8a.4xlarge":1}
  • Verify instances are created and stopped by maintenance
  • Queue a job and verify warm instance is activated
  • Verify pool is replenished after use
  • Test fallback when pool is empty

Skip extraction step since runner is now pre-extracted
during AMI build at /opt/actions-runner/. This reduces
startup time and disk I/O.
Implement a warm pool of pre-stopped EC2 instances to reduce GitHub
runner startup time. Key features:

- New WarmPoolConfig parameter (JSON map of instance type to pool size)
- Warm pool instances stop after first boot, ready for quick activation
- When activated, shutdown behavior changes to TERMINATE (ephemeral)
- Pool automatically replenishes when instances are used
- Pool size of 0 or empty config disables feature (current behavior)

Example config: {"c8a.4xlarge":2,"c8a.2xlarge":3}

New Lambda permissions: ec2:DescribeInstances, ec2:StartInstances,
ec2:StopInstances, ec2:ModifyInstanceAttribute
- Extract warmPoolFilters() helper to reduce duplication
- Consolidate EC2 launch logic into buildRunInstancesInput() and launchInstance()
- Extract tryAcquireWarmInstance() and replenishWarmPool() helpers
- Use single multipartTemplate constant
- Simplify nil map access (Go returns zero value for nil maps)

Net reduction of ~80 lines while improving readability.
- Add CloudWatch Events rule that triggers every 5 minutes
- Add handleMaintenance() to check and populate all configured instance types
- Refactor handler to dispatch between API Gateway and scheduled events
- Extract getLaunchConfig() helper to reduce duplication

The maintenance function iterates through all configured instance types
and launches instances to reach target pool sizes.
Fall back to extracting from /opt/runner-cache/ if /opt/actions-runner
doesn't exist. This supports both old AMIs (with runner cache) and new
AMIs (with pre-extracted runner).
- Return empty map instead of nil from parseWarmPoolConfig()
- Remove redundant nil check in handleMaintenance()
- Consolidate duplicate launch config building in handleWebhook()
  by reusing getLaunchConfig() (~40 lines removed)
Generate JIT config from Lambda via GitHub API instead of passing PAT
to the instance. This eliminates the 15-30 second config.sh registration
step on the runner.

Changes:
- main.go: Add generateJITConfig() to call GitHub's JIT config API
- main.go: Build labels list and pass JIT config to user-data template
- user-data.sh: Remove get_github_token() and config.sh steps
- user-data.sh: Use ./run.sh --jitconfig instead
ModifyInstanceAttribute with BlobAttributeValue automatically handles
base64 encoding, so we shouldn't pre-encode. This was causing user-data
to exceed the 16KB limit.
Cloud-init caches user-data from first boot, so when we update user-data
on warm pool activation, the cached script runs instead of the new one.

Fix by storing JIT config in SSM Parameter Store (/github-runner/jit-config/{instance-id})
and having the user-data script fetch it from there. This works because:
1. The script itself doesn't change (no templating needed)
2. The SSM parameter is created fresh for each job

Changes:
- main.go: Add storeJITConfigInSSM(), remove template usage
- user-data.sh: Fetch JIT config from SSM using instance ID
- template.yaml: Add SSM permissions for Lambda and EC2 instance
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants