Add warm pool support for faster runner startup #14

hskiba · 2026-01-11T02:14:22Z

Summary

Add warm pool of pre-stopped EC2 instances to reduce GitHub runner startup time from ~90s to ~20-30s.

Features

Pre-create and stop EC2 instances that can be quickly started when jobs are queued
Configurable pool sizes per instance type via JSON: {"c8a.4xlarge":1,"c8a.2xlarge":2}
Scheduled maintenance (every 5 minutes) to maintain pool levels
Instances are ephemeral - terminate after job completion, not reused
Graceful fallback to fresh instance launch if no warm instances available

Changes

main.go: Add warm pool logic, scheduled maintenance handler, multipart MIME user-data for cloud-init
template.yaml: Add WarmPoolConfig parameter, scheduled event, IAM permissions
user-data.sh: Handle both pre-extracted runner and fallback to cache extraction

How it works

Maintenance Lambda runs every 5 minutes to ensure pool is populated
When a job is queued, Lambda checks for available stopped instance in pool
If found: updates tags, changes shutdown behavior to TERMINATE, injects job-specific user-data, starts instance
If not found: launches fresh instance (original behavior)
After job completes, instance terminates (ephemeral)

Test plan

Deploy with WARM_POOL_CONFIG={"c8a.4xlarge":1}
Verify instances are created and stopped by maintenance
Queue a job and verify warm instance is activated
Verify pool is replenished after use
Test fallback when pool is empty

Skip extraction step since runner is now pre-extracted during AMI build at /opt/actions-runner/. This reduces startup time and disk I/O.

Implement a warm pool of pre-stopped EC2 instances to reduce GitHub runner startup time. Key features: - New WarmPoolConfig parameter (JSON map of instance type to pool size) - Warm pool instances stop after first boot, ready for quick activation - When activated, shutdown behavior changes to TERMINATE (ephemeral) - Pool automatically replenishes when instances are used - Pool size of 0 or empty config disables feature (current behavior) Example config: {"c8a.4xlarge":2,"c8a.2xlarge":3} New Lambda permissions: ec2:DescribeInstances, ec2:StartInstances, ec2:StopInstances, ec2:ModifyInstanceAttribute

- Extract warmPoolFilters() helper to reduce duplication - Consolidate EC2 launch logic into buildRunInstancesInput() and launchInstance() - Extract tryAcquireWarmInstance() and replenishWarmPool() helpers - Use single multipartTemplate constant - Simplify nil map access (Go returns zero value for nil maps) Net reduction of ~80 lines while improving readability.

- Add CloudWatch Events rule that triggers every 5 minutes - Add handleMaintenance() to check and populate all configured instance types - Refactor handler to dispatch between API Gateway and scheduled events - Extract getLaunchConfig() helper to reduce duplication The maintenance function iterates through all configured instance types and launches instances to reach target pool sizes.

Fall back to extracting from /opt/runner-cache/ if /opt/actions-runner doesn't exist. This supports both old AMIs (with runner cache) and new AMIs (with pre-extracted runner).

- Return empty map instead of nil from parseWarmPoolConfig() - Remove redundant nil check in handleMaintenance() - Consolidate duplicate launch config building in handleWebhook() by reusing getLaunchConfig() (~40 lines removed)

Generate JIT config from Lambda via GitHub API instead of passing PAT to the instance. This eliminates the 15-30 second config.sh registration step on the runner. Changes: - main.go: Add generateJITConfig() to call GitHub's JIT config API - main.go: Build labels list and pass JIT config to user-data template - user-data.sh: Remove get_github_token() and config.sh steps - user-data.sh: Use ./run.sh --jitconfig instead

ModifyInstanceAttribute with BlobAttributeValue automatically handles base64 encoding, so we shouldn't pre-encode. This was causing user-data to exceed the 16KB limit.

Cloud-init caches user-data from first boot, so when we update user-data on warm pool activation, the cached script runs instead of the new one. Fix by storing JIT config in SSM Parameter Store (/github-runner/jit-config/{instance-id}) and having the user-data script fetch it from there. This works because: 1. The script itself doesn't change (no templating needed) 2. The SSM parameter is created fresh for each job Changes: - main.go: Add storeJITConfigInSSM(), remove template usage - user-data.sh: Fetch JIT config from SSM using instance ID - template.yaml: Add SSM permissions for Lambda and EC2 instance

hskiba added 9 commits January 11, 2026 08:41

Use pre-extracted runner from AMI

e3990e0

Skip extraction step since runner is now pre-extracted during AMI build at /opt/actions-runner/. This reduces startup time and disk I/O.

Fix user-data to handle AMIs without pre-extracted runner

0950342

Fall back to extracting from /opt/runner-cache/ if /opt/actions-runner doesn't exist. This supports both old AMIs (with runner cache) and new AMIs (with pre-extracted runner).

Simplify main.go: consolidate getLaunchConfig usage

f6dd725

- Return empty map instead of nil from parseWarmPoolConfig() - Remove redundant nil check in handleMaintenance() - Consolidate duplicate launch config building in handleWebhook() by reusing getLaunchConfig() (~40 lines removed)

Fix double base64 encoding of user-data for warm pool instances

020b273

ModifyInstanceAttribute with BlobAttributeValue automatically handles base64 encoding, so we shouldn't pre-encode. This was causing user-data to exceed the 16KB limit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add warm pool support for faster runner startup #14

Add warm pool support for faster runner startup #14

Uh oh!

hskiba commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add warm pool support for faster runner startup #14

Are you sure you want to change the base?

Add warm pool support for faster runner startup #14

Uh oh!

Conversation

hskiba commented Jan 11, 2026

Summary

Features

Changes

How it works

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants