Skip to content

Conversation

@hskiba
Copy link
Member

@hskiba hskiba commented Jun 6, 2025

Summary

  • Fixed intermittent runner registration failures by implementing retry logic and error handling
  • Added comprehensive CloudWatch logging to help debug registration and runtime issues
  • Created IAM role and instance profile for EC2 instances to enable CloudWatch logging

Problem

The GitHub Actions runners were experiencing intermittent failures where:

  • Runners would deploy but not always register successfully with GitHub
  • Instances would shut down immediately after starting
  • No visibility into what was happening during the registration process
  • Jobs would wait indefinitely for runners that never became available

Solution

1. Enhanced Error Handling and Retry Logic

  • Added retry logic for GitHub API token retrieval (5 attempts with exponential backoff)
  • Added retry logic for runner configuration (3 attempts)
  • Proper error code capture and logging when runner fails
  • Simplified runner lifecycle management - removed complex monitoring scripts

2. Comprehensive CloudWatch Logging

  • Created CloudWatch log group /aws/ec2/github-runner
  • Log streams now include instance ID for easier tracking: runner-i-XXXXX-YYYYMMDD-HHMMSS
  • All major steps in the runner lifecycle are logged
  • Error messages and exit codes are captured
  • Last 50 lines of runner output sent to CloudWatch on failure

3. IAM Permissions

  • Created IAM role RunnerInstanceRole with CloudWatch logs permissions
  • Added instance profile to EC2 launch configuration
  • Ensured proper permissions for log group/stream creation and event publishing

4. Simplified Runner Management

  • Removed complex process monitoring that was causing premature shutdowns
  • Runner now runs in foreground and manages its own lifecycle
  • 60-minute timeout remains as safety net for stuck instances
  • Instance shuts down cleanly after job completion

Changes Made

Infrastructure (template.yaml)

  • Added RunnerInstanceRole IAM role with CloudWatch permissions
  • Added RunnerInstanceProfile for EC2 instances
  • Added INSTANCE_PROFILE_ARN environment variable to Lambda
  • Updated Lambda to use instance profile when launching EC2s

Lambda Function (main.go)

  • Added instance profile ARN validation
  • Updated RunInstances call to include IAM instance profile

User Data Script (user-data.sh)

  • Complete rewrite with proper error handling
  • Added CloudWatch logging function with JSON formatting
  • Added retry logic for API calls
  • Simplified runner lifecycle management
  • Added instance ID to log stream names
  • Proper error capture and logging

Testing Files

  • Added event.json for local testing with sam local invoke
  • Added event-custom-instance.json for testing custom instance types
  • Added env.json template for local environment variables

Testing

  1. Deploy the changes:

    sam build
    sam deploy --config-env <environment>
  2. Monitor CloudWatch logs in the /aws/ec2/github-runner log group

  3. Test with a GitHub workflow using ephemeral label:

    runs-on: [self-hosted, ephemeral, X64, Linux]

Rollback Plan

If issues arise, rollback by reverting this PR. The changes are backwards compatible and don't affect existing runner functionality.

This commit addresses intermittent runner registration failures by adding:
- Comprehensive CloudWatch logging for debugging
- Retry logic for GitHub API calls and runner configuration
- IAM role for EC2 instances with CloudWatch permissions
- Proper error handling and exit code logging
- Instance ID in log stream names for easier tracking

The simplified approach removes complex monitoring scripts and lets the runner process manage its own lifecycle, with a 60-minute timeout as a safety net.
@hskiba hskiba merged commit b466591 into main Jun 6, 2025
1 check passed
@hskiba hskiba deleted the fix/runner-registration-reliability branch June 6, 2025 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants