HelmReboot Operator

A Kubernetes operator that automatically restarts failed Flux HelmRelease resources when they encounter timeout errors. This operator monitors HelmRelease objects and triggers reconciliation by adding the fluxcd.io/reconcileAt annotation when specific failure conditions are detected.

Overview

The HelmReboot Operator is designed to solve a common issue in GitOps workflows where Flux HelmRelease resources fail due to temporary network issues, registry timeouts, or other transient errors. Instead of manual intervention, this operator automatically detects these failures and triggers a retry by adding reconciliation annotations.

Key Features

Automatic Recovery: Detects failed HelmRelease resources and triggers automatic retries
Smart Detection: Only restarts releases that failed due to specific timeout errors
Monitoring Ready: Includes Prometheus metrics and comprehensive logging
Lightweight: Minimal resource footprint with efficient reconciliation loops
Secure: Follows Kubernetes RBAC best practices with minimal required permissions
Well Tested: Comprehensive unit and end-to-end test coverage

How It Works

The operator continuously monitors all HelmRelease resources in the cluster and:

Watches for HelmRelease objects with failed conditions
Detects specific error patterns (e.g., "context deadline exceeded")
Triggers automatic retry by adding the fluxcd.io/reconcileAt annotation
Logs all restart actions for audit and debugging purposes

Supported Error Patterns

Currently, the operator handles:

context deadline exceeded - Network timeouts during chart operations
Additional patterns can be easily configured in the controller logic

Installation

Prerequisites

Kubernetes cluster (v1.20+)
Flux v2 installed and running
kubectl configured to access your cluster

Quick Start

Install using kubectl:

kubectl apply -f https://raw.githubusercontent.com/sfotiadis/helmreboot-operator/main/config/default/kustomization.yaml

Or build and deploy from source:

git clone https://github.com/sfotiadis/helmreboot-operator.git
cd helmreboot-operator
make deploy

Verify installation:

kubectl get pods -n helmreboot-operator-system

Helm Installation (Coming Soon)

helm repo add helmreboot-operator https://sfotiadis.github.io/helmreboot-operator
helm install helmreboot-operator helmreboot-operator/helmreboot-operator

Configuration

Environment Variables

Variable	Description	Default
`METRICS_BIND_ADDRESS`	Address for metrics server	`:8080`
`HEALTH_PROBE_BIND_ADDRESS`	Address for health probes	`:8081`
`LEADER_ELECT`	Enable leader election	`false`

RBAC Permissions

The operator requires the following permissions:

get, list, watch on HelmRelease resources
patch, update on HelmRelease resources (for adding annotations)

Monitoring

Prometheus Metrics

The operator exposes metrics on the /metrics endpoint:

controller_runtime_reconcile_total - Total number of reconciliations
controller_runtime_reconcile_errors_total - Total number of reconciliation errors
controller_runtime_reconcile_time_seconds - Time spent in reconciliation

Health Checks

Health endpoints are available:

GET /healthz - Liveness probe
GET /readyz - Readiness probe

Development

Prerequisites

Go 1.21+
Docker
kubectl
Kubebuilder 3.0+

Local Development

Clone the repository:

git clone https://github.com/sfotiadis/helmreboot-operator.git
cd helmreboot-operator

Install dependencies:
```
go mod download
```
Run tests:
```
make test
```
Run locally against your cluster:
```
make install run
```

Building

# Build the binary
make build

# Build the Docker image
make docker-build

# Build and push Docker image
make docker-build-push

# Run tests with coverage
make test

# Run linting
make lint

Testing

The project includes comprehensive testing:

# Unit tests
make test

# End-to-end tests
make test-e2e

# Integration tests with coverage
make test-integration

Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  HelmRelease    │    │  HelmReboot     │    │  Flux           │
│  (Failed)       │───▶│  Operator       │───▶│  Controller     │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                              │
                              ▼
                       ┌─────────────────┐
                       │  Add reconcile  │
                       │  annotation     │
                       └─────────────────┘

Controller Logic

Watch Phase: Monitor all HelmRelease resources for status changes
Analysis Phase: Check if the failure matches known recoverable patterns
Action Phase: Add fluxcd.io/reconcileAt annotation to trigger Flux retry
Monitoring Phase: Log actions and update metrics

Roadmap

Helm Chart for easy installation
Support for additional error patterns
Configurable retry delays and limits
Dashboard for monitoring restart actions
Integration with popular monitoring systems
Multi-cluster support

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

Flux CD for the excellent GitOps toolkit
Kubebuilder for the operator framework
Controller Runtime for the underlying controller libraries

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
chart		chart
cmd		cmd
config		config
dist		dist
hack		hack
internal/controller		internal/controller
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
PROJECT		PROJECT
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HelmReboot Operator

Overview

Key Features

How It Works

Supported Error Patterns

Installation

Prerequisites

Quick Start

Helm Installation (Coming Soon)

Configuration

Environment Variables

RBAC Permissions

Monitoring

Prometheus Metrics

Health Checks

Development

Prerequisites

Local Development

Building

Testing

Architecture

Controller Logic

Roadmap

License

Acknowledgments

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

sfotiadis/helmreboot-operator

Folders and files

Latest commit

History

Repository files navigation

HelmReboot Operator

Overview

Key Features

How It Works

Supported Error Patterns

Installation

Prerequisites

Quick Start

Helm Installation (Coming Soon)

Configuration

Environment Variables

RBAC Permissions

Monitoring

Prometheus Metrics

Health Checks

Development

Prerequisites

Local Development

Building

Testing

Architecture

Controller Logic

Roadmap

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages