A Kubernetes operator that automatically restarts failed Flux HelmRelease resources when they encounter timeout errors. This operator monitors HelmRelease objects and triggers reconciliation by adding the fluxcd.io/reconcileAt annotation when specific failure conditions are detected.
The HelmReboot Operator is designed to solve a common issue in GitOps workflows where Flux HelmRelease resources fail due to temporary network issues, registry timeouts, or other transient errors. Instead of manual intervention, this operator automatically detects these failures and triggers a retry by adding reconciliation annotations.
- Automatic Recovery: Detects failed HelmRelease resources and triggers automatic retries
- Smart Detection: Only restarts releases that failed due to specific timeout errors
- Monitoring Ready: Includes Prometheus metrics and comprehensive logging
- Lightweight: Minimal resource footprint with efficient reconciliation loops
- Secure: Follows Kubernetes RBAC best practices with minimal required permissions
- Well Tested: Comprehensive unit and end-to-end test coverage
The operator continuously monitors all HelmRelease resources in the cluster and:
- Watches for HelmRelease objects with failed conditions
- Detects specific error patterns (e.g., "context deadline exceeded")
- Triggers automatic retry by adding the
fluxcd.io/reconcileAtannotation - Logs all restart actions for audit and debugging purposes
Currently, the operator handles:
context deadline exceeded- Network timeouts during chart operations- Additional patterns can be easily configured in the controller logic
- Kubernetes cluster (v1.20+)
- Flux v2 installed and running
kubectlconfigured to access your cluster
-
Install using kubectl:
kubectl apply -f https://raw.githubusercontent.com/sfotiadis/helmreboot-operator/main/config/default/kustomization.yaml
-
Or build and deploy from source:
git clone https://github.com/sfotiadis/helmreboot-operator.git cd helmreboot-operator make deploy -
Verify installation:
kubectl get pods -n helmreboot-operator-system
helm repo add helmreboot-operator https://sfotiadis.github.io/helmreboot-operator
helm install helmreboot-operator helmreboot-operator/helmreboot-operator| Variable | Description | Default |
|---|---|---|
METRICS_BIND_ADDRESS |
Address for metrics server | :8080 |
HEALTH_PROBE_BIND_ADDRESS |
Address for health probes | :8081 |
LEADER_ELECT |
Enable leader election | false |
The operator requires the following permissions:
get,list,watchon HelmRelease resourcespatch,updateon HelmRelease resources (for adding annotations)
The operator exposes metrics on the /metrics endpoint:
controller_runtime_reconcile_total- Total number of reconciliationscontroller_runtime_reconcile_errors_total- Total number of reconciliation errorscontroller_runtime_reconcile_time_seconds- Time spent in reconciliation
Health endpoints are available:
GET /healthz- Liveness probeGET /readyz- Readiness probe
- Go 1.21+
- Docker
- kubectl
- Kubebuilder 3.0+
-
Clone the repository:
git clone https://github.com/sfotiadis/helmreboot-operator.git cd helmreboot-operator -
Install dependencies:
go mod download
-
Run tests:
make test -
Run locally against your cluster:
make install run
# Build the binary
make build
# Build the Docker image
make docker-build
# Build and push Docker image
make docker-build-push
# Run tests with coverage
make test
# Run linting
make lintThe project includes comprehensive testing:
# Unit tests
make test
# End-to-end tests
make test-e2e
# Integration tests with coverage
make test-integration┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ HelmRelease │ │ HelmReboot │ │ Flux │
│ (Failed) │───▶│ Operator │───▶│ Controller │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ Add reconcile │
│ annotation │
└─────────────────┘
- Watch Phase: Monitor all HelmRelease resources for status changes
- Analysis Phase: Check if the failure matches known recoverable patterns
- Action Phase: Add
fluxcd.io/reconcileAtannotation to trigger Flux retry - Monitoring Phase: Log actions and update metrics
- Helm Chart for easy installation
- Support for additional error patterns
- Configurable retry delays and limits
- Dashboard for monitoring restart actions
- Integration with popular monitoring systems
- Multi-cluster support
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Flux CD for the excellent GitOps toolkit
- Kubebuilder for the operator framework
- Controller Runtime for the underlying controller libraries