Enterprise-grade ETL pipeline orchestration for Kubernetes with seamless deduplication and joins
🚀 Quick Start • 📖 Documentation • 🏗️ Architecture • 🤝 Contributing
The GlassFlow ETL Kubernetes Operator is a production-ready Kubernetes operator that enables scalable, cloud-native data pipeline deployments. Built as a companion to the GlassFlow ClickHouse ETL project, it provides enterprise-grade data processing capabilities with advanced features like deduplication, temporal joins, and seamless pause/resume functionality.
- 🔄 Pipeline Lifecycle Management - Create, pause, resume, and terminate data pipelines
- 🎯 Advanced Deduplication - Dedicated dedup component with persistent storage and configurable time windows
- 🔗 Stream Joins - Seamless joining of multiple data streams
- ⚡ Kubernetes Native - Full CRD-based pipeline management
- 🛡️ Production Ready - Enterprise-grade reliability and monitoring
- 📊 Scalable Ingestor - Efficiently reads from multiple Kafka partitions with horizontal scaling
- 💾 Stateful Deduplication - StatefulSet-based dedup components with persistent BadgerDB storage
- 🔧 Helm Charts - Easy deployment and configuration management
graph LR
KAFKA[Kafka Cluster]
subgraph "Kubernetes Cluster"
subgraph "GlassFlow ETL"
subgraph "Operator"
OP[Operator Controller]
CRD[Pipeline CRD]
end
subgraph "Data Pipeline"
ING[Ingestor Pods]
DEDUP[Dedup StatefulSets]
JOIN[Join Pod]
SINK[Sink Pod]
end
subgraph NATS_JETSTREAM["NATS JetStream"]
NATS[NATS]
DLQ[DLQ]
end
end
end
CH[ClickHouse]
subgraph "External"
API[GlassFlow API]
UI[Web UI]
end
API --> CRD
CRD --> OP
OP --> ING
OP --> DEDUP
OP --> JOIN
OP --> SINK
ING <--> NATS_JETSTREAM
DEDUP <--> NATS_JETSTREAM
JOIN <--> NATS_JETSTREAM
SINK <--> NATS_JETSTREAM
KAFKA --> ING
SINK --> CH
UI --> API
- Kubernetes 1.19+ cluster
- Helm 3.2.0+
- kubectl configured for your cluster
- Kafka (optional - can use external setup for development)
- ClickHouse (optional - can use external setup for development)
Deploy using the complete GlassFlow ETL stack from the GlassFlow Charts repository:
# Add GlassFlow Helm repository
helm repo add glassflow https://glassflow.github.io/charts
helm repo update
# Install complete GlassFlow ETL stack
helm install glassflow-etl glassflow/glassflow-etlDeploy just the operator as a dependency:
# Install operator chart
helm install glassflow-operator glassflow/glassflow-operatorThe operator includes automatic cleanup functionality that ensures all pipelines are immediately terminated when uninstalling:
helm uninstall glassflow-operatorThis will:
- ✅ Terminate all existing pipelines ungracefully
- ✅ Clean up all resources (namespaces, deployments, NATS streams)
- ✅ Remove Pipeline CRD definitions
- ✅ Remove the operator deployment
# Clone the repository
git clone https://github.com/glassflow/glassflow-etl-k8s-operator.git
cd glassflow-etl-k8s-operator
# Install CRDs
make install
# Deploy operator
make deploy IMG=ghcr.io/glassflow/glassflow-etl-k8s-operator:latestCreate pipelines using the GlassFlow ClickHouse ETL backend API. The operator will automatically create the corresponding Pipeline CRDs. Here's an example of what the generated CRD will look like:
apiVersion: etl.glassflow.io/v1alpha1
kind: Pipeline
metadata:
name: user-events-pipeline
spec:
pipeline_id: "user-events-v1"
config: "pipeline-config"
dlq: "dead-letter-queue"
sources:
type: kafka
topics:
- topic_name: "user-events"
stream: "users"
dedup_window: 60000000000 # 1 minute in nanoseconds
replicas: 2
deduplication:
enabled: true
stream: "users-deduped" # NATS stream after deduplication
storage_size: "10Gi" # Persistent storage for BadgerDB
storage_class: "standard" # Optional: storage class name
join:
type: "temporal"
stream: "joined-users"
enabled: true
replicas: 1
sink:
type: "clickhouse"
replicas: 1| Feature | Status | Description |
|---|---|---|
| Pipeline Creation | ✅ | Deploy new ETL pipelines via CRD |
| Pipeline Termination | ✅ | Graceful shutdown and cleanup |
| Pipeline Pausing | ✅ | Temporarily halt data processing |
| Pipeline Resuming | ✅ | Resume paused pipelines |
| Deduplication Component | ✅ | StatefulSet-based dedup with persistent storage and configurable time windows |
| Stream Joins | ✅ | Multi-stream data joining |
| Auto-scaling | ✅ | Horizontal pod autoscaling / ingestor replicas support |
| Monitoring | ✅ | Prometheus metrics integration |
| Helm Uninstall Cleanup | ✅ | Automatic pipeline termination and CRD cleanup on uninstall |
The operator manages several components for each pipeline:
- Ingestor: Reads data from Kafka topics and publishes to NATS streams. Supports horizontal scaling with multiple replicas.
- Dedup (Optional): StatefulSet-based component that performs deduplication using BadgerDB for persistent storage. Created per ingestor stream when deduplication is enabled. Configurable storage size and storage class.
- Join (Optional): Joins multiple data streams using temporal join logic. Supports configurable buffer TTLs.
- Sink: Writes processed data to ClickHouse. Supports multiple replicas for high availability.
The operator manages pipeline lifecycle through a comprehensive state machine that ensures reliable and predictable pipeline operations.
| Status | Type | Description |
|---|---|---|
| Created | Core | Pipeline created and ready to start |
| Running | Core | Pipeline is actively processing data |
| Stopped | Core | Pipeline has been stopped |
| Resuming | Transition | Pipeline is being resumed (temporary) |
| Stopping | Transition | Pipeline is being stopped (temporary) |
| Terminating | Transition | Pipeline is being terminated (temporary) |
stateDiagram
direction LR
[*] --> Created
Created --> Running
Running --> Stopping
Stopping --> Stopped
Stopped --> Resuming
Terminating --> Stopped
Resuming --> Running
Any --> Terminating
- Start: Created → Running
- Stop: Running → Stopping → Stopped
- Resume: Stopped → Resuming → Running
- Edit: Stopped → Resuming → Running
- Terminate: Any → Terminating → Stopped
- Delete: Stopped → [deleted]
- Edit Operation: Requires pipeline to be in Stopped state, then follows same path as Resume
- Transition States: Temporary states during async operations
- Terminate: Can interrupt any operation and force pipeline to Stopped state
- Failed State: Excluded from diagrams (handled separately in error scenarios)
- Go 1.23+
- Docker 17.03+
- kubectl v1.11.3+
- Kind (for local testing)
- NATS (for messaging)
-
Clone and setup:
git clone https://github.com/glassflow/glassflow-etl-k8s-operator.git cd glassflow-etl-k8s-operator make help # See all available targets
-
Install dependencies:
# Install development tools make controller-gen make kustomize make golangci-lint -
Start local infrastructure:
# Start NATS with JetStream (must run inside the cluster) helm repo add nats https://nats-io.github.io/k8s/helm/charts/ helm install nats nats/nats --set nats.jetstream.enabled=true # Start Kafka (using Helm) helm repo add bitnami https://charts.bitnami.com/bitnami helm install kafka bitnami/kafka # Start ClickHouse (using Helm) helm install clickhouse bitnami/clickhouse # Or use external Kafka/ClickHouse for development
-
Run the operator:
# Run locally (requires NATS running inside the cluster) make run
This project was built using Kubebuilder v4 and follows Kubernetes operator best practices:
├── api/v1alpha1/ # CRD definitions
├── internal/controller/ # Operator controller logic
├── internal/nats/ # NATS client integration
├── charts/ # Helm charts
├── config/ # Kustomize configurations
└── test/ # Unit and e2e tests
- Kubebuilder - Operator framework and scaffolding
- Kustomize - Kubernetes configuration management
- Helmify - Automatic Helm chart generation
- GolangCI-Lint - Code quality and linting
# Run e2e tests (requires Kind cluster) - Primary testing method
make test-e2e
# Run unit tests (coverage being improved)
make test
# Run linter
make lint| Chart | Purpose | Components | Use Case |
|---|---|---|---|
| glassflow-etl | Complete ETL Platform | UI, API, Operator, NATS | Full-featured deployment |
| glassflow-operator | Operator Only | Operator, CRDs | Dependency for custom setups |
The glassflow-etl chart includes the complete platform with web UI, backend API, NATS, and the operator as dependencies. The glassflow-operator chart is designed as a dependency for the main chart or custom deployments.
- GlassFlow ClickHouse ETL - Core ETL engine and API
- GlassFlow Charts - Helm charts repository
- GlassFlow Documentation - Complete documentation
- Demo Video: Coming soon
- Live Demo: demo.glassflow.dev
- Documentation: docs.glassflow.dev
We welcome contributions!
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests:
make test - Run linter:
make lint - Submit a pull request
This project is licensed under the Apache License 2.0 - see the clickhouse-etl LICENSE file for details.
- Slack Community: GlassFlow Hub
- Documentation: docs.glassflow.dev
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: help@glassflow.dev
Built by GlassFlow Team
