diff --git a/docs/modules/ROOT/pages/adr/0049-strimzi-operator-for-kafka.adoc b/docs/modules/ROOT/pages/adr/0049-strimzi-operator-for-kafka.adoc new file mode 100644 index 00000000..26bb3254 --- /dev/null +++ b/docs/modules/ROOT/pages/adr/0049-strimzi-operator-for-kafka.adoc @@ -0,0 +1,446 @@ += ADR 0043 - Strimzi Operator for Apache Kafka +:adr_author: Simon Hofer +:adr_owner: Schedar +:adr_reviewers: Schedar +:adr_date: 2025-12-05 +:adr_upd_date: 2025-12-05 +:adr_status: draft +:adr_tags: kafka,service +:page-aliases: explanations/decisions/kafka.adoc + +include::partial$adr-meta.adoc[] + +[NOTE] +.Summary +==== +We use the Strimzi Kafka Operator to provide Apache Kafka as a managed service on Kubernetes, complemented by Apicurio Registry for schema management and AKHQ for operational visibility. +==== + +== Problem + +We need to provide Apache Kafka on Kubernetes as a managed service with the following features: + +* Production-ready Kafka cluster management with KRaft mode +* High availability across multiple availability zones +* Schema registry for managing Avro/Protobuf/JSON schemas +* Operational visibility and monitoring tooling +* TLS-encrypted connections and authentication +* Automated backup and disaster recovery +* Topic and user self-service management +* Comprehensive metrics and monitoring with Grafana dashboards +* Regular maintenance and version upgrades +* Support for various sizing options (1 or 3 replicas) + +The solution must align with VSHN's "more buy than make" approach, leveraging existing open-source operators rather than building custom solutions. + +== Evaluated Solutions + +For Apache Kafka on Kubernetes, the following production-ready solutions were evaluated: + +[cols="1,1,1"] +|=== +|Requirements |https://strimzi.io/[Strimzi] |https://github.com/confluentinc/confluent-kubernetes-examples[Confluent Operator] + +|KRaft Mode Support |✅ |✅ + +|Schema Registry Integration |❌ (Apicurio) |✅ (Confluent) + +|Topic and User Management |✅ (Native CRDs) |✅ + +|Metrics Export |✅ |✅ + +|Multi-AZ Support |✅ (Rack awareness) |✅ (Rack awareness) + +|2-AZ Support |❌ (reconfiguration needed) |✅ (automatic observer promotion) + +|Automated Rebalancing |✅ (Cruise Control) |✅ (Self-balancing clusters) + +|TLS/mTLS Support |✅ |✅ + +|Open Source |✅ (Apache 2.0) |❌ (Proprietary) + +|Active Development |✅ |✅ + +|Grafana Dashboards |✅ (Community maintained) |✅ (Community maintained) + +|Commercial Support Available |✅ (Red Hat, SPOUD) |✅ (Confluent) + +|Maturity |✅ CNCF Sandbox, Production-ready since 2018 |✅ Widely used in enterprise + +|=== + +Additional Notes: + +* **Strimzi** is a CNCF sandbox project with strong community backing and Red Hat support. It has been production-ready since 2018 and is used by numerous enterprises worldwide. It provides the most comprehensive mature Kubernetes-native approach with extensive CRDs for all Kafka resources available as open-source. +* **Confluent Operator** requires a license for production use and ties users to Confluent's ecosystem, which conflicts with our preference for open-source flexibility. If customers specifically require Confluent features, we can consider it as a separate offering. + +**Not Considered:** + +* **https://github.com/adobe/koperator[Koperator]** (Banzai Cloud): Limited recent development activity and lacks some operational features like Cruise Control integration. Not actively maintained. +* **https://github.com/bitnami/charts/tree/main/bitnami/kafka[Bitnami Helm Chart]**: Lacks advanced operational features like topic/user management CRDs and automated rebalancing. More suitable for development environments than production. + + +== Decision + +We will use https://strimzi.io/[Strimzi Kafka Operator] to provide Apache Kafka as a managed service. + +The platform will consist of the following components: + +=== Core Components + +**Kafka Cluster (Strimzi Operator)**:: + +* Manages Kafka brokers and KRaft controllers +* Provides Kubernetes-native resources (CRDs) for Kafka, KafkaTopic, KafkaUser +* Integrates Cruise Control for automated cluster rebalancing +* Supports rolling updates with zero downtime +* Version support follows https://github.com/strimzi/strimzi-kafka-operator/blob/main/KAFKA_VERSION_SUPPORT.md[Strimzi's version support policy] + +**Schema Registry (Apicurio Registry)**:: + +* Manages Avro, Protobuf, and JSON schemas +* Compatible with Confluent Schema Registry API +* Integrated with Kafka for schema storage backend + +**Web UI (AKHQ)**:: + +* Provides operational visibility into Kafka clusters +* Allows browsing topics, consumer groups, and messages +* Facilitates debugging and troubleshooting +* Read-only mode for production environments + +=== Supporting Components + +**Monitoring and Observability**:: + +* Prometheus for metrics collection +* JMX Exporter for Kafka broker metrics +* Grafana dashboards based on Strimzi community templates +* AppCat SLI Exporter for basic availability checks (broker/controller health) +* https://github.com/spoud/kafka-synth-client[Kafka Synth Client] for continuous end-to-end latency monitoring +* AlertManager for capacity and availability alerts + +**Operational Tools**:: +* Strimzi Drain Cleaner for safe node maintenance +* Cruise Control for partition rebalancing +* Entity Operators for Topic and User management + +=== Deployment Architecture + +**Node Pools**:: + +* Operations Node Pool: Strimzi Operators, Drain Cleaner +* Kafka Broker Node Pool: Kafka broker pods with high-throughput storage +* KRaft Controller Node Pool: KRaft controllers with low-latency storage + +**Storage**:: + +* Premium SSDs with zone-pinned volumes +* High throughput for broker data +* Low latency NVMe for KRaft metadata +* Expandable disks with `volumeBindingMode: WaitForFirstConsumer` + +**High Availability**:: + +Ideally deployed across 3 availability zones with the following configuration: + +* Minimum 3 availability zones for fault tolerance +* 2 brokers per availability zone (6 total for production) +* 3 KRaft controllers (1 per AZ) +* 3 Schema Registry instances (1 per AZ) +* Replication factor: 5, min.insync.replicas: 3 + +If not 3 AZs are available we need to choose the best possible alternative 2 / 2.5 DC setup / 3 different racks or at least 3 different nodes to ensure some level of fault tolerance. + +The possibilities heavily depend on the underlying infrastructure capabilities found at customer sites. + +The solution must allow a parameter to specify the rack/zone topology for pod distribution and for Kafkas internal rack awareness configuration. + + +**Authentication and Authorization**:: + +* mTLS for inter-broker communication +* mTLS for internal clients via Strimzi KafkaUser resources +* ACL-based authorization managed through KafkaUser CRDs +* Optional OAuth integration via https://github.com/strimzi/strimzi-kafka-oauth[strimzi-kafka-oauth] for external clients + +**Apicurio Registry Auth**:: + +* Authentication requires Keycloak integration using OpenID Connect +* Authorization managed via Keycloak roles and groups + +As the requirements to have Keycloak as identity provider are already in place for other services, this fits well into the existing architecture. +On the other hand not all users have Keycloak so we can also consider alternatives. + +**Alternative Authentication Options**:: + +* Disable schema registration for clients by default +* Allow all reads on schemas (no authentication on apicurio registry) +* Prohibit writes on an Ingress level for public read-only access +* Having an extra Ingress for internal access with a static authentication for e.g. CI/CD systems that need to register schemas automatically. + +=== Rationale + +**Why Strimzi?** + +1. **Kubernetes-Native Design**: Strimzi is designed from the ground up for Kubernetes, using operators and CRDs extensively. This aligns with our infrastructure-as-code approach and enables declarative management. + +2. **Open Source and Flexibility**: Apache License 2.0 ensures no vendor lock-in. We can use any Kafka distribution and aren't tied to proprietary licensing models based on node count or throughput. + +3. **Operational Maturity**: Strimzi provides battle-tested operational features: + * Zero-downtime rolling upgrades + * Integrated Cruise Control for rebalancing + * Drain Cleaner for safe node maintenance + * Comprehensive monitoring integration + +4. **Community and Support**: Strong CNCF community backing, Red Hat support, and SPOUD partnership for third-level Kafka expertise. + +5. **Feature Completeness**: Native support for all requirements including KRaft mode, schema registry integration (Apicurio), metrics export, and multi-AZ deployments. + +6. **Proven Track Record**: Strimzi is used in production by numerous organizations and has a proven track record of stability and reliability. + +**Why Apicurio Registry?** + +* Open-source and vendor-neutral +* API-compatible with Confluent Schema Registry +* Managed via Kubernetes operator +* Multiple storage backend options (Kafka, PostgreSQL) +* Active development and Red Hat support + +**Why AKHQ?** + +* Provides essential operational visibility without vendor lock-in +* Lightweight and easy to deploy +* Read-only mode prevents accidental changes in production +* Complements command-line tools with visual interface + +== Advantages + +* **Kubernetes-Native Operations**: Declarative management through CRDs enables GitOps workflows and automation +* **Zero-Downtime Maintenance**: Rolling updates for brokers and configuration changes +* **Automated Rebalancing**: Cruise Control integration enables intelligent partition distribution +* **Comprehensive Monitoring**: Pre-built Prometheus exporters and Grafana dashboards +* **High Availability**: Multi-AZ support with rack awareness ensures resilience +* **Schema Management**: Integrated schema registry prevents breaking changes +* **Security**: Built-in mTLS support and ACL management +* **Cost Efficiency**: No licensing fees based on cluster size or throughput +* **Expert Support**: Partnership with SPOUD provides third-level Kafka expertise +* **Community Backing**: CNCF project with active development and contributions + +== Disadvantages + +* **Kafka Version Dependency**: Limited to Kafka versions supported by Strimzi operator +* **Operator Complexity**: Additional layer that requires understanding of operator patterns +* **Potential Breaking Changes**: Operator upgrades could introduce breaking changes requiring manual intervention +* **Learning Curve**: Teams need to learn Strimzi-specific CRDs and operational patterns +* **Opinionated Decisions**: Some configurations are opinionated by the operator design + +== Risks and Mitigation + +=== Operator Becomes Unmaintained + +Risk:: Strimzi development stagnates or the project is abandoned + +Mitigation:: +* **Option 1**: Fork and maintain internally or through SPOUD partnership +* **Option 2**: Migrate to alternative solution (e.g., Confluent Operator with licensing) +* **Likelihood**: Low - Strong CNCF backing and Red Hat support reduce this risk + +=== Breaking Changes in CRDs + +Risk:: Strimzi upgrades introduce breaking changes in CRDs requiring manual migration + +Mitigation:: +* **Option 1**: Handle migrations transparently in Crossplane compositions where possible +* **Option 2**: Develop automated migration tooling for customer instances +* **Process**: Thoroughly test operator upgrades in staging environments before production rollout + +=== Performance Issues + +Risk:: Default configurations don't meet performance requirements for high-throughput use cases + +Mitigation:: +* Leverage Cruise Control for continuous optimization +* Implement synthetic monitoring for continuous latency tracking +* Regular performance testing in pre-production environments + +=== Capacity Management + +Risk:: Insufficient planning leads to storage or throughput limitations + +Mitigation:: +* Implement capacity alerting with 80% thresholds for disk and network +* Use Cruise Control for proactive rebalancing when scaling +* Regular capacity review process with stakeholders +* Optimize applications for efficient Kafka usage +* Fine-tune Kafka producer/consumer configurations + +== Consequences + +=== Positive + +* Customers get a production-ready, highly available Kafka service +* Operations team has comprehensive monitoring and operational tooling +* Automated maintenance reduces operational burden +* Schema registry prevents data quality issues +* Self-service topic and user provisioning empowers development teams + +=== Negative + +* Initial setup complexity higher than simple Helm chart deployment +* Requires training for operations team on Strimzi-specific concepts +* Dependency on Strimzi release cycle for Kafka version updates +* Additional components (Apicurio, AKHQ) increase the overall system complexity + +== Monitoring and SLA + +Following xref:adr/0015-metrics-and-monitoring-of-services.adoc[ADR 0015], we will implement a two-tier monitoring approach: + +=== Basic Availability Monitoring + +* Use the https://github.com/vshn/appcat/tree/master/pkg/sliexporter[AppCat SLI Exporter] to perform basic health checks: +** Verify KRaft controller pods are healthy and reachable +** Verify Kafka broker pods are healthy and reachable +** Verify Schema Registry pods are healthy and reachable +** This provides the foundational "service is running" metrics + +=== End-to-End Latency Monitoring + +* Deploy https://github.com/spoud/kafka-synth-client[Kafka Synth Client] for continuous synthetic monitoring: +** Produces canary messages to a dedicated heartbeat topic at regular intervals +** Consumes the same messages and measures end-to-end latency (time from produce to consume) +** Exports latency metrics (p50, p95, p99) to Prometheus +** Detects silent degradation that basic health checks might miss +** Validates the entire data path: producer → broker → consumer +** Low-volume traffic that doesn't impact production workloads + +=== Additional Monitoring + +* Export JMX metrics from Kafka brokers using Prometheus JMX Exporter +* Deploy pre-configured Grafana dashboards from Strimzi community +* Configure capacity alerts for disk usage, network throughput, and consumer lag +* Route SLO alerts to VSHN operations team +* Route capacity alerts to customer's chosen alerting channels + +=== Service Level Indicator (SLI) + +The service is considered "Up" when: + +* KRaft controller nodes are reachable and healthy in Kubernetes (AppCat SLI Exporter) +* Kafka brokers are reachable and healthy in Kubernetes (AppCat SLI Exporter) +* End-to-end latency p95 is below 5 seconds (Kafka Synth Client) +* Schema registry cluster is reachable and healthy in Kubernetes (AppCat SLI Exporter) + +=== Key Metrics + +* Broker availability and health +* Under-replicated partitions +* Offline partitions +* End-to-end message latency (producer → consumer) +* Consumer lag per consumer group +* Disk usage per broker +* Network throughput +* JVM heap usage +* KRaft controller active status + +== Implementation Notes + +=== Operator and CRD Installation + +Following xref:adr/0014-commodore-component-to-deploy-compositions-and-xrds.adoc[ADR 0014], operators and their CRDs are deployed via Project Syn Commodore Components: + +**Strimzi Operator**:: +* Deployed via Commodore component (to be created: `component-strimzi-kafka-operator`) +* Alternatively, use the https://strimzi.io/docs/operators/latest/deploying.html#deploying-cluster-operator-helm-chart-str[Strimzi Helm Chart] directly in Commodore component +* Installs the Strimzi Cluster Operator and all required CRDs (`Kafka`, `KafkaTopic`, `KafkaUser`, `KafkaConnect`, etc.) +* Operator runs in dedicated namespace (e.g., `syn-strimzi-kafka-operator`) +* Manages Kafka clusters across all namespaces (cluster-scoped deployment) +* Configure with appropriate resource limits and RBAC permissions + +**Supporting Components**:: +* Strimzi Drain Cleaner: Deployed via `component-strimzi-kafka-operator` or separate component +* Cruise Control: Integrated into Kafka cluster via Strimzi `Kafka` CR (not a separate operator) +* Kafka Synth Client: Deployed per instance in the instance namespace (not an operator) +* Apicurio Registry: Deployed per instance in the instance namespace (not with an operator) + +The Commodore components handle: + +* Operator deployment and lifecycle management +* CRD installation and version management +* Namespace creation and RBAC setup +* Configuration of operator-wide settings +* Integration with Project Syn configuration management + +=== Initial Instance Deployment + +* Use Crossplane Provider Kubernetes to create Strimzi `Kafka` CRs +* Use Crossplane compositions to abstract Kafka cluster configuration +* Configure proper resource requests and limits based on expected load +* Implement topologySpreadConstraints and Kafka Rack awareness for multi-AZ distribution +* Enable Prometheus ServiceMonitors / PodMonitors for all components +* Deploy with KRaft mode (ZooKeeper-less) + +=== Client Configuration + +* Provide starter templates for Java applications (Quarkus, Spring Boot) +* Include sample KafkaUser and KafkaTopic resources that can be used with the cluster +* Document best practices for consumer lag monitoring +* Provide guidance on replication factor and min.insync.replicas configuration + + +=== Backup and Recovery + +Backup and recovery of Kafka require careful consideration because Kafka is a distributed system where durability is primarily provided by replication. The platform objective is to provide a reliable emergency restore capability while encouraging application teams to adopt patterns that reduce reliance on point-in-time restores (idempotent processing, event sourcing, tiered retention, compacted topics, etc.). + +Scope and Responsibility:: +*Platform*: provide PV snapshot capability for Kafka PVCs, configuration for snapshot retention/rotation, a documented emergency restore playbook, and tools for cluster-level restores. The platform will also snapshot Schema Registry storage and configuration where possible. +*Application Teams*: own application-level recovery requirements (single-topic or message-level restores), define RPO/RTO needs, and adopt application patterns that avoid the need for frequent restores. + +Snapshot Frequency & RPO Options:: +* **Daily snapshots (default)** — RPO ≈ 24 hours. Reasonable for most use cases; minimal storage cost. +* **Hourly snapshots (optional)** — RPO ≈ 1 hour. Enable for mission-critical clusters where acceptable. + + +RTO Expectations (high level):: +* **Full cluster restore**: dependent on cluster size and I/O performance — expect several hours for small clusters, longer for larger clusters. Restoration includes PV restore time, broker startup, partition leader elections and potential rebalancing. +* **Single-topic / message restores**: may require custom tooling or manual processes; expect non-trivial effort and longer RTOs. + + +Notes on Single-Topic / Message-Level Recovery:: +* This is not natively supported by PV snapshots. Typical approaches include: + - Restoring an entire broker state and extracting topic data via consumer tooling (time-consuming). + - Using MirrorMaker/replication clusters to copy topic data before destructive maintenance (proactive strategy). + - Exporting topic data into an external store for ad-hoc restore (requires per-application design). +* These approaches have trade-offs in cost, complexity and RTO; platform-level support is limited to emergency workflows and guidance. + + +Limitations and Caveats:: +* PV snapshots are **platform-level emergency** tools — they are suitable for full-cluster rollback but are poor at single-message or point-in-time restores for specific topics. +* Snapshot consistency across multiple broker PVs depends on snapshot provider capabilities (some providers offer volume group snapshots; others do not). Document the guarantees for each CSP. +* Restores may cause data loss up to the snapshot age (RPO) and require manual reconciliation in applications. + +Operational Recommendations:: +* **Enable PV snapshots** (with sensible retention) for Kafka PVCs by default and document the provider-specific semantics. +* **Test restores regularly** (quarterly) and maintain a runbook with exact steps and expected timings. +* **Expose schema registry backup** — ensure Apicurio/Schema Registry storage is included in snapshots or export schema artifacts periodically. +* **Encourage application patterns** that minimise platform restore needs (idempotent consumers, compacted topics for critical keys, event sourcing patterns, consumer-side checkpoints). + + + +NOTE: If higher frequency or point-in-time recovery is required, an additional application-level solution can be implemented (e.g., MirrorMaker replication, topic export tools, Kafka Connect sink to S3 etc.). Such solutions are outside the platform scope. + + +=== Operator Read-Only / UI Safeguards + +* Deploy AKHQ (or any Kafka UI) in **read-only** mode for production clusters to prevent accidental destructive operations (topic deletion, ACL changes) from the UI. +* Restrict UI access with authentication, network policies and/or ingress restrictions; grant UI access only to troubleshooting groups. + + +== Further Reading + +* https://strimzi.io/docs/operators/latest/overview.html[Strimzi Documentation] +* https://www.apicur.io/registry/docs/[Apicurio Registry Documentation] +* https://akhq.io/[AKHQ Documentation] +* https://products.vshn.ch/appcat/kafka.html[VSHN Kafka Service Description] +* xref:adr/0015-metrics-and-monitoring-of-services.adoc[ADR 0015: Metrics and Monitoring] +* xref:adr/0014-commodore-component-to-deploy-compositions-and-xrds.adoc[ADR 0014: Commodore Component] diff --git a/docs/modules/ROOT/pages/adr/index.adoc b/docs/modules/ROOT/pages/adr/index.adoc index aa39fb1f..435840e3 100644 --- a/docs/modules/ROOT/pages/adr/index.adoc +++ b/docs/modules/ROOT/pages/adr/index.adoc @@ -197,4 +197,8 @@ `database,service` |draft | |2026-01-14 +|xref:adr/0049-strimzi-operator-for-kafka.adoc[] + +`kafka,service` +|draft |2025-12-05 |2025-12-05 |=== diff --git a/docs/modules/ROOT/partials/nav-adrs.adoc b/docs/modules/ROOT/partials/nav-adrs.adoc index cd51872e..d0b6518d 100644 --- a/docs/modules/ROOT/partials/nav-adrs.adoc +++ b/docs/modules/ROOT/partials/nav-adrs.adoc @@ -45,4 +45,5 @@ ** xref:adr/0045-service-orchestration-crossplane-2-0.adoc[] ** xref:adr/0046-secret-management-framework-2-0.adoc[] ** xref:adr/0047-service-maintenance-and-upgrades-framework-2-0.adoc[] -** xref:adr/0048-evaluating-vector-databases-as-appcat-services.adoc[] \ No newline at end of file +** xref:adr/0048-evaluating-vector-databases-as-appcat-services.adoc[] +** xref:adr/0049-strimzi-operator-for-kafka.adoc[]