Skip to content

Conversation

@spike83
Copy link

@spike83 spike83 commented Dec 5, 2025

Summary

Checklist

  • Try to isolate changes into separate PRs (to build a better changelog).
  • Categorize the PR by setting a good title and adding one of the labels:
    change, decision, requirement/quality, requirement/functional, dependency
    as they show up in the changelog
  • Link this PR to related issues if applicable.

@Kidswiss Kidswiss requested review from a team, Kidswiss, TheBigLee, mdnix, mikeshootzz and zugao and removed request for a team December 5, 2025 13:50
Copy link
Member

@TheBigLee TheBigLee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small nitpicks, other than that LGTM

Comment on lines 391 to 397
=== Backup and Recovery

Backup and especially restoration of Kafka clusters is complex due to the distributed nature of Kafka. However, we implement a minimal backup strategy.

* Regular snapshots of Kafka data volumes
* Snapshots include backups of the Schema Registry topics
* Documentation of the backup and restore procedures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since backup and restoration of Kafka cluster is very complex, this section should probably be a bit more detailed.
Specifically:

  • How regular do we do snapshots? Hourly? Daily?
  • What are the RPO and RTO expectations?
  • How would a restore look like? (Might make sense to give a high level restore overview and pro/cons of various backup/restore approaches.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we need to workout the details. Question if we can already do it in here or if we need to postpone it a bit.

What is typically already available to snapshot volumes in those installations? How regular are these done usually?

I was thinking to at least "turn on" the PV snapshot for Kafka PVs. With this a full cluster restore could be achieved, with some data loss at the end of course. This solution is not very well suited for single topic restores or single message restores but in an emergency this could be achieved as well with some manual effort.

In general I think that backup and restore strategy needs to be defined together with the applications using Kafka as it usually involves more than just the messages and in lots of cases it's not needed at all. This cannot possibly be done in a generic way and thus IMO is not the platform responsibility.
We will try our best to optimize the high availability and placement so that we can rely mostly on the Kafka-internal replication. This type of backup is therefore more of an emergency solution that the platform can offer.

If you agree I can document it like this to be a bit more specific.

Copy link
Contributor

@Kidswiss Kidswiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment about the snapshots. Rest LGTM

* These approaches have trade-offs in cost, complexity and RTO; platform-level support is limited to emergency workflows and guidance.


Limitations and Caveats::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO a pretty important caveat:

Snapshots can't protect against complete loss of the underlying cluster/storage. Some CSPs support off-site snapshots, however that's usually a feature of the big ones (AWS,GCP and Azure)

Copy link
Author

@spike83 spike83 Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, ok, but for sure Kafka is not the first service having this problem. Sounds like another ADR then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not relying on snapshots for other services. Every other service has a proper off-site S3 backup solution. Either built-in via operators or via k8up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think k8up will do as well, thanks for the hint. As we can have huge amounts of data going though Kafka we need to be able to tune it on a per installation base I guess.

Copy link
Contributor

@zugao zugao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants