-
Notifications
You must be signed in to change notification settings - Fork 3
Add ADR for Kafka with Strimzi #171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
TheBigLee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small nitpicks, other than that LGTM
| === Backup and Recovery | ||
|
|
||
| Backup and especially restoration of Kafka clusters is complex due to the distributed nature of Kafka. However, we implement a minimal backup strategy. | ||
|
|
||
| * Regular snapshots of Kafka data volumes | ||
| * Snapshots include backups of the Schema Registry topics | ||
| * Documentation of the backup and restore procedures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since backup and restoration of Kafka cluster is very complex, this section should probably be a bit more detailed.
Specifically:
- How regular do we do snapshots? Hourly? Daily?
- What are the RPO and RTO expectations?
- How would a restore look like? (Might make sense to give a high level restore overview and pro/cons of various backup/restore approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we need to workout the details. Question if we can already do it in here or if we need to postpone it a bit.
What is typically already available to snapshot volumes in those installations? How regular are these done usually?
I was thinking to at least "turn on" the PV snapshot for Kafka PVs. With this a full cluster restore could be achieved, with some data loss at the end of course. This solution is not very well suited for single topic restores or single message restores but in an emergency this could be achieved as well with some manual effort.
In general I think that backup and restore strategy needs to be defined together with the applications using Kafka as it usually involves more than just the messages and in lots of cases it's not needed at all. This cannot possibly be done in a generic way and thus IMO is not the platform responsibility.
We will try our best to optimize the high availability and placement so that we can rely mostly on the Kafka-internal replication. This type of backup is therefore more of an emergency solution that the platform can offer.
If you agree I can document it like this to be a bit more specific.
Kidswiss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one comment about the snapshots. Rest LGTM
| * These approaches have trade-offs in cost, complexity and RTO; platform-level support is limited to emergency workflows and guidance. | ||
|
|
||
|
|
||
| Limitations and Caveats:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO a pretty important caveat:
Snapshots can't protect against complete loss of the underlying cluster/storage. Some CSPs support off-site snapshots, however that's usually a feature of the big ones (AWS,GCP and Azure)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, ok, but for sure Kafka is not the first service having this problem. Sounds like another ADR then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not relying on snapshots for other services. Every other service has a proper off-site S3 backup solution. Either built-in via operators or via k8up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think k8up will do as well, thanks for the hint. As we can have huge amounts of data going though Kafka we need to be able to tune it on a per installation base I guess.
zugao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Summary
Checklist
change,decision,requirement/quality,requirement/functional,dependencyas they show up in the changelog