Replies: 5 comments 2 replies
-
@ethervoid Thanks for your feedback. Can you paste more INFO logs and CPU/Net performance when becoming unreachable. I can't tell what's wrong with your instance by above information. |
Beta Was this translation helpful? Give feedback.
-
@git-hulk Thank you for your answer! I cannot provide more logs than those, sadly we lost the rest :( but I can provide network data from cloudwatch and CPU usage from Prometheus metrics provided by the exporter, let me know if anything else is useful |
Beta Was this translation helpful? Give feedback.
-
Thanks @ethervoid
This log entry means the replica's replication offset was too old, so it would fully sync with master DB. The exporter can't fetch the metrics that happens at the same time point, so I guess that it may be caused by high network bandwidth and CPU usage when doing full sync. |
Beta Was this translation helpful? Give feedback.
-
So restarting a replica made it start a full sync and that caused the master to be blocked by that full sync in terms of network and CPU usage, is that correct? If that's the case, any recommendation to avoid that? |
Beta Was this translation helpful? Give feedback.
-
config set max-replication-mb xxx |
Beta Was this translation helpful? Give feedback.
-
Hello everyone!
We've been using kvrocks for a while and to give a bit of context on how we're working with it our system has 2 replicas and 1 master node as part of a Sentinel cluster.
In our case when we want to release an update we first make the release and restart of the replicas, then we do a manual fail-over with Sentinel and after that, we release in the former master node.
Using this workflow today we've found that for some reason after releasing the changes in the first replica our master becomes unresponsive. We started to have gaps in our metrics from Grafana as you can see
We connected to the machine and checked the docker image and was running with the following logs for that timeline
We stopped writing on that node and after some minutes the node went back and started to be responsive again without doing anything else.
Could be this a bug, an issue or misconfiguration?
Beta Was this translation helpful? Give feedback.
All reactions