Master node got unresponsive after restart one of the replicas #728

ethervoid · 2022-07-13T17:09:32Z

ethervoid
Jul 13, 2022

Hello everyone!

We've been using kvrocks for a while and to give a bit of context on how we're working with it our system has 2 replicas and 1 master node as part of a Sentinel cluster.

In our case when we want to release an update we first make the release and restart of the replicas, then we do a manual fail-over with Sentinel and after that, we release in the former master node.

Using this workflow today we've found that for some reason after releasing the changes in the first replica our master becomes unresponsive. We started to have gaps in our metrics from Grafana as you can see

We connected to the machine and checked the docker image and was running with the following logs for that timeline

E0713 12:22:43.009356 14447 replication.cc:111] Write error while sending batch to slave: Broken pipe. batches: 0x243130360D0A7B11FE880D0000000200000003013201250B5F5F6E616D6573706163650000000C735F363737333231333331342F1B266EB733CEAC30060181F782F22601250B5F5F6E616D6573706163650000000C735F363737333231333331342F1B266EB733CEAC6105302E3735300D0A
E0713 12:23:08.211652    32 redis_cmd.cc:3533] checkWALBoundary with sequence: 58132926866, but GetWALIter return older sequence: 58132926860
E0713 12:43:06.192111  9671 replication.cc:111] Write error while sending batch to slave: Broken pipe. batches: 0x2431350D0A510D26890D000000000000000301320D0A

We stopped writing on that node and after some minutes the node went back and started to be responsive again without doing anything else.

Could be this a bug, an issue or misconfiguration?

git-hulk · 2022-07-14T02:11:39Z

git-hulk
Jul 14, 2022
Collaborator

@ethervoid Thanks for your feedback. Can you paste more INFO logs and CPU/Net performance when becoming unreachable. I can't tell what's wrong with your instance by above information.

0 replies

ethervoid · 2022-07-14T08:44:27Z

ethervoid
Jul 14, 2022
Author

@git-hulk Thank you for your answer! I cannot provide more logs than those, sadly we lost the rest :( but I can provide network data from cloudwatch and CPU usage from Prometheus metrics provided by the exporter, let me know if anything else is useful

0 replies

git-hulk · 2022-07-14T09:51:46Z

git-hulk
Jul 14, 2022
Collaborator

Thanks @ethervoid

E0713 12:23:08.211652 32 redis_cmd.cc:3533] checkWALBoundary with sequence: 58132926866, but GetWALIter return older sequence: 58132926860

This log entry means the replica's replication offset was too old, so it would fully sync with master DB. The exporter can't fetch the metrics that happens at the same time point, so I guess that it may be caused by high network bandwidth and CPU usage when doing full sync.

0 replies

ethervoid · 2022-07-14T10:24:10Z

ethervoid
Jul 14, 2022
Author

So restarting a replica made it start a full sync and that caused the master to be blocked by that full sync in terms of network and CPU usage, is that correct?

If that's the case, any recommendation to avoid that?

2 replies

git-hulk Jul 14, 2022
Collaborator

You can increase the rocksdb.wal_size_limit_mb to reduce the possibility of full sync, but it will also use more disk space. For the network usage, can use max-replication-mb to limit the speed of replication, this configuration can be changed online by using config set max-replication-mb xxx.

ethervoid Jul 14, 2022
Author

Understood! Thank you very much Hulk 🙇

adangadang · 2024-10-28T12:34:30Z

adangadang
Oct 28, 2024

config set max-replication-mb xxx

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Master node got unresponsive after restart one of the replicas #728

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Master node got unresponsive after restart one of the replicas #728

ethervoid Jul 13, 2022

Replies: 5 comments · 2 replies

git-hulk Jul 14, 2022 Collaborator

ethervoid Jul 14, 2022 Author

git-hulk Jul 14, 2022 Collaborator

ethervoid Jul 14, 2022 Author

git-hulk Jul 14, 2022 Collaborator

ethervoid Jul 14, 2022 Author

adangadang Oct 28, 2024

ethervoid
Jul 13, 2022

Replies: 5 comments 2 replies

git-hulk
Jul 14, 2022
Collaborator

ethervoid
Jul 14, 2022
Author

git-hulk
Jul 14, 2022
Collaborator

ethervoid
Jul 14, 2022
Author

git-hulk Jul 14, 2022
Collaborator

ethervoid Jul 14, 2022
Author

adangadang
Oct 28, 2024