Durable queue down or inaccessible after two out of three nodes are forcefully shut down #2454

jshen28 · 2020-09-28T09:20:46Z

jshen28
Sep 28, 2020

we are seeing a lot of durable queues are not working after force reboot pod, logs look like below

operation queue.declare caused a channel exception not_found: home node 'rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local' of durable queue 'q-acl-plugin' in vhost 'neutron' is down or inaccessible
2020-09-28 09:17:03.117 [warning] <0.16320.8> closing AMQP connection <0.16320.8> (100.101.182.60:51060 -> 100.101.90.68:5672 - neutron-server:17:c951f597-baf5-45de-9b2c-001d5bf8f1bd, vhost: 'neutron', user: 'neutron'):
client unexpectedly closed TCP connection
2020-09-28 09:17:04.153 [info] <0.16346.8> accepting AMQP connection <0.16346.8> (100.101.182.60:51202 -> 100.101.90.68:5672)
2020-09-28 09:17:04.161 [info] <0.16346.8> Connection <0.16346.8> (100.101.182.60:51202 -> 100.101.90.68:5672) has a client-provided name: neutron-server:17:c951f597-baf5-45de-9b2c-001d5bf8f1bd
2020-09-28 09:17:04.163 [info] <0.16346.8> connection <0.16346.8> (100.101.182.60:51202 -> 100.101.90.68:5672 - neutron-server:17:c951f597-baf5-45de-9b2c-001d5bf8f1bd): user 'neutron' authenticated and granted access to vhost 'neutron'
2020-09-28 09:17:04.171 [error] <0.16354.8> Channel error on connection <0.16346.8> (100.101.182.60:51202 -> 100.101.90.68:5672, vhost: 'neutron', user: 'neutron'), channel 1:
operation queue.declare caused a channel exception not_found: home node 'rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local' of durable queue 'q-acl-plugin' in vhost 'neutron' is down or inaccessible
2020-09-28 09:17:04.184 [error] <0.16360.8> Channel error on connection <0.16346.8> (100.101.182.60:51202 -> 100.101.90.68:5672, vhost: 'neutron', user: 'neutron'), channel 1:
operation queue.declare caused a channel exception not_found: home node 'rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local' of durable queue 'q-acl-plugin' in vhost 'neutron' is down or inaccessible
2020-09-28 09:17:05.191 [warning] <0.16346.8> closing AMQP connection <0.16346.8> (100.101.182.60:51202 -> 100.101.90.68:5672 - neutron-server:17:c951f597-baf5-45de-9b2c-001d5bf8f1bd, vhost: 'neutron', user: 'neutron'):
client unexpectedly closed TCP connection
2020-09-28 09:17:06.226 [info] <0.16371.8> accepting AMQP connection <0.16371.8> (100.101.182.60:51316 -> 100.101.90.68:5672)
2020-09-28 09:17:06.229 [info] <0.16371.8> Connection <0.16371.8> (100.101.182.60:51316 -> 100.101.90.68:5672) has a client-provided name: neutron-server:17:c951f597-baf5-45de-9b2c-001d5bf8f1bd
2020-09-28 09:17:06.231 [info] <0.16371.8> connection <0.16371.8> (100.101.182.60:51316 -> 100.101.90.68:5672 - neutron-server:17:c951f597-baf5-45de-9b2c-001d5bf8f1bd): user 'neutron' authenticated and granted access to vhost 'neutron'
2020-09-28 09:17:06.239 [error] <0.16379.8> Channel error on connection <0.16371.8> (100.101.182.60:51316 -> 100.101.90.68:5672, vhost: 'neutron', user: 'neutron'), channel 1:
operation queue.declare caused a channel exception not_found: home node 'rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local' of durable queue 'q-acl-plugin' in vhost 'neutron' is down or inaccessible
2020-09-28 09:17:06.244 [error] <0.16385.8> Channel error on connection <0.16371.8> (100.101.182.60:51316 -> 100.101.90.68:5672, vhost: 'neutron', user: 'neutron'), channel 1:
operation queue.declare caused a channel exception not_found: home node 'rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local' of durable queue 'q-acl-plugin' in vhost 'neutron' is down or inaccessible
2020-09-28 09:17:06.873 [warning] <0.15552.8> closing AMQP connection <0.15552.8> (100.101.182.34:44654 -> 100.101.90.68:5672 - health-probe.py:156:f738a36c-fa06-4b62-9dc0-bf0377058411, vhost: 'nova', user: 'nova'):
client unexpectedly closed TCP connection
2020-09-28 09:17:07.250 [warning] <0.16371.8> closing AMQP connection <0.16371.8> (100.101.182.60:51316 -> 100.101.90.68:5672 - neutron-server:17:c951f597-baf5-45de-9b2c-001d5bf8f1bd, vhost: 'neutron', user: 'neutron'):
client unexpectedly closed TCP connection

evaluate on the different nodes shows

root@cve-rabbitmq-0:/# rabbitmqctl eval 'rabbit_misc:dirty_read({rabbit_queue, rabbit_misc:r(<<"nova">>, queue, <<"compute">>)}).'
{error,not_found}

root@cve-rabbitmq-2:/# rabbitmqctl eval 'rabbit_misc:dirty_read({rabbit_queue, rabbit_misc:r(<<"nova">>, queue, <<"compute">>)}).'
{ok,{amqqueue,{resource,<<"nova">>,queue,<<"compute">>},
              true,false,none,
              [{<<"x-ha-policy">>,longstr,<<"all">>}],
              <10614.7269.0>,[],[],[],
              [{vhost,<<"nova">>},
               {name,<<"ha_ttl_nova">>},
               {pattern,<<"^(?!(amq\\.|reply_)).*">>},
               {'apply-to',<<"all">>},
               {definition,[{<<"ha-mode">>,<<"all">>},
                            {<<"ha-sync-mode">>,<<"automatic">>},
                            {<<"message-ttl">>,70000}]},
               {priority,0}],
              undefined,
              [{<10614.7270.0>,<10614.7269.0>}],
              [],live,0,[],<<"nova">>,
              #{user => <<"nova">>}}}

rabbit_queue table are not synced.

There are around 1k queues & exchanges, the way I am using to reproducing problem is to first

# kubectl delete pod -n openstack cve-rabbitmq-1 --force --grace-period 0
# sleep 8
# kubectl delete pod -n openstack cve-rabbitmq-0 --force --grace-period 0

The cluster status yields OK on both nodes

Cluster status of node rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local ...
[{nodes,
     [{disc,
          ['rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local',
           'rabbit@cve-rabbitmq-1.cve-rabbitmq-discovery.openstack.svc.cluster.local',
           'rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local']}]},
 {running_nodes,
     ['rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local',
      'rabbit@cve-rabbitmq-1.cve-rabbitmq-discovery.openstack.svc.cluster.local',
      'rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local']},
 {cluster_name,
     <<"rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local">>},
 {partitions,[]},
 {alarms,
     [{'rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local',
          []},
      {'rabbit@cve-rabbitmq-1.cve-rabbitmq-discovery.openstack.svc.cluster.local',
          []},
      {'rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local',
          []}]}]

root@cve-rabbitmq-0:/# rabbitmqctl cluster_status
Cluster status of node rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local ...
[{nodes,
     [{disc,
          ['rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local',
           'rabbit@cve-rabbitmq-1.cve-rabbitmq-discovery.openstack.svc.cluster.local',
           'rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local']}]},
 {running_nodes,
     ['rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local',
      'rabbit@cve-rabbitmq-1.cve-rabbitmq-discovery.openstack.svc.cluster.local',
      'rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local']},
 {cluster_name,
     <<"rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local">>},
 {partitions,[]},
 {alarms,
     [{'rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local',
          []},
      {'rabbit@cve-rabbitmq-1.cve-rabbitmq-discovery.openstack.svc.cluster.local',
          []},
      {'rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local',
          []}]}]

We are using rabbitmq 3.7.26 and OTP 22.3.4.4

And for clearance the policies are listed below

root@mgt01:~/rabbitmq_cmd# ./rabbitmqadmin -N test list policies
+----------+-----------------+----------+---------------------------------------------------------------------------------------------+-----------------------+----------+
|  vhost   |      name       | apply-to |                                         definition                                          |        pattern        | priority |
+----------+-----------------+----------+---------------------------------------------------------------------------------------------+-----------------------+----------+
| /        | ha-all          | all      | {"ha-sync-mode": "automatic", "ha-mode": "all"}                                             | .*                    | 0        |
| cinder   | ha_ttl_cinder   | all      | {"ha-sync-mode": "automatic", "ha-mode": "all", "message-ttl": 70000}                       | ^(?!(amq\.|reply_)).* | 0        |
| glance   | ha_ttl_glance   | all      | {"ha-sync-mode": "automatic", "ha-mode": "all", "message-ttl": 70000}                       | ^(?!(amq\.|reply_)).* | 0        |
| heat     | ha_ttl_heat     | all      | {"ha-sync-mode": "automatic", "ha-mode": "all", "queue-mode": "lazy", "message-ttl": 70000} | ^(?!(amq\.|reply_)).* | 0        |
| ironic   | ha_ttl_ironic   | all      | {"ha-sync-mode": "automatic", "ha-mode": "all", "queue-mode": "lazy", "message-ttl": 70000} | ^(?!(amq\.|reply_)).* | 0        |
| keystone | ha_ttl_keystone | all      | {"ha-sync-mode": "automatic", "ha-mode": "all", "message-ttl": 70000}                       | ^(?!(amq\.|reply_)).* | 0        |
| masakari | ha_ttl_masakari | all      | {"ha-sync-mode": "automatic", "ha-mode": "all", "message-ttl": 70000}                       | ^(?!(amq\.|reply_)).* | 0        |
| neutron  | ha_ttl_neutron  | all      | {"ha-sync-mode": "automatic", "ha-mode": "all", "message-ttl": 70000}                       | ^(?!(amq\.|reply_)).* | 0        |
| nova     | ha_ttl_nova     | all      | {"ha-sync-mode": "automatic", "ha-mode": "all", "message-ttl": 70000}                       | ^(?!(amq\.|reply_)).* | 0        |
| watcher  | ha_ttl_watcher  | all      | {"ha-sync-mode": "automatic", "ha-mode": "all", "message-ttl": 70000}                       | ^(?!(amq\.|reply_)).* | 0        |
+----------+-----------------+----------+---------------------------------------------------------------------------------------------+-----------------------+----------+

Answered by michaelklishin

Sep 30, 2020

I am closing this as resolved since this test

Kills the majority of nodes, which is guaranteed to be problematic one way or another (for classic mirrored queues, there are options, schema table sync will be interrupted, quorum queues will lose availability by design)
Assumes that when a pod is reported as started, so is the RabbitMQ running on it. Sadly this is not the case because node startup takes time and so is syncing from peers. We currently do not have a readiness probe which satisfies both this condition and the stateful ordered set restart (see Node Restarts and Restarts and Health Checks (Readiness Probes)).
Likely kills the source node when another one is syncing the schema da…

View full answer

michaelklishin · 2020-09-28T09:44:19Z

michaelklishin
Sep 28, 2020
Maintainer

We don't know how many replicas there are in total but two of them are force-deleted. The loss of a majority of replicas for classic mirrored queues is covered in a separate section.
A node will take more than eight seconds to start and sync, so your commands remove two nodes from the cluster at once, not just restart one of the nodes.

operation queue.declare caused a channel exception not_found: home node 'rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local' of durable queue 'q-acl-plugin' in vhost 'neutron' is down or inaccessible

is a pretty specific message covered in a dedicated section in the classic queue mirroring guide. As far as the node that's handling your Nova connection is, the queue q-acl-plugin was not mirrored but was durable and its hosting node was not accessible.

The direct table row lookup uses a dirty (node-local) lookup for a different queue named compute. So I'm not really sure what the question here is: the message and the lookup are not contradicting each other since different queues can be in different states.

With OpenStack in the past, we have seen clients connecting and immediately starting to perform operations on nodes that are not yet 100% booted and synchronized. In part, this is because client operations are concurrent to everything else the node may be doing, and in part because client connection listeners were started in the middle of the boot sequence. In 3.8, this has been addressed as part of a group of changes to address #2384 (#2406).

RabbitMQ 3.7 is three days away from going entirely out of support. Consider upgrading to 3.8.9.

3.8 introduced a new replicated queue type which is focusses on data safety and predictable recovery. However, when the majority of replicas are offline, which very well may be the case in this example,
they will also become unavailable to client operations until a majority of replicas are back online.

1 reply

michaelklishin Sep 28, 2020
Maintainer

#1869 was an earlier attempt at delaying listener startup. It was enough for some cases but not others.

jshen28 · 2020-09-28T13:10:35Z

jshen28
Sep 28, 2020
Author

First thank you for reply. But I am sorryI cannot agree with your opinion, q-acl-plugin is a actually both mirrored & durable, so I do not agree it is really the answer. Besides, I am wondering if there is any way client could actually know what the state current server is in, or I have no choice but add some fixed delay for reconnection? IMO, this delay is hard if there are a lot of clients connecting at the same time as cluster scales up.

2 replies

michaelklishin Sep 28, 2020
Maintainer

I cannot have an "opinion" here: I work with the data that you provide. There is no evidence that q-acl-plugin is mirrored and the error message is what an attempt to use a non-mirrored durable queue on an unaccessible node would produce.

michaelklishin Sep 28, 2020
Maintainer

A client cannot know what the state of the queue or replication is, which is why #1869 and #2406 were introduced.

jshen28 · 2020-09-28T13:38:42Z

jshen28
Sep 28, 2020
Author

The question is if there is a difference in rabbit_queue table, publish to the same behaves differently on different node. Publish message to one node is fine, but not OK if publish to another node.

1 reply

michaelklishin Sep 28, 2020
Maintainer

rabbit_queue is a table used for transient queues. Durable queues are stored in rabbit_durable_queue. The same goes for bindings. This is because transient queues are all removed on node boot, by definition.

There are many unknowns here. You forcefully shut down two cluster nodes, so how many are there left? If it is just one then you are very likely to have some queues that may or may not have a promotable master. We don't have any logs to verify whether that is or isn't the case.

What node are these logs for? What's in the logs of other nodes? Which node out of them has no queue rows in its schema database? What's in that node's logs? We do not guess in this community, it's too time-consuming.

michaelklishin · 2020-09-28T16:50:45Z

michaelklishin
Sep 28, 2020
Maintainer

@jshen28 can full logs from all nodes be shared from the start of your test to the end of it? Without the logs we won't be able to offer a meaningful theory. We also would investigate in detail only if this can be reproduced with 3.8.9, as 3.7.x goes completely out of support on October 1st and there won't be any new 3.7 releases.

0 replies

jshen28 · 2020-09-29T01:19:34Z

jshen28
Sep 29, 2020
Author

Thank you for reply. Unfortunately, I do not keep the log. Using the procedure, unroutable publishes could be consistently reproduced but internal behavior is different. Following log is error from a different test run.

After some debugging, I found rabbit_topic_trie_edge table is inconsistent on different nodes

root@mgt01:~# kubectl -n openstack exec -it cve-rabbitmq-0 -c rabbitmq-ha -- rabbitmqctl eval 'length(mnesia:dirty_all_keys(rabbit_topic_trie_edge)).'
12
root@mgt01:~# kubectl -n openstack exec -it cve-rabbitmq-2 -c rabbitmq-ha -- rabbitmqctl eval 'length(mnesia:dirty_all_keys(rabbit_topic_trie_edge)).'
720
root@mgt01:~# kubectl -n openstack exec -it cve-rabbitmq-1 -c rabbitmq-ha -- rabbitmqctl eval 'length(mnesia:dirty_all_keys(rabbit_topic_trie_edge)).'
19

while cluster is still fine

root@mgt01:~# kubectl -n openstack exec -it cve-rabbitmq-1 -c rabbitmq-ha -- rabbitmqctl cluster_status
Cluster status of node rabbit@cve-rabbitmq-1.cve-rabbitmq-discovery.openstack.svc.cluster.local ...
[{nodes,
     [{disc,
          ['rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local',
           'rabbit@cve-rabbitmq-1.cve-rabbitmq-discovery.openstack.svc.cluster.local',
           'rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local']}]},
 {running_nodes,
     ['rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local',
      'rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local',
      'rabbit@cve-rabbitmq-1.cve-rabbitmq-discovery.openstack.svc.cluster.local']},
 {cluster_name,
     <<"rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local">>},
 {partitions,[]},
 {alarms,
     [{'rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local',
          []},
      {'rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local',
          []},
      {'rabbit@cve-rabbitmq-1.cve-rabbitmq-discovery.openstack.svc.cluster.local',
          []}]}]
root@mgt01:~# kubectl -n openstack exec -it cve-rabbitmq-2 -c rabbitmq-ha -- rabbitmqctl cluster_status
Cluster status of node rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local ...
[{nodes,
     [{disc,
          ['rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local',
           'rabbit@cve-rabbitmq-1.cve-rabbitmq-discovery.openstack.svc.cluster.local',
           'rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local']}]},
 {running_nodes,
     ['rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local',
      'rabbit@cve-rabbitmq-1.cve-rabbitmq-discovery.openstack.svc.cluster.local',
      'rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local']},
 {cluster_name,
     <<"rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local">>},
 {partitions,[]},
 {alarms,
     [{'rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local',
          []},
      {'rabbit@cve-rabbitmq-1.cve-rabbitmq-discovery.openstack.svc.cluster.local',
          []},
      {'rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local',
          []}]}]
root@mgt01:~# kubectl -n openstack exec -it cve-rabbitmq-0 -c rabbitmq-ha -- rabbitmqctl cluster_status
Cluster status of node rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local ...
[{nodes,
     [{disc,
          ['rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local',
           'rabbit@cve-rabbitmq-1.cve-rabbitmq-discovery.openstack.svc.cluster.local',
           'rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local']}]},
 {running_nodes,
     ['rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local',
      'rabbit@cve-rabbitmq-1.cve-rabbitmq-discovery.openstack.svc.cluster.local',
      'rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local']},
 {cluster_name,
     <<"rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local">>},
 {partitions,[]},
 {alarms,
     [{'rabbit@cve-rabbitmq-2.cve-rabbitmq-discovery.openstack.svc.cluster.local',
          []},
      {'rabbit@cve-rabbitmq-1.cve-rabbitmq-discovery.openstack.svc.cluster.local',
          []},
      {'rabbit@cve-rabbitmq-0.cve-rabbitmq-discovery.openstack.svc.cluster.local',
          []}]}]

Test script looks like this

kubectl delete pod -n openstack --force --grace-period 0 cve-rabbitmq-1
while true; do
	ret=`kubectl get pods -n openstack -owide | grep cve-rabbitmq-1 | grep -i Running`
	if $? -eq 0; then
	  sleep .05
	  kubectl delete pod -n openstack --force --grace-period 0 cve-rabbitmq-0
	  break
	fi
done

5 replies

michaelklishin Sep 29, 2020
Maintainer

There are no errors of anything suspicious in this log. The node mostly starts up, then adds classic mirrored queue replicas for hundreds or so queues.

The test, as I previously expected, kills two out of three nodes in the cluster: as soon as it sees the pod to be up, it kills the next one.
However, a pod being up does not mean that the RabbitMQ node in the pod is fully booted and synced. So the process must be killed mid-sync, which means some tables may be missing some rows; table sync does not happen in a single giant transaction for all kinds of reasons. On top of that, in one example, one of the queue tables was empty, now a topic exchange topology (trie edges) table has different contents.

Node restarts and syncing are documented in a dedicated section. Note that we have a section dedicated to health checks and Kubernetes deployment next to it. This is because on Kubernetes specifically, how exactly node startup/readiness is determined matters a lot.

rabbitmqctl await_startup would wait for actual node startup but it does not coordinate with table syncing, so in theory, there will be a time window where it would already have returned but the syncing would still be finishing. It would be longer for environments that have more queues (exchanges, bindings). In 3.7, every queue has explicit (actual binding table rows) to the default exchange, while in 3.8 they are implicit for the default exchange (special-cased in the code), which means there are fewer bindings to transfer.

So as a summary, if nodes are killed mid-table sync, tables can have different contents, in particular, if the source node is killed. This test kills two nodes out of three before they finish syncing, which is guaranteed to be problematic one way or another.

jshen28 Sep 29, 2020
Author

thank you for reply. Is there any way to determine if a rabbitmq node is in problematic state? Besides, Is mnesia syncing table contents from a single node? If I kill that master node, will mnesia try to sync from node, or it will just stops there and runs with an inconsistent state?

michaelklishin Sep 29, 2020
Maintainer

It should be syncing from the peer this node could connect to. There are no master nodes in Mnesia or RabbitMQ.

There is no easy way to determine if some tables may be missing some data since, well, a table can naturally be empty. Schema data store tables can have node-specific rows and tables (e.g. connection tracking uses node-local tables), too.

michaelklishin Sep 29, 2020
Maintainer

I can't tell if a node can observe that its peer goes down (it's a Mnesia "feature" we use and wait for it to return). In theory retries from another node could be an option but as I said earlier, it is not always obvious if a node is in sync or not. This is very much a Mnesia error handling question. Our team would primarily be interested in investigating this for Mnevis, our Raft-based distributed Mnesia replacement for future RabbitMQ versions.

michaelklishin Sep 30, 2020
Maintainer

One known approach to efficiently compute a delta between two data sets is Merkle trees. Mnevis or future RabbitMQ releases could potentially adopt it but given that the data set is volatile by nature, there is a serious risk of false positives if this comparison is done naively.

jshen28 · 2020-09-29T01:23:07Z

jshen28
Sep 29, 2020
Author

Besides，in kuberntes delete pod will change container IP, will it cause some unexpected behavior?

1 reply

michaelklishin Sep 29, 2020
Maintainer

Nodes use hostnames, so as long as those still resolve, and the data directory computed (or explicitly configured) by the nodes is still the same, and, of course, the data is on a persistent volume, the fact that the pod's containers were recreated won't make any difference.

jshen28 · 2020-09-30T08:00:46Z

jshen28
Sep 30, 2020
Author

So while a node is starting up, and I kill the pod immediately, should it receive a nodedown signal from net_kernel? Currently the net tick time is default. And force shutdown with grace period of 0 could potentially startup the pod in a very short period of which could be less than 1/2 of tick time, will it also cause some issues?

5 replies

michaelklishin Sep 30, 2020
Maintainer

If the source node is killed, how the sync process handles that is handled by Mnesia. If it simply returns and local tables only have a subset of data, RabbitMQ has no way of knowing that (a replica addition/sync return is a return).

jshen28 Sep 30, 2020
Author

hmm... how can I tell if mnesia returns directly with continue syncing? Will debug logs help?

michaelklishin Sep 30, 2020
Maintainer

See rabbit_mnesia:create_schema/0 and rabbit_table:wait_for_replicated/2, which will lead you to mnesia:wait_for_tables/2. This is where RabbitMQ waits for the table syncing to finish. What exactly is involved and how peer failures are handled is has to be tracked in the OTP codebase (Mnesia specifically).

With any schema data store, losing the majority of nodes when the cluster is still [re-]forming will be problematic one way or another (you either lose consistency like in the examples above or availability).

jshen28 Sep 30, 2020
Author

right, but I am also wondering why there is no nodedown events got logged? IMO, while node starting up, it is still forms a cluster, so one node got kill should notify other nodes.

michaelklishin Sep 30, 2020
Maintainer

RabbitMQ node monitor is one of the boot steps and it is started without coordination the schema database startup as far as I can tell.

-rabbit_boot_step({database,
                   [{mfa,         {rabbit_mnesia, init, []}},
                    {requires,    file_handle_cache},
                    {enables,     external_infrastructure}]}).

-rabbit_boot_step({database_sync,
                   [{description, "database sync"},
                    {mfa,         {rabbit_sup, start_child, [mnesia_sync]}},
                    {requires,    database},
                    {enables,     external_infrastructure}]}).

-rabbit_boot_step({rabbit_node_monitor,
                   [{description, "node monitor"},
                    {mfa,         {rabbit_sup, start_restartable_child,
                                   [rabbit_node_monitor]}},
                    {requires,    [rabbit_alarm, guid_generator]},
                    {enables,     core_initialized}]}).

There can be other reasons, such as peer unavailability detecting taking a while, and Mnesia doing its own error handling.

michaelklishin · 2020-09-30T08:10:17Z

michaelklishin
Sep 30, 2020
Maintainer

I am closing this as resolved since this test

Kills the majority of nodes, which is guaranteed to be problematic one way or another (for classic mirrored queues, there are options, schema table sync will be interrupted, quorum queues will lose availability by design)
Assumes that when a pod is reported as started, so is the RabbitMQ running on it. Sadly this is not the case because node startup takes time and so is syncing from peers. We currently do not have a readiness probe which satisfies both this condition and the stateful ordered set restart (see Node Restarts and Restarts and Health Checks (Readiness Probes)).
Likely kills the source node when another one is syncing the schema database from it; a subset of data won't be transferred and depending on what it is, you will get different outcomes (a queue won't be found, routing will not work the same across nodes, and so on)

There are doc links in this discussion that explain all the important bits about how nodes rejoin a cluster and what that means specifically on Kubernetes.

1 reply

michaelklishin Sep 30, 2020
Maintainer

Added some links to the above comment.

gerhard · 2020-10-01T11:00:45Z

gerhard
Oct 1, 2020

Adding extra insights in response to: https://www.youtube.com/watch?v=I02oKJlOnR4&lc=Ugyei_s5o0sVFfKce-t4AaABAg

The important ones:

Don't use mirrored classic queues, use quorum queues. They are not safe by design and there is nothing that we can do about it. This explains why.
Run a supported RabbitMQ version. RabbitMQ 3.7.x is EOL.
RabbitMQ v3.8.9 ships an important bug fix that may affect you, make sure that you run that.

The less important ones:

By design, mnesia syncs from peer nodes on startup
A mnesia sync implies an exact copy, nothing is merged
It is possible for a clustered node to boot without mnesia sync-ing (the logs will confirm and explain why that happened). In this case, the node will not join the cluster, even though one might expect it to.
It is possible for a sync to hang, get interrupted, or simply fail
When a sync completes successfully, all nodes in the cluster will share some of the mnesia data, except the data which is node-local
All mnesia operations are synchronous & replicated. Given a healthy cluster, all nodes will either confirm a cluster-wide mnesia transaction or not, there is no "maybe" or eventual consistency.
Dirty mnesia reads are node local and are not representative across the cluster.
rabbitmqctl stop_app && rabbitmqctl reset && rabbitmqctl startp_app keeps the Erlang VM running, but forces mnesia to re-sync.
When mnesia goes out of sync across nodes, your cluster is considered partitioned. Force re-sync-ing will fix this.

In conclusion, there is a limit to how much free support we can provide. Considering how much @michaelklishin helped with this, and the extra insights provided above, if you feel @jshen28 that you need more, I would recommend looking at https://www.rabbitmq.com/services.html

Thank you!

0 replies

Durable queue down or inaccessible after two out of three nodes are forcefully shut down #2454

jshen28 Sep 28, 2020

Replies: 9 comments · 16 replies

michaelklishin Sep 28, 2020 Maintainer

michaelklishin Sep 28, 2020 Maintainer

jshen28 Sep 28, 2020 Author

michaelklishin Sep 28, 2020 Maintainer

michaelklishin Sep 28, 2020 Maintainer

jshen28 Sep 28, 2020 Author

michaelklishin Sep 28, 2020 Maintainer

michaelklishin Sep 28, 2020 Maintainer

jshen28 Sep 29, 2020 Author

michaelklishin Sep 29, 2020 Maintainer

jshen28 Sep 29, 2020 Author

michaelklishin Sep 29, 2020 Maintainer

michaelklishin Sep 29, 2020 Maintainer

michaelklishin Sep 30, 2020 Maintainer

jshen28 Sep 29, 2020 Author

michaelklishin Sep 29, 2020 Maintainer

jshen28 Sep 30, 2020 Author

michaelklishin Sep 30, 2020 Maintainer

jshen28 Sep 30, 2020 Author

michaelklishin Sep 30, 2020 Maintainer

jshen28 Sep 30, 2020 Author

michaelklishin Sep 30, 2020 Maintainer

michaelklishin Sep 30, 2020 Maintainer

michaelklishin Sep 30, 2020 Maintainer

gerhard Oct 1, 2020

jshen28
Sep 28, 2020

Replies: 9 comments 16 replies

michaelklishin
Sep 28, 2020
Maintainer

michaelklishin Sep 28, 2020
Maintainer

jshen28
Sep 28, 2020
Author

michaelklishin Sep 28, 2020
Maintainer

michaelklishin Sep 28, 2020
Maintainer

jshen28
Sep 28, 2020
Author

michaelklishin Sep 28, 2020
Maintainer

michaelklishin
Sep 28, 2020
Maintainer

jshen28
Sep 29, 2020
Author

michaelklishin Sep 29, 2020
Maintainer

jshen28 Sep 29, 2020
Author

michaelklishin Sep 29, 2020
Maintainer

michaelklishin Sep 29, 2020
Maintainer

michaelklishin Sep 30, 2020
Maintainer

jshen28
Sep 29, 2020
Author

michaelklishin Sep 29, 2020
Maintainer

jshen28
Sep 30, 2020
Author

michaelklishin Sep 30, 2020
Maintainer

jshen28 Sep 30, 2020
Author

michaelklishin Sep 30, 2020
Maintainer

jshen28 Sep 30, 2020
Author

michaelklishin Sep 30, 2020
Maintainer

michaelklishin
Sep 30, 2020
Maintainer

michaelklishin Sep 30, 2020
Maintainer

gerhard
Oct 1, 2020