Rabbit MQTT Plugin failing to start #2665
-
We have set up a cluster of 3 nodes on VM's behind a load balancer We are using the RabbitMQ MQTT plugin for the messaging protocol and then doing a load test. After some time the client connections start to drop and one/multiple nodes crash When they reboot they fail to work at all, and just go into a loop of rebooting. We looked at the logs and can see that specifically, the rabbitmq_mqtt plugin is failing to start giving the following error
The MQTT plugin then never starts again. This then brings down our entire cluster We are not sure how to fix this other than clearing everything from the db and re setting up the cluster from scratch |
Beta Was this translation helpful? Give feedback.
Replies: 9 comments 7 replies
-
Thank you for your time. Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. This assumes that we have a certain amount of information to work with. Getting all the details necessary to reproduce an issue, make a conclusion or even form a hypothesis about what's happening can take a fair amount of time. Our team is multiple orders of magnitude smaller than the RabbitMQ community. Please help others help you by providing a way to reproduce the behavior you're
Feel free to edit out hostnames and other potentially sensitive information. When/if we have a complete enough understanding of what's going on, a recommendation will be provided or a new issues with more context will be filed. Thank you. |
Beta Was this translation helpful? Give feedback.
-
We cannot suggest much without a way to reproduce, logs from all nodes, and any information about what RabbitMQ version was used. All we know is that the plugin failed to initiate a Raft leader election for its client ID tracker. It could be a known scenario addressed in rabbitmq/rabbitmq-mqtt#235 or something else. |
Beta Was this translation helpful? Give feedback.
-
Here is all of the info that we can provide Server & ClientThe server is RabbitMQ version 3.8.9 We then are running a client app for load testing that just fires of connections each with a unique client id (just a guid) this is on a .net core app running client library MqttNet The test app is at this repo https://github.com/tiaan-lg/MQTT-Test-App All the logs from the rabbitMQ logs folder and Aditional Info
The rabbitmq app is not running so the status can not be retrieved.
Transcript to reproduce
High-Level Explanation and other relevant informationServer Envirment:
Client setup:
Thank you for being so prompt in response, feel free to DM me and we can also give you access to our system. We can run a nother setup and show how it fails in real time. |
Beta Was this translation helpful? Give feedback.
-
Hi We have additional info, the server starts up fine without the mqtt plugin and then when we try to enable it we get this error
When the broker crashed there were about 20k queues active. |
Beta Was this translation helpful? Give feedback.
-
@michaelklishin Could a limit on Disk IOPS be causing the |
Beta Was this translation helpful? Give feedback.
-
Is there any update on this? We are busy testing the mqqt plugin, and we are just having issues with it. It seems the plugin crashes, and then it cannot restart again, causing the entire cluster to go down. We are running 3 windows virtual machines in a cluster, with the mqqt plugin enabled. rabbitmq_mqtt plugin version 3.8.9 |
Beta Was this translation helpful? Give feedback.
-
Accoding to the provided logs, this cluster experience all kinds of highly unusual events, such as… Node Fails to Rejoin the Cluster Because Its Cluster Identity is Different
means that a node tried to join a cluster it was not previously a member of. The "cookie" here refers to a schema database identity and has nothing to do with the Erlang cookie. I fail to see how this can happen according to the set of steps above but logs don't lie. Nodes Stop Responding to Inter-Node Heartbeats
Can be just an indication that a peer node has gone down. MQTT Plugin Timing Out on Triggering Elections
@tiaan-lg do you monitor nodes to have at least some idea as to what makes the node run out of memory? TCP connections, and MQTT state on top are not free, so at the very least see Tuning for a Large Number of Connections. There is evidence of memory alarms going off, suggesting that the node's memory usage grows over time:
This is no evidence of a memory leak since every connection consumes resources.
|
Beta Was this translation helpful? Give feedback.
-
So my conclusion so far is that the node is run out of resources by a flood connection test. It certainly hits its memory alarm limit. I cannot know what uses the memory but node can report a decently detailed breakdown. Nodes also seem to run out of file descriptor or a socket limit of some kind, or both, likely due to the test creating a high connection churn scenario which is a very problematic scenario that requires optimization of certain settings. A node that has run out of file descriptors and cannot open a connection to its peers will fail in all kinds of weird ways. Supporting 200K connections is unrealistic with all defaults for both RabbitMQ TCP buffer size and kernel TCP settings (that can keep connections in |
Beta Was this translation helpful? Give feedback.
-
For the record we are also seeing this on clusters with a lot of mqtt connections. Our "solution" has been to delete the quorum dir in mnesia (1 node clusters). |
Beta Was this translation helpful? Give feedback.
So my conclusion so far is that the node is run out of resources by a flood connection test. It certainly hits its memory alarm limit. I cannot know what uses the memory but node can report a decently detailed breakdown.
Nodes also seem to run out of file descriptor or a socket limit of some kind, or both, likely due to the test creating a high connection churn scenario which is a very problematic scenario that requires optimization of certain settings. A node that has run out of file descriptors and cannot open a connection to its peers will fail in all kinds of weird ways.
Supporting 200K connections is unrealistic with all defaults for both RabbitMQ TCP buffer size and kernel TCP setti…