Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate sudden disconnects #8

Open
bmario opened this issue Jul 13, 2020 · 4 comments
Open

Investigate sudden disconnects #8

bmario opened this issue Jul 13, 2020 · 4 comments
Labels
bug Something isn't working

Comments

@bmario
Copy link
Member

bmario commented Jul 13, 2020

Sometimes we see disconnects in the C++ Agents, especially in transformers. What I've learned so far from investigating this issue:

Conditions:

  • It always happens on the data connection
  • It only happens with a good chunk of throughput
  • The node setup doesn't seem to influence the setup. I had source connections to both nodes in the occurrence below.

Procedure:

  • It seems to happen, whenever there are no more data messages coming from the rabbitmq
  • The data connection switches over to receiving and sending heartbeats
  • After a few minutes, the rabbitmq complains about missing heartbeats and closes the connection
  • The client receives an EndOfFile on the data connection socket and throws exceptions until it dies.
@bmario bmario added the bug Something isn't working label Jul 13, 2020
@bmario
Copy link
Member Author

bmario commented Jul 13, 2020

This is one observed occurrence. From the RabbitMQ log:

2020-07-10 14:55:30.992 [info] <0.22574.62> accepting AMQP connection <0.22574.62> (141.30.75.111:41474 -> 141.76.19.152:5671)
2020-07-10 14:55:31.038 [info] <0.22574.62> connection <0.22574.62> (141.30.75.111:41474 -> 141.76.19.152:5671): user 'scorep-elab' authenticated and granted access to vhost '/'
2020-07-10 14:55:31.292 [info] <0.22593.62> accepting AMQP connection <0.22593.62> (141.30.75.111:41476 -> 141.76.19.152:5671)
2020-07-10 14:55:31.338 [info] <0.22593.62> connection <0.22593.62> (141.30.75.111:41476 -> 141.76.19.152:5671): user 'scorep-elab' authenticated and granted access to vhost 'data'
...
2020-07-11 09:13:32.481 [error] <0.22593.62> closing AMQP connection <0.22593.62> (141.30.75.111:41476 -> 141.76.19.152:5671): missed heartbeats from client, timeout: 60s
2020-07-11 09:13:32.484 [warning] <0.22574.62> closing AMQP connection <0.22574.62> (141.30.75.111:41474 -> 141.76.19.152:5671, vhost: '/', user: 'scorep-elab'): client unexpectedly closed TCP connection

The Aggregator log (only data connection):

[2020-07-11 09:11:03.401007192 CEST][metricq][TRACE]: [Data connection] Schedule heartbeat timer in flush callback
[2020-07-11 09:11:03.401132822 CEST][metricq][TRACE]: [Data connection] Successfully received 16384 bytes through the socket. Waiting for at least 115113 bytes.
[2020-07-11 09:11:03.401246010 CEST][metricq][TRACE]: [Data connection] Completed to send 101 bytes through the socket
[2020-07-11 09:11:03.401330995 CEST][metricq][TRACE]: [Data connection] Schedule heartbeat timer in flush callback
[2020-07-11 09:11:03.401455096 CEST][metricq][TRACE]: [Data connection] Successfully received 16384 bytes through the socket. Waiting for at least 115113 bytes.
[2020-07-11 09:11:03.401568134 CEST][metricq][TRACE]: [Data connection] Completed to send 21 bytes through the socket
[2020-07-11 09:11:03.401653052 CEST][metricq][TRACE]: [Data connection] Schedule heartbeat timer in flush callback
[2020-07-11 09:11:03.401762736 CEST][metricq][TRACE]: [Data connection] Successfully received 16384 bytes through the socket. Waiting for at least 115113 bytes.
[2020-07-11 09:11:03.401900988 CEST][metricq][TRACE]: [Data connection] Successfully received 16384 bytes through the socket. Waiting for at least 115113 bytes.
[2020-07-11 09:11:03.402040104 CEST][metricq][TRACE]: [Data connection] Successfully received 16384 bytes through the socket. Waiting for at least 115113 bytes.
[2020-07-11 09:11:03.402151961 CEST][metricq][TRACE]: [Data connection] Successfully received 16384 bytes through the socket. Waiting for at least 115113 bytes.
[2020-07-11 09:11:03.402238906 CEST][metricq][TRACE]: [Data connection] Successfully received 541 bytes through the socket. Waiting for at least 115113 bytes.
[2020-07-11 09:11:03.408529285 CEST][metricq][TRACE]: [Data connection] Consumed 115113 of 541 bytes.
[2020-07-11 09:11:03.408626532 CEST][metricq][TRACE]: [Data connection] Completed to send 72 bytes through the socket
[2020-07-11 09:11:03.408692019 CEST][metricq][TRACE]: [Data connection] Schedule heartbeat timer in flush callback
[2020-07-11 09:11:03.408792483 CEST][metricq][TRACE]: [Data connection] Successfully received 16384 bytes through the socket. Waiting for at least 7 bytes.
[2020-07-11 09:11:03.408872053 CEST][metricq][TRACE]: [Data connection] Consumed 113 of 16384 bytes.
[2020-07-11 09:11:03.408966093 CEST][metricq][TRACE]: [Data connection] Completed to send 22 bytes through the socket
[2020-07-11 09:11:03.409035163 CEST][metricq][TRACE]: [Data connection] Schedule heartbeat timer in flush callback
[2020-07-11 09:11:03.409136262 CEST][metricq][TRACE]: [Data connection] Successfully received 16384 bytes through the socket. Waiting for at least 115113 bytes.
[2020-07-11 09:11:03.409228842 CEST][metricq][TRACE]: [Data connection] Completed to send 113 bytes through the socket
[2020-07-11 09:11:03.409297148 CEST][metricq][TRACE]: [Data connection] Schedule heartbeat timer in flush callback
[2020-07-11 09:11:03.409395942 CEST][metricq][TRACE]: [Data connection] Successfully received 16384 bytes through the socket. Waiting for at least 115113 bytes.
[2020-07-11 09:11:03.409491414 CEST][metricq][TRACE]: [Data connection] Completed to send 21 bytes through the socket
[2020-07-11 09:11:03.409560919 CEST][metricq][TRACE]: [Data connection] Schedule heartbeat timer in flush callback
[2020-07-11 09:11:03.409650019 CEST][metricq][TRACE]: [Data connection] Successfully received 16384 bytes through the socket. Waiting for at least 115113 bytes.
[2020-07-11 09:11:03.409743820 CEST][metricq][TRACE]: [Data connection] Successfully received 16384 bytes through the socket. Waiting for at least 115113 bytes.
[2020-07-11 09:11:03.409854962 CEST][metricq][TRACE]: [Data connection] Successfully received 16384 bytes through the socket. Waiting for at least 115113 bytes.
[2020-07-11 09:11:03.409996884 CEST][metricq][TRACE]: [Data connection] Successfully received 16384 bytes through the socket. Waiting for at least 115113 bytes.
[2020-07-11 09:11:03.410104494 CEST][metricq][TRACE]: [Data connection] Successfully received 538 bytes through the socket. Waiting for at least 115113 bytes.
[2020-07-11 09:11:03.417882746 CEST][metricq][TRACE]: [Data connection] Consumed 115113 of 538 bytes.
[2020-07-11 09:11:03.417966672 CEST][metricq][TRACE]: [Data connection] Completed to send 69 bytes through the socket
[2020-07-11 09:11:03.418032231 CEST][metricq][TRACE]: [Data connection] Schedule heartbeat timer in flush callback
[2020-07-11 09:11:03.418116327 CEST][metricq][TRACE]: [Data connection] Completed to send 22 bytes through the socket
[2020-07-11 09:11:03.418185885 CEST][metricq][TRACE]: [Data connection] Schedule heartbeat timer in flush callback
[2020-07-11 09:11:03.418266849 CEST][metricq][TRACE]: [Data connection] Completed to send 113 bytes through the socket
[2020-07-11 09:11:03.418336382 CEST][metricq][TRACE]: [Data connection] Schedule heartbeat timer in flush callback
[2020-07-11 09:11:03.418416807 CEST][metricq][TRACE]: [Data connection] Completed to send 21 bytes through the socket
[2020-07-11 09:11:03.418485428 CEST][metricq][TRACE]: [Data connection] Schedule heartbeat timer in flush callback
[2020-07-11 09:11:33.418568258 CEST][metricq][TRACE]: [Data connection] Sending heartbeat to server
[2020-07-11 09:11:33.418870960 CEST][metricq][TRACE]: [Data connection] Completed to send 8 bytes through the socket
[2020-07-11 09:11:33.419029837 CEST][metricq][TRACE]: [Data connection] Schedule heartbeat timer in flush callback
[2020-07-11 09:11:34.055835506 CEST][metricq][TRACE]: [Data connection] Successfully received 8 bytes through the socket. Waiting for at least 7 bytes.
[2020-07-11 09:11:34.056048599 CEST][metricq][TRACE]: [Data connection] Received heartbeat from server
[2020-07-11 09:11:34.056201686 CEST][metricq][TRACE]: [Data connection] Consumed 8 of 8 bytes.
[2020-07-11 09:12:03.419137751 CEST][metricq][TRACE]: [Data connection] Sending heartbeat to server
[2020-07-11 09:12:03.419429523 CEST][metricq][TRACE]: [Data connection] Completed to send 8 bytes through the socket
[2020-07-11 09:12:03.419588943 CEST][metricq][TRACE]: [Data connection] Schedule heartbeat timer in flush callback
[2020-07-11 09:12:04.056685559 CEST][metricq][TRACE]: [Data connection] Successfully received 8 bytes through the socket. Waiting for at least 7 bytes.
[2020-07-11 09:12:04.056900847 CEST][metricq][TRACE]: [Data connection] Received heartbeat from server
[2020-07-11 09:12:04.057053882 CEST][metricq][TRACE]: [Data connection] Consumed 8 of 8 bytes.
[2020-07-11 09:12:33.419690817 CEST][metricq][TRACE]: [Data connection] Sending heartbeat to server
[2020-07-11 09:12:33.419971803 CEST][metricq][TRACE]: [Data connection] Completed to send 8 bytes through the socket
[2020-07-11 09:12:33.420131313 CEST][metricq][TRACE]: [Data connection] Schedule heartbeat timer in flush callback
[2020-07-11 09:12:34.057698486 CEST][metricq][TRACE]: [Data connection] Successfully received 8 bytes through the socket. Waiting for at least 7 bytes.
[2020-07-11 09:12:34.057911503 CEST][metricq][TRACE]: [Data connection] Received heartbeat from server
[2020-07-11 09:12:34.058062466 CEST][metricq][TRACE]: [Data connection] Consumed 8 of 8 bytes.
[2020-07-11 09:13:03.420223259 CEST][metricq][TRACE]: [Data connection] Sending heartbeat to server
[2020-07-11 09:13:03.420512489 CEST][metricq][TRACE]: [Data connection] Completed to send 8 bytes through the socket
[2020-07-11 09:13:03.420671144 CEST][metricq][TRACE]: [Data connection] Schedule heartbeat timer in flush callback
[2020-07-11 09:13:04.058858528 CEST][metricq][TRACE]: [Data connection] Successfully received 8 bytes through the socket. Waiting for at least 7 bytes.
[2020-07-11 09:13:04.059063837 CEST][metricq][TRACE]: [Data connection] Received heartbeat from server
[2020-07-11 09:13:04.059221648 CEST][metricq][TRACE]: [Data connection] Consumed 8 of 8 bytes.
[2020-07-11 09:13:32.483074157 CEST][metricq][ERROR]: [Data connection] read failed: End of file

@bmario
Copy link
Member Author

bmario commented Jul 24, 2020

Looks like an SSL issue. Will try with snakeoil.

@bmario
Copy link
Member Author

bmario commented Feb 2, 2021

The issue wasn't seen in the wild for quite some time. Possibly an update of RabbitMQ, MetricQ, asio fixed a thing here.
Naturally, I'm going to close this for now.

@bmario bmario closed this as completed Feb 2, 2021
@bmario
Copy link
Member Author

bmario commented Apr 16, 2021

Looks like the undead is back.

Apr 16 11:54:42 igel metricq-db-hta[91234]: [2021-04-16 11:54:42.892137934 CEST][metricq][ERROR]: [Data connection] write failed: stream truncated
Apr 16 11:54:42 igel metricq-db-hta[91234]: [2021-04-16 11:54:42.893894823 CEST][metricq][ERROR]: data channel error: stream truncated
Apr 16 11:55:12 igel metricq-db-hta[91234]: [2021-04-16 11:55:12.005071924 CEST][metricq][ INFO]: sink data queue consume finalize
Apr 16 11:55:12 igel metricq-db-hta[91234]: [2021-04-16 11:55:12.005162176 CEST][metricq][ INFO]: sink history queue consume finalize
Apr 16 11:55:12 igel metricq-db-hta[91234]: [2021-04-16 11:55:12.007889857 CEST][ERROR]: Unhandled exception: ConnectionHandler::onError: write failed

@bmario bmario reopened this Apr 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant