Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receiver node from "develop" branch dies due to accessing nullptr #17

Open
marco-tranzatto opened this issue Aug 12, 2021 · 4 comments · May be fixed by #18
Open

Receiver node from "develop" branch dies due to accessing nullptr #17

marco-tranzatto opened this issue Aug 12, 2021 · 4 comments · May be fixed by #18

Comments

@marco-tranzatto
Copy link

Hi @xqms

We've got an issue with the receiver node dying because of segfault (exit code -11) when using the "develop" branch.

@tomlankhorst and I tried to debug this issue and could get a backtrace using gdb (see end of the message). We were wondering if you could provide us with your feedback about this issue. This is our setup:

  • a base station runs the receiver node
  • an agent runs the sender node
  • the agent sends both tcp (sporadic messages) and udp (images and tf) topics

When the agent goes out of connection range and comes back after some minutes, we often get the receiver node dying on the base station. We suspect that Depacketizer::addPacket is called asynchronously on the same object (for example when 2 UDP packets are received) and so addPacket is called twice. But if it's called async, then it will mess up because the value of the iterator will be changing.

Do you have any hint or suggest on how we could fix this issue?

Best,
Marco

Backtrace:

#0  0x00007ffff7d6ce72 in nimbro_topic_transport::Depacketizer::handleMessagePacket (this=0x7fffffffbfb0, it=
    {id = 24190, complete = false, received_symbols = 10, packets = std::vector of length 10, capacity 10 = {std::shared_ptr<nimbro_topic_transport::Packet> (use count 1, weak count 0) = {get() = 0x7fffd48a92b0}, std::shared_ptr<nimbro_topic_transport::Packet> (use count 1, weak count 0) = {get() = 0x7fffd48b0120}, std::shared_ptr<nimbro_topic_transport::Packet> (use count 1, weak count 0) = {get() = 0x7fffd48b06f0}, std::shared_ptr<nimbro_topic_transport::Packet> (use count 1, weak count 0) = {get() = 0x7fffd48b0cc0}, std::shared_ptr<nimbro_topic_transport::Packet> (use count 2, weak count 0) = {get() = 0x7fffd48b1290}, std::shared_ptr<nimbro_topic_transport::Packet> (use count 1, weak count 0) = {get() = 0x7fffd48accd0}, std::shared_ptr<nimbro_topic_transport::Packet> (use count 1, weak count 0) = {get() = 0x7fffd48a9e50}, std::shared_ptr<nimbro_topic_transport::Packet> (use count 1, weak count 0) = {get() = 0x7fffd48a8140}, std::shared_ptr<nimbro_topic_transport::Packet> (empty) = {get() = 0x0}, std::shared_ptr<nimbro_topic_transport::Packet> (use count 1, weak count 0) = {get() = 0x7fffd48aefb0}}, decoder = std::shared_ptr<WirehairCodec_t> (empty) = {get() = 0x0}}, packet=...) at /usr/include/c++/9/bits/shared_ptr_base.h:1020
#1  0x00007ffff7d6df79 in nimbro_topic_transport::Depacketizer::addPacket (this=0x7fffffffbfb0,
    packet=std::shared_ptr<nimbro_topic_transport::Packet> (use count 2, weak count 0) = {...})
    at /home/subt/catkin_ws/src/nimbro_network/nimbro_topic_transport/src/receiver/depacketizer.cpp:44
#2  0x00007ffff7d71432 in std::function<void (std::shared_ptr<nimbro_topic_transport::Packet> const&)>::operator()(std::shared_ptr<nimbro_topic_transport::Packet> const&) const (__args#0=std::shared_ptr<nimbro_topic_transport::Packet> (use count 2, weak count 0) = {...},

xqms added a commit that referenced this issue Aug 13, 2021
This could potentially trigger a segfault when all messages are older
than 5s.

Maybe related: #17
@xqms xqms linked a pull request Aug 13, 2021 that will close this issue
@xqms
Copy link
Member

xqms commented Aug 13, 2021

Hey, thanks for the report! I think you uncovered a pretty serious bug - I'm not sure how this went unnoticed for so long :/

I'm pretty sure it's not a concurrency issue, since Depacketizer::addPacket is protected with a mutex.

But there's a bug hidden in addPacket(): It first saves the iterator, and then calls pruneMessages(), which potentially removes all messages.

Could you please try the fix in #18?

marco-tranzatto pushed a commit to leggedrobotics/nimbro_network that referenced this issue Aug 13, 2021
This could potentially trigger a segfault when all messages are older
than 5s.

Maybe related: AIS-Bonn#17
@marco-tranzatto
Copy link
Author

@xqms thanks for your quick reply and fix ... amazing!

I did try the proposed fix a few times and could not reproduce the error anymore. I'm gonna try again next week with more extensive tests and let you know.

Once again, thanks a lot for the fix!
Best,
Marco

@marco-tranzatto
Copy link
Author

marco-tranzatto commented Aug 18, 2021

Hi @xqms

Unfortunately the issue happens again even after using the proposed fix. I got another backtrack:

(gdb) bt
#0  0x00007ffff7d6ce72 in nimbro_topic_transport::Depacketizer::handleMessagePacket (
    this=0x7fffffffbed0, it=
    {id = 6619, complete = false, received_symbols = 11, packets = std::vector of length 11, capacity 11 = {std::shared_ptr<nimbro_topic_transport::Packet> (use count 1, weak count 0) = {get() = 0x7fffd420e940}, std::shared_ptr<nimbro_topic_transport::Packet> (use count 1, weak count 0) = {get() = 0x7fffd420e370}, std::shared_ptr<nimbro_topic_transport::Packet> (use count 1, weak count 0) = {get() = 0x7fffd420ef10}, std::shared_ptr<nimbro_topic_transport::Packet> (use count 1, weak count 0) = {get() = 0x7fffd420fab0}, std::shared_ptr<nimbro_topic_transport::Packet> (use count 1, weak count 0) = {get() = 0x7fffd4210650}, std::shared_ptr<nimbro_topic_transport::Packet> (use count 1, weak count 0) = {get() = 0x7fffd42111f0}, std::shared_ptr<nimbro_topic_transport::Packet> (use count 1, weak count 0) = {get() = 0x7fffd4211d90}, std::shared_ptr<nimbro_topic_transport::Packet> (use count 1, weak count 0) = {get() = 0x7fffd4212930}, std::shared_pt

FYI @tomlankhorst

@xqms
Copy link
Member

xqms commented Aug 18, 2021

That's unfortunate... Could you please compile nimbro_network in Debug mode (cmake arg -DCMAKE_BUILD_TYPE=Debug) so that we can get a more complete backtrace?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants