Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rclcpp] C++ lib leaking rclcpp::Node() can cause segfaults at dlclose() #2567

Open
EricCousineau-TRI opened this issue Jun 21, 2024 · 3 comments

Comments

@EricCousineau-TRI
Copy link

Bug report

Required Info:

  • Operating System: Ubuntu 22.04
  • Installation type: binaries
  • Version or commit hash: humble
  • DDS implementation: rmw_cyclonedds_cpp
  • Client library (if applicable): rclcpp

Steps to reproduce issue

See https://github.com/EricCousineau-TRI/rclcpp_dlclose_segfault

Expected behavior

dlclose() does not trigger race conditions with threads -- ideally, shutdowns existing workers, or tries to order atexit operations manually.

Actual behavior

dlclose() appears to trigger race conditions when user leaks rclcpp::Node.

Additional information

See README in https://github.com/EricCousineau-TRI/rclcpp_dlclose_segfault

@EricCousineau-TRI
Copy link
Author

Inline copy of README:

Relates to Drake, drake-ros, and pydrake.

Due to a workaround in pydrake RobotLocomotion/drake#14356,
we may not always release certain objects. (Non-ideal, but will take some effort to fix.)

If we have components that use rclcpp::Node, but are not deleted, we can encounter segfaults.

Example minimal code (from src/ros_segfault_min_lib.cc) capturing the core issue:

// Expected case: rclcpp::Node will be freed.
void InitNode() {
  RclcppInit();
  rclcpp::Node("node");
}

// Bad case: rclcpp::Node is leaked :(
void InitAndLeakNode() {
  RclcppInit();
  new rclcpp::Node("node");
}

This shows three cases, running 200 trials, 50 in parallel

  • direct_leak: Link a shared lib directly, call InitAndLeakNode(). No segfaults.
  • dlopen_noleak: dlopen() shared lib, call InitNode(). No segfaults.
  • dlopen_leak: dlopen() lib, call InitAndLeakNode(). Many segfaults.

The ideal workaround is that dlclose() ideally does not trigger segfaults when (poorly)
leaking rclcpp::Node instances.

Output

Errors w/ dlopen_leak are non-deterministic

# Non-standard build, but just want to focus on issue.
$ source /opt/ros/humble/setup.bash
$ cmake -S . -B build
$ cmake --build build -j

$ python ./src/ros_segfault_test.py
[ direct_leak ]
num_fail: 0 / 200
returncodes: {0}

[ dlopen_noleak ]
num_fail: 0 / 200
returncodes: {0}

[ dlopen_leak ]
num_fail: 185 / 200
returncodes: {0, -11, 127}

./build/ros_segfault_min_dlopen: symbol lookup error: /opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0: undefined symbol: ddsrt_avl_swap_node
./build/ros_segfault_min_dlopen: symbol lookup error: /opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0: undefined symbol: ddsrt_ehh_new
./build/ros_segfault_min_dlopen: symbol lookup error: /opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0: undefined symbol: ddsi_update_proxy_writer
./build/ros_segfault_min_dlopen: symbol lookup error: /opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0: undefined symbol: ddsi_update_proxy_reader
./build/ros_segfault_min_dlopen: symbol lookup error: /opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0: undefined symbol: plist_fini_generic
./build/ros_segfault_min_dlopen: symbol lookup error: /opt/ros/humble/lib/x86_64-linux-gnu/libddsc.so.0: undefined symbol: ddsrt_avl_free_arg

Attaching GDB

# Terminal 1: Run background stuff to incite segfault.
$ python ./src/ros_segfault_test.py --modes dlopen_leak --count 0

# Terminal 2: Run with gdb.
$ gdb --args ./build/ros_segfault_min_dlopen ./build/libros_segfault_min_lib.so InitAndLeakNode
# show backtrace on all threads
(gdb) thread apply all bt

Example stacktrace from main thread:

Thread 1 (Thread 0x7ffff7f8c740 (LWP 207729) "ros_segfault_mi"):
#0  __GI__dl_debug_state () at ./elf/dl-debug.c:117
#1  0x00007ffff7fc5c24 in _dl_close_worker (force=force@entry=false, map=<optimized out>, map=<optimized out>) at ./elf/dl-close.c:769
#2  0x00007ffff7fc62a2 in _dl_close_worker (force=false, map=0x555555579330) at ./elf/dl-close.c:150
#3  _dl_close (_map=0x555555579330) at ./elf/dl-close.c:818
#4  0x00007ffff7d74a98 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffffd090, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:208
#5  0x00007ffff7d74b63 in __GI__dl_catch_error (objname=0x7fffffffd0e8, errstring=0x7fffffffd0f0, mallocedp=0x7fffffffd0e7, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:227
#6  0x00007ffff7c9012e in _dlerror_run (operate=<optimized out>, args=<optimized out>) at ./dlfcn/dlerror.c:138
#7  0x00007ffff7c8fe58 in __dlclose (handle=<optimized out>) at ./dlfcn/dlclose.c:31
#8  0x00007ffff7fa021d in rcutils_unload_shared_library () from /opt/ros/humble/lib/librcutils.so
#9  0x00007ffff7ea9370 in rcpputils::SharedLibrary::~SharedLibrary() () from /opt/ros/humble/lib/librcpputils.so
#10 0x00007ffff7ed0a66 in ?? () from /opt/ros/humble/lib/librmw_implementation.so
#11 0x00007ffff7c45495 in __run_exit_handlers (status=0, listp=0x7ffff7e1a838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
#12 0x00007ffff7c45610 in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
#13 0x00007ffff7c29d97 in __libc_start_call_main (main=main@entry=0x5555555551a9 <main>, argc=argc@entry=3, argv=argv@entry=0x7fffffffd748) --Type <RET> for more, q to quit, c to continue without paging--
at ../sysdeps/nptl/libc_start_call_main.h:74
#14 0x00007ffff7c29e40 in __libc_start_main_impl (main=0x5555555551a9 <main>, argc=3, argv=0x7fffffffd748, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd738) at ../csu/libc-start.c:392
#15 0x00005555555550e5 in _start ()

@mjcarroll
Copy link
Member

Thanks for the detailed analysis and reproduction.

My initial hunch would be that the Node is holding a function pointer to something that is getting unloaded before the library in question.

Have you tried running this under asan or valgrind to see if there is more detailed stack traces?

@EricCousineau-TRI
Copy link
Author

I ran it under asan, ubsan, and valgrind, but didn't see anything obvious leap out:
EricCousineau-TRI/rclcpp_dlclose_segfault@425f602

Perhaps rcl_node_init -> pthread_create is something that could have function pointer / some other data race condition when dlcose() is triggered, like you said?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants