Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add image cache to avoid calling dladdr() and add libbacktrace elf image list refresh #674

Conversation

tiago-rodrigues
Copy link

@tiago-rodrigues tiago-rodrigues commented Nov 29, 2023

On glibc platforms, add an image cache to avoid calling dladdr() when doing offline symbol resolution, as dladdr() can cause contention with other dl* calls (see: #665 (comment)).

Add support for libbacktrace to detect new elfs have been dynamically loaded after backtrace_initialize() has been called, and consider them for further symbol resolution.

…esolution. This cache can also be used in the runtime symbol resolution case to detect we should recreate "backtrace_state" when new images have been loaded
@YaLTeR
Copy link
Contributor

YaLTeR commented Dec 2, 2023

Gave this a try and it seems to work very well! I am able to record GNOME Shell with offline symbol resolving without much extra lag seemingly.

}
};
#endif //#ifdef TRACY_USE_IMAGE_CACHE

namespace tracy
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for keeping the above code outside the tracy namespace?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, moved it under

if( !entry )
{
//printf("* addr not found: %p (%d entries) refreshing (m_numberOfRefreshes: %d)\n",
// address, m_images->size(), m_numberOfRefreshes);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove debug code or convert to TracyDebug().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


if( it != m_images->end() && address < it->m_endAddress )
{
return &(*it);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return it?

CMakeLists.txt Outdated
@@ -86,6 +86,7 @@ set_option(TRACY_NO_CRASH_HANDLER "Disable crash handling" OFF)
set_option(TRACY_TIMER_FALLBACK "Use lower resolution timers" OFF)
set_option(TRACE_CLIENT_LIBUNWIND_BACKTRACE "Use libunwind backtracing where supported" OFF)
set_option(TRACY_SYMBOL_OFFLINE_RESOLVE "Instead of full runtime symbol resolution, only resolve the image path and offset to enable offline symbol resolution" OFF)
set_option(TRACY_ENABLE_IMAGE_CACHE "On glibc platforms, when doing offline symbol resolution, use an cache to determine the image path and offset instead of dladdr()" OFF)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for keeping this as an option?

Copy link
Author

@tiago-rodrigues tiago-rodrigues Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description of that option should have included the runtime resolve case. For offline symbol resolving, having the image cache be used by default sounds ok to me.
I wasn't sure what to do with the runtime resolve case, where the image cache is used to detect new images were loaded and trigger libbacktrace state recreation. Maybe the option should be renamed something like "TRACY_ENABLE_LIBBACKTRACE_DYNAMIC_LOADED_IMAGES" and only control enabling that code for the runtime resolve case. It could then be disable by default as maybe not too many users needed (if they don't do dynamic loading of shared libs) and also knowing that it comes with libbacktrace state leaking?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every program, even helloworldware, will do dynamic loading of shared libs. For example, when I put a breakpoint on elf_add in libbacktrace in the tracy test application, I see hits on /proc/self/exe, /usr/lib/libdebuginfod.so.1, /usr/lib/libstdc++.so.6, /usr/lib/libm.so.6, /usr/lib/libgcc_s.so.1, /usr/lib/libc.so.6, /usr/lib/libcurl.so.4, /usr/lib/libelf.so.1, /lib64/ld-linux-x86-64.so.2, and so on.

The call stack for each of these library loads is:

#0  tracy::elf_add (state=0x7fffee270080, filename=0x7fffee260120 "/usr/lib/libgcc_s.so.1", descriptor=232, memory=0x0, memory_size=0, base_address=140737350017024, error_callback=0x555555603dd0 <tracy::CallstackErrorCb(void*, char const*, int)>, data=0x0, fileline_fn=0x7fffef27e608, found_sym=0x7fffef27e754, found_dwarf=0x7fffef27e604, fileline_entry=0x0, exe=0, debuginfo=0, with_buildid_data=0x0, with_buildid_size=0) at ../public/libbacktrace/elf.cpp:6578
#1  0x000055555560f2a1 in tracy::phdr_callback (info=0x7fffee3c00d0, pdata=0x7fffef27e708) at ../public/libbacktrace/elf.cpp:7411
#2  0x000055555560c80c in tracy::backtrace_initialize (state=0x7fffee270080, filename=0x555555633d60 "/proc/self/exe", descriptor=232, error_callback=0x555555603dd0 <tracy::CallstackErrorCb(void*, char const*, int)>, data=0x0, fileline_fn=0x7fffef27e7f8) at ../public/libbacktrace/elf.cpp:7459
#3  0x0000555555609bc8 in tracy::fileline_initialize (state=0x7fffee270080, error_callback=0x555555603dd0 <tracy::CallstackErrorCb(void*, char const*, int)>, data=0x0) at ../public/libbacktrace/fileline.cpp:264
#4  0x000055555560340d in tracy::backtrace_pcinfo (state=0x7fffee270080, pc=93824992316766, callback=0x555555603780 <tracy::CallstackDataCb(void*, unsigned long, unsigned long, char const*, int, char const*)>, error_callback=0x555555603dd0 <tracy::CallstackErrorCb(void*, char const*, int)>, data=0x0) at ../public/libbacktrace/fileline.cpp:298
#5  0x00005555555f2eea in tracy::DecodeCallstackPtr (ptr=93824992316766) at ../public/client/TracyCallstack.cpp:1075
#6  0x00005555555ede59 in tracy::Profiler::HandleSymbolQueueItem (this=0x555555644700 <tracy::s_profiler>, si=...) at ../public/client/TracyProfiler.cpp:3266
#7  0x00005555555f4b3a in tracy::Profiler::SymbolWorker (this=0x555555644700 <tracy::s_profiler>) at ../public/client/TracyProfiler.cpp:3388
#8  0x000055555562afc5 in tracy::Profiler::LaunchSymbolWorker (ptr=0x555555644700 <tracy::s_profiler>) at ../public/common/../client/TracyProfiler.hpp:796
#9  0x000055555562c6ad in tracy::Thread::Launch (ptr=0x7ffff63600e0) at ../public/common/../client/TracyThread.hpp:80

Note that DecodeCallstackPtr is on the stack here, so I'm not sure of your reasoning in that sentence:

This cache can also be used in the runtime symbol resolution case to detect we should recreate "backtrace_state" when new images have been loaded, as libbacktrace symbol resolution will only consider images loaded before the first symbol resolve operation.

Can you tell me why you think the libbacktrace state should be reset when a new library appears? The trace above show that the library should handle this out of the box.

Copy link
Author

@tiago-rodrigues tiago-rodrigues Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, that is strange. I'll unoptimize the code and do some stepping again in my case to make sure what I was seeing was actually correct (I'll hopefully get to it later today). But after the first symbol resolve was hit, I was missing symbol resolving for shared libraries loaded after that. Looking through the libbacktrace code I could see it was using backtrace_state::fileline_fn as the control value to already called backtrace_initialize() (which was what was responsible for the shared library indexing), and that was set to something non-null after the call. That doesn't appear to be the case for you though, maybe I misinterpreted the results in the debugger with optimized code and something else was causing the issues.

Copy link
Author

@tiago-rodrigues tiago-rodrigues Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to preface this comment, by admitting that the code path I've been testing with out main app is still using a 0.9.1 client code branch with extra cherry-picks on top (we haven't been able to move all workflows to use 0.10 just yet), so it's not really the same test with the HEAD code you did (I'll try to do this with HEAD code tomorrow), however I don't see many changes on the libbacktrace code that would have changed this behavior.
I only see it hit fileline_initialize then backtrace_initialize once on the first resolve, because it will then set state->fileline_fn at the end of fileline_initialize() and that will prevent it from calling backtrace_initialize on any subsequent calls.
Maybe you recall if something was changed between 0.9.1 and HEAD that would have allowed backtrace_initialize be called multiple times?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as I understand it, the problem you have is that shared objects that are dlopened at a later time are not visible because libbacktrace has already collected a list of shared objects that the executable is linking to and will not look for more

exactly.

This feels like something libbacktrace should handle itself (in an ideal world).

I guess before I started looking more closely at the code I didn't realize you had already modified that libbacktrace code quite a bit and it wasn't just a dump from upstream.

you already know when you want to reset libbacktrace state. Have you tried running dl_iterate_phdr to call phdr_callback instead?

Yes, libacktrace code is already doing pretty much what I was doing once, so I can try to expose an update method I can call from outside - I'll try to do this this week and update this PR.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess before I started looking more closely at the code I didn't realize you had already modified that libbacktrace code quite a bit and it wasn't just a dump from upstream.

There have been some changes that add some sorely missing features, but there have been no substantial changes. The changes to dl_iterate_phdr are technical: 7e8961d

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, libacktrace code is already doing pretty much what I was doing once, so I can try to expose an update method I can call from outside - I'll try to do this this week and update this PR.

Please check if it would be feasible to implement this update functionality within libbacktrace itself.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to dl_iterate_phdr are technical: 7e8961d

I think this was needed because elf_add can now call debuginfod, which might issue network requests, and this didn't play well with the requirements of dl_iterate_phdr.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if it would be feasible to implement this update functionality within libbacktrace itself.

I made an attempt at it in the latest commit

Tiago Rodrigues added 4 commits December 3, 2023 09:23
return ret;
}

// if we failed to obtain an entry in range, it can mean that the address map has been cahnges and new entries
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole file is marked as changed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry line endings shenanigans - should be fixed now.
The indentation in those libbacktrace files is also somewhat "special"

}
}
image->m_name = cache->m_imageName;
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is image->m_name valid after this function returns? My guess would be that the info argument is only valid inside this function, and the dlInfo provided filename will be invalidated after the next call to dladdr.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point - fixed.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this memory released?

[]( const ImageEntry& lhs, const ImageEntry& rhs ) { return lhs.m_startAddress > rhs.m_startAddress; } );
}

const ImageEntry* GetImageEntryForAddress( void* address ) const
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to GetImageForAddressImpl.

const ImageEntry* entry = GetImageEntryForAddress( address );
if( !entry )
{
Refresh();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a mechanism that will prevent constant reloads when a large set of unmapped addresses is queried?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't one atm. The expectation would be that it wouldn't actually be called that often in real scenarios. In my case with a few hundred dynamic loaded shared libs, and instrumenting all memory allocations with a callstack captures I only get it called ~5 times until it actually has the full map as symbol resolution is relatively slow in itself.
What would be your suggestion, a "min ms between refreshes" and either skip (and return unknown symbols for that period) or "stall" the resolution thread for a period or time?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stalling to allow batch processing of a larger set may be a solution. But I was more interested in whether you have considered this case. I don't think this is immediately actionable.

@@ -7366,6 +7393,12 @@ phdr_callback_mock (struct dl_phdr_info *info, size_t size ATTRIBUTE_UNUSED,
}
else ptr->dlpi_name = nullptr;
ptr->dlpi_addr = info->dlpi_addr;

// calculate the address range so we can quickly determine is a PC is within the range of this image
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo.

@tiago-rodrigues tiago-rodrigues changed the title Add image cache to avoid calling dladdr() and enable detecting image loading and recreate "backtrace_state" Add image cache to avoid calling dladdr() and add libbacktrace elf image list refresh Dec 9, 2023
// so we must get it in an alternative way and cache it
if( info->dlpi_name && info->dlpi_name[0] != '\0' )
{
image->m_name = info->dlpi_name;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this assignment?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair point - copied it there as well.
I now free and rebuild images names with every refresh for simplicity - would probably be better if I just recreated the new image entries, although it would require searching and comparing the paths to be sure so not sure it's worth it.

@wolfpld wolfpld merged commit 9bc014b into wolfpld:master Dec 11, 2023
5 checks passed
@YaLTeR
Copy link
Contributor

YaLTeR commented Dec 12, 2023

I'm testing libunwind bt + dynload and I'm getting funny strings in the backtrace occasionally, could be memory corruption?

image

@YaLTeR
Copy link
Contributor

YaLTeR commented Dec 12, 2023

Okay, I get this without dynload too, so I'll make an issue instead.

@YaLTeR
Copy link
Contributor

YaLTeR commented Dec 12, 2023

Made an issue at #684; I somewhat suspect the latter changes in this PR though, because when I tested it in the beginning it seemed to work fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants