Stress testing causes PDBGroupMonitor to segfault #64

akete · 2025-01-23T12:30:24Z

Background: this was first observed using custom-built IOC using our custom record type support and using Phoebus CSS as a client. Later, I was able to reproduce the problem entirely in softIocPVA with multiple pvget -m clients.

Problem

With Qsrv grouping multiple PVs into NTTable (complete setup attached) and either CSS screen displaying all components in a X/Y Plot or using multiple pvget -m clients (*), IOC crashes with

Core was generated by `../../softIocPVA -D /test/softIoc/dbd/softIocPVA.dbd st.cmd'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  epicsMutex::lock (this=this@entry=0x48) at ../osi/epicsMutex.cpp:276
276	../osi/epicsMutex.cpp: No such file or directory.
[Current thread is 1 (Thread 0x7f3fe75fa700 (LWP 23040))]
(gdb) bt
#0  epicsMutex::lock (this=this@entry=0x48) at ../osi/epicsMutex.cpp:276
#1  0x000055e51b191a62 in epicsGuard<epicsMutex>::epicsGuard (mutexIn=..., this=<synthetic pointer>) at /build/3rdparty/epics/base-7/include/epicsGuard.h:143
#2  PDBGroupMonitor::requestUpdate (this=0x7f3fc800d630) at ../pdbgroup.cpp:472
#3  0x000055e51b18c6d9 in BaseMonitor::release (this=<optimized out>, elem=...) at ../../common/pvahelper.h:338
#4  0x000055e51b26e999 in epics::pvAccess::MonitorElement::Ref::reset (this=0x7f3fe75f9ab0) at ../../src/client/pv/monitor.h:186
#5  epics::pvAccess::ServerMonitorRequesterImpl::send (this=0x7f3fc800d450, buffer=<optimized out>, control=<optimized out>) at ../../src/server/responseHandlers.cpp:2039
#6  0x000055e51b1f1db0 in epics::pvAccess::detail::AbstractCodec::processSender (this=this@entry=0x7f3fd0000b20, sender=std::shared_ptr<epics::pvAccess::TransportSender> (use count 3, weak count 2) = {...}) at ../../src/remote/codec.cpp:884
#7  0x000055e51b1f31ec in epics::pvAccess::detail::AbstractCodec::processSendQueue (this=0x7f3fd0000b20) at ../../src/remote/codec.cpp:844
#8  0x000055e51b1f37e5 in epics::pvAccess::detail::AbstractCodec::processWrite (this=this@entry=0x7f3fd0000b20) at ../../src/remote/codec.cpp:754
#9  0x000055e51b1f6438 in epics::pvAccess::detail::BlockingTCPTransportCodec::sendThread (this=0x7f3fd0000b20) at ../../src/remote/codec.cpp:1157
#10 0x000055e51b3d9899 in epicsThreadCallEntryPoint (pPvt=0x7f3fd0000ca0) at ../osi/epicsThread.cpp:95
#11 0x000055e51b3dea7a in start_routine (arg=0x7f3fd001a9b0) at ../osi/os/posix/osdThread.c:442
#12 0x00007f40310666db in start_thread (arg=0x7f3fe75fa700) at pthread_create.c:463
#13 0x00007f402fdfb61f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Steps to reproduce

start softIocPVA serving structured PV,
start triggering PV processing as fast as possible (e. g. while true; do pvput ....; done) and continue to do so until IOC crashes,
start multiple (7+) monitoring clients or open GUI (see image below or attached .bob file),
wait for a minute
kill clients or close GUI

Test setup

NTTable_example.zip

softIocPVA with PV structure:

$ pvget -v  myioc:hist:pva | cut -c 1-50
myioc:hist:pva epics:nt/NTTable:1.0 
    structure record
        structure _options
            uint queueSize 0
            boolean atomic true
    alarm_t alarm 
        int severity 0
        int status 0
        string message NO_ALARM
    structure timeStamp
        long secondsPastEpoch 1737632421
        int nanoseconds 721458675
        int userTag 0
    structure value
        uint[] a [1,0,1,1,1,1,0,1,3,0,0,2,0,1,1,0,
        uint[] b [0,0,2,1,0,0,1,0,0,0,0,1,0,2,4,0,
        uint[] c [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
        uint[] d [0,1,0,0,0,2,2,2,0,1,1,0,0,0,1,0,
        uint[] e [1,0,0,1,3,0,0,1,0,0,3,1,1,0,0,0,
        uint[] f [0,1,2,0,0,1,1,0,0,1,1,0,2,1,1,0,
        uint[] g [0,0,2,1,0,0,1,0,0,0,0,1,0,2,4,0,

CSS GUI:

(*) I was able to reproduce this using 7 pvget -m clients and killing them (kill -9) at the same time after a while.

Initial analysis

By adding additional traces into modules/pva2pva/pdbApp/pdbgroup.cpp I was able to confirm that PDBGroupMonitor::release() is being called after PDBGroupMonitor::destroy() has already been called. This happens when IOC becomes overwhelmed, and calls to PDBGroupMonitor::release() start occurring.

The text was updated successfully, but these errors were encountered:

ralphlange · 2025-01-23T16:15:30Z

Do you have any chance to test this setup against QSRV2/PVXS in the IOC?

akete · 2025-01-24T08:00:57Z

Using above setup adapted for softIocPVX, I was not able to reproduce this problem (tried 20 clients). IOC reports disconnects and does not crash:

epics> 
epics> 2025-01-24T08:50:30.158497784 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:37828 closed with socket error 104 : Connection reset by peer
2025-01-24T08:50:30.176984026 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:48118 closed with socket error 32 : Broken pipe
2025-01-24T08:50:30.188412817 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:48026 closed with socket error 32 : Broken pipe
2025-01-24T08:50:30.188547299 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:48142 closed with socket error 32 : Broken pipe
2025-01-24T08:50:30.188602057 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:48060 closed with socket error 32 : Broken pipe
2025-01-24T08:50:30.188656139 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:60618 closed with socket error 32 : Broken pipe
2025-01-24T08:50:30.188714453 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:48208 closed with socket error 32 : Broken pipe
2025-01-24T08:50:30.188752460 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:48140 closed with socket error 32 : Broken pipe
2025-01-24T08:50:30.188814922 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:48152 closed with socket error 32 : Broken pipe
2025-01-24T08:50:30.188862806 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:48102 closed with socket error 32 : Broken pipe
2025-01-24T08:50:30.188913788 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:48160 closed with socket error 32 : Broken pipe
2025-01-24T08:50:30.188955986 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:48116 closed with socket error 32 : Broken pipe
2025-01-24T08:50:30.189003952 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:48024 closed with socket error 32 : Broken pipe
2025-01-24T08:50:30.189050189 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:48130 closed with socket error 32 : Broken pipe
2025-01-24T08:50:30.189079446 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:48056 closed with socket error 32 : Broken pipe
2025-01-24T08:50:30.189120974 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:48206 closed with socket error 32 : Broken pipe
2025-01-24T08:50:30.189166080 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:48182 closed with socket error 32 : Broken pipe
2025-01-24T08:50:30.189207024 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:48086 closed with socket error 32 : Broken pipe
2025-01-24T08:50:30.189250416 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:48076 closed with socket error 32 : Broken pipe
2025-01-24T08:50:30.189291046 ERR pvxs.tcp.io connection to Client [::ffff:10.0.1.106]:48028 closed with socket error 32 : Broken pipe

epics>

mdavidsaver · 2025-01-28T02:51:48Z

Using above setup adapted for softIocPVX, I was not able to reproduce this problem ...

Race conditions / ownership bugs like this one motivated to create PVXS and QSRV2. So I would recommend migrating your IOCs.

kill clients or close GUI

I think this makes the circumstance clear enough.

If practical, please attach the result of running thread apply all backtrace to print the stack traces of all threads. This crash is likely a result of a race condition between the sender thread (shown), and the receiver thread (not shown), servicing one TCP socket. Knowing the state of the receiver thread would help to unravel where the shared_ptr ownership is going wrong.

akete · 2025-01-29T06:32:07Z

File with stack traces from all threads: all_threads.log

mdavidsaver · 2025-01-30T15:50:29Z

Well this is puzzling. The associated receiveThread() is still running, and apparently idly blocking in recv().

There is also a second pair of threads for another client where both threads also appear to be idle.

So whatever corruption has already occurred.

@akete If you are feeling adventurous, or are interested in learning about a neat Linux-only tool. The RR reverse execution debugger can be very effective in finding these sorts of use-after-free errors. Run to the point of the crash, set a watchpoint on this and run backwards to the point where it is freed.

I might also suggest valgrind, but in this situation I think the performance hit of full emulation might change behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stress testing causes PDBGroupMonitor to segfault #64

Stress testing causes PDBGroupMonitor to segfault #64

akete commented Jan 23, 2025

ralphlange commented Jan 23, 2025

akete commented Jan 24, 2025

mdavidsaver commented Jan 28, 2025

akete commented Jan 29, 2025

mdavidsaver commented Jan 30, 2025

Stress testing causes PDBGroupMonitor to segfault #64

Stress testing causes PDBGroupMonitor to segfault #64

Comments

akete commented Jan 23, 2025

Problem

Steps to reproduce

Test setup

Initial analysis

ralphlange commented Jan 23, 2025

akete commented Jan 24, 2025

mdavidsaver commented Jan 28, 2025

akete commented Jan 29, 2025

mdavidsaver commented Jan 30, 2025