Skip to content

VPP crashing when clients are connecting/disconnecting from the API socket #3668

@okanaganrusty

Description

@okanaganrusty

Summary

VPP is crashing when clients are connecting and disconnecting from the API socket. We have been able to reproduce this by connecting to the unix API socket, creating a govpp l2fib events watcher, and then restarting our agent subscribing to these want l2fib events.

The crash appears to be happening in the function vl_mem_api_can_send in the file memory_shared.c at line 795. This function is called when VPP is trying to send a message to a client, but it seems that the message queue pointer (q) is null, leading to a segmentation fault. To further investigate this issue, we can attach gdb to the VPP process and set a breakpoint at line 781 in memory_shared.c, which is just before the crash occurs. This will allow us to inspect the state of the program and the variables involved when the crash happens.

Steps to Reproduce

  1. Start VPP with the necessary configuration to enable API socket communication.
  2. Connect to the VPP API socket using a client (e.g., govpp).
  3. Create an l2fib events watcher in the client.
  4. Restart the client to trigger disconnection and reconnection to the API socket.
  5. Create two VNFs and give them both the same IP address in the same bridge domain, have both of the VNFs constantly send ping requests, generating l2fib updates where each VNF has the same virtual MAC generating l2fib path updates.

Environment

  • VPP Version: vpp v24.02
  • OS: Flatcar 4230.2.4
  • Kernel 6.6.110-flatcar
  • Container Runtime: containerd 1.7.23
  • Kubernetes: v1.32.2

GDB Output

Attaching to process 42
[New LWP 43]
[New LWP 44]
[New LWP 45]
[New LWP 46]
[New LWP 47]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f0fc38d0bd8 in epoll_pwait () from /lib/x86_64-linux-gnu/libc.so.6
Breakpoint 1 at 0x7f0fc3da3cf4: file /root/vpp/src/vlibapi/memory_shared.c, line 781.
Continuing.

Thread 1 "vpp_main" received signal SIGSEGV, Segmentation fault.
vl_mem_api_can_send (q=0x0) at /root/vpp/src/vlibapi/memory_shared.c:795
795     /root/vpp/src/vlibapi/memory_shared.c: No such file or directory.

(gdb) bt full
#0  vl_mem_api_can_send (q=0x0) at /root/vpp/src/vlibapi/memory_shared.c:795
No locals.
#1  0x00007f0fc3f08b10 in vl_api_can_send_msg (rp=0x7f0f444bc878) at /root/vpp/src/vlibmemory/api.h:53
No locals.
#2  l2fib_scan (vm=0x7f0f40000700, start_time=<optimized out>, event_only=1 '\001') at /root/vpp/src/vnet/l2/l2_fib.c:1291
        last_start = <optimized out>
        accum_t = <optimized out>
        delta_t = <optimized out>
        evt_idx = <optimized out>
        learn_count = <optimized out>
        client = 1
        cl_idx = 117440640
        mp = 0x1301c7810
        reg = 0x7f0f444bc878
        i = <optimized out>
        j = <optimized out>
        k = <optimized out>
        bd_index = <optimized out>
        fm = <optimized out>
        lm = <optimized out>
        h = <optimized out>
        v = <optimized out>
        b = <optimized out>
        doublebreak = <optimized out>
        b = <optimized out>
        v = <optimized out>
        kv = <optimized out>
        key = <optimized out>
        result = <optimized out>
        bd_index = <optimized out>
        sw_if_index = <optimized out>
        sn = <optimized out>
        bd_config = <optimized out>
        delta = <optimized out>
        age_out = <optimized out>
        kv = <optimized out>
#3  l2fib_mac_age_scanner_process (vm=0x7f0f40000700, rt=<optimized out>, f=<optimized out>) at /root/vpp/src/vnet/l2/l2_fib.c:1363
        scan = <optimized out>
        SCAN_MAC_AGE = SCAN_MAC_AGE
        SCAN_MAC_EVENT = SCAN_MAC_EVENT
        SCAN_DISABLE = SCAN_DISABLE
        event_data = 0x7f0f443af618
        enabled = <optimized out>
        next_age_scan_time = <optimized out>
        start_time = <optimized out>
        event_type = <optimized out>
        fm = <optimized out>
        lm = <optimized out>
#4  0x00007f0fc3ba9557 in vlib_process_bootstrap (_a=<optimized out>) at /root/vpp/src/vlib/main.c:1221
        a = <optimized out>
        vm = 0x3
        p = 0x7f0f405e7980
        f = 0x3c
        node = 0x7f0f405e7980
        n = <optimized out>
#5  0x00007f0fc3b2d4f4 in clib_calljmp () at /root/vpp/src/vppinfra/longjmp.S:123
No locals.
#6  0x00007f0fc2ddfd50 in ?? ()
No symbol table info available.
#7  0x00007f0fc3ba011a in vlib_process_startup (vm=0x7f0f40000700, p=0x7f0f405e7980, f=0x0) at /root/vpp/src/vlib/main.c:1246
        a = {vm = 0x737365, process = 0x0, frame = 0x0}
        r = 139703545623496
#8  dispatch_process (vm=0x7f0f40000700, p=0x7f0f405e7980, f=0x0, last_time_stamp=4975705906266793) at /root/vpp/src/vlib/main.c:1302
        nm = 0x7f0f40000858
        node_runtime = 0x7f0f405e7980
        node = 0x7f0f405e7830
        t = 4975705906266793
        old_process_index = 4294967295
        n_vectors = <optimized out>
        is_suspend = <optimized out>
#9  0x0000000000000000 in ?? ()
No symbol table info available.

Interim Workaround

After applying this patch, the SIGSEGV crashes related to a null pointer dereference are no longer occurring. Further investigation is ongoing. Although the patch addresses the immediate crash, there is potential for additional issues, such as missed L2FIB event notifications during L2FIB scans.

diff --git a/src/vlibapi/memory_shared.c b/src/vlibapi/memory_shared.c
index a2ea50deb..22772cb93 100644
--- a/src/vlibapi/memory_shared.c
+++ b/src/vlibapi/memory_shared.c
@@ -778,7 +778,7 @@ vl_msg_api_send_shmem (svm_queue_t * q, u8 * elem)
 int
 vl_mem_api_can_send (svm_queue_t * q)
 {
-  return (q->cursize < q->maxsize);
+  return q ? (q->cursize < q->maxsize) : 0;
 }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions