Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert error in VBP_Control(): Condition(vt->heap_idx == VBH_NOIDX) not true. #4108

Closed
delthas opened this issue May 21, 2024 · 6 comments · Fixed by #4115
Closed

Assert error in VBP_Control(): Condition(vt->heap_idx == VBH_NOIDX) not true. #4108

delthas opened this issue May 21, 2024 · 6 comments · Fixed by #4115

Comments

@delthas
Copy link
Contributor

delthas commented May 21, 2024

I'm using Varnish 7.5.0 and the 7.5 branch of libvmod-dynamic.

I'm stressing a Varnish with a simple VCL with many dynamic backends with probes, reloading and restarting many times over multiples hours and gathering any issues. I've had one repeated panic:

Assert error in VBP_Control(), cache/cache_backend_probe.c line 661:
  Condition(vt->heap_idx == VBH_NOIDX) not true.
Backtrace:
  0x55fb3cd793c7: /usr/sbin/varnishd(+0x5d3c7) [0x55fb3cd793c7]        // pan_ic
  0x55fb3cdf9dc5: /usr/sbin/varnishd(VAS_Fail+0x45) [0x55fb3cdf9dc5]   // VAS_Fail
  0x55fb3cd4f703: /usr/sbin/varnishd(+0x33703) [0x55fb3cd4f703]        // VBP_Control
  0x55fb3cd88dbb: /usr/sbin/varnishd(+0x6cdbb) [0x55fb3cd88dbb]        // vcl_BackendEvent
  0x55fb3cd8a335: /usr/sbin/varnishd(+0x6e335) [0x55fb3cd8a335]        // vcl_set_state
  0x55fb3cd8b369: /usr/sbin/varnishd(+0x6f369) [0x55fb3cd8b369]        // vcl_cli_load
  0x55fb3cdfc405: /usr/sbin/varnishd(+0xe0405) [0x55fb3cdfc405]        // cls_exec
  0x55fb3cdfc9f3: /usr/sbin/varnishd(VCLS_Poll+0x333) [0x55fb3cdfc9f3]
  0x55fb3cd55a94: /usr/sbin/varnishd(CLI_Run+0x64) [0x55fb3cd55a94]
  0x55fb3cd743ba: /usr/sbin/varnishd(child_main+0x22a) [0x55fb3cd743ba]
  0x55fb3cdbe7fd: /usr/sbin/varnishd(+0xa27fd) [0x55fb3cdbe7fd]
  0x55fb3cdbf38c: /usr/sbin/varnishd(+0xa338c) [0x55fb3cdbf38c]
  0x55fb3cdbf56e: /usr/sbin/varnishd(+0xa356e) [0x55fb3cdbf56e]
  0x55fb3cdfe4c6: /usr/sbin/varnishd(VEV_Once+0x1e6) [0x55fb3cdfe4c6]
  0x55fb3cdfe6c8: /usr/sbin/varnishd(VEV_Loop+0x28) [0x55fb3cdfe6c8]
  0x55fb3cd48071: /usr/sbin/varnishd(main+0x2001) [0x55fb3cd48071]
  0x7fdb47bd1b45: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fdb47bd1b45]
  0x55fb3cd48711: /usr/sbin/varnishd(+0x2c711) [0x55fb3cd48711]

This looks very close to #3957 (itself fixing nigoroll/libvmod-dynamic#100), but even with a version including the fix, I'm getting this assert. In that PR the crash is Assert error in VBP_Remove(): Condition(vt->heap_idx == VBH_NOIDX) not true., here it's Assert error in VBP_Control(): Condition(vt->heap_idx == VBH_NOIDX) not true., so in a slightly different place.

@delthas
Copy link
Contributor Author

delthas commented May 21, 2024

cc @nigoroll

@nigoroll
Copy link
Member

Yes, I think it basically is the same race as the one closed in #3957

@delthas
Copy link
Contributor Author

delthas commented May 24, 2024

Does that mean the fix would just be the following?

--- a/bin/varnishd/cache/cache_backend_probe.c
+++ b/bin/varnishd/cache/cache_backend_probe.c
@@ -658,6 +658,9 @@ VBP_Control(const struct backend *be, int enable)
 
        Lck_Lock(&vbp_mtx);
        if (enable) {
+               if (vt->heap_idx != VBH_NOIDX) {
+                       VBH_delete(vbp_heap, vt->heap_idx);
+               }
                assert(vt->heap_idx == VBH_NOIDX);
                vt->due = VTIM_real();
                vbp_heap_insert(vt);

@nigoroll
Copy link
Member

Does that mean the fix would just be the following?

FTR: @delthas reported on IRC that the patch avoids the panic, which is helpful information, thank you.

@nigoroll
Copy link
Member

nigoroll commented Jun 2, 2024

@delthas could you share more details about your tests, like, for example, a script you are using?

nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Jun 3, 2024
Every time I looked at the probe code, my mind ended up twisted and confused. A
probe could change the "enabled" state (tracking the temperature) and be removed
at any time (unless the mtx is held), yet the code did not seem to reflect this.

We un-twist my mind by completing the transition to probe states and adding a
chain of two states for the case that a probe is controlled/deleted while its
task is running:

cooling: running probe disabled
deleted: running probe removed (while cooling only)

With this new scheme, we can now have (I think) a clean state diagram (see dot
file):

- a probe begins in the cold state
- from cold, it can either get removed or scheduled via VBP_Control()
- from scheduled, it can go back to cold (via VBP_Control()) or
  be picked up by vbp_thread() to change to running (aka task started)
- once the task finishes, it normally goes back to scheduled,
  but in the meantime it could have changed to cooling or deleted,
  so vbp_task_comple() hadles these cases and either transitions to cold
  or deletes the probe
- if the task can not be scheduled, the same handling happens

We now also remove running probes from the binheap to remove complexity.

Fixes varnishcache#4108 for good
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Jun 3, 2024
Every time I looked at the probe code, my mind ended up twisted and confused. A
probe could change the "enabled" state (tracking the temperature) and be removed
at any time (unless the mtx is held), yet the code did not seem to reflect this.

We un-twist my mind by completing the transition to probe states and adding a
chain of two states for the case that a probe is controlled/deleted while its
task is running:

cooling: running probe disabled
deleted: running probe removed (while cooling only)

With this new scheme, we can now have (I think) a clean state diagram (see dot
file):

- a probe begins in the cold state
- from cold, it can either get removed or scheduled via VBP_Control()
- from scheduled, it can go back to cold (via VBP_Control()) or
  be picked up by vbp_thread() to change to running (aka task started)
- once the task finishes, it normally goes back to scheduled,
  but in the meantime it could have changed to cooling or deleted,
  so vbp_task_comple() hadles these cases and either transitions to cold
  or deletes the probe
- if the task can not be scheduled, the same handling happens

We now also remove running probes from the binheap to remove complexity.

Fixes varnishcache#4108 for good
@nigoroll
Copy link
Member

nigoroll commented Jun 3, 2024

@delthas could you try #4115 ?

nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Jun 3, 2024
Every time I looked at the probe code, my mind ended up twisted and confused. A
probe could change the "enabled" state (tracking the temperature) and be removed
at any time (unless the mtx is held), yet the code did not seem to reflect this.

We un-twist my mind by completing the transition to probe states and adding a
chain of two states for the case that a probe is controlled/deleted while its
task is running:

cooling: running probe disabled
deleted: running probe removed (while cooling only)

With this new scheme, we can now have (I think) a clean state diagram (see dot
file):

- a probe begins in the cold state
- from cold, it can either get removed or scheduled via VBP_Control()
- from scheduled, it can go back to cold (via VBP_Control()) or
  be picked up by vbp_thread() to change to running (aka task started)
- once the task finishes, it normally goes back to scheduled,
  but in the meantime it could have changed to cooling or deleted,
  so vbp_task_comple() hadles these cases and either transitions to cold
  or deletes the probe
- if the task can not be scheduled, the same handling happens

We now also remove running probes from the binheap to remove complexity.

Fixes varnishcache#4108 for good
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Jul 8, 2024
Every time I looked at the probe code, my mind ended up twisted and confused. A
probe could change the "enabled" state (tracking the temperature) and be removed
at any time (unless the mtx is held), yet the code did not seem to reflect this.

We un-twist my mind by completing the transition to probe states and adding a
chain of two states for the case that a probe is controlled/deleted while its
task is running:

cooling: running probe disabled
deleted: running probe removed (while cooling only)

With this new scheme, we can now have (I think) a clean state diagram (see dot
file):

- a probe begins in the cold state
- from cold, it can either get removed or scheduled via VBP_Control()
- from scheduled, it can go back to cold (via VBP_Control()) or
  be picked up by vbp_thread() to change to running (aka task started)
- once the task finishes, it normally goes back to scheduled,
  but in the meantime it could have changed to cooling or deleted,
  so vbp_task_comple() hadles these cases and either transitions to cold
  or deletes the probe
- if the task can not be scheduled, the same handling happens

We now also remove running probes from the binheap to remove complexity.

Fixes varnishcache#4108 for good
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Jul 8, 2024
Every time I looked at the probe code, my mind ended up twisted and confused. A
probe could change the "enabled" state (tracking the temperature) and be removed
at any time (unless the mtx is held), yet the code did not seem to reflect this.

We un-twist my mind by completing the transition to probe states and adding a
chain of two states for the case that a probe is controlled/deleted while its
task is running:

cooling: running probe disabled
deleted: running probe removed (while cooling only)

With this new scheme, we can now have (I think) a clean state diagram (see dot
file):

- a probe begins in the cold state
- from cold, it can either get removed or scheduled via VBP_Control()
- from scheduled, it can go back to cold (via VBP_Control()) or
  be picked up by vbp_thread() to change to running (aka task started)
- once the task finishes, it normally goes back to scheduled,
  but in the meantime it could have changed to cooling or deleted,
  so vbp_task_comple() hadles these cases and either transitions to cold
  or deletes the probe
- if the task can not be scheduled, the same handling happens

We now also remove running probes from the binheap to remove complexity.

Fixes varnishcache#4108 for good
nigoroll added a commit to nigoroll/varnish-cache that referenced this issue Jul 15, 2024
Every time I looked at the probe code, my mind ended up twisted and confused. A
probe could change the "enabled" state (tracking the temperature) and be removed
at any time (unless the mtx is held), yet the code did not seem to reflect this.

We un-twist my mind by completing the transition to probe states and adding a
chain of two states for the case that a probe is controlled/deleted while its
task is running:

cooling: running probe disabled
deleted: running probe removed (while cooling only)

With this new scheme, we can now have (I think) a clean state diagram (see dot
file):

- a probe begins in the cold state
- from cold, it can either get removed or scheduled via VBP_Control()
- from scheduled, it can go back to cold (via VBP_Control()) or
  be picked up by vbp_thread() to change to running (aka task started)
- once the task finishes, it normally goes back to scheduled,
  but in the meantime it could have changed to cooling or deleted,
  so vbp_task_comple() hadles these cases and either transitions to cold
  or deletes the probe
- if the task can not be scheduled, the same handling happens

We now also remove running probes from the binheap to remove complexity.

Fixes varnishcache#4108 for good
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants