test/0100-api.bats fail on recent systems #76

wentasah · 2022-11-12T22:46:23Z

When I run the tests in current nixos-unstable with Linux 6.0.7, the test sometimes fails as follows:

./0100-api.bats
 ✓ running api-test outside of demos-sched fails
 ✓ demos schedules first two iterations of the sub-process
 ✓ demos doesn't hang when process exits during initialization
 ✗ each process has opportunity to initialize before window scheduler starts
   (from function `assert_output' in file /nix/store/s0bj2xi9wv9axqi8amgyrcklin7vq47r-bats-with-libraries-1.8.2/share/bats/bats-assert/src/assert.bash, line 255,
    in test file ./0100-api.bats, line 56)
     `assert_output - <<EOF' failed
   
   -- output differs --
   expected (5 lines):
     init start
     init start
     init done
     init done
     <test>
   actual (6 lines):
     init start
     init start
     init done
     init done
     <test>
     <test>
   --

I was not able to reproduce the failure it on nixos-22.05 with the same kernel.

The text was updated successfully, but these errors were encountered:

MatejKafka · 2022-11-13T13:59:22Z

My guess would be that the api-init-test binary spends 50ms after the initialization doing something, thus spilling over into a second window (maybe a cleanup from asan, or writing the coverage report?). Just to check, could you try running the test manually with trace logs and see if it also happens? Also try running in release mode, to exclude asan,...

wentasah · 2022-11-14T11:00:59Z

Release mode helps - the test always passes.

The trace log from the test with demos compiled in debug mode is below. Do you see something suspicious there. Or is it that ASAN makes everything much slower?

>>> 11:51:22.574 [info] Spawning scheduled processes
>>> 11:51:22.575 [debug] Running 'api-init-test' as PID '54244' (partition 'SC1')
>>> 11:51:22.575 [trace] Adding process '54244' to cgroup '/sys/fs/cgroup/unified/user.slice/user-1000.slice/session-2.scope/demos/SC1/proc0'
>>> 11:51:22.576 [trace] Adding process '54244' to cgroup '/sys/fs/cgroup/freezer/user/wsh/0/demos/SC1/proc0'
>>> 11:51:22.576 [trace] Adding process '54244' to cgroup '/sys/fs/cgroup/cpuset/user/wsh/0/demos/SC1'
>>> 11:51:22.577 [debug] Running 'api-init-test' as PID '54245' (partition 'SC2')
>>> 11:51:22.578 [trace] Adding process '54245' to cgroup '/sys/fs/cgroup/unified/user.slice/user-1000.slice/session-2.scope/demos/SC2/proc0'
>>> 11:51:22.578 [trace] Adding process '54245' to cgroup '/sys/fs/cgroup/freezer/user/wsh/0/demos/SC2/proc0'
>>> 11:51:22.578 [trace] Adding process '54245' to cgroup '/sys/fs/cgroup/cpuset/user/wsh/0/demos/SC2'
>>> 11:51:22.578 [info] Scheduled processes spawned
>>> 11:51:22.578 [debug] DEmOS is running on CPUs '0-7'
>>> 11:51:22.578 [debug] Starting event loop
>>> 11:51:22.578 [debug] Starting process initialization
>>> 11:51:22.579 [trace] Initializing process '54244'
>>> 11:51:22.579 [trace] Resuming process '54244' (partition 'SC1')
>>> 11:51:22.579 [trace] Initializing process '54245'
>>> 11:51:22.579 [trace] Resuming process '54245' (partition 'SC2')
>>> 11:51:22.579 [info] Initializing 2 processes
[DEMOS] Process has an initialization window
[DEMOS] Process library set up
init start
[DEMOS] Ignoring call to `demos_completed()` during initialization window
[DEMOS] Process has an initialization window
[DEMOS] Process library set up
init start
[DEMOS] Ignoring call to `demos_completed()` during initialization window
init done
[DEMOS] Initialization completed
[DEMOS] Notifying scheduler of completion and suspending process...
>>> 11:51:22.651 [trace] Process '54244' completed (cmd: 'api-init-test')
>>> 11:51:22.651 [trace] Suspended process '54244' (partition 'SC1')
init done
[DEMOS] Initialization completed
[DEMOS] Notifying scheduler of completion and suspending process...
>>> 11:51:22.652 [trace] Process '54245' completed (cmd: 'api-init-test')
>>> 11:51:22.652 [trace] Suspended process '54245' (partition 'SC2')
>>> 11:51:22.653 [debug] Process initialization finished
>>> 11:51:22.653 [info] Starting scheduler
<test>
>>> 11:51:22.653 [trace] Starting window
>>> 11:51:22.654 [trace] Running process '54244' for '50 milliseconds'
>>> 11:51:22.654 [trace] Resuming process '54244' (partition 'SC1')
[DEMOS] Process resumed
>>> 11:51:22.654 [trace] Running process '54245' for '50 milliseconds'
>>> 11:51:22.655 [trace] Resuming process '54245' (partition 'SC2')
[DEMOS] Process resumed
>>> 11:51:22.701 [debug] Process '54245' exited (partition 'SC2', cmd: 'api-init-test')
>>> 11:51:22.703 [trace] Process ran out of budget exactly at the window end
>>> 11:51:22.703 [trace] Suspended process '54244' (partition 'SC1')
>>> 11:51:22.703 [trace] Process ran out of budget exactly at the window end
<test>
>>> 11:51:22.703 [trace] Starting window
>>> 11:51:22.703 [trace] Running process '54244' for '50 milliseconds'
>>> 11:51:22.703 [trace] Resuming process '54244' (partition 'SC1')
>>> 11:51:22.707 [debug] Process '54244' exited (partition 'SC1', cmd: 'api-init-test')
>>> 11:51:22.707 [info] All processes exited, stopping scheduler

MatejKafka · 2022-11-14T12:35:13Z

Not sure about ASAN, that imo shouldn't make that much difference for such a short program, but looking at strace, the compiled-in coverage tool seems to do some file writes to dump the coverage statistics, and it also does some processing after exit is called, to calculate the coverage. My guess would be that this could be the cause for the delay.

Try commenting out -Db_coverage=true in make debug and see if it still happens.

wentasah · 2022-11-14T14:00:37Z

Try commenting out `-Db_coverage=true` in `make debug` and see if it still happens.

It happens less often, but it happens. I think I'll work around it by requiring the output to only contain our snippet, instead of requiring exact equivalence. There will still be opportunity to fail due to mixing initial lines, but this the downside of time-dependent tests.

MatejKafka · 2022-11-14T14:15:49Z

This patch should also fix the issue by adding a second, longer window, which gives the process enough time to finish, while still correctly testing that the initialization occurred.

diff --git a/test/0100-api.bats b/test/0100-api.bats
index 35994392..9bb47eec 100644
--- a/test/0100-api.bats
+++ b/test/0100-api.bats
@@ -46,6 +46,10 @@ windows:
     slices:
       - {cpu: 0, sc_partition: SC1}
       - {cpu: 1, sc_partition: SC2}
+  - length: 1000
+    slices:
+      - {cpu: 0, sc_partition: SC1}
+      - {cpu: 1, sc_partition: SC2}

 partitions:
   - name: SC1

wentasah · 2022-11-16T09:06:01Z

Why we need a second window? If I make the first window longer (I use 10 seconds), it is sometimes sufficient. I mean that sometimes, demos ends immediately when both process finish, sometimes demos waits for the end of the window, which is annoying. It seems like there is some bug in termination logic.

wentasah · 2022-11-16T09:14:42Z

It seems that the termination of the second process is detected only after the window ends. I'm not sure whether the the process really ends that late, but it seem unlikely. Maybe, we have a bug in ev::child handling. Do you have time to look at it? Or it might be libev bug. From libev documentation, I remember that SIGCHILD handling is tricky.

MatejKafka · 2022-11-16T15:51:44Z

Why we need a second window? If I make the first window longer (I use 10 seconds), it is sometimes sufficient.

Because by making the first window longer, the test no longer actually tests that the process has time to complete initialization. The original logic behind the test case is that during initialization, api-init-test sleeps for 60ms, and the first window has a length of 50ms. Thus, if the process does not have an initialization window, the sleep will overflow into the second window, and change the order of the prints. By making the first window longer, this is no longer the case, and the test would pass even if initialization was not working correctly.

Currently, the test would fail anyway, because the process library detects an attempt to call demos_initialization_completed() when there's no pending initialization, but the test is primarily supposed to ensure that we correctly handle initialization on the scheduler side.

MatejKafka · 2022-11-16T17:15:08Z

I bisected master, and all these weird issues with slow startup, delays,... started happening after this commit, so I'd say that we have the suspect:

97860ef258fa43e04b19c41b7c79570fc3d2069b is the first bad commit
commit 97860ef258fa43e04b19c41b7c79570fc3d2069b
Author: Michal Sojka <michal.sojka@cvut.cz>
Date:   Sun Nov 21 16:22:01 2021 +0100

    Allow generating test coverage reports in debug builds

MatejKafka · 2022-11-16T17:41:33Z

For the whole duration where the process is unresponsive, I see that it's in the D state in htop (disk sleep), which would also indicate that the coverage tool is doing something with the dumped coverage files and thus blocking the exit. That would explain the delay in shutdown for api-init-test. I also suspect (and strace seems to confirm) that the coverage reports are dumped also on execve in demos-sched, before the process is replaced, which would explain the delay on process startup.

It's possible that on your kernel version, this is compounded by some other effect, but I cannot reproduce the issue after a clean rebuild (rm -r build && make debug && make, which seems necessary, because Meson does not seem to pickup on the Makefile option changes otherwise, even with make debug) on any of the tests, and it reliably shows up when I add the -Db_coverage=true option back.

wentasah · 2022-11-29T08:59:58Z

So my conclusion is that when compiled with code coverage tracking, this test simply cannot work reliably even with the additional 1000 ms window. I'll keep this open to remind that to anybody encountering the problem. To test these things reliably, we would need to eliminate time from the tests. This is IMO only possible by using some kind of virtual time instead of real time. Perhaps some ptrace magic similar to what we do in OSY. And that's too much work for this project :-)

MatejKafka mentioned this issue Nov 14, 2022

test/0020-basic.bats also fails sometimes #77

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test/0100-api.bats fail on recent systems #76

test/0100-api.bats fail on recent systems #76

wentasah commented Nov 12, 2022

MatejKafka commented Nov 13, 2022 •

edited

Loading

wentasah commented Nov 14, 2022

MatejKafka commented Nov 14, 2022

wentasah commented Nov 14, 2022 via email

MatejKafka commented Nov 14, 2022

wentasah commented Nov 16, 2022

wentasah commented Nov 16, 2022

MatejKafka commented Nov 16, 2022 •

edited

Loading

MatejKafka commented Nov 16, 2022 •

edited

Loading

MatejKafka commented Nov 16, 2022 •

edited

Loading

wentasah commented Nov 29, 2022 via email

test/0100-api.bats fail on recent systems #76

test/0100-api.bats fail on recent systems #76

Comments

wentasah commented Nov 12, 2022

MatejKafka commented Nov 13, 2022 • edited Loading

wentasah commented Nov 14, 2022

MatejKafka commented Nov 14, 2022

wentasah commented Nov 14, 2022 via email

MatejKafka commented Nov 14, 2022

wentasah commented Nov 16, 2022

wentasah commented Nov 16, 2022

MatejKafka commented Nov 16, 2022 • edited Loading

MatejKafka commented Nov 16, 2022 • edited Loading

MatejKafka commented Nov 16, 2022 • edited Loading

wentasah commented Nov 29, 2022 via email

MatejKafka commented Nov 13, 2022 •

edited

Loading

MatejKafka commented Nov 16, 2022 •

edited

Loading

MatejKafka commented Nov 16, 2022 •

edited

Loading

MatejKafka commented Nov 16, 2022 •

edited

Loading