Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/0100-api.bats fail on recent systems #76

Open
wentasah opened this issue Nov 12, 2022 · 11 comments
Open

test/0100-api.bats fail on recent systems #76

wentasah opened this issue Nov 12, 2022 · 11 comments

Comments

@wentasah
Copy link
Member

When I run the tests in current nixos-unstable with Linux 6.0.7, the test sometimes fails as follows:

./0100-api.bats
 ✓ running api-test outside of demos-sched fails
 ✓ demos schedules first two iterations of the sub-process
 ✓ demos doesn't hang when process exits during initialization
 ✗ each process has opportunity to initialize before window scheduler starts
   (from function `assert_output' in file /nix/store/s0bj2xi9wv9axqi8amgyrcklin7vq47r-bats-with-libraries-1.8.2/share/bats/bats-assert/src/assert.bash, line 255,
    in test file ./0100-api.bats, line 56)
     `assert_output - <<EOF' failed
   
   -- output differs --
   expected (5 lines):
     init start
     init start
     init done
     init done
     <test>
   actual (6 lines):
     init start
     init start
     init done
     init done
     <test>
     <test>
   --   

I was not able to reproduce the failure it on nixos-22.05 with the same kernel.

@MatejKafka
Copy link
Collaborator

MatejKafka commented Nov 13, 2022

My guess would be that the api-init-test binary spends 50ms after the initialization doing something, thus spilling over into a second window (maybe a cleanup from asan, or writing the coverage report?). Just to check, could you try running the test manually with trace logs and see if it also happens? Also try running in release mode, to exclude asan,...

@wentasah
Copy link
Member Author

Release mode helps - the test always passes.

The trace log from the test with demos compiled in debug mode is below. Do you see something suspicious there. Or is it that ASAN makes everything much slower?

>>> 11:51:22.574 [info] Spawning scheduled processes
>>> 11:51:22.575 [debug] Running 'api-init-test' as PID '54244' (partition 'SC1')
>>> 11:51:22.575 [trace] Adding process '54244' to cgroup '/sys/fs/cgroup/unified/user.slice/user-1000.slice/session-2.scope/demos/SC1/proc0'
>>> 11:51:22.576 [trace] Adding process '54244' to cgroup '/sys/fs/cgroup/freezer/user/wsh/0/demos/SC1/proc0'
>>> 11:51:22.576 [trace] Adding process '54244' to cgroup '/sys/fs/cgroup/cpuset/user/wsh/0/demos/SC1'
>>> 11:51:22.577 [debug] Running 'api-init-test' as PID '54245' (partition 'SC2')
>>> 11:51:22.578 [trace] Adding process '54245' to cgroup '/sys/fs/cgroup/unified/user.slice/user-1000.slice/session-2.scope/demos/SC2/proc0'
>>> 11:51:22.578 [trace] Adding process '54245' to cgroup '/sys/fs/cgroup/freezer/user/wsh/0/demos/SC2/proc0'
>>> 11:51:22.578 [trace] Adding process '54245' to cgroup '/sys/fs/cgroup/cpuset/user/wsh/0/demos/SC2'
>>> 11:51:22.578 [info] Scheduled processes spawned
>>> 11:51:22.578 [debug] DEmOS is running on CPUs '0-7'
>>> 11:51:22.578 [debug] Starting event loop
>>> 11:51:22.578 [debug] Starting process initialization
>>> 11:51:22.579 [trace] Initializing process '54244'
>>> 11:51:22.579 [trace] Resuming process '54244' (partition 'SC1')
>>> 11:51:22.579 [trace] Initializing process '54245'
>>> 11:51:22.579 [trace] Resuming process '54245' (partition 'SC2')
>>> 11:51:22.579 [info] Initializing 2 processes
[DEMOS] Process has an initialization window
[DEMOS] Process library set up
init start
[DEMOS] Ignoring call to `demos_completed()` during initialization window
[DEMOS] Process has an initialization window
[DEMOS] Process library set up
init start
[DEMOS] Ignoring call to `demos_completed()` during initialization window
init done
[DEMOS] Initialization completed
[DEMOS] Notifying scheduler of completion and suspending process...
>>> 11:51:22.651 [trace] Process '54244' completed (cmd: 'api-init-test')
>>> 11:51:22.651 [trace] Suspended process '54244' (partition 'SC1')
init done
[DEMOS] Initialization completed
[DEMOS] Notifying scheduler of completion and suspending process...
>>> 11:51:22.652 [trace] Process '54245' completed (cmd: 'api-init-test')
>>> 11:51:22.652 [trace] Suspended process '54245' (partition 'SC2')
>>> 11:51:22.653 [debug] Process initialization finished
>>> 11:51:22.653 [info] Starting scheduler
<test>
>>> 11:51:22.653 [trace] Starting window
>>> 11:51:22.654 [trace] Running process '54244' for '50 milliseconds'
>>> 11:51:22.654 [trace] Resuming process '54244' (partition 'SC1')
[DEMOS] Process resumed
>>> 11:51:22.654 [trace] Running process '54245' for '50 milliseconds'
>>> 11:51:22.655 [trace] Resuming process '54245' (partition 'SC2')
[DEMOS] Process resumed
>>> 11:51:22.701 [debug] Process '54245' exited (partition 'SC2', cmd: 'api-init-test')
>>> 11:51:22.703 [trace] Process ran out of budget exactly at the window end
>>> 11:51:22.703 [trace] Suspended process '54244' (partition 'SC1')
>>> 11:51:22.703 [trace] Process ran out of budget exactly at the window end
<test>
>>> 11:51:22.703 [trace] Starting window
>>> 11:51:22.703 [trace] Running process '54244' for '50 milliseconds'
>>> 11:51:22.703 [trace] Resuming process '54244' (partition 'SC1')
>>> 11:51:22.707 [debug] Process '54244' exited (partition 'SC1', cmd: 'api-init-test')
>>> 11:51:22.707 [info] All processes exited, stopping scheduler

@MatejKafka
Copy link
Collaborator

Not sure about ASAN, that imo shouldn't make that much difference for such a short program, but looking at strace, the compiled-in coverage tool seems to do some file writes to dump the coverage statistics, and it also does some processing after exit is called, to calculate the coverage. My guess would be that this could be the cause for the delay.

Try commenting out -Db_coverage=true in make debug and see if it still happens.

@wentasah
Copy link
Member Author

wentasah commented Nov 14, 2022 via email

@MatejKafka
Copy link
Collaborator

This patch should also fix the issue by adding a second, longer window, which gives the process enough time to finish, while still correctly testing that the initialization occurred.

diff --git a/test/0100-api.bats b/test/0100-api.bats
index 35994392..9bb47eec 100644
--- a/test/0100-api.bats
+++ b/test/0100-api.bats
@@ -46,6 +46,10 @@ windows:
     slices:
       - {cpu: 0, sc_partition: SC1}
       - {cpu: 1, sc_partition: SC2}
+  - length: 1000
+    slices:
+      - {cpu: 0, sc_partition: SC1}
+      - {cpu: 1, sc_partition: SC2}

 partitions:
   - name: SC1

@wentasah
Copy link
Member Author

Why we need a second window? If I make the first window longer (I use 10 seconds), it is sometimes sufficient. I mean that sometimes, demos ends immediately when both process finish, sometimes demos waits for the end of the window, which is annoying. It seems like there is some bug in termination logic.

@wentasah
Copy link
Member Author

It seems that the termination of the second process is detected only after the window ends. I'm not sure whether the the process really ends that late, but it seem unlikely. Maybe, we have a bug in ev::child handling. Do you have time to look at it? Or it might be libev bug. From libev documentation, I remember that SIGCHILD handling is tricky.

@MatejKafka
Copy link
Collaborator

MatejKafka commented Nov 16, 2022

Why we need a second window? If I make the first window longer (I use 10 seconds), it is sometimes sufficient.

Because by making the first window longer, the test no longer actually tests that the process has time to complete initialization. The original logic behind the test case is that during initialization, api-init-test sleeps for 60ms, and the first window has a length of 50ms. Thus, if the process does not have an initialization window, the sleep will overflow into the second window, and change the order of the prints. By making the first window longer, this is no longer the case, and the test would pass even if initialization was not working correctly.

Currently, the test would fail anyway, because the process library detects an attempt to call demos_initialization_completed() when there's no pending initialization, but the test is primarily supposed to ensure that we correctly handle initialization on the scheduler side.

@MatejKafka
Copy link
Collaborator

MatejKafka commented Nov 16, 2022

I bisected master, and all these weird issues with slow startup, delays,... started happening after this commit, so I'd say that we have the suspect:

97860ef258fa43e04b19c41b7c79570fc3d2069b is the first bad commit
commit 97860ef258fa43e04b19c41b7c79570fc3d2069b
Author: Michal Sojka <michal.sojka@cvut.cz>
Date:   Sun Nov 21 16:22:01 2021 +0100

    Allow generating test coverage reports in debug builds

@MatejKafka
Copy link
Collaborator

MatejKafka commented Nov 16, 2022

For the whole duration where the process is unresponsive, I see that it's in the D state in htop (disk sleep), which would also indicate that the coverage tool is doing something with the dumped coverage files and thus blocking the exit. That would explain the delay in shutdown for api-init-test. I also suspect (and strace seems to confirm) that the coverage reports are dumped also on execve in demos-sched, before the process is replaced, which would explain the delay on process startup.

It's possible that on your kernel version, this is compounded by some other effect, but I cannot reproduce the issue after a clean rebuild (rm -r build && make debug && make, which seems necessary, because Meson does not seem to pickup on the Makefile option changes otherwise, even with make debug) on any of the tests, and it reliably shows up when I add the -Db_coverage=true option back.

@wentasah
Copy link
Member Author

wentasah commented Nov 29, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants