Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ASAN/UBSAN recipe to just, make a CI that runs it + adding GDB #1474

Merged
merged 15 commits into from
Sep 27, 2024

Conversation

tvami
Copy link
Member

@tvami tvami commented Sep 24, 2024

I am updating ldmx-sw, here are the details.

What are the issues that this addresses?

Resolves #1043 (the last bit of it)

Check List

  • I successfully compiled ldmx-sw with my developments
  • I ran my developments and the following shows that they are successful.

I did

just configure-asan-ubsan
just fire .github/validation_samples/kaon_enhanced/config.py 

give a lot of indirect memory leaks, which is fine, none of them are big. This is kinda expected since we fixed the ASAN / UBSAN problems in previous PRs.

Added it to the CI with the option w/o the leak sanitizer, so currently there is no output from ASAN/UBSAN (given that we cleaned all the issues up already).

Then

just configure-gdb
denv gdb build/run_test 

this runs too, although I'm not sure how to actually make use of it for a real config file with fire.
I introduced

just debug

as well.

@tvami tvami marked this pull request as draft September 24, 2024 04:31
@tvami tvami marked this pull request as ready for review September 24, 2024 04:31
@tvami tvami marked this pull request as draft September 24, 2024 14:30
@tvami tvami marked this pull request as ready for review September 24, 2024 14:30
@tvami tvami marked this pull request as draft September 24, 2024 18:05
@tvami tvami marked this pull request as ready for review September 25, 2024 04:29
@tvami tvami marked this pull request as draft September 25, 2024 04:31
@tvami tvami marked this pull request as ready for review September 25, 2024 04:36
@EinarElen
Copy link
Contributor

Thinking a bit more about this, do we need a separate sanitizer build? Why not just run all of the CI with a sanitized build?

@tvami
Copy link
Member Author

tvami commented Sep 25, 2024

Thinking a bit more about this, do we need a separate sanitizer build? Why not just run all of the CI with a sanitized build?

It takes forever... we usual test 10k events, I tried with 7.5k for the ecal pn, and was killed at 6 hours. I'm now converging to the point of only testing 1500 events which will take about an hour. (But for some reason I cant keep the ASAN output in the CI)

Copy link
Member

@tomeichlersmith tomeichlersmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One DRY comment that is optional and some guidance on writing the workflow.

.github/workflows/asan_ubsan_build.yml Outdated Show resolved Hide resolved
.github/workflows/asan_ubsan_build.yml Show resolved Hide resolved
justfile Outdated Show resolved Hide resolved
@EinarElen
Copy link
Contributor

Thinking a bit more about this, do we need a separate sanitizer build? Why not just run all of the CI with a sanitized build?

It takes forever... we usual test 10k events, I tried with 7.5k for the ecal pn, and was killed at 6 hours. I'm now converging to the point of only testing 1500 events which will take about an hour. (But for some reason I cant keep the ASAN output in the CI)

This sounds really really strange. The sanitizers should have an extremely small impact on performance

@tvami
Copy link
Member Author

tvami commented Sep 25, 2024

Thinking a bit more about this, do we need a separate sanitizer build? Why not just run all of the CI with a sanitized build?

It takes forever... we usual test 10k events, I tried with 7.5k for the ecal pn, and was killed at 6 hours. I'm now converging to the point of only testing 1500 events which will take about an hour. (But for some reason I cant keep the ASAN output in the CI)

This sounds really really strange. The sanitizers should have an extremely small impact on performance

Hmm, I dont know, here is the action that was killed after 6 hours https://github.com/LDMX-Software/ldmx-sw/actions/runs/11007030465 if it helps

@tvami tvami marked this pull request as draft September 25, 2024 18:16
@tvami tvami marked this pull request as ready for review September 25, 2024 18:16
@tvami tvami marked this pull request as draft September 25, 2024 19:17
@tvami
Copy link
Member Author

tvami commented Sep 25, 2024

Ok the output is being saved now
https://github.com/LDMX-Software/ldmx-sw/actions/runs/11038784203/artifacts/1978699120
but it still doesnt have what I want... I'll play with it more later

@tvami tvami marked this pull request as ready for review September 26, 2024 01:25
@tvami tvami marked this pull request as draft September 26, 2024 01:28
@tvami
Copy link
Member Author

tvami commented Sep 26, 2024

(sorry I'm going back and forth between draft and ready -- that's my only way to trigger tests until it's merged [after that we'll be able to do this from the Actions with the dispatch options])

@tvami tvami marked this pull request as ready for review September 26, 2024 02:00
@tvami tvami marked this pull request as draft September 26, 2024 02:00
@tvami
Copy link
Member Author

tvami commented Sep 26, 2024

Hmm, I dont understand, in local the ASAN output is in stderr. But in the action I only get

https://github.com/LDMX-Software/ldmx-sw/actions/runs/11044190876

denv fire /home/runner/work/ldmx-sw/ldmx-sw/.github/validation_samples/ecal_pn/config.py 
==5==LeakSanitizer has encountered a fatal error.
==5==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
==5==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)
error: Recipe `fire` failed on line 95 with exit code 1

even tho LSAN_OPTIONS=verbosity=1:log_threads=1 are enabled... (which btw I dont need when I run locally)

@EinarElen
Copy link
Contributor

That seems strange... But the simulation finishes and the leak sanitizer issue is only at the end? I'm still a bit skeptical to the value of running leaksanitizer in CI for us so I might just turn it off isntead

@tvami
Copy link
Member Author

tvami commented Sep 26, 2024

But the simulation finishes and the leak sanitizer issue is only at the end?

I'm running ASAN + UBSAN, and they somehow invoke LSAN too. And yes, they only show up in the end, bc of the setting to dont stop on issues (which is a default set by you @EinarElen I believe, but correct me if I'm wrong [given that we fixed all the issues, it's hard to tell]).

I can be convinced either way about the leak part. But the issue about github dealing with stderr from ASAN/UBSAN is not going to be connected to the leak part. Maybe @tomeichlersmith has an idea about the environment settings on the machines used by the actions.

@tomeichlersmith
Copy link
Member

The runners do not swallow anything from stdout or stderr, but they do not have a TTY and so programs that are "smart" and disable certain behaviors in a non-interactive environment (mainly determined by detecting if a TTY is present) will disable those behaviors.

You can mimic running without a TTY to see if that is the issue.

https://superuser.com/a/1430883

@EinarElen
Copy link
Contributor

LSAN is a part of ASAN so it gets enabled by default. You turn if off with ASAN_OPTIONS=detect_leaks=0

@tvami
Copy link
Member Author

tvami commented Sep 26, 2024

You can mimic running without a TTY to see if that is the issue.

so none of this managed to reproduce what I see in the actions, but I think I found a workaround with specifying the ASAN output, I'll test that now

@tvami tvami marked this pull request as ready for review September 26, 2024 17:48
@tvami tvami marked this pull request as draft September 26, 2024 18:02
@tvami tvami marked this pull request as draft September 26, 2024 18:30
@tvami tvami marked this pull request as ready for review September 26, 2024 19:01
@tvami
Copy link
Member Author

tvami commented Sep 26, 2024

OK I think I'm ready now, this is the module operandi I propose: we dont save the log for now (it didnt work even with specifying the output for ASAN), we dont run the leak part --> this means currently there is no issues. Then we run this every time a new PR comes in, and if any of those introduce a problem, this test will fail. It will not show what the issue is (given the output issue), but it will show that there is an issue, so we can check out the branch locally and run this and we'll see the problem.

@tvami tvami marked this pull request as draft September 26, 2024 19:23
@tvami tvami marked this pull request as ready for review September 26, 2024 19:23
@tvami
Copy link
Member Author

tvami commented Sep 26, 2024

Screenshot 2024-09-26 at 13 40 41

ok so w/o the the leak part, it doesnt actually take extremely long: 1500 events took 30 min, so the 10k would take 3.3 hours. That's still more than the usual 1.5-2h time. I could increase this to 7500. I think the ecal_pn at this point runs all kinda processors so we are safe to just run that with ASAN/UBSAN.

@tvami
Copy link
Member Author

tvami commented Sep 26, 2024

OK, tests are fine, @tomeichlersmith @EinarElen I wont push anymore, please approve if you agree with the changes

.github/workflows/asan_ubsan_build.yml Outdated Show resolved Hide resolved
justfile Show resolved Hide resolved
.github/workflows/asan_ubsan_build.yml Show resolved Hide resolved
justfile Outdated Show resolved Hide resolved
@tvami tvami merged commit 8efa112 into trunk Sep 27, 2024
3 checks passed
@tvami tvami deleted the iss1043-ASAN-UBSAN-GDB branch September 27, 2024 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improved debugging QOL
3 participants