Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update and sandardize implementation of packages, in sync with spack update #593

Merged
merged 34 commits into from
Nov 14, 2024

Conversation

adrienbernede
Copy link
Member

@adrienbernede adrienbernede commented Sep 25, 2024

Summary

Supersedes #588

This PR :

  • migrates CARE and Caliper to CachedCMakePackage, reducing the gap with implementations found in llnl/radiuss-spack-configs.
  • improves coherency in version constraints across RADIUSS packages.
  • updates Spack.

⚠️ TODO Before Merge:

@adrienbernede
Copy link
Member Author

@daboehme It appears that recent changes in Caliper main branch fixed the issues we were seeing with cce compilers.
Now remains an issue with rocm 6.2.0 I would like you to look at:
https://lc.llnl.gov/gitlab/radiuss/Caliper/-/jobs/2148980
Thank you.

@adrienbernede
Copy link
Member Author

@daboehme any idea what could be causing this ?

5/5 Test #5: CI_app_tests .....................***Failed   45.26 sec
..................................Efree(): double free detected in tcache 2
Efree(): double free detected in tcache 2
E......................cali-query: Error reading stdin: Unknown/invalid record: __rec=n
E............EEEE....E.....E...

@daboehme
Copy link
Member

@daboehme any idea what could be causing this ?

5/5 Test #5: CI_app_tests .....................***Failed   45.26 sec
..................................Efree(): double free detected in tcache 2
Efree(): double free detected in tcache 2
E......................cali-query: Error reading stdin: Unknown/invalid record: __rec=n
E............EEEE....E.....E...

Hi @adrienbernede, where did you see this happening? Can't find it in any of the recent CI results.

@adrienbernede
Copy link
Member Author

@daboehme any idea what could be causing this ?

5/5 Test #5: CI_app_tests .....................***Failed   45.26 sec
..................................Efree(): double free detected in tcache 2
Efree(): double free detected in tcache 2
E......................cali-query: Error reading stdin: Unknown/invalid record: __rec=n
E............EEEE....E.....E...

Hi @adrienbernede, where did you see this happening? Can't find it in any of the recent CI results.

@daboehme I think you just missed it, it right after the test summary in the logs of the only failing job:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ 2024-10-08 10:15:13-07:00 ~ Testing Caliper
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cannot find file: /dev/shm/tioga14-2161536/build_caliper-linux-rhel8-zen2-rocmcc@6.2.0/DartConfiguration.tcl
   Site: 
   Build name: (empty)
Create new tag: 20241008-1715 - Experimental
Cannot find file: /dev/shm/tioga14-2161536/build_caliper-linux-rhel8-zen2-rocmcc@6.2.0/DartConfiguration.tcl
Test project /dev/shm/tioga14-2161536/build_caliper-linux-rhel8-zen2-rocmcc@6.2.0
    Start 1: test-caliper-common
1/5 Test #1: test-caliper-common ..............   Passed    0.01 sec
    Start 2: test-caliper-reader
2/5 Test #2: test-caliper-reader ..............   Passed    0.01 sec
    Start 3: test-adiak-services
3/5 Test #3: test-adiak-services ..............   Passed    1.13 sec
    Start 4: test-caliper
4/5 Test #4: test-caliper .....................   Passed    0.75 sec
    Start 5: CI_app_tests
5/5 Test #5: CI_app_tests .....................***Failed   45.26 sec
..................................Efree(): double free detected in tcache 2
Efree(): double free detected in tcache 2
E......................cali-query: Error reading stdin: Unknown/invalid record: __rec=n
E............EEEE....E.....E...

@daboehme
Copy link
Member

Hi @adrienbernede, thanks I found it. I tried building Caliper with the same compiler and libraries, but I can't reproduce these issues. All tests are running fine for me. It also doesn't seem like the CI is running this particular configuration lately. Can we simply retry running this config? Maybe it was a HW issue or something.

@adrienbernede
Copy link
Member Author

adrienbernede commented Nov 4, 2024

Hello @daboehme

I ran the job again and it failed the same.

The easiest way to reproduce the issue is by using the in-log reproducer.
In each job the CI is set to print a reproducer script. Here it looks like:

working_dir="/usr/workspace/${USER}/Caliper/2222036-$(date +%s)" 
mkdir -p ${working_dir} && cd ${working_dir} 
git clone https://github.com/LLNL/Caliper.git --single-branch --depth=1 
cd Caliper 
git fetch origin --depth=1 c634187441c3ad88420de7d00ca642b78dd14da5 
git checkout c634187441c3ad88420de7d00ca642b78dd14da5 
git submodule update --init --recursive 
# Required variables 
export MODULE_LIST="" 
export SPEC="+tests +rocm amdgpu_target=gfx90a %rocmcc@=6.2.0 ^hip@6.2.0 " 
# Allow to set job script for debugging (only this differs from CI) 
export DEBUG_MODE=true
flux watch $(flux batch -o output.stdout.type=kvs --nodes=1 --begin-time=+5s ./scripts/gitlab/build-and-test.sh)

Please note that the failing job is new: we were previously testing with rocm@6.1.1 and this PR updates rocm to 6.2.0.

@daboehme
Copy link
Member

Hi @adrienbernede,

Thanks, that's very helpful. I was finally able to reproduce the failing tests, which did actually catch a real issue. This should now be fixed in the current Caliper master branch. Can you rebase your branch and try it again? Thanks!

@adrienbernede
Copy link
Member Author

@daboehme glad it helped.
I merged master and pushed it. Were you actually expecting a rebase ?

@daboehme
Copy link
Member

@daboehme glad it helped. I merged master and pushed it. Were you actually expecting a rebase ?

Hi @adrienbernede, a merge is fine. Are we good to merge this in then?

@adrienbernede
Copy link
Member Author

@daboehme yes we are !

@adrienbernede adrienbernede changed the title [WIP] Update and sandardize implementation of packages, in sync with spack update Update and sandardize implementation of packages, in sync with spack update Nov 14, 2024
@daboehme daboehme merged commit 53071d8 into master Nov 14, 2024
3 checks passed
@adrienbernede adrienbernede deleted the woptim/spack-update branch November 14, 2024 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants