Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for tracer to not be advected by SHOC #3085

Closed

Conversation

tcclevenger
Copy link
Contributor

@tcclevenger tcclevenger commented Oct 31, 2024

Changes here are very simple: if we want tracer advected by SHOC, add_tracer() adds to group "turbulence_advected_tracers" and "tracers", if not, only add to "tracers". Almost all the complexity of the PR comes from the error message when we find conflicting tracer requests wrt. SHOC advected.

Name `turbulence_advected_tracers" is not final, will consult with others before merging.

@bartgol I opted against an enum for turbulence_advected where we have a "don't care" option. I think it makes it more confusing, and processes that want this represent more of a special case. Maybe that is an argument for defaulting this value to true? Feel free to push back though.

@tcclevenger tcclevenger added the BFB Bit for bit label Oct 31, 2024
@tcclevenger tcclevenger requested a review from bartgol October 31, 2024 18:16
@tcclevenger tcclevenger self-assigned this Oct 31, 2024
"turbulence_advected_tracers") != req.groups.end();
ss << " - (" + req.calling_process + ", " + grid_name + ", " + (turb_advect ? "true" : "false") + ")\n";
}
EKAT_ERROR_MSG(ss.str());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sample output in case that P3 requests qv to not be advected by SHOC.

Error! Incompatible tracer request. Turbulence advection requests not
  consistent among processes.
    - Tracer name: qv
    - Requests (process name, grid name, is tracers turbulence advected):
      - (homme, Physics GLL, true)
      - (mam4_aci, Physics GLL, true)
      - (p3, Physics GLL, false)
      - (homme, Physics GLL, true)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a repeated entry for homme?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because there exists a computed and required group request (since "qv" is "updated" in homme), but then I don't know why p3 does not have 2 requests as well. Let me investigate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yes, double entries are due to "updated" fields, since a request exists for "required" and "computed".

Also, field requests in ATMProcGroup is a set, and so only one version of each reqest is used. By chance homme's "qv" was the version of "qv" with packsize>1, P3 was picked up since it's req.groups was unique (i.e., no turbulence advection), and mam4_aci was picked up as the packsize=1 request (all other mam "qv" request ignored.

@bartgol I've pushed a solution that does the following

  1. Makes tracer_requests a map of set of field requests, so removes duplicate computed/required entry.
  2. Use req.calling_process in FieldRequest::operator< so that requests with different calling_process are distinct (when loaded into a std::set<FieldRequest>). The m_*_field_requests in AtmosphereProcess is now larger, but it does not result in more allocations since FM will skip extra requests when it allocates.

The other options for 2. I think are
a. remove all calling process info in FieldRequest and simplify the error message with no calling process and just say "qv has incompatible group requests" and make the user find the processes,
b. or leave as it was in the output above which is not complete but gives user the name of conflicting processes.

Copy link
Contributor

@bartgol bartgol Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could put an enum in the request, that specifies whether this req is for input, output, or input-output. It may be helpful for debugging purposes.

That said, I don't think that seeing duplicated processes in the output is a big deal. What I would do, maybe, is explain why the requests are incompatible. That is, you could add a line at the bottom of the error msg, like

Error! Incompatible tracer request. Turbulence advection requests not
  consistent among processes.
    - Tracer name: qv
    - Requests (process name, grid name, is tracers turbulence advected):
      - (homme, Physics GLL, true)
      - (mam4_aci, Physics GLL, true)
      - (p3, Physics GLL, false)
      - (homme, Physics GLL, true)
  All processes MUST agree on whether this tracer is advected by the turbulence scheme.

@E3SM-Bot
Copy link
Collaborator

Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects:

Pull Request Auto Testing STARTING (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: 6267
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
PR_LABELS BFB
PULLREQUESTNUM 3085
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA 14a4fd7
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 91a2410
TEST_REPO_ALIAS SCREAM

Using Repos:

Repo: SCREAM (E3SM-Project/scream)
  • Branch: tcclevenger/add_tracer_with_turbulence_advection
  • SHA: 14a4fd7
  • Mode: TEST_REPO

Pull Request Author: tcclevenger

@E3SM-Bot
Copy link
Collaborator

Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED

Note: Testing will normally be attempted again in approx. 2 Hrs. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run.

Pull Request Auto Testing has FAILED (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: 6267
  • Status: FAILED

Jenkins Parameters

Parameter Name Value
PR_LABELS BFB
PULLREQUESTNUM 3085
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA 14a4fd7
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 91a2410
TEST_REPO_ALIAS SCREAM
SCREAM_PullRequest_Autotester_Weaver # 6267 FAILED (click to see last 100 lines of console output)

===============================================================================
Testing '\''14a4fd78b76804917b0000c9427f62da90804733'\'' for test '\''full_sp_debug'\''
===============================================================================
RUN: taskset -c 52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103 sh -c '\''SCREAM_BUILD_PARALLEL_LEVEL=52 CTEST_PARALLEL_LEVEL=1 ctest -V --output-on-failure --resource-spec-file /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx/ctest-build/full_sp_debug/ctest_resource_file.json -DNO_SUBMIT=True -DBUILD_WORK_DIR=/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx/ctest-build/full_sp_debug -DBUILD_NAME_MOD=full_sp_debug -S /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx/cmake/ctest_script.cmake -DCTEST_SITE=weaver -DCMAKE_COMMAND="-C /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx/cmake/machine-files/weaver.cmake -DNetCDF_Fortran_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-fortran/4.6.1/gcc/11.3.0/openmpi/4.1.6/5tv5psl -DNetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-c/4.9.2/gcc/11.3.0/openmpi/4.1.6/pyuuqd3 -DPnetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/parallel-netcdf/1.12.3/gcc/11.3.0/openmpi/4.1.6/2s52shy -DCMAKE_BUILD_TYPE=Debug -DEKAT_DEFAULT_BFB=True -DSCREAM_DOUBLE_PRECISION=False -DEKAT_DISABLE_TPL_WARNINGS='\''\'\'''\''ON'\''\'\'''\'' -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpifort -DSCREAM_DYNAMICS_DYCORE=HOMME -DSCREAM_TEST_MAX_TOTAL_THREADS=1 -DSCREAM_BASELINES_DIR=/home/projects/e3sm/scream/pr-autotester/master-baselines/weaver/full_sp_debug" '\''
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx/ctest-build/full_sp_debug
===============================================================================
Testing '\''14a4fd78b76804917b0000c9427f62da90804733'\'' for test '\''release'\''
===============================================================================
RUN: taskset -c 104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155 sh -c '\''SCREAM_BUILD_PARALLEL_LEVEL=52 CTEST_PARALLEL_LEVEL=1 ctest -V --output-on-failure --resource-spec-file /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx/ctest-build/release/ctest_resource_file.json -DNO_SUBMIT=True -DBUILD_WORK_DIR=/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx/ctest-build/release -DBUILD_NAME_MOD=release -S /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx/cmake/ctest_script.cmake -DCTEST_SITE=weaver -DCMAKE_COMMAND="-C /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx/cmake/machine-files/weaver.cmake -DNetCDF_Fortran_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-fortran/4.6.1/gcc/11.3.0/openmpi/4.1.6/5tv5psl -DNetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-c/4.9.2/gcc/11.3.0/openmpi/4.1.6/pyuuqd3 -DPnetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/parallel-netcdf/1.12.3/gcc/11.3.0/openmpi/4.1.6/2s52shy -DCMAKE_BUILD_TYPE=Release -DEKAT_DISABLE_TPL_WARNINGS='\''\'\'''\''ON'\''\'\'''\'' -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpifort -DSCREAM_DYNAMICS_DYCORE=HOMME -DSCREAM_TEST_MAX_TOTAL_THREADS=1 -DSCREAM_BASELINES_DIR=/home/projects/e3sm/scream/pr-autotester/master-baselines/weaver/release" '\''
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx/ctest-build/release
===============================================================================
Testing '\''14a4fd78b76804917b0000c9427f62da90804733'\'' for test '\''full_debug'\''
===============================================================================
RUN: taskset -c 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51 sh -c '\''SCREAM_BUILD_PARALLEL_LEVEL=52 CTEST_PARALLEL_LEVEL=1 ctest -V --output-on-failure --resource-spec-file /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx/ctest-build/full_debug/ctest_resource_file.json -DNO_SUBMIT=True -DBUILD_WORK_DIR=/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx/ctest-build/full_debug -DBUILD_NAME_MOD=full_debug -S /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx/cmake/ctest_script.cmake -DCTEST_SITE=weaver -DCMAKE_COMMAND="-C /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx/cmake/machine-files/weaver.cmake -DNetCDF_Fortran_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-fortran/4.6.1/gcc/11.3.0/openmpi/4.1.6/5tv5psl -DNetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/netcdf-c/4.9.2/gcc/11.3.0/openmpi/4.1.6/pyuuqd3 -DPnetCDF_C_PATH=/projects/ppc64le-pwr9-rhel8/tpls/parallel-netcdf/1.12.3/gcc/11.3.0/openmpi/4.1.6/2s52shy -DCMAKE_BUILD_TYPE=Debug -DEKAT_DEFAULT_BFB=True -DKokkos_ENABLE_DEBUG_BOUNDS_CHECK=True -DEKAT_DISABLE_TPL_WARNINGS='\''\'\'''\''ON'\''\'\'''\'' -DCMAKE_CXX_COMPILER=mpicxx -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpifort -DSCREAM_DYNAMICS_DYCORE=HOMME -DSCREAM_TEST_MAX_TOTAL_THREADS=1 -DSCREAM_BASELINES_DIR=/home/projects/e3sm/scream/pr-autotester/master-baselines/weaver/full_debug" '\''
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx/ctest-build/full_debug
Build type full_debug failed at testing time. Here'\''s a list of failed tests:
153:homme_shoc_cld_p3_mam_optics_rrtmgp_baseline_cmp
155:homme_shoc_cld_mam_aci_p3_mam_optics_rrtmgp_mam_drydep_baseline_cmp
157:homme_shoc_cld_spa_p3_rrtmgp_mam4_wetscav_baseline_cmp

Build type release failed at testing time. Here'''s a list of failed tests:
152:homme_shoc_cld_p3_mam_optics_rrtmgp_baseline_cmp
154:homme_shoc_cld_mam_aci_p3_mam_optics_rrtmgp_mam_drydep_baseline_cmp
156:homme_shoc_cld_spa_p3_rrtmgp_mam4_wetscav_baseline_cmp

Error(s) occurred during test phase
OVERALL STATUS: FAIL
Starting analysis on weaver with cmd: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
RUN: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx
weaver failed'

  • errors='Build type full_debug failed at testing time. Here'''s a list of failed tests:
    153:homme_shoc_cld_p3_mam_optics_rrtmgp_baseline_cmp
    155:homme_shoc_cld_mam_aci_p3_mam_optics_rrtmgp_mam_drydep_baseline_cmp
    157:homme_shoc_cld_spa_p3_rrtmgp_mam4_wetscav_baseline_cmp

Build type release failed at testing time. Here'''s a list of failed tests:
152:homme_shoc_cld_p3_mam_optics_rrtmgp_baseline_cmp
154:homme_shoc_cld_mam_aci_p3_mam_optics_rrtmgp_mam_drydep_baseline_cmp
156:homme_shoc_cld_spa_p3_rrtmgp_mam4_wetscav_baseline_cmp

Error(s) occurred during test phase
OVERALL STATUS: FAIL
Starting analysis on weaver with cmd: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
RUN: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx
weaver failed'

  • SA_FAILURES_DETAILS+='Build type full_debug failed at testing time. Here'''s a list of failed tests:
    153:homme_shoc_cld_p3_mam_optics_rrtmgp_baseline_cmp
    155:homme_shoc_cld_mam_aci_p3_mam_optics_rrtmgp_mam_drydep_baseline_cmp
    157:homme_shoc_cld_spa_p3_rrtmgp_mam4_wetscav_baseline_cmp

Build type release failed at testing time. Here'''s a list of failed tests:
152:homme_shoc_cld_p3_mam_optics_rrtmgp_baseline_cmp
154:homme_shoc_cld_mam_aci_p3_mam_optics_rrtmgp_mam_drydep_baseline_cmp
156:homme_shoc_cld_spa_p3_rrtmgp_mam4_wetscav_baseline_cmp

Error(s) occurred during test phase
OVERALL STATUS: FAIL
Starting analysis on weaver with cmd: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
RUN: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx
weaver failed'

  • [[ 1 == 0 ]]
  • [[ weaver == \m\a\p\p\y ]]
  • set +x
    ######################################################
    FAILS DETECTED:
    SCREAM STANDALONE TESTING FAILED!
    Build type full_debug failed at testing time. Here's a list of failed tests:
    153:homme_shoc_cld_p3_mam_optics_rrtmgp_baseline_cmp
    155:homme_shoc_cld_mam_aci_p3_mam_optics_rrtmgp_mam_drydep_baseline_cmp
    157:homme_shoc_cld_spa_p3_rrtmgp_mam4_wetscav_baseline_cmp

Build type release failed at testing time. Here's a list of failed tests:
152:homme_shoc_cld_p3_mam_optics_rrtmgp_baseline_cmp
154:homme_shoc_cld_mam_aci_p3_mam_optics_rrtmgp_mam_drydep_baseline_cmp
156:homme_shoc_cld_spa_p3_rrtmgp_mam4_wetscav_baseline_cmp

Error(s) occurred during test phase
OVERALL STATUS: FAIL
Starting analysis on weaver with cmd: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
RUN: cd /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx && source /etc/profile.d/modules.sh && module purge && module load cmake/3.25.1 git/2.39.1 python/3.10.8 py-netcdf4/1.5.8 gcc/11.3.0 cuda/11.8.0 openmpi netcdf-c netcdf-fortran parallel-netcdf netlib-lapack && export HDF5_USE_FILE_LOCKING=FALSE && true && bsub -I -q rhel8 -n 4 -gpu num=4 ./scripts/test-all-scream --baseline-dir AUTO $compiler -p -c EKAT_DISABLE_TPL_WARNINGS=ON -m weaver
FROM: /home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6267/scream/components/eamxx
weaver failed
######################################################
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash -le

cd $WORKSPACE/${BUILD_ID}/

./scream/components/eamxx/scripts/jenkins/jenkins_cleanup.sh
[SCREAM_PullRequest_Autotester_Weaver] $ /bin/bash -le /tmp/jenkins9763932841219945050.sh
POST BUILD TASK : SUCCESS
END OF POST BUILD TASK : 0
Sending e-mails to: lbertag@sandia.gov
Finished: FAILURE

@tcclevenger tcclevenger force-pushed the tcclevenger/add_tracer_with_turbulence_advection branch from 14a4fd7 to 55b1441 Compare November 4, 2024 21:56
@tcclevenger tcclevenger changed the title Allow for tracer to not be advected by SHOC [WIP] Allow for tracer to not be advected by SHOC Nov 4, 2024
@tcclevenger tcclevenger changed the title [WIP] Allow for tracer to not be advected by SHOC Allow for tracer to not be advected by SHOC Nov 4, 2024
@tcclevenger
Copy link
Contributor Author

SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.ghci-snl-cpu_gnu.scream-mam4xx-all_mam4xx_procs fails in the run phase. Output includes

 295 rrtmgp_main:aerosol_sw.tau: 10038 values outside range [0,1000]; minval = 0; maxval = 1.55476e+16
 296 rrtmgp_main:aerosol_lw.tau: 11472 values outside range [0,1000]; minval = 0; maxval = 6.0161e+14

...

Error! Failed post-condition property check (cannot be repaired).
 376   - Atmosphere process name: rrtmgp
 377   - Property check name: T_mid within interval [100, 500]
 378   - Atmosphere process MPI Rank: 18
 379   - Message: Check failed.
 380   - check name: T_mid within interval [100, 500]
 381   - field id: T_mid[Physics PG2] <double:ncol,lev>(16,72) [K]
 382   - minimum:
 383     - value: -1473.27
 384     - indices (w/ global column index): (171,0)
 385     - lat/lon: (16.8077, 185.652)
 386     - additional data (w/ local column index):

...

 FAIL:
 373 false
 374 /home/tccleve/E3SM-Project/SCREAM/scream/components/eamxx/src/share/atm_process/atmosphere_process.cpp:455
 375 Error! Failed post-condition property check (cannot be repaired).
 376   - Atmosphere process name: rrtmgp
 377   - Property check name: T_mid within interval [100, 500]
 378   - Atmosphere process MPI Rank: 18
 379   - Message: Check failed.
 380   - check name: T_mid within interval [100, 500]
 381   - field id: T_mid[Physics PG2] <double:ncol,lev>(16,72) [K]
 382   - minimum:
 383     - value: -1473.27
 384     - indices (w/ global column index): (171,0)
 385     - lat/lon: (16.8077, 185.652)
 386     - additional data (w/ local column index):

@singhbalwinder @odiazib Any thoughts on these variables (rrtmgp_main:aerosol_sw.tau, rrtmgp_main:aerosol_lw.tau, T_mid(lev=0)) in relation to MAM? My guess is that the rrtmgp aerosols are affecting T_mid?

Copy link
Contributor

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great! Did you verify that in mam tests the turb-advected tracers are all indeed first in the list? And that shoc does not pick up the others?

I have a few comments, but they are not requests for changes.

components/eamxx/src/control/atmosphere_driver.cpp Outdated Show resolved Hide resolved
"turbulence_advected_tracers") != req.groups.end();
ss << " - (" + req.calling_process + ", " + grid_name + ", " + (turb_advect ? "true" : "false") + ")\n";
}
EKAT_ERROR_MSG(ss.str());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a repeated entry for homme?

@@ -534,6 +534,66 @@ void AtmosphereDriver::create_fields()
m_field_mgrs[grid->name()]->registration_begins();
}

// Before registering fields, check that Field Requests for tracers are compatible
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this snippet should be moved somewhere, like in share/util, or maybe as a free function in field_request.hpp, something like

// ensure all requests are on the same page on whether the field belongs to $group_name
bool compatible_groups (std::vector<FieldRequest>& reqs, std::string& group_name);

(maybe with a better name)

components/eamxx/src/share/field/field_request.hpp Outdated Show resolved Hide resolved
bartgol
bartgol previously approved these changes Nov 7, 2024
Copy link
Contributor

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great! Did you verify that in mam tests the turb-advected tracers are all indeed first in the list? And that shoc does not pick up the others?

I have a few comments, but they are not requests for changes.

@bartgol
Copy link
Contributor

bartgol commented Nov 7, 2024

@tcclevenger @singhbalwinder @odiazib any thought on the eamxx-sa fails? They're all baseline_cmp for mam tests. Was shoc advecting ALL tracers before, so we were doing the wrong thing before and the right thing now?

@singhbalwinder
Copy link
Contributor

I expected the MAM4xx tests to fail as SHOC was originally advecting (vertical mixing) the interstitial aerosols, but now it is not (as far as I understood). This should cause all such tests to fail.

@bartgol bartgol added Non-B4B Not bit for bit and removed BFB Bit for bit labels Nov 7, 2024
@bartgol
Copy link
Contributor

bartgol commented Nov 7, 2024

I correct myself: the FPE shows an error for the homme_shoc_cld_mam_aci_p3_mam_optics_rrtmgp_mam_drydep_npX tests. Hard to tell from the logs where the error is. Conrad will have to run manually.

@tcclevenger
Copy link
Contributor Author

Yeah, let me run the FPE test, it may be related to the CIME case error I posted above.

@tcclevenger
Copy link
Contributor Author

Did you verify that in mam tests the turb-advected tracers are all indeed first in the list? And that shoc does not pick up the others?

Here are the fields in each tracer group (I output at SHOC and HOMME set_computed_group_impl() using field group info)

tracers: 
  qv(0) tke(1) SOAG(2) SO2(3) qr(4) qm(5) qi(6) qc(7) O3(8) nr(9) ni(10) nc(11) H2SO4(12) 
  H2O2(13) DMS(14) bm(15) soa_a3(16) soa_a2(17) soa_a1(18) so4_a3(19) so4_a2(20) so4_a1(21) 
  pom_a4(22) pom_a3(23) pom_a1(24) num_a4(25) num_a3(26) num_a2(27) num_a1(28) nacl_a3(29)
  nacl_a2(30) nacl_a1(31) mom_a4(32) mom_a3(33) mom_a2(34) mom_a1(35) dst_a3(36) dst_a1(37) 
  bc_a4(38) bc_a3(39) bc_a1(40)
turbulence_advected_tracers: 
  qv(0) tke(1) SOAG(2) SO2(3) qr(4) qm(5) qi(6) qc(7) O3(8) nr(9) ni(10) nc(11) H2SO4(12)
  H2O2(13) DMS(14) bm(15)

So, turbulence_advected_tracers are a subset of tracers, qv first.

@singhbalwinder Do these groups look right in terms of interstitial and gas aerosols?

@singhbalwinder
Copy link
Contributor

Looks great! Yes, the species are grouped correctly.
tracers should be advected by HOMEE, but SHOC should only advect `turbulence_advected_tracers.

@tcclevenger
Copy link
Contributor Author

@singhbalwinder correct on the tracer names. I've narrowed down the issues in the FPE tests and the SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.ghci-snl-cpu_gnu.scream-mam4xx-all_mam4xx_procs case to the "all-mam4xx-procs" runs (both stand alone and CIME cases). Is there a major difference that is added in "all-procs" cases? My initial guess is that SHOC might have been initializing some of those tracers that some mam process is then using?

Copy link
Contributor

mergify bot commented Nov 12, 2024

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟠 Enforce checks passing

Waiting checks: cpu-gcc / ${{ matrix.test.short_name }}, cpu-gcc / ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.scream-small_kernels--scream-output-preset-5, cpu-gcc / ERS_Ln9.ne4_ne4.F2000-SCREAMv1-AQP1.scream-output-preset-2, cpu-gcc / ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.scream-dpxx-arm97, cpu-gcc / SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.scream-mam4xx-all_mam4xx_procs, gcc-cuda / ${{ matrix.build_type }}, gcc-cuda / dbg, gcc-cuda / opt, gcc-cuda / sp, gcc-openmp / ${{ matrix.build_type }}, gcc-openmp / dbg, gcc-openmp / fpe, gcc-openmp / opt, gcc-openmp / sp.

Make sure that checks are not failing on the PR, and reviewers approved

  • any of:
    • check-skipped={% raw %}gcc-openmp / ${{ matrix.build_type }}{% endraw %}
    • all of:
      • check-success="gcc-openmp / dbg"
      • check-success="gcc-openmp / fpe"
      • check-success="gcc-openmp / opt"
      • check-success="gcc-openmp / sp"
  • any of:
    • check-skipped={% raw %}gcc-cuda / ${{ matrix.build_type }}{% endraw %}
    • all of:
      • check-success="gcc-cuda / dbg"
      • check-success="gcc-cuda / opt"
      • check-success="gcc-cuda / sp"
  • any of:
    • check-skipped={% raw %}cpu-gcc / ${{ matrix.test.short_name }}{% endraw %}
    • all of:
      • check-success="cpu-gcc / ERS_Ln22.ne4pg2_ne4pg2.F2010-SCREAMv1.scream-small_kernels--scream-output-preset-5"
      • check-success="cpu-gcc / ERS_Ln9.ne4_ne4.F2000-SCREAMv1-AQP1.scream-output-preset-2"
      • check-success="cpu-gcc / ERS_P16_Ln22.ne30pg2_ne30pg2.FIOP-SCREAMv1-DP.scream-dpxx-arm97"
      • check-success="cpu-gcc / SMS_D_Ln5.ne4pg2_oQU480.F2010-SCREAMv1-MPASSI.scream-mam4xx-all_mam4xx_procs"
  • #approved-reviews-by >= 1
  • #changes-requested-reviews-by == 0
  • any of:
    • check-skipped=cpu-gcc
    • check-success=cpu-gcc

@singhbalwinder
Copy link
Contributor

@tcclevenger and @bartgol : I have added a fix for the failing tests. Can one of you trigger the tests again to see if the fix fixes all the issues? Thanks!

bartgol
bartgol previously approved these changes Nov 13, 2024
@tcclevenger
Copy link
Contributor Author

tcclevenger commented Nov 13, 2024

@singhbalwinder seems your commit fixed the FPE fail, but not the CIME case fail (where rrtmgp aerosols blow up). I was hopeful they would be related. I won't be able to look at this until next week, but I can put it at the top of my list.

@singhbalwinder
Copy link
Contributor

@tcclevenger: I will look into the CIME case as well to see if I can fix this error.

bartgol
bartgol previously approved these changes Nov 15, 2024
tcclevenger and others added 6 commits November 18, 2024 13:41
Increases the number of entries in the AtmosphereProcess set m_*_field_requests, but not the number of allocations. Allows us to output all instances of field requests of the same name.
@tcclevenger tcclevenger force-pushed the tcclevenger/add_tracer_with_turbulence_advection branch from 06d1637 to 457bc78 Compare November 18, 2024 20:42
@tcclevenger
Copy link
Contributor Author

Update: The issue currently is that for PG2, homme interface is adding tracers on the GLL grid as an "import" group, but that group does not have the same order of tracers. So when homme is remapping, from gll -> pg2, the values do not match the tracer names.

@bartgol
Copy link
Contributor

bartgol commented Nov 20, 2024

Update: The issue currently is that for PG2, homme interface is adding tracers on the GLL grid as an "import" group, but that group does not have the same order of tracers. So when homme is remapping, from gll -> pg2, the values do not match the tracer names.

Ah, that's annoying. Indeed, the tracers are re-ordered when the FM allocates the fields, while the imported group is setup after requests have been gathered but before the FMs allocates the fields... @tcclevenger maybe we should handle grids in such a way that:

  • we process one grid at a time: first import groups from other grids, then allocate fields
  • if grid X imports a group from grid Y, then grid Y is processed first

This makes me really want to get rid of having multiple FM, and store ALL fields in the same FM...

@rljacob
Copy link
Member

rljacob commented Dec 12, 2024

Moved to E3SM-Project/E3SM#6789

@rljacob rljacob closed this Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Non-B4B Not bit for bit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants