Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in atlas_fctest_trans_unstructured #249

Closed
DJDavies2 opened this issue Dec 15, 2024 · 3 comments · Fixed by #254
Closed

Failure in atlas_fctest_trans_unstructured #249

DJDavies2 opened this issue Dec 15, 2024 · 3 comments · Fixed by #254

Comments

@DJDavies2
Copy link
Contributor

What happened?

Running atlas_fctest_trans_unstructured with certain configs gives this failure:

183/216 Test #183: atlas_fctest_trans_unstructured ...........................Subprocess aborted***Exception: 0.23 sec
Runtime Error: *** Arithmetic exception: Floating overflow - aborting
/home/users/david.davies/cylc-run/mi-bg671/work/1/get_source_atlas/atlas/src/atlas_f/trans/atlas_Trans_module.F90, line 352: Error occurred in ATLAS_TRANS_MODULE:INVTRANS_VORDIV2WIND_FIELD
/data/users/david.davies/cylc-run/mi-bg671/work/1/get_source_atlas/atlas/src/tests/trans/fctest_trans_unstructured.F90, line 84: Called by FCTEST_ATLAS_TRANS_UNSTR:TEST_TRANS
/home/users/david.davies/cylc-run/mi-bg671/work/1/build_atlas_nag/build/src/tests/trans/fctest_trans_unstructured_main.F90, line 21: Called by RUN_FCTEST_ATLAS_TRANS_UNSTR

What are the steps to reproduce the bug?

Building and runinng with NAG/GCC, building with ectrans support. However this failure only occurs with some configs. I don't think the NAG is particularly relevant here (see below)(

Version

Head

Platform (OS and architecture)

Linus

Relevant log output

No response

Accompanying data

No response

Organisation

Met Office

@DJDavies2
Copy link
Contributor Author

I dug around a bit and printed some stuff out. I think the failure occurs in this line in src/atlas/trans/local/VorDivToUVLocal.cc:

                    rv[ir + ji]  = -chiIm * rvor[ii + ji] - psiM1 * rdiv[ir + ji + 1] + psiP1 * rdiv[ir + ji - 1];

However I think the root of the problem lies earlier than that. In src/tests/trans/fctest_trans_unstructured.F90 the call to spectral%create_field for sp_div_field and friends produce arrays that have 6 elements. I have traced the path of sp_div_field down into the code and I believe it ends up in extend_trunction in the file src/atlas/trans/local/TransLocal.cc as the old_spectra parameter. Printing some values and array indices out show that this line:

                      new_spectra[k++] = old_spectra[k_old++];

is going out of bounds in terms of the access to old_spectra. This results in undefined values being copied into new_spectra, which in some circumstances result in arithmetic exceptions in subsequent calculations such as the line noted above.

I have checked the hypothesis by adding some special code to extend_truncation so that only the first 6 elements of old_spectra are used (0 otherwise); this seems to fix atlas_fctest_trans_unstructured but is of course unacceptable as it breaks other tests.

I don't know what to do next about this. There is no way in extend_trunction of knowing how many elements old_spectra has so there would be no way of generalizing my hack even if it was considered okay in principle.

@wdeconinck
Copy link
Member

Thank you @DJDavies2 for digging! I will dig myself a bit further in the New Year.
I can refactor this and add some assertions to make sure we don't go out of bound silently, and then see how to prevent it.
These functions are implementation details.

@wdeconinck
Copy link
Member

@DJDavies2 I have nailed down the bug and fixed it in #254.
Thanks once more for the discovery and report!

wdeconinck added a commit that referenced this issue Feb 10, 2025
* release/0.41.0: (44 commits)
  Update changelog
  Version 0.41.0
  Suppress more Intel warnings
  Suppress more Intel warnings
  Fix more warnings with gnu 13.2 Release build
  Fix warnings (#256)
  Move SparseMatrixStorage template functions in .cc
  Add atlas-global-matrix sandbox
  Add AssembleGlobalMatrix
  Add SparseMatrixToTriplets
  Fix SparseMatrixStorage constructor from SparseMatrixView
  Fix creation of spectral function space from TransLocal (fixes #249) (#254)
  Fortran: Allow to create MatchingPartitioner with derived FunctionSpaces
  Remove spurios debugging output
  Mesh constructor with Distribution
  New simplified cubed sphere grid (#245)
  Fix for StructuredPartitionPolygon inner bounding box
  Add hicSparse backend to sparse matrix multiply. (#246)
  Update atlas-run runner: remove old aprun command and update slurm (#251)
  Avoid no_discard warning
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants