Skip to content

Fix ScaLAPACK not linked in MPI builds with meson#717

Open
jameskermode wants to merge 3 commits intopublicfrom
fix-scalapack-mpi-linking
Open

Fix ScaLAPACK not linked in MPI builds with meson#717
jameskermode wants to merge 3 commits intopublicfrom
fix-scalapack-mpi-linking

Conversation

@jameskermode
Copy link
Copy Markdown
Member

Summary

  • Add missing mpi_dep to gap_fit_lib dependencies so libgap_fit.so records ScaLAPACK/MPI as needed libraries. On Linux where --as-needed is the default linker behaviour, these were being silently dropped.
  • Add find_library('scalapack') fallback for ScaLAPACK detection, matching the existing OpenBLAS pattern. Environments with broken or missing pkg-config files (conda, HPC modules) now fall back to a direct library search.

Fixes #715
Fixes #716

Test plan

  • MPI build (meson setup builddir -Dmpi=true) configures and compiles successfully
  • Non-MPI build still works
  • Verified libgap_fit link command in build.ninja now includes ScaLAPACK and MPI flags
  • Linux: ldd builddir/src/Programs/gap_fit | grep scalapack should show the library

🤖 Generated with Claude Code

Two issues caused gap_fit to fail with "LA_Matrix_Factorise: cannot
factorise" when built with meson and MPI enabled:

1. gap_fit_lib was missing mpi_dep in its dependencies, so
   libgap_fit.so didn't record ScaLAPACK/MPI as needed libraries.
   On Linux where --as-needed is the default, the linker dropped
   ScaLAPACK from the final executable since gap_fit.o doesn't
   directly reference ScaLAPACK symbols.

2. ScaLAPACK detection had no find_library fallback, unlike OpenBLAS.
   Environments with broken or missing pkg-config files (conda, HPC
   modules) would silently produce wrong linker flags, and
   b_lundef=false masked the resulting undefined symbols.

Fixes #715, fixes #716

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jameskermode and others added 2 commits March 31, 2026 09:53
The old Makefile build enabled HAVE_QR by default (make config
defaulted to 'y'). Without this flag, gap_fit uses the Cholesky
factorisation code path (LA_Matrix_Factorise/dpotrf) instead of QR
decomposition. The Cholesky path is both numerically less stable
(requires positive definiteness) and lacks MPI/ScaLAPACK support,
causing failures in both serial and parallel gap_fit runs.

Fixes #715, fixes #716

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The missing -DHAVE_QR flag wasn't caught because:
1. Serial tests never checked which solve method was used
2. CI never built with MPI, so all ScaLAPACK tests were skipped

Fix both gaps:
- Add check_qr_path() assertion to all serial gap_fit tests that
  perform a fit, verifying "Using LAPACK to solve QR" appears in the
  log. This would have immediately caught the missing -DHAVE_QR.
- Add a build-mpi CI job on ubuntu-latest that installs OpenMPI +
  ScaLAPACK, builds with -Dmpi=true, and runs the full test suite
  with HAVE_SCALAPACK=1, enabling the 6 existing ScaLAPACK tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jameskermode
Copy link
Copy Markdown
Member Author

@albapa do you want to check this before I merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LA_Matrix_Factorise: cannot factorise, error: 28 Problem in compiling gap_fit MPI with meson

1 participant