|
1 | 1 | OpenBLAS ChangeLog |
| 2 | +==================================================================== |
| 3 | +Version 0.3.31 |
| 4 | +15-Jan-2025 |
| 5 | + |
| 6 | +general: |
| 7 | + - reverted a matrix partitioning optimization from 0.3.30 that could lead to |
| 8 | + race conditions and subsequent invalid results in GEMM |
| 9 | + - added the bfloat16 extensions BGEMM and BGEMV |
| 10 | + - added a BLAS interface for the ?GEMM_BATCH extensions |
| 11 | + - added the BLAS extensions ?GEMM_BATCH_STRIDED and their CBLAS interface |
| 12 | + - added the basic infrastructure for half-precision float (FP16) format |
| 13 | + using SH prefix |
| 14 | + - reimplemented the LAPACK SLAED3/DLAED3 function using multithreading, thereby |
| 15 | + improving the performance of the SSYEVD/DSYEVD eigensolver for symmetric matrices |
| 16 | + on all platforms |
| 17 | + - limited the number of retries for initial memory allocation to avoid infinite |
| 18 | + hanging on low-memory systems |
| 19 | + - fixed a thread lockup situation encountered with python 3.9 or older and numpy |
| 20 | + - introduced a problem size threshold for multithreading in STRMV/DTRMV |
| 21 | + - introduced a problem size threshold for multithreading in CHER/CHER2/CHPR/CHPR2 |
| 22 | + and ZHER/ZHER2/ZHPR/ZHPR2 |
| 23 | + - improved the problem size thresholds for multithreading in SGER/DGER |
| 24 | + - improved autodetection of the Fortran compiler |
| 25 | + - fixed passing of the INTERFACE64=1 option to the flang-new compiler |
| 26 | + - fixed a potential deadlock in multithreaded code after calling fork() |
| 27 | + - fixed builds using CMake on FreeBSD |
| 28 | + - fixed builds using CMake from within Cygwin on Windows |
| 29 | + - fixed builds using CMake and the NVHPC compiler on ARM64 |
| 30 | + - fixed CMake build error from misdetecting compiler or OpenMP versions |
| 31 | + - improved contents of the CMake-generated OpenBLASConfig.cmake file |
| 32 | + - added support for cross-compilation to RISCV targets via CMake |
| 33 | + - fixed cross-compilation to x86 targets from non-x86 architectures |
| 34 | + - fixed failure to install cblas.h if NO_CBLAS=0 was specified |
| 35 | + - fixed missing user-defined pre- and postfixes on functions in lapack.h,lapacke.h |
| 36 | + - included fixes from the Reference-LAPACK project: |
| 37 | + - fix ordering bug in ?LAED/?LASD (Reference-LAPACK PR 1140) |
| 38 | + - revert changes in ?GEEV from PR 1129 (Reference-LAPACK PR 1142) |
| 39 | + - fix workspace allocation in LAPACKE_?TRSEN (Reference-LAPACK PR 1144) |
| 40 | + |
| 41 | +riscv: |
| 42 | + - added optimized SBGEMM kernels for ZVL128B and ZVL256B targets |
| 43 | + - added optimized SHGEMM kernels for ZVL128B and ZVL256B targets |
| 44 | + - added optimized SBGEMV and SHGEMV kernels for ZVL128B/ZVL256B |
| 45 | + - improved performance of the GEMV kernel for ZVL256B |
| 46 | + - improved the performance of the CROT and ZROT kernels for ZVL128B and x280 |
| 47 | + - improved the detection of RVV1.0 capability |
| 48 | + - improved performance of the matrix packing helper functions for ZVL128B and ZVL256B |
| 49 | + - improved performance of OMATCOPY for ZVL128B and ZVL256B |
| 50 | + |
| 51 | +arm: |
| 52 | + - fixed spurious executable stack in the getarch utility |
| 53 | + |
| 54 | +arm64: |
| 55 | + - fixed spurious executable stack in the getarch utility |
| 56 | + - fixed compiler warnings arising from the timer macro RPCC |
| 57 | + - fixed cache size detection for Qualcomm Oryon under Windows on Arm |
| 58 | + - fixed argument handling in the default SVE kernel for SDOT/DDOT |
| 59 | + - building the BFLOAT16 kernels is now enabled by default |
| 60 | + - improved the overall performance of GEMM,SYMM and HEMM on A64FX |
| 61 | + - improved the performance of SDOT/DDOT on A64FX |
| 62 | + - improved the multithreading performance of SDOT/DDOT on A64FX by |
| 63 | + introduction of a throttling table matching thread count to problem size |
| 64 | + - improved the performance of SGER/DGER on A64FX and NEOVERSEV1 |
| 65 | + - improved the multithreading performance of GEMM on A64FX and NEOVERSEV1 |
| 66 | + - improved the performance of the GEMV kernel for SVE-capable targets |
| 67 | + - improved the multithreading performance of SGEMM on NEOVERSEV1 and V2 |
| 68 | + - added optimized SAXPY/DAXPY SVE kernels for A64FX and NEOVERSEV1 |
| 69 | + - added optimized BGEMM and BGEMV kernels for NEOVERSEV1 |
| 70 | + - added an optimized BGEMM kernel for NEOVERSEN2 |
| 71 | + - added support for the NEOVERSEV2 cpu |
| 72 | + - added dedicated support for the Apple M4 cpu as VORTEXM4 |
| 73 | + - added optimized SGEMM/SSYMM/STRMM/SSYRK/SSYR2K for SME-capable targets |
| 74 | + (ARMV9SME and VORTEXM4) |
| 75 | + - improved the precision of the SNRM2 kernel |
| 76 | + - added cpu autodetection and compiler settings for Ampere One processors |
| 77 | + - fixed cpu autodetection for Apple M systems running Linux |
| 78 | + - fixed building on MacOS with AppleClang,gfortran and xcode v16 or newer |
| 79 | + - fixed several errors in the C code replacements for the complex and double |
| 80 | + precision complex LAPACK functions that get used (only) when compiling with |
| 81 | + Microsoft C and NOFORTRAN=1 under MS Windows |
| 82 | + |
| 83 | +power: |
| 84 | + - added initial support for the POWER11 architecture |
| 85 | + - improved performance of DGEMM and DGEMV on POWER10 |
| 86 | + - fixed the default compiler flags to use "-O3" instead of the possibly unsafe |
| 87 | + "-Ofast" |
| 88 | + - fixed building under MacOS (for old G4 Macs) with CMake |
| 89 | + - fixed potential miscompilation of DGEMV and other assembly kernels by gcc15.1 |
| 90 | + - fixed compilation with recent versions of flang |
| 91 | + |
| 92 | +loongarch64: |
| 93 | + - fixed warnings and potential inaccuracies arising from incorrect saving of registers |
| 94 | + - fixed enumeration of logical cores on big NUMA servers |
| 95 | + - fixed building with LLVM and the INTERFACE64=1 option |
| 96 | + |
| 97 | +x86: |
| 98 | + - fixed building the GEMM3M kernels for the GENERIC target |
| 99 | + - fixed several errors in the C code replacements for the complex and double |
| 100 | + precision complex LAPACK functions that get used (only) when compiling with |
| 101 | + Microsoft C and NOFORTRAN=1 under MS Windows |
| 102 | + |
| 103 | +x86_64: |
| 104 | + - added cpu autodetection for Intel Lunar Lake (Core Ultra 200V) |
| 105 | + - changed all ?MIN and ?MAX assembly kernels to use unaligned operations |
| 106 | + - fixed several errors in the C code replacements for the complex and double |
| 107 | + precision complex LAPACK functions that get used (only) when compiling with |
| 108 | + Microsoft C and NOFORTRAN=1 under MS Windows |
| 109 | + - fixed potential crashes in builds for Cooper Lake, Sapphire Rapids or Zen5 cpus |
| 110 | + under MS Windows |
| 111 | + |
| 112 | +zarch: |
| 113 | + - added support for building with CMake |
| 114 | + |
| 115 | +sparc: |
| 116 | + - fixed a potential crash in the DNRM2 kernel |
| 117 | + |
2 | 118 | ==================================================================== |
3 | 119 | Version 0.3.30 |
4 | 120 | 19-Jun-2025 |
|
0 commit comments