Skip to content

Commit 5bb7ef1

Browse files
authored
Update the Changelog for version 0.3.31
1 parent 4cd575c commit 5bb7ef1

File tree

1 file changed

+116
-0
lines changed

1 file changed

+116
-0
lines changed

Changelog.txt

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,120 @@
11
OpenBLAS ChangeLog
2+
====================================================================
3+
Version 0.3.31
4+
15-Jan-2025
5+
6+
general:
7+
- reverted a matrix partitioning optimization from 0.3.30 that could lead to
8+
race conditions and subsequent invalid results in GEMM
9+
- added the bfloat16 extensions BGEMM and BGEMV
10+
- added a BLAS interface for the ?GEMM_BATCH extensions
11+
- added the BLAS extensions ?GEMM_BATCH_STRIDED and their CBLAS interface
12+
- added the basic infrastructure for half-precision float (FP16) format
13+
using SH prefix
14+
- reimplemented the LAPACK SLAED3/DLAED3 function using multithreading, thereby
15+
improving the performance of the SSYEVD/DSYEVD eigensolver for symmetric matrices
16+
on all platforms
17+
- limited the number of retries for initial memory allocation to avoid infinite
18+
hanging on low-memory systems
19+
- fixed a thread lockup situation encountered with python 3.9 or older and numpy
20+
- introduced a problem size threshold for multithreading in STRMV/DTRMV
21+
- introduced a problem size threshold for multithreading in CHER/CHER2/CHPR/CHPR2
22+
and ZHER/ZHER2/ZHPR/ZHPR2
23+
- improved the problem size thresholds for multithreading in SGER/DGER
24+
- improved autodetection of the Fortran compiler
25+
- fixed passing of the INTERFACE64=1 option to the flang-new compiler
26+
- fixed a potential deadlock in multithreaded code after calling fork()
27+
- fixed builds using CMake on FreeBSD
28+
- fixed builds using CMake from within Cygwin on Windows
29+
- fixed builds using CMake and the NVHPC compiler on ARM64
30+
- fixed CMake build error from misdetecting compiler or OpenMP versions
31+
- improved contents of the CMake-generated OpenBLASConfig.cmake file
32+
- added support for cross-compilation to RISCV targets via CMake
33+
- fixed cross-compilation to x86 targets from non-x86 architectures
34+
- fixed failure to install cblas.h if NO_CBLAS=0 was specified
35+
- fixed missing user-defined pre- and postfixes on functions in lapack.h,lapacke.h
36+
- included fixes from the Reference-LAPACK project:
37+
- fix ordering bug in ?LAED/?LASD (Reference-LAPACK PR 1140)
38+
- revert changes in ?GEEV from PR 1129 (Reference-LAPACK PR 1142)
39+
- fix workspace allocation in LAPACKE_?TRSEN (Reference-LAPACK PR 1144)
40+
41+
riscv:
42+
- added optimized SBGEMM kernels for ZVL128B and ZVL256B targets
43+
- added optimized SHGEMM kernels for ZVL128B and ZVL256B targets
44+
- added optimized SBGEMV and SHGEMV kernels for ZVL128B/ZVL256B
45+
- improved performance of the GEMV kernel for ZVL256B
46+
- improved the performance of the CROT and ZROT kernels for ZVL128B and x280
47+
- improved the detection of RVV1.0 capability
48+
- improved performance of the matrix packing helper functions for ZVL128B and ZVL256B
49+
- improved performance of OMATCOPY for ZVL128B and ZVL256B
50+
51+
arm:
52+
- fixed spurious executable stack in the getarch utility
53+
54+
arm64:
55+
- fixed spurious executable stack in the getarch utility
56+
- fixed compiler warnings arising from the timer macro RPCC
57+
- fixed cache size detection for Qualcomm Oryon under Windows on Arm
58+
- fixed argument handling in the default SVE kernel for SDOT/DDOT
59+
- building the BFLOAT16 kernels is now enabled by default
60+
- improved the overall performance of GEMM,SYMM and HEMM on A64FX
61+
- improved the performance of SDOT/DDOT on A64FX
62+
- improved the multithreading performance of SDOT/DDOT on A64FX by
63+
introduction of a throttling table matching thread count to problem size
64+
- improved the performance of SGER/DGER on A64FX and NEOVERSEV1
65+
- improved the multithreading performance of GEMM on A64FX and NEOVERSEV1
66+
- improved the performance of the GEMV kernel for SVE-capable targets
67+
- improved the multithreading performance of SGEMM on NEOVERSEV1 and V2
68+
- added optimized SAXPY/DAXPY SVE kernels for A64FX and NEOVERSEV1
69+
- added optimized BGEMM and BGEMV kernels for NEOVERSEV1
70+
- added an optimized BGEMM kernel for NEOVERSEN2
71+
- added support for the NEOVERSEV2 cpu
72+
- added dedicated support for the Apple M4 cpu as VORTEXM4
73+
- added optimized SGEMM/SSYMM/STRMM/SSYRK/SSYR2K for SME-capable targets
74+
(ARMV9SME and VORTEXM4)
75+
- improved the precision of the SNRM2 kernel
76+
- added cpu autodetection and compiler settings for Ampere One processors
77+
- fixed cpu autodetection for Apple M systems running Linux
78+
- fixed building on MacOS with AppleClang,gfortran and xcode v16 or newer
79+
- fixed several errors in the C code replacements for the complex and double
80+
precision complex LAPACK functions that get used (only) when compiling with
81+
Microsoft C and NOFORTRAN=1 under MS Windows
82+
83+
power:
84+
- added initial support for the POWER11 architecture
85+
- improved performance of DGEMM and DGEMV on POWER10
86+
- fixed the default compiler flags to use "-O3" instead of the possibly unsafe
87+
"-Ofast"
88+
- fixed building under MacOS (for old G4 Macs) with CMake
89+
- fixed potential miscompilation of DGEMV and other assembly kernels by gcc15.1
90+
- fixed compilation with recent versions of flang
91+
92+
loongarch64:
93+
- fixed warnings and potential inaccuracies arising from incorrect saving of registers
94+
- fixed enumeration of logical cores on big NUMA servers
95+
- fixed building with LLVM and the INTERFACE64=1 option
96+
97+
x86:
98+
- fixed building the GEMM3M kernels for the GENERIC target
99+
- fixed several errors in the C code replacements for the complex and double
100+
precision complex LAPACK functions that get used (only) when compiling with
101+
Microsoft C and NOFORTRAN=1 under MS Windows
102+
103+
x86_64:
104+
- added cpu autodetection for Intel Lunar Lake (Core Ultra 200V)
105+
- changed all ?MIN and ?MAX assembly kernels to use unaligned operations
106+
- fixed several errors in the C code replacements for the complex and double
107+
precision complex LAPACK functions that get used (only) when compiling with
108+
Microsoft C and NOFORTRAN=1 under MS Windows
109+
- fixed potential crashes in builds for Cooper Lake, Sapphire Rapids or Zen5 cpus
110+
under MS Windows
111+
112+
zarch:
113+
- added support for building with CMake
114+
115+
sparc:
116+
- fixed a potential crash in the DNRM2 kernel
117+
2118
====================================================================
3119
Version 0.3.30
4120
19-Jun-2025

0 commit comments

Comments
 (0)