apple-bottom

FP64-class BLAS on Apple Silicon GPU — 618 GFLOP/s, ~10⁻¹⁵ precision, zero CUDA dependency

Apple Silicon GPUs have no native FP64. apple-bottom fixes that — double-float (DD) arithmetic on Metal compute shaders gives you FP64-class matrix operations at GPU throughput. Production-validated with Quantum ESPRESSO (22% faster, 11 decimal places of agreement).

git clone https://github.com/grantdh/apple-bottom.git
cd apple-bottom
make && make test   # 48/48 tests pass

Performance

Benchmarked on M2 Max (38-core GPU, 64 GB unified memory):

Operation	GPU (apple-bottom)	CPU (Accelerate AMX)	Speedup	Precision
DGEMM 2048²	552 GFLOP/s	468 GFLOP/s	+18%	~10⁻¹⁵
DGEMM 4096²	643 GFLOP/s	566 GFLOP/s	+14%	~10⁻¹⁵
ZGEMM (QE Si64)	—	—	+22% wall	11 decimal places
ZGEMM (QE Si128)	—	—	+12% wall	exact energy match

Quantum ESPRESSO production benchmark (Si64, 64 atoms, DFT):

Configuration          Wall Time    CPU Usage    Energy (Ry)
──────────────────────────────────────────────────────────────
OpenBLAS 6-thread         2:28        5.3×       -2990.44276157
apple-bottom GPU          2:01        3.4×       -2990.44276157  ✓

22% faster wall time, 47% less CPU usage, bit-identical energy.

When to Use apple-bottom

Use it for: iterative solvers (SCF, CG, GMRES, Lanczos), large matrices (N ≥ 2048), repeated GEMM in loops — anywhere GPU upload/download overhead amortizes across iterations.

Don't use it for: single small GEMM calls (N < 1024), element-sensitive algorithms (pivoted LU, eigensolvers), or when IEEE 754 bit-exact FP64 is required.

API

#include "apple_bottom.h"

ab_init();

// Create and upload matrices
ABMatrix A = ab_matrix_create(2048, 2048);
ABMatrix B = ab_matrix_create(2048, 2048);
ABMatrix C = ab_matrix_create(2048, 2048);
ab_matrix_upload(A, data_A, true);
ab_matrix_upload(B, data_B, true);

// C = A × B  (FP64-class precision on GPU)
ab_dgemm(A, B, C);

// Download result
ab_matrix_download(C, result, true);

// Cleanup
ab_matrix_destroy(A);
ab_matrix_destroy(B);
ab_matrix_destroy(C);
ab_shutdown();

Full API Surface

Category	Functions
BLAS	`ab_dgemm`, `ab_dgemm_scaled`, `ab_zgemm`, `ab_zgemm_ex`, `ab_dsyrk`
Matrix	`ab_matrix_create`, `ab_matrix_upload`, `ab_matrix_download`, `ab_matrix_zero`, `ab_matrix_copy`
Element-wise	`ab_matrix_add`, `ab_matrix_sub`, `ab_matrix_scale`
Async	`ab_dgemm_async`, `ab_zgemm_async`, `ab_future_wait`
Pool	`ab_pool_create`, `ab_pool_get_matrix`, `ab_pool_reset` (reduces allocation overhead in iterative codes)
Session	`ab_session_create`, `ab_session_dgemm`, `ab_session_zgemm` (named matrix management)

Fortran Integration (Quantum ESPRESSO)

Drop-in replacement via EXTERNAL declaration — no module dependencies:

EXTERNAL :: ab_zgemm
CALL ab_zgemm('N', 'C', m, n, k, alpha, A, lda, B, ldb, beta, C, ldc)

The Fortran bridge auto-routes small calls (< 100M FLOPs) to OpenBLAS and large calls to GPU. See Integration Guide for C, Python, and Fortran setup.

Architecture

Double-Float (DD) Arithmetic

Each FP64 value is stored as two FP32 values (hi, lo) using Dekker/Knuth error-free transformations. This gives ~48-bit effective mantissa (~10⁻¹⁵ relative error) — not full IEEE 754 FP64 (53-bit, ~10⁻¹⁶), but sufficient for scientific computing where accumulated solver errors dominate.

Key implementation details: MTLMathModeSafe prevents the Metal compiler from reordering FMA operations that would destroy DD precision. Complex GEMM uses Gauss's 3-multiply algorithm (3 real GEMM instead of 4, 25% compute reduction). The GPU kernel uses 4×4 register blocking with BM=BN=64 tiling, optimized for Apple Silicon GPU occupancy.

Precision Guarantees

Per-element DD precision is ~10⁻¹⁵. DGEMM Frobenius error scales as ~√N×10⁻¹⁵ (statistical cancellation in random matrices). For N=4096, empirical Frobenius error is ~5×10⁻¹⁴. Full analysis in Precision Envelope.

Fortran Bridge Architecture

Fortran (QE)  →  fortran_bridge.c  →  < 100M FLOPs? → cblas (OpenBLAS/Accelerate)
                                    →  ≥ 100M FLOPs? → blas_wrapper.c → Metal GPU

Verification & Validation

Production-validated following NASA-STD-7009A methodology:

V-2 Convergence Study: Frobenius error 6.5×10⁻¹⁵ to 5.1×10⁻¹⁴ for N ∈ {64, 128, ..., 4096}
VAL-1 Production: Quantum ESPRESSO Si64 DFT — 11 decimal place agreement (-2990.44276157 Ry)
48/48 tests: 6 precision tests + 42 correctness tests (regression coverage for 7 critical bug fixes)

Documentation: V&V Report · Precision Envelope · QE Validation

Requirements

macOS 14+ (Sonoma) with Xcode 16+ SDK (for MTLMathModeSafe)
Apple Silicon (M1, M2, M3, M4 — any variant)
Older SDKs compile but achieve only ~10⁻⁸ precision (fast-math breaks DD arithmetic)

Prior Art

Project	Approach	Throughput	Status
metal-float64 (Turner)	Integer FP64 emulation via SoftFloat/LLVM	~24 GFLOP/s	Archived 2024
AppleNumericalComputing (Yamanishi)	Educational benchmarks, FP32 GPU	N/A	Research
MLX (Apple)	ML framework, FP64 on CPU only	N/A	Active, no GPU FP64
Accelerate (Apple)	AMX hardware, IEEE FP64	536 GFLOP/s	CPU-only, thread-hostile
apple-bottom	DD arithmetic on Metal FP32 ALUs	643 GFLOP/s	Active, production-validated

apple-bottom takes a fundamentally different approach from metal-float64: native FP32 ALU arithmetic with error-free transformations (Dekker 1971) instead of integer bit manipulation. This trades strict IEEE 754 compliance for ~25× higher throughput at the BLAS level, which is the right tradeoff for scientific iterative solvers where norm-averaged convergence dominates.

Research

Triple-double (TD) emulation achieves faithfully-rounded FP64 (99.5% correctly rounded, max 1 ULP error) at 148 GFLOP/s — a new point on the precision-performance Pareto frontier. See research/td-dgemm/.

Limitations

Rectangular matrices (aspect ratio > 10:1): known correctness issues, being addressed
Single GEMM calls: ~100 μs GPU overhead dominates for one-shot operations
ZHERK: deprecated (20× slower than CPU AMX) — use cblas_zherk instead
Thread safety: Metal command queue serializes; use separate contexts for concurrency (planned)
Max dimension: 46,340 × 46,340 (overflow protection)

Project Structure

apple-bottom/
├── src/apple_bottom.m          # Core Metal GPU implementation (DD kernels)
├── src/blas_wrapper.c          # BLAS-compatible C API
├── src/fortran_bridge.c        # Fortran ABI bridge for QE/VASP/CP2K
├── include/apple_bottom.h      # Public API header (v1.2.0)
├── tests/                      # 48 tests (precision + correctness + V&V)
├── benchmarks/                 # DGEMM, ZGEMM, DSYRK, pool, async benchmarks
├── examples/01_basic_dgemm/    # Runnable example
├── docs/                       # Integration guide + V&V documentation
└── research/td-dgemm/          # Triple-double faithfully-rounded FP64 research

Contributing

See CONTRIBUTING.md. Bug reports and feature requests welcome via Issues.

Citation

@software{heileman2026applebottom,
  author    = {Heileman, Grant David},
  title     = {apple-bottom: FP64-class BLAS for Apple Silicon GPU},
  year      = {2026},
  url       = {https://github.com/grantdh/apple-bottom},
  version   = {1.2.0}
}

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.github		.github
benchmarks		benchmarks
docs		docs
examples/01_basic_dgemm		examples/01_basic_dgemm
include		include
python		python
research		research
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
applebottom.pc.in		applebottom.pc.in

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

apple-bottom

Performance

When to Use apple-bottom

API

Full API Surface

Fortran Integration (Quantum ESPRESSO)

Architecture

Double-Float (DD) Arithmetic

Precision Guarantees

Fortran Bridge Architecture

Verification & Validation

Requirements

Prior Art

Research

Limitations

Project Structure

Contributing

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

apple-bottom

Performance

When to Use apple-bottom

API

Full API Surface

Fortran Integration (Quantum ESPRESSO)

Architecture

Double-Float (DD) Arithmetic

Precision Guarantees

Fortran Bridge Architecture

Verification & Validation

Requirements

Prior Art

Research

Limitations

Project Structure

Contributing

Citation

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages