Skip to content

grantdh/apple-bottom

apple-bottom

FP64-class BLAS on Apple Silicon GPU — 618 GFLOP/s, ~10⁻¹⁵ precision, zero CUDA dependency

CI Tests License Platform Precision


Apple Silicon GPUs have no native FP64. apple-bottom fixes that — double-float (DD) arithmetic on Metal compute shaders gives you FP64-class matrix operations at GPU throughput. Production-validated with Quantum ESPRESSO (22% faster, 11 decimal places of agreement).

git clone https://github.com/grantdh/apple-bottom.git
cd apple-bottom
make && make test   # 48/48 tests pass

Performance

Benchmarked on M2 Max (38-core GPU, 64 GB unified memory):

Operation GPU (apple-bottom) CPU (Accelerate AMX) Speedup Precision
DGEMM 2048² 552 GFLOP/s 468 GFLOP/s +18% ~10⁻¹⁵
DGEMM 4096² 643 GFLOP/s 566 GFLOP/s +14% ~10⁻¹⁵
ZGEMM (QE Si64) +22% wall 11 decimal places
ZGEMM (QE Si128) +12% wall exact energy match

Quantum ESPRESSO production benchmark (Si64, 64 atoms, DFT):

Configuration          Wall Time    CPU Usage    Energy (Ry)
──────────────────────────────────────────────────────────────
OpenBLAS 6-thread         2:28        5.3×       -2990.44276157
apple-bottom GPU          2:01        3.4×       -2990.44276157  ✓

22% faster wall time, 47% less CPU usage, bit-identical energy.

When to Use apple-bottom

Use it for: iterative solvers (SCF, CG, GMRES, Lanczos), large matrices (N ≥ 2048), repeated GEMM in loops — anywhere GPU upload/download overhead amortizes across iterations.

Don't use it for: single small GEMM calls (N < 1024), element-sensitive algorithms (pivoted LU, eigensolvers), or when IEEE 754 bit-exact FP64 is required.

API

#include "apple_bottom.h"

ab_init();

// Create and upload matrices
ABMatrix A = ab_matrix_create(2048, 2048);
ABMatrix B = ab_matrix_create(2048, 2048);
ABMatrix C = ab_matrix_create(2048, 2048);
ab_matrix_upload(A, data_A, true);
ab_matrix_upload(B, data_B, true);

// C = A × B  (FP64-class precision on GPU)
ab_dgemm(A, B, C);

// Download result
ab_matrix_download(C, result, true);

// Cleanup
ab_matrix_destroy(A);
ab_matrix_destroy(B);
ab_matrix_destroy(C);
ab_shutdown();

Full API Surface

Category Functions
BLAS ab_dgemm, ab_dgemm_scaled, ab_zgemm, ab_zgemm_ex, ab_dsyrk
Matrix ab_matrix_create, ab_matrix_upload, ab_matrix_download, ab_matrix_zero, ab_matrix_copy
Element-wise ab_matrix_add, ab_matrix_sub, ab_matrix_scale
Async ab_dgemm_async, ab_zgemm_async, ab_future_wait
Pool ab_pool_create, ab_pool_get_matrix, ab_pool_reset (reduces allocation overhead in iterative codes)
Session ab_session_create, ab_session_dgemm, ab_session_zgemm (named matrix management)

Fortran Integration (Quantum ESPRESSO)

Drop-in replacement via EXTERNAL declaration — no module dependencies:

EXTERNAL :: ab_zgemm
CALL ab_zgemm('N', 'C', m, n, k, alpha, A, lda, B, ldb, beta, C, ldc)

The Fortran bridge auto-routes small calls (< 100M FLOPs) to OpenBLAS and large calls to GPU. See Integration Guide for C, Python, and Fortran setup.

Architecture

Double-Float (DD) Arithmetic

Each FP64 value is stored as two FP32 values (hi, lo) using Dekker/Knuth error-free transformations. This gives ~48-bit effective mantissa (~10⁻¹⁵ relative error) — not full IEEE 754 FP64 (53-bit, ~10⁻¹⁶), but sufficient for scientific computing where accumulated solver errors dominate.

Key implementation details: MTLMathModeSafe prevents the Metal compiler from reordering FMA operations that would destroy DD precision. Complex GEMM uses Gauss's 3-multiply algorithm (3 real GEMM instead of 4, 25% compute reduction). The GPU kernel uses 4×4 register blocking with BM=BN=64 tiling, optimized for Apple Silicon GPU occupancy.

Precision Guarantees

Per-element DD precision is ~10⁻¹⁵. DGEMM Frobenius error scales as ~√N×10⁻¹⁵ (statistical cancellation in random matrices). For N=4096, empirical Frobenius error is ~5×10⁻¹⁴. Full analysis in Precision Envelope.

Fortran Bridge Architecture

Fortran (QE)  →  fortran_bridge.c  →  < 100M FLOPs? → cblas (OpenBLAS/Accelerate)
                                    →  ≥ 100M FLOPs? → blas_wrapper.c → Metal GPU

Verification & Validation

Production-validated following NASA-STD-7009A methodology:

  • V-2 Convergence Study: Frobenius error 6.5×10⁻¹⁵ to 5.1×10⁻¹⁴ for N ∈ {64, 128, ..., 4096}
  • VAL-1 Production: Quantum ESPRESSO Si64 DFT — 11 decimal place agreement (-2990.44276157 Ry)
  • 48/48 tests: 6 precision tests + 42 correctness tests (regression coverage for 7 critical bug fixes)

Documentation: V&V Report · Precision Envelope · QE Validation

Requirements

  • macOS 14+ (Sonoma) with Xcode 16+ SDK (for MTLMathModeSafe)
  • Apple Silicon (M1, M2, M3, M4 — any variant)
  • Older SDKs compile but achieve only ~10⁻⁸ precision (fast-math breaks DD arithmetic)

Prior Art

Project Approach Throughput Status
metal-float64 (Turner) Integer FP64 emulation via SoftFloat/LLVM ~24 GFLOP/s Archived 2024
AppleNumericalComputing (Yamanishi) Educational benchmarks, FP32 GPU N/A Research
MLX (Apple) ML framework, FP64 on CPU only N/A Active, no GPU FP64
Accelerate (Apple) AMX hardware, IEEE FP64 536 GFLOP/s CPU-only, thread-hostile
apple-bottom DD arithmetic on Metal FP32 ALUs 643 GFLOP/s Active, production-validated

apple-bottom takes a fundamentally different approach from metal-float64: native FP32 ALU arithmetic with error-free transformations (Dekker 1971) instead of integer bit manipulation. This trades strict IEEE 754 compliance for ~25× higher throughput at the BLAS level, which is the right tradeoff for scientific iterative solvers where norm-averaged convergence dominates.

Research

Triple-double (TD) emulation achieves faithfully-rounded FP64 (99.5% correctly rounded, max 1 ULP error) at 148 GFLOP/s — a new point on the precision-performance Pareto frontier. See research/td-dgemm/.

Limitations

  • Rectangular matrices (aspect ratio > 10:1): known correctness issues, being addressed
  • Single GEMM calls: ~100 μs GPU overhead dominates for one-shot operations
  • ZHERK: deprecated (20× slower than CPU AMX) — use cblas_zherk instead
  • Thread safety: Metal command queue serializes; use separate contexts for concurrency (planned)
  • Max dimension: 46,340 × 46,340 (overflow protection)

Project Structure

apple-bottom/
├── src/apple_bottom.m          # Core Metal GPU implementation (DD kernels)
├── src/blas_wrapper.c          # BLAS-compatible C API
├── src/fortran_bridge.c        # Fortran ABI bridge for QE/VASP/CP2K
├── include/apple_bottom.h      # Public API header (v1.2.0)
├── tests/                      # 48 tests (precision + correctness + V&V)
├── benchmarks/                 # DGEMM, ZGEMM, DSYRK, pool, async benchmarks
├── examples/01_basic_dgemm/    # Runnable example
├── docs/                       # Integration guide + V&V documentation
└── research/td-dgemm/          # Triple-double faithfully-rounded FP64 research

Contributing

See CONTRIBUTING.md. Bug reports and feature requests welcome via Issues.

Citation

@software{heileman2026applebottom,
  author    = {Heileman, Grant David},
  title     = {apple-bottom: FP64-class BLAS for Apple Silicon GPU},
  year      = {2026},
  url       = {https://github.com/grantdh/apple-bottom},
  version   = {1.2.0}
}

License

MIT — see LICENSE.

About

BLASphemy: Metal-native linear algebra for Apple Silicon...because cuBLAS isn't coming to Cupertino

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors