Author: Zhang Xiaolong (Lord of the Manifolds / 流形之主:张小龙)
A high-performance PyTorch implementation for computing the inverse square root of a symmetric positive-definite matrix (
This engine provides extreme acceleration for large-scale matrices commonly found in Second-Order Deep Learning Optimization (K-FAC), Fully Homomorphic Encryption (FHE) whitening, and High-Frequency Trading covariance mapping.
The rigorous algebraic derivation of the Nyström-based manifold sketching and the subspace projection used in this engine has been documented separately.
Please refer to math.md for the complete mathematical proof and geometric interpretation.
- Python 3.8+
- PyTorch
- NumPy
- SciPy (Required for CPU baseline benchmarking only)
import torch
from randnla_yan import inv_sqrt_yan
# Define target dimensions
N, K = 2048, 64
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Construct a simulated covariance matrix: M ≈ I + L * L^T
I_N = torch.eye(N, device=device)
low_rank_base = torch.randn(N, K, device=device)
M = I_N + 0.1 * (low_rank_base @ low_rank_base.T)
# Compute M^{-1/2} instantly
M_inv_sqrt = inv_sqrt_yan(M, k=K)Traditional sqrtm and inv) suffer from severe performance degradation, cache pollution, and thermal throttling as matrix dimensions scale and sustained loads increase. The inv_sqrt_yan engine maintains stable, low-millisecond execution times regardless of continuous throughput.
Tested on standard consumer-grade CPU architecture under 100 continuous iterations, eliminating L3 cache interference between baseline and the optimized engine.
| Matrix Size (N) | Sketch Size (K) | SciPy Baseline (Avg) | Yan Engine (Avg) | Absolute Physical Limit | Absolute Acceleration | Max Error |
|---|---|---|---|---|---|---|
| 1024x1024 | 64 | ~465.34 ms | ~1.51 ms | 1.42ms | ~308.1x | 1.16e-02 |
| 2048x2048 | 64 | ~2140.42 ms | ~5.60 ms | 4.68 ms | ~382.2x | 7.31e-03 |
Note: The discrepancy between 1024x1024 and 2048x2048 performance scaling highlights cache miss penalties vs pure computational complexity. The 2048 scale demonstrates the true theoretical capability of the algorithm after bypassing memory I/O bottlenecks.
