Rewrite README

Shnatsel · Shnatsel · commit bdc185f7a617 · 2024-01-30T22:24:34.000Z
diff --git a/README.md b/README.md
@@ -1,25 +1,37 @@
 # PHFT
 
 **PH**ast**FT** (PHFT) is a high-performance, "quantum-inspired" Fast Fourier Transform (FFT) library written in pure
-and
-safe Rust.
+and safe Rust. It is the fastest pure-Rust FFT library according to our benchmarks.
 
-What's with the name? Great question!
+## Features
 
-The name, **PHFT**, is derived from the implementation of the
-[Quantum Fourier Transform](https://en.wikipedia.org/wiki/Quantum_Fourier_transform) (QFT). Namely, the
-[quantum circuit implementation of QFT](https://en.wikipedia.org/wiki/Quantum_Fourier_transform#Circuit_implementation)
-consists of the **P**hase gates and **H**adamard gates. Hence, **PH**ast**FT**.
+- Takes advantage of latest CPU features up to and including AVX-512, but performs well even without them.
+- Zero `unsafe` code
+- Python bindings (via [PyO3](https://github.com/PyO3/pyo3)).
+- Optional parallelization of some steps to 2 threads (with even more parallelization planned).
+- Did we mention it is really fast?!
 
-In general, the FFT is equivalent to applying gates to all qubits in `[0, n)`. This approach creates to oppurtunity to
-leverage the same memory access patterns as high-performance quantum state simulator. This results in a fast and
-efficient FFT implementation that surpasses the performance of existing Rust FFT crates, including RustFFT.
+## Limitations
 
-## Features
+ - No runtime CPU feature detection (yet). Right now achieving the highest performance requires compiling with `-C target-cpu=native` or [`cargo multivers`](https://github.com/ronnychevalier/cargo-multivers).
+ - Requires nightly Rust compiler due to use of portable SIMD
+
+## How is it so fast?
+
+PHFT is designed around the capabilities and limitations of modern hardware (that is, anything made in the last 10 years or so).
+
+The two major bottlenecks in FFT are the **CPU cycles** and **memory accesses.**
 
-- Performance ...
-- Python bindings (via PyO3) ...
-- Safety ...
+We picked an FFT algorithm that maps well to modern CPUs. The implementation can make use of latest CPU features such as AVX-512, but performs well even without them.
+
+Our key insight for speeding up memory accesses is that FFT is equivalent to applying gates to all qubits in `[0, n)`.
+This creates to oppurtunity to leverage the same memory access patterns as a [high-performance quantum state simulator](https://github.com/QuState/spinoza).
+
+We also use the Cache-Optimal Bit Reveral Algorithm ([COBRA](https://csaws.cs.technion.ac.il/~itai/Courses/Cache/bit.pdf))
+on large datasets and optionally run it on 2 parallel threads, accelerating it even further.
+
+All of this combined results in a fast and efficient FFT implementation that surpasses the performance of existing Rust FFT crates,
+including [RustFFT](https://crates.io/crates/rustfft/), on both large and small inputs and while using significantly less memory.
 
 ## Getting Started
 
@@ -88,3 +100,10 @@ Finally, run:
 ```bash
 ./profile.sh
 ```
+
+## What's with the name?
+
+The name, **PHFT**, is derived from the implementation of the
+[Quantum Fourier Transform](https://en.wikipedia.org/wiki/Quantum_Fourier_transform) (QFT). Namely, the
+[quantum circuit implementation of QFT](https://en.wikipedia.org/wiki/Quantum_Fourier_transform#Circuit_implementation)
+consists of the **P**hase gates and **H**adamard gates. Hence, **PH**ast**FT**.