SLOTHY - Super (Lazy) Optimization of Tricky Handwritten assemblY - is an assembly-level superoptimizer for:
- Instruction scheduling
- Register allocation
- Software pipelining (= periodic loop interleaving)
SLOTHY is generic in the target architecture and microarchitecture. This repository provides instantiations for:
- Armv8.1-M+Helium: Cortex-M55, Cortex-M85
- AArch64: Cortex-A55, and experimentally Cortex-A72, Cortex-X/Neoverse-V, Apple M1 (Firestorm, Icestorm)
SLOTHY is discussed in Fast and Clean: Auditable high-performance assembly via constraint solving.
SLOTHY enables a development workflow where developers write 'clean' assembly by hand, emphasizing the logic of the computation, while SLOTHY automates microarchitecture-specific micro-optimizations. This accelerates development, keeps manually written code artifacts maintainable, and allows to split efforts for formal verification into the separate verification of the clean code and the micro-optimizations.
SLOTHY is essentially a constraint solver frontend: It converts the input source into a data flow graph and builds a constraint model capturing valid instruction schedulings, register renamings, and periodic loop interleavings. The model is passed to an external constraint solver and, upon success, a satisfying assignment converted back into the final code. Currently, SLOTHY uses Google OR-Tools as its constraint solver backend.
As a rough rule of thumb, SLOTHY typically optimizes workloads of <50 instructions in seconds to minutes, workloads up to 150 instructions in minutes to hours, while for larger kernels some heuristics are necessary.
SLOTHY has been used to provide the fastest known implementations of various cryptographic and DSP primitives:
For example, the SLOTHY paper discusses the NTTs underlying ML-KEM and ML-DSA for
Cortex-{A55, A72, M55, M85}, the FFT for Cortex-{M55,M85}, and the X25519 scalar multiplication for Cortex-A55. You find
the clean and optimized source code for those examples in paper/
.
Have a look at the SLOTHY tutorial for a hands-on and example-based introduction to SLOTHY.
-
Arm EndpointAI: SLOTHY-optimized code has been deployed to the CMSIS DSP Library for the radix-4 CFFT routines as part of the Arm EndpointAI project in this commit.
-
AWS libcrypto (AWS-LC): SLOTHY-optimized X25519 code based on our un-interleaved form of the original code by Emil Lenngren has been formally verified and included in s2n-bignum and merged into AWS-LC.
SLOTHY has been successfully used on
- Ubuntu-21.10 and up (64-bit),
- macOS Monterey 12.6 and up.
SLOTHY requires Python >= 3.10. See requirements.txt for package requirements, and install via pip install -r requirements.txt
.
Note: requirements.txt
pins versions for reproducibility. If you already have newer versions of some dependencies
installed and don't want them downgraded, consider using a virtual environment:
python3 -m venv venv
./venv/bin/python3 -m pip install -r requirements.txt
Then, enter the virtual environment via source venv/bin/activate
prior to running SLOTHY.
A dockerfile for an Ubuntu-22.04 based Docker image with all dependencies of SLOTHY and the PQMX+PQAX test environments setup can be found in paper/artifact/slothy.dockerfile. See paper/artifact/README.md for instructions.
To check that your setup is complete, try the following from the base directory:
% python3 example.py --examples aarch64_simple0_a55
You should see something like the following:
* Example: aarch64_simple0_a55...
INFO:aarch64_simple0_a55:Instructions in body: 20
INFO:aarch64_simple0_a55.slothy:Perform internal binary search for minimal number of stalls...
INFO:aarch64_simple0_a55.slothy:Attempt optimization with max 32 stalls...
INFO:aarch64_simple0_a55.slothy:Objective: minimize number of stalls
INFO:aarch64_simple0_a55.slothy:Invoking external constraint solver (OR-Tools CP-SAT v9.7.2996) ...
INFO:aarch64_simple0_a55.slothy:[0.0721s]: Found 1 solutions so far... objective 19.0, bound 8.0 (minimize number of stalls)
INFO:aarch64_simple0_a55.slothy:[0.0765s]: Found 2 solutions so far... objective 18.0, bound 12.0 (minimize number of stalls)
INFO:aarch64_simple0_a55.slothy:OPTIMAL, wall time: 0.155224 s
INFO:aarch64_simple0_a55.slothy:Booleans in result: 509
INFO:aarch64_simple0_a55.slothy.selfcheck:OK!
INFO:aarch64_simple0_a55.slothy:Minimum number of stalls: 18
The SLOTHY Tutorial and the examples directory contain numerous exemplary
assembly snippets. To try them, use python3 example.py --examples={YOUR_EXAMPLE}
. See python3 example.py --help
for
the list of all available examples.
The use of SLOTHY from the command line is illustrated in scripts/ supporting the real-world optimizations for the NTT, FFT and X25519 discussed in Fast and Clean: Auditable high-performance assembly via constraint solving.