This is the development repository for ISAAC, an input-aware auto-tuning framework and code-generator for HPC/DL. This version is only compatible with NVIDIA hardware (it generates PTX source code). For OpenCL/CUDA compatibility, visit the Intel fork ( or the v1.0 branch (deprecated) or the
ISAAC is distributed under the MIT/X11 license.
In order to compile and use ISAAC, only a proprietary NVIDIA driver is necessary. No CUDA SDK is required (except for testing and benchmarking against cuBLAS/cuDNN)
git clone
cd isaac;
mkdir build;
cd build;
cmake ../ ; make -j8;
./examples/isaac-tools --gemm --bench --suite deepbench --dtype float32
./examples/isaac-tools --conv --bench --suite deepbench --dtype float32
The Tensorflow wrapper can be installed as follows in an environment where Tensorflow is present.
cd python;
python build;
python install;
You can test the installation by executing:
python ./python/examples/
What the script does is pretty straightforward:
import isaac as sc
isaac = tf.load_op_library(sc.tensorflow)
Will expose isaac.conv2d
and isaac.conv3d
. You can use them like you'd use tf.nn.conv2d and tf.nn.conv3d.
If you don't want to use Tensorflow, it is possible to use the python bindings directly. See the "tune/" folder for an example.
Basic benchmarks for GEMM and CONV for DeepBench can be obtained using the isaac-tools binary interface:
Note that only float32 and float64 are supported at the moment.
If you want, you can also dump the PTX source code generated by ISAAC for some shapes:
./examples/isaac-tools --gemm --dump --format ptx --shape 2048,2048,2048 --layout NT --dtype float32
If you really know what you're doing, you can also capture the tiling parameters found by ISAAC:
./examples/isaac-tools --gemm --dump --format params --shape 2048,2048,2048 --layout NT --dtype float32
You will get the following output:
Tuning parameters: 4, 16, 8, 8, 8, 8, 16, 8, 16, 8, 1, 1, 1
The parameters respectively mean: (1) that shared memory loads have a width of 4 ; (2) each block comprises 16x8 threads ; (3) each threads computes a tile of 8x8 elements; (4) Each loop iteration processes 8 elements along the K axis ; (5) threads are rearranged as a 16 x 8 block for loading A, and a 16 x 8 block for loading B; (6) the reduction is split accross 1, 1 and 1 independent batches within each thread, thread-block and grid, and the results are accumulated after the inner-loop
ISAAC often provides
Tesla P100 - SGEMM:
I would consider GEMM and CONV as both being production-ready. Kernel selection is done for each new shape and the best kernel is cached in RAM. I wouldn't advise this library for applications that use 1000s of different shapes exactly once (e.g., Blocked SVD).
This work was partially supported by the National Science Foundation (IIS 1409097) and by IARPA (contract D16PC00002).