Copied some BMAD documentation to this branch.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
JackAKirk · Jan 21, 2024 · 66968d4 · 66968d4
1 parent 42e2b17
commit 66968d4
Showing 1 changed file with 74 additions and 0 deletions.
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc
@@ -164,6 +164,80 @@ namespace sycl::ext::oneapi::experimental::matrix {
 ```
 The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result.
 
+### Bitwise Multiply and Add - `joint_matrix_bmad` (Nvidia® only)
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+template <typename Group, std::size_t M,
+          std::size_t K, std::size_t N, matrix_layout LayoutC, class BinaryOperation>
+joint_matrix<int32_t, matrix_use::accumulator, M, N, LayoutC, Group>
+joint_matrix_bmad(
+    Group sg, joint_matrix<uint32_t, matrix_use::a, M, K, matrix_layout::row_major, Group> A,
+    joint_matrix<uint32_t, matrix_use::b, K, N, matrix_layout::col_major, Group> B,
+    joint_matrix<int32_t, matrix_use::accumulator, M, N, LayoutC, Group> C, BinaryOperation Op);
+}
+```
+
+Bitwise Multiply and Add (BMAD) operations replace the usual dot product between a row of matrix A (M by K) with a column of matrix B (K by N), where the programmer can construct e.g. a standard C++ (M by K) array of specified type T to represent matrix A.  Instead, a sequence of logical operations are performed: The AND or XOR logical operations operate on the ith bit of a (K * 32) bit row of matrix A with the ith bit of a (K * 32) bit column of matrix B, to produce a 128 bit intermediate output.
+The Population Count (popc) operator then operates on this intermediate output and the result is added with the (M, N)th element of the accumulator matrix C. Currently only the shape M = 8, N = 8, K = 4 (K = 4 corresponds to 128 single-bit matrix elements) is supported.
+An important difference with respect to the joint_matrix_mad interface is the addition of the `BinaryOperator Op`  parameter. `Op` may be either:
+
+`sycl::bit_and<uint32_t>()`
+
+or
+
+`sycl::bit_xor<uint32_t>()`
+
+The A, B, and C `joint_matrix` objects are constructed and loaded/stored in the normal way, using the previously defined `joint_matrix`, `joint_matrix_load`, and `joint_matrix_store` interfaces respectively.
+The C matrix must be loaded from an array of 32 bit signed integers, and the A, B single bit matrices must be loaded from an array of unsigned 32-bit integers.
+
+IMPORTANT: When using Bitwise Multiply and Add joint_matrix A must be in row major layout and joint_matrix B must be in column major layout.
+
+IMPORTANT: Bitwise Multiply and Add operations are an experimental hardware feature and all implementation details are subject to change.
+
+#### Motivation for BMAD
+
+Single-bit MADs can be used as part of Binarized Neural Networks (BNNs) in the case that *both* the activations *and* weights are binarized. "Quantizing" a network to form a BNN represents the extreme limit of reducing the precision of the network degrees of freedom in order to gain performance and improve efficiency.
+Hubara et al. (I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized Neural Networks, Advances in Neural Information Processing Systems 29 (NIPS 2016)) first demonstrated the utility of an algorithm that could use both binarized activations and weights with backpropagation, by keeping track of real valued weights which are mapped to the binarized weights. In the backwards pass the real valued weights are updated according to a heuristic named the "Straight Through Estimator", whereby the gradient of the loss function with respect to the real weights is set equal to the gradient of the loss function with respect to the binarized weights.
+This implies that the precision of the data type used in the matrix multiplications can be single bit, with the necessary addition of forward and backward element wise mappings between binarized and real valued representations of the matrices.
+This could prove a significant advantage for large models, since the computational cost of Matrix Multiplication scales with the number of elements per dimension, N, as O(N^3) for square matrices, whereas corresponding element wise operations scale as O(N^2).
+Further algorithms based on this binarized approach have been proposed, e.g. see Rastegari et al. (M. Rastegari, V Ordonez, J. Redmon, and A. Farhadi. Computer Vision – ECCV 2016, 525-542) who have made a comparison between a binarized version of a CNN (Using a XNOR Binary Dot Product) and corresponding full precision models, for both the accuracy and performance of image classification using the ImageNet data set.
+The outlook for BNNs appears to be that they may be useful in the future for edge devices with memory/compute constraints, targetting applications for which the drop in accuracy is acceptable.
+
+For an example of how bitwise MADs can be leveraged on current Nvidia® hardware see (A. Li, and S. Su. IEEE Transactions on Parallel and Distributed Systems, 32(7):1878-1891, 2021).
+
+#### Example using bitwise operations with `joint_matrix_bmad`
+
+```c++
+using namespace sycl::ext::oneapi::experimental::matrix;
+
+queue q;
+  q.submit([&](handler &cgh) {
+    auto accC = bufC.template get_access<access::mode::read_write>(cgh);
+    auto accA = bufA.template get_access<access::mode::read_write>(cgh);
+    auto accB = bufB.template get_access<access::mode::read_write>(cgh);
+    auto accD = bufD.template get_access<access::mode::read_write>(cgh);
+    range<2> LocalRange = {1, N_THREADS_PER_MATRIX_OP};
+    range<2> GlobalRange = {Sub_Tiles_M, Sub_Tiles_N * N_THREADS_PER_MATRIX_OP};
+    cgh.parallel_for<KernelName<M, K, N, BinaryOperation>>(
+        nd_range<2>(GlobalRange, LocalRange), [=](nd_item<2> item) {
+          sycl::sub_group sg = item.get_sub_group();
+          const auto m = item.get_group().get_id()[0]; // row id of current submatrix of BIG C matrix
+          const auto n = item.get_group().get_id()[1]; // column id of current submatrix of BIG C matrix
+          joint_matrix<uint32_t, matrix_use::a, 8, 4, matrix_layout::row_major> sub_a;
+          joint_matrix<uint32_t, matrix_use::b, 4, 8, matrix_layout::col_major> sub_b;
+          joint_matrix<int32_t, matrix_use::accumulator, 8, 8, matrix_layout::row_major> sub_c;
+          joint_matrix_load(sg, sub_c, accC.get_pointer() + (m * M) * Big_N + n * N, Big_N);
+          for (int k = 0; k < Sub_Tiles_K; k++) // row/col id of current submatrix of BIG A/B matrices
+          {
+            joint_matrix_load(sg, sub_a, accA.get_pointer() + (k * K) + (m * M * Big_K), Big_K);
+            joint_matrix_load(sg, sub_b, accB.get_pointer() + (n * N * Big_K) + (k * K), Big_K);
+            sub_c = joint_matrix_bmad(sg, sub_a, sub_b, sub_c, Op);
+          }
+          joint_matrix_store(sg, sub_c, accD.get_pointer() + (m * M) * Big_N + n * N, Big_N);
+        });
+  }).wait();
+```
 
 ## VNNI/Packed Layout
 Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory.