July 2024: To tile or not to tile

The common wisdom for efficient matrix multiplications is to use block tiling, and this is also used here for fp16/fp32 matrices. But block tiling does not somehow magically reduce the amount of computation that needs to get done. Performance gains are simply due to the better utilization of memory caches. When dealing with quantized matrix multiplications, there is an additional factor that comes into play: the quantized data needs to be unpacked to 8-bit integers before being used in the matrix multiplication multiply-add operations. Depending on quantization type, this unpacking can represent a significant fraction of the overall computation cost. Hence, for best performance, one would want to reuse the unpacked quants as much as possible, thus spending some fraction of the available vector registers to hold the unpacked data. But when using block tiling, one also needs a certain number of vector registers for accumulating results. For instance, on AVX2 (16 vector registers available), for fp16/fp32 models best performance is achieved with 2 x 6 tiles (where the 2 refers to rows in the left matrix and is measured in units of the vector register size, so 16/8 floats for fp16/fp32, and 6 is for the number of columns in the right matrix). Unpacking quantized data works best when done in blocks of 128 or 256 quants so that, if we wanted to keep unpacked quants for 2 rows, we would need at least 8 vector registers, thus being left with less than 8 registers for result accumulation, so at best 2 x 3 tiles. In practice one needs addition vector registers for various constants that are typically needed for de-quantization, so that, at the end, it becomes better to use 1 x N "tiles", i.e., a row-wise multiplication where each row in the left matrix is multiplied with N columns in the right matrix, thus reusing the unpacked data N times. This (i.e., amortizing de-quantization cost) is the main mechanism for seeding up quantized matrix multiplications. Having started with quantized matrices, and having gone from tiles to a row-wise implementation after some experimentation, I did try row-wise multiplication for float matrices first. Performance was not quite as good as for block-tiling, but I did get up to 90-95% of the speed of tinyBLAS that way before switching the fp16/fp32 implementation to 2 x 6 (AVX2) or 5 x 5 (AVX512 and ARM_NEON) block-tiles. But even for for Q8_0 x Q8_0 multiplications, where there is basically no de-quantization cost, row-wise multiplication is faster than tiling (and hence this implemeintation beats tinyBLAS, which uses block-tiling also for Q8_0).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

July 2024: To tile or not to tile

Clone this wiki locally