Skip to content

Conversation

@vraspar
Copy link
Contributor

@vraspar vraspar commented Dec 1, 2025

Description

This PR introduces a new experimental lookup-table(LUT) based matrix multiplication method inspired from T-MAC paper and T-MAC repository to speed up low bit LLM inference.

Unlike the existing quant-dequant methods, the LUT-based method directly supports mixed-precision-GEMM without dequantization. It uses bit-wise table lookup to eliminate multiplications and reduce additions required in matrix multiplication.

This PR:

  • Add mlas.use_lut_gemm session option allowing use of LUT GEMM inside matmulnbits when it is available
  • Add initial avx2 kernel for 2 bit weights

How to test

Perf

Future Work

  • Support MLFloat16
  • Add neon kernel
  • Add kernels for 4 bit weights and bitnet kernel

liqunfu and others added 30 commits January 29, 2025 19:11
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
…as kernel not implemented for fp32. Also, I need to write the packing logic for the scales as well.
…ssert issue with the data shuffling in prepack
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants