G2 is a TPC(Tensor Processor Cpre) optimized attention lib for transformer decoder on Intel Gaudi2.
- clone source code in Intel Gaudi2 docker
git clone https://github.com/ZhaiFeiyue/g2.git- build
pip install .pip install git+https://github.com/ZhaiFeiyue/g2.git#egg=g2attncd tests
./run.shQK bmm of transformer decoder could be illustrated in the following picture

The Q shape is [B, M, 1, H], the K shape is [B, M, T, H] and the output shape is [B, M, 1, T].
where:
- B is the Batch Size.
- M is the number of head.
- H is head dimention
- T is the number of cached tokens for K.
Intel Gaudi2 is a Systolic array based AI Accelerators, and the peek tops is ~410T for BF16. But when calculate QK Bmm(same on ScoreV Bmm), the valid tops is only 3.2T, since the valid row of Q is only one, see following.
So this project aims to tackle the above problem by leverage the tops of TPC.
| BS | Head | KV length | Head Dim | Dtype | TPC latency(us) | MME latency(us) |
|---|---|---|---|---|---|---|
| 64 | 32 | 128 | 128 | BF16 | 63 | 41 |
| 64 | 32 | 256 | 128 | BF16 | 62 | 77 |
| 64 | 32 | 512 | 128 | BF16 | 126 | 69 |
| 64 | 32 | 1024 | 128 | BF16 | 244 | 192 |
| 64 | 32 | 2048 | 128 | BF16 | 498 | 467 |
| 64 | 32 | 4096 | 128 | BF16 | 1007 | 935 |
| 64 | 32 | 8192 | 128 | BF16 | 2011 | 1894 |
| BS | Head | KV length | Head Dim | Dtype | TPC latency(us) | MME latency(us) |
|---|---|---|---|---|---|---|
| 64 | 32 | 128 | 128 | BF16 | 63 | 42 |
| 64 | 32 | 256 | 128 | BF16 | 123 | 81 |
| 64 | 32 | 512 | 128 | BF16 | 247 | 160 |
| 64 | 32 | 1024 | 128 | BF16 | 482 | 311 |
| 64 | 32 | 2048 | 128 | BF16 | 981 | 639 |
| 64 | 32 | 4096 | 128 | BF16 | 1965 | 1279 |
| 64 | 32 | 8192 | 128 | BF16 | 3911 | 2539 |
not support release >1.12- bad perf of ScoreV BMM
