multi-query-attention

Here are 3 public repositories matching this topic...

knotgrass / attention

several types of attention modules written in PyTorch for learning purposes

transformers pytorch transformer attention attention-mechanism softmax-layer multi-head-attention multi-query-attention grouped-query-attention scale-dot-product-attention

Updated Oct 1, 2024
Python

M-e-r-c-u-r-y / pytorch-transformers

Star

Collection of different types of transformers for learning purposes

transformers pytorch multi-head-attention einsum-notation multi-query-attention

Updated Jan 30, 2020
Jupyter Notebook

CUDA implementation of Multi-Query Attention achieving 97% KV-cache memory reduction for LLM inference, enabling 32x larger batch sizes. Educational project demonstrating CUDA kernel development with PyTorch integration and Llama model benchmarks.

cuda attention-mechanism gpu-programming multi-query-attention llm-inference

Updated Sep 10, 2025
Python

Improve this page

Add a description, image, and links to the multi-query-attention topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the multi-query-attention topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly