flashinfer

Here are 3 public repositories matching this topic...

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

gpu cuda inference nvidia mha mla multi-head-attention gqa mqa llm large-language-model flash-attention cuda-core decoding-attention flashinfer flashmla

Kernel Library Wheel for SGLang

cuda cutlass sglang flashinfer

a powerful, large-scale, multimodal model for Text-to-Image generation.

Add a description, image, and links to the flashinfer topic page so that developers can more easily learn about it.

To associate your repository with the flashinfer topic, visit your repo's landing page and select "manage topics."