I'm obsessed with one thing: making LLMs ridiculously fast.
Every wasted microsecond on GPU is a personal offense to me. I work at the intersection of LLM inference systems, GPU kernel wizardry, and AI compilers β turning "that's theoretically possible" into shipped code.
Most inference engines optimize for throughput. We chose the harder problem: per-request latency.
TileRT is a tile-based runtime built for scenarios where every millisecond counts β AI-assisted coding, real-time conversation, high-frequency decision making. No batching tricks, no latency hiding. Just raw speed.
- β‘ 600 tok/s on DeepSeek-V3.2 | 500 tok/s on GLM-5-FP8
- π§ Multi-Token Prediction β why generate one token when you can do three?
- π§© Compiler-driven tile-level scheduling with dynamic rescheduling across devices
- π
pip install tilert| Try it live at tilert.ai
TileRT doesn't exist in a vacuum. It's part of tile-ai β a full stack we're building from scratch around one simple idea: tiles are the right abstraction for AI compute.
| Project | What it does | |
|---|---|---|
| π£οΈ | tilelang |
The language. Write tile programs, get optimized GPU kernels. Simple as that. |
| π | TileScale |
The scale-out. Multi-GPU, multi-node β one mega-device, zero headaches. |
| βοΈ | TileOPs |
The operators. FlashAttention, MLA, DSA β battle-tested, auto-tuned. |
- NNFusion β A DNN compiler that turns model descriptions into framework-free, high-performance executables. Built at Microsoft Research. We were doing AI compilers before it was cool. β 1000+
CUDA C++ Python PyTorch CUTLASS
Building something latency-critical? Want to push LLM inference to the edge? Let's talk.
- Email: xiayuqing0622@outlook.com | xiayq001@gmail.com
- GitHub: @xiayuqing0622




