Skip to content
View xiayuqing0622's full-sized avatar

Block or report xiayuqing0622

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
xiayuqing0622/README.md

Hey, I'm Yuqing Xia πŸ‘‹

I'm obsessed with one thing: making LLMs ridiculously fast.

Every wasted microsecond on GPU is a personal offense to me. I work at the intersection of LLM inference systems, GPU kernel wizardry, and AI compilers β€” turning "that's theoretically possible" into shipped code.


πŸ”₯ TileRT β€” LLM Inference, Absurdly Fast

TileRT PyPI

Most inference engines optimize for throughput. We chose the harder problem: per-request latency.

TileRT is a tile-based runtime built for scenarios where every millisecond counts β€” AI-assisted coding, real-time conversation, high-frequency decision making. No batching tricks, no latency hiding. Just raw speed.

  • ⚑ 600 tok/s on DeepSeek-V3.2 | 500 tok/s on GLM-5-FP8
  • 🧠 Multi-Token Prediction β€” why generate one token when you can do three?
  • 🧩 Compiler-driven tile-level scheduling with dynamic rescheduling across devices
  • πŸš€ pip install tilert | Try it live at tilert.ai

🧱 The tile-ai Ecosystem

TileRT doesn't exist in a vacuum. It's part of tile-ai β€” a full stack we're building from scratch around one simple idea: tiles are the right abstraction for AI compute.

Project What it does
πŸ—£οΈ tilelang The language. Write tile programs, get optimized GPU kernels. Simple as that.
🌐 TileScale The scale-out. Multi-GPU, multi-node β€” one mega-device, zero headaches.
βš™οΈ TileOPs The operators. FlashAttention, MLA, DSA β€” battle-tested, auto-tuned.

πŸ›οΈ Previously

  • NNFusion β€” A DNN compiler that turns model descriptions into framework-free, high-performance executables. Built at Microsoft Research. We were doing AI compilers before it was cool. ⭐ 1000+

πŸ› οΈ Tech Stack

CUDA C++ Python PyTorch CUTLASS


πŸ“« Get in Touch

Building something latency-critical? Want to push LLM inference to the edge? Let's talk.

Pinned Loading

  1. tile-ai/TileRT tile-ai/TileRT Public

    Tile-Based Runtime for Ultra-Low-Latency LLM Inference

    Python 652 37