14 Feb 11:59

lcy-seso

deeedc3

Latest

🚀 TileRT v0.1.3 – GLM-5 Lands. Two Frontier Models, One Ultra-Low-Latency Runtime.

We're thrilled to announce TileRT v0.1.3, a major release that brings full GLM-5 support to TileRT alongside DeepSeek-V3.2, making TileRT a multi-model ultra-low-latency inference runtime.

With this release, GLM-5 delivers the same class of ultra-low-latency performance as DeepSeek-V3.2 on 8× NVIDIA B200 GPUs. Both models now benefit from Multi-Token Prediction (MTP), Top-P sampling, and extended context lengths — all accessible through the same Python API. Don't take our word for it — try it yourself:

🌐 Try It Now — Live Online Demo

👉 https://www.tilert.ai

✨ Key Highlights

This release adds GLM-5 as a first-class supported model, launches a public online demo, and introduces new sampling and generation capabilities.

🧠 GLM-5 Model Support

TileRT now provides full end-to-end inference support for GLM-5, bringing a second frontier-class model into the runtime.

This includes:

Multi-Token Prediction (MTP) for accelerated decode
Thinking mode for extended reasoning workloads
Up to 200K context length
Per-request dynamic sampling parameter control

GLM-5 is fully integrated into TileRT's unified Python API, sharing the same interface and workflow as DeepSeek-V3.2.

🎯 Top-P (Nucleus) Sampling

TileRT now supports Top-P sampling for both DeepSeek-V3.2 and GLM-5, in addition to the existing Top-K strategy.

Top-P sampling is available through the standard generation API and requires no additional setup.

📏 Extended Context Length

Both supported models now handle significantly longer sequences:

DeepSeek-V3.2: up to 160K tokens
GLM-5: up to 200K tokens

Longer context support is transparent to users and enabled by default.

🔧 Unified Python Interface & Weight Conversion

The Python frontend has been restructured around a unified, model-agnostic interface, with a consistent workflow shared across all supported models.

Direct Hugging Face compatibility — Load and convert official model weights from Hugging Face without manual preprocessing
Clean, model-agnostic abstractions designed to streamline onboarding of future architectures

Users can now go from a Hugging Face model checkpoint to TileRT inference in a single step.

🔮 What's Next

TileRT is evolving quickly. Upcoming areas of focus include:

Further MTP and decode throughput optimization across both models
Expanded PD disaggregation for production-scale serving
Ongoing kernel-level performance tuning for new GPU features

🤝 Join the Community

TileRT is developed in the open, and user feedback plays a key role in shaping its evolution ❤️

If you're interested in:

ultra-low-latency LLM inference,
multi-model serving on cutting-edge GPUs,
or production-oriented inference optimization,

we invite you to try this release, share your experiences, and join the discussion.

⭐ Star the repo to show your interest and support
🐞 Open issues to report bugs, share feedback, or request features
💬 Start discussions around use cases, performance observations, and integration experiences

Let's move toward faster and more scalable inference together with TileRT 🚀

Assets 3

26 Jan 06:06

lcy-seso

v0.1.2-alpha.1

d18b3ef

v0.1.2-alpha.1: Multi-Token Prediction Lands, Batch Support Expanded

🚀 TileRT v0.1.2-alpha.1 – Multi-Token Prediction Is Here. Faster Inference Starts Now.

We’re excited to introduce TileRT v0.1.2-alpha.1, an alpha release that marks TileRT’s first step toward Multi-Token Prediction (MTP) by reducing sequential decoding depth in autoregressive inference.

This release adds initial support for MTP, enabling multiple tokens to be generated per forward pass. With mtp=3, we observe decoding rates up to 590 tokens/s on synthetic workloads and ~440 tokens/s on real generation tasks. These results establish an early reference point as we continue exploring and refining MTP.

✨ Key Highlights

This release focuses on expanding TileRT’s inference capabilities, improving scalability, and strengthening the foundation for future performance work.

🧠 Multi-Token Prediction (MTP)

TileRT now provides end-to-end support for Multi-Token Prediction (MTP), enabling multiple-token generation per forward pass.

This includes:

DSA MTP model integration
End-to-end execution flow
Weight conversion tooling

Together, these components form a complete and practical foundation for experimenting with and evaluating multi-token generation workflows through TileRT’s Python API.

⚡ Performance, Scalability, and Execution Improvements

This release includes broad internal improvements aimed at better scalability and more efficient inference execution:

Expanded support for batched execution across key inference paths, including attention, projection, normalization, and MoE-related operators.
Continued internal optimizations targeting compute efficiency, operator fusion, and reduced overhead during token generation.

These enhancements are delivered through prebuilt binaries and Python APIs, and do not require user-side code changes.

🏗️ Architecture & Maintainability

To support faster iteration and long-term evolution, TileRT’s internal architecture has been further refined:

Operator inputs are unified through a consistent argument abstraction.
Operator interfaces are simplified with compile-time batch and sequence-length specialization.

These changes improve maintainability today and prepare the codebase for future feature expansion.

🔮 What’s Next

TileRT is evolving quickly, and this alpha release sets the stage for upcoming work, including:

Further MTP refinements.
Improved weight conversion workflows to enable more flexible optimization strategies.
Ongoing latency improvements across inference pipelines.

🤝 Join the Community

TileRT is developed in the open, and user feedback plays a key role in shaping its evolution ❤️

If you’re interested in:

multi-token generation,
high-performance inference runtimes,
or production-oriented inference optimization,

we invite you to try this alpha release, share your experiences, and join the discussion.

⭐ Star the repo to show your interest and support
🐞 Open issues to report bugs, share feedback, or request features
💬 Start discussions around use cases, performance observations, and integration experiences

Let’s move toward faster and more scalable inference together with TileRT 🚀

Assets 3

23 Dec 13:06

lcy-seso

v0.1.1

20a862c

v0.1.1: Faster Token Generation ⚡

🚀 TileRT v0.1.1 – Ultra-Low-Latency Token Generation

TileRT v0.1.1 delivers a significant boost in token generation performance, reducing latency by 35% compared to the previous release.

This improvement is achieved through optimizations to core operators and enhancements to the tile-level runtime engine. Key updates include faster GEMV kernels, expanded FP8/BF16 support across multiple kernels, and improved runtime scheduling and memory behavior.

✨ Highlights

Performance Boost: Token generation is now significantly faster, with latency reduced by around 35%. See our latest speed tests for exact figures.
Operator & Precision Optimizations: Faster GEMV, RMSNorm, and MMA-based operators with expanded FP8/BF16 support.
Runtime Enhancements: Improved tile-level scheduling, prefetching, memory alignment, and multi-device task handling.
Stability Fixes: Resolved issues affecting runtime stability and memory behavior.

🔧 What’s Changed

🚀 Performance & Operators

Optimized GEMV and RMSNorm operators for improved performance.
Expanded FP8/BF16 support across multiple kernels.
Improved expert selection performance.

⚙️ Runtime & Kernel Execution

Enhanced tile-level runtime engine for better scheduling, prefetching, and memory management.
Fixed shared memory alignment issues and inter-operator dependencies.

🔮 Looking Ahead

TileRT is under active development. The next release and upcoming work will focus on:

Further latency reductions in token generation.
Introduction of new features, including MTP support.
Opening the weight converter, enabling decoupled layouts and more flexible kernel optimizations.

With ongoing refactoring and continuous enhancements to operators and the runtime engine, we invite the community to follow our progress, test new features, and provide feedback to help shape the future development of TileRT.

Assets 3

22 Nov 09:09

lcy-seso

v0.1.0-alpha.1

8b5225a

v0.1.0-alpha.1 Release Notes

TileRT: Pushing the Boundaries of
Low-Latency LLM Inference

We’re excited to announce the first preview release of TileRT (v0.1.0-alpha.1). This initial exploration version introduces an experimental runtime that investigates tile-level compilation techniques for ultra-low-latency LLM inference. It serves as a starting point for evaluating TileRT’s potential to reduce end-to-end latency while maintaining compatibility with large-scale models and supporting future integration with TileLang and TileScale.

🚀 Overview

The goal of the TileRT project is to push the latency boundaries of LLMs without compromising model size or quality—for example, enabling models with hundreds of billions of parameters to run at millisecond-level TPOT. TileRT addresses these challenges with a new tile-level runtime engine. It uses a compiler-driven approach to decompose LLM operators into fine-grained tile-level tasks, and a tile-level runtime that reschedules compute, I/O, and communication across multiple devices in a highly overlapped manner. This allows TileRT to minimize idle time and maximize hardware utilization. These compiler techniques will be incorporated into TileLang and TileScale.

We evaluated TileRT’s preliminary performance using the DeepSeek-V3.2-Exp model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs. As shown in the benchmark below, TileRT significantly outperforms existing inference systems:

Fig. Evaluation setup: Input seqlen/output seqlen: 1K/1K, SGLang-0.5.5, vLLM-0.11.0, CUDA-12.9

TileRT is a continuously evolving project. Our ongoing plans include pursuing more aggressive optimizations, supporting various batch sizes, more model families and more hardware, and establishing a new foundation for low-latency AI inference. Stay tuned for updates!

Installation

Before installing the TileRT wheel package, please ensure your environment meets the following requirements:

Supported Environment

This wheel is built and tested under the following conditions:

Hardware: 8× NVIDIA B200 GPUs
Operating System: Linux x86_64 (Ubuntu 20.04+ recommended)
Python Versions: 3.11 – 3.12
CUDA Version: 12.9
CUDA Driver: Compatible with the B200 runtime environment
PyTorch Build: PyTorch wheels compiled for CUDA 12.8 or 12.9 (matching the driver/runtime above for B200)

Python Package Installation

Important

Disclaimer: TileRT is an experimental project. The current preview build supports the 8-GPU B200 setup. For the most reliable experience, we strongly recommend installing the package within the provided Docker image.
For more details on the Docker environment and usage instructions, please refer to the TileRT project homepage on GitHub.

Docker Installation

To get started, pull the Docker image:

docker pull tileai/tilert:v0.1.0

Then, launch a Docker container using the following command:

IMAGE_NAME="tileai/tilert:v0.1.0"
WORKSPACE_PATH="xxx"  # Path to the workspace you want to mount

docker run --gpus all -it \
    -v $WORKSPACE_PATH:/workspace/ \
    $IMAGE_NAME

After the container starts, install the TileRT package:

pip install tilert

🌟 Join the Journey

TileRT is developed and maintained by the TileRT team. This preview release marks just the beginning, and we’re continuing to explore new compiler techniques, improve runtime performance, and expand multi-device support.

If you’re interested in ultra-low-latency LLM inference, we invite you to follow the project, share feedback, and join us as TileRT evolves.

💬 Start a conversation via Issues
📧 Contact the TileRT team: tile-ai@outlook.com

Assets 3

Releases: tile-ai/TileRT