Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 41 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,30 +20,39 @@ ______________________________________________________________________

## 📰 News

- :fire: **2026-01-26 · [v0.1.2-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.2-alpha.1)**. **Multi-Token Prediction (MTP) lands in TileRT**. With mtp=3, we observe decoding rates up to **590 tokens/s** under synthetic workloads.
- :fire: **2026-02-14 · [Try the Online Demo](https://www.tilert.ai/)**. Our online demo is now live! Experience ultra-low-latency inference with **GLM-5** and **DeepSeek-V3.2**. [Try it now !](https://www.tilert.ai)

- 🎉 **2026-02-14 · [v0.1.3](https://github.com/tile-ai/TileRT/releases/tag/v0.1.3) Released**. The v0.1.3 release introduces full support for the latest GLM-5 model, achieving up to 500 tokens/s on GLM-5-FP8 and up to 600 tokens/s on DeepSeek-V3.2.

- 🚀 **2026-01-26 · [v0.1.2-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.2-alpha.1)**. **Multi-Token Prediction (MTP)** is now available in TileRT! With mtp=3, we achieve decoding rates of up to **590 tokens/s** under synthetic workloads.

<details>
<summary>Key Milestones</summary>

- ⚡ **2025-12-23 · [v0.1.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.1)**. Achieved ~**35% further reduction** (3 ~ 4x speedup over baseline) in end-to-end token generation latency on a single node with **8× NVIDIA B200**.

- 🚀 **2025-11-20 · [v0.1.0-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.0-alpha.1)**. Initial public release for **DeepSeek-V3.2-Exp**, targeting **ultra-low-latency** inference. Available on [PyPI](https://pypi.org/project/tilert) and [HuggingFace](https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT).

</details>

______________________________________________________________________

<a id="overview"></a>

## TileRT: Pushing LLM Latency to the Limit
**TileRT** is a project designed to serve large language models (LLMs) in ultra-low-latency scenarios. Its goal is to push the latency limits of LLMs without compromising model size or quality—enabling models with hundreds of billions of parameters to achieve millisecond-level time per output token (TPOT).

In our latest **v0.1.3** release, we tested **TileRT's** performance on the newest [**GLM-5**](https://huggingface.co/zai-org/GLM-5-FP8) model, demonstrating the effectiveness of our approach in real-world applications. We were among the first to support this latest model, validating the power of the technology we've developed.

TileRT is an experimental project exploring core compiler techniques for serving large language models (LLMs) in **ultra-low-latency** scenarios. Its goal is to push the latency limits of LLMs without compromising model size or quality—for example, enabling models with hundreds of billions of parameters to achieve millisecond-level **time per output token (TPOT)**.
Using the [**GLM-5**](https://huggingface.co/zai-org/GLM-5-FP8) model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs, we evaluated TileRT’s preliminary performance. As shown in the benchmarks below, TileRT demonstrates substantial improvements over existing inference systems.

<p align="center">
<img src="assets/generate.gif" alt="TileRT Benchmark"><br>
Figure 1. Sequence generation with TileRT, now enhanced with Multi-Token Prediction (MTP) to accelerate inference.
<img src="assets/glm5-mtp.png" alt="TileRT Benchmark" width="800"><br>
Figure 1. Evaluation setup. Batch size: 1; Input sequence length: 1K, 16K, 32K, 64K, 128K, 150K, 192K; Output sequence length: 1K; Benchmark with <a href="https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/bench/dataset/prepare_synthetic_data.py">synthetic data</a>. SGLang v0.5.9.dev0 with MTP=3; vLLM v0.16.0rc2.dev173 with MTP=1 (vLLM failed when MTP=3, so we set MTP=1 as <a href="https://docs.vllm.ai/projects/recipes/en/latest/GLM/GLM5.html">vLLM-GPT5-recipe</a>); TileRT v0.1.3 with MTP=3.
</p>

We evaluated TileRT’s preliminary performance using the [**DeepSeek-V3.2-Exp**](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs. As shown in the benchmark below, TileRT demonstrates substantial improvements over existing inference systems.

<p align="center">
<img src="assets/perf.png" alt="TileRT Benchmark" width="500"><br>
Figure 2. Evaluation setup. Batch size: 1, Input sequence length/Output sequence length: 1K/1K; SGLang v0.5.6, TensorRT-LLM v1.2.0-rc5, vLLM v0.13.0, TileRT v0.1.1 with CUDA 12.9.
<img src="assets/glm5-without-mtp.png" alt="TileRT Benchmark" width="800"><br>
Figure 2. Evaluation setup. Batch size: 1; Input sequence length: 1K, 16K, 32K, 64K, 128K, 150K, 192K; Output sequence length: 1K; Benchmark with <a href="https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/bench/dataset/prepare_synthetic_data.py">synthetic data</a>. SGLang v0.5.9.dev0; vLLM v0.16.0rc2.dev173; TileRT v0.1.3.
</p>

Unlike traditional inference systems optimized for high-throughput batch processing, TileRT prioritizes **responsiveness**, which is critical for applications such as high-frequency trading, interactive AI, real-time decision-making, long-running agents, and AI-assisted coding, where the latency of individual requests matters most.
Expand Down Expand Up @@ -117,36 +126,46 @@ You're now ready to use TileRT! Proceed to the [Getting Started](#getting-starte

## Getting Started

### Download Pre-Converted Weights from HuggingFace
### Step 1: Download Official Model Weights

Starting from release v0.1.3, TileRT no longer requires downloading pre-converted weights from Hugging Face. Instead, you can download the official model weights directly from the model's source (e.g., Hugging Face), and then convert them using the weight converter script included with the latest TileRT release.

TileRT requires preprocessing of the original DeepSeek-V3.2-Exp model weights before they can be used for ultra-low-latency inference.
To simplify this process, we provide **pre-converted weights** directly on HuggingFace so users do not need to run the preprocessing pipeline themselves.
### Step 2: Convert Weights Using `weight_converter.py`

You can download the weights using one of the recommended methods below:
After downloading the official model weights, you can use the following command to convert them into a format compatible with TileRT:

#### Option 1: Using `huggingface-cli` (recommended)
For **DeepSeek-V3.2**, run:

```bash
hf download Tile-AI/DeepSeek-V3.2-Exp-TileRT --local-dir ./tilert_weights
python -m tilert.models.preprocess.weight_converter \
--model_type deepseek-v32 \
--model_dir "/path/to/DeepSeek-V3.2" \
--save_dir "/path/to/DeepSeek-V3.2-TileRT"
```

This will download all files into the `./tilert_weights` directory.
Replace `/path/to/DeepSeek-V3.2` with the directory where you've downloaded the model weights, and `/path/to/DeepSeek-V3.2-TileRT` with the directory where you'd like the converted weights to be saved.

#### Option 2: Using Git + Git LFS
Similarly, for **GLM-5**, run:

```bash
git lfs install
git clone https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT
python -m tilert.models.preprocess.weight_converter \
--model_type glm-5 \
--model_dir "/path/to/GLM-5-FP8" \
--save_dir "/path/to/GLM-5-FP8-TileRT"
```

For additional download methods or advanced usage, please refer to the official Hugging Face documentation.
Replace `/path/to/GLM-5-FP8` with the directory containing the downloaded GLM-5 model weights, and `/path/to/GLM-5-FP8-TileRT` with the desired location for saving the converted weights.

### Step 3: Set the Converted Weights Directory

After downloading the weights, point TileRT to the directory using:
Once the weights are converted, set the environment variable to point TileRT to the directory containing the converted weights:

```bash
export MODEL_WEIGHTS_DIR=/path/to/tilert_weights
export MODEL_WEIGHTS_DIR= ... # converted weights
```

Now you're ready to use TileRT with the converted weights!

### Running the Generation Example

After downloading the model weights, you can run the generation example within the Docker environment as follows:
Expand Down Expand Up @@ -203,11 +222,6 @@ This example demonstrates basic single-step autoregressive generation using the

### Running the Generation Example with Multi-Token Prediction (MTP)

> \[!IMPORTANT\]
> **Weights update required for MTP.** Multi-Token Prediction (MTP) introduces additional **MTP heads** in the model weights.
> If you were using TileRT **before v0.1.1**, please make sure you download the **latest weights** from Hugging Face.
> Older weights do not include the required MTP heads and will fail to run when MTP is enabled.

TileRT also supports Multi-Token Prediction (MTP), which allows the model to generate multiple tokens per forward pass and reduces sequential decoding depth.

To better illustrate MTP behavior, we use a longer prompt that encourages extended generation:
Expand Down
Binary file removed assets/generate.gif
Binary file not shown.
Binary file added assets/glm5-mtp.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/glm5-without-mtp.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified assets/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed assets/perf.png
Binary file not shown.
2 changes: 0 additions & 2 deletions python/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,6 @@ def _load_library(filename: str) -> Any:


from . import models # noqa: E402
from .generate import ShowHandsGenerator # noqa: E402
from .models import deepseek_v3_2 # noqa: E402
from .tilert_init import tilert_init # noqa: E402

Expand All @@ -59,6 +58,5 @@ def _load_library(filename: str) -> Any:
"tilert_init",
"models",
"deepseek_v3_2",
"ShowHandsGenerator",
"__version__",
]
129 changes: 129 additions & 0 deletions python/benchmark/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
"""Benchmark suite for TileRT generation."""

from dataclasses import dataclass
from typing import TypeAlias

from tilert.models.deepseek_v3_2.generator import DSAv32Generator
from tilert.models.glm_5.generator import GLM5Generator

Generator: TypeAlias = DSAv32Generator | GLM5Generator


@dataclass
class BenchMode:
"""Configuration for a single benchmark mode."""

with_mtp: bool
label: str
# Sampling parameters — None means keep current generator defaults (top-k1 argmax).
use_topp: bool = False
top_p: float = 1.0
top_k: int = 256
temperature: float = 1.0


@dataclass
class CellStats:
"""Stats for a single table cell (one mode x one benchmark column)."""

tok_s: float = 0.0
ms: float = 0.0
acc_rate: str = "-"


BenchStats = dict[str, dict[str, CellStats]]


def apply_mode(generator: Generator, mode: BenchMode) -> None:
"""Apply sampling parameters for a benchmark mode."""
generator.update_sampling_params(
temperature=mode.temperature,
top_p=mode.top_p,
top_k=mode.top_k,
use_topp=mode.use_topp,
)


def merge_stats(stats_list: list[BenchStats]) -> BenchStats:
"""Merge multiple benchmark stats dicts by mode label."""
merged: BenchStats = {}
for stats in stats_list:
for mode, cols in stats.items():
merged.setdefault(mode, {}).update(cols)
return merged


def _fmt(number: float, suffix: str) -> str:
return f"{number:.3f} {suffix}"


def print_summary_table(
all_stats: BenchStats,
model_name: str,
) -> None:
"""Print a markdown summary table from merged benchmark stats.

Each mode occupies 3 rows: tok/s, ms, acc_rate.
"""
if not all_stats:
return

# Collect column keys in insertion order (preserves benchmark ordering)
col_keys: list[str] = []
for cols in all_stats.values():
for k in cols:
if k not in col_keys:
col_keys.append(k)

ROW_LABELS = ["tok/s", "ms", "acc"]

# Build formatted cell strings: {mode: {col: [row0, row1, row2]}}
formatted: dict[str, dict[str, list[str]]] = {}
for mode, cols in all_stats.items():
formatted[mode] = {}
for k in col_keys:
cell = cols.get(k)
if cell is None:
formatted[mode][k] = ["-", "-", "-"]
else:
formatted[mode][k] = [
_fmt(cell.tok_s, "tok/s"),
_fmt(cell.ms, "ms"),
cell.acc_rate,
]

# Compute column widths
col_widths: dict[str, int] = {}
for k in col_keys:
w = len(k)
for mode_cells in formatted.values():
for row_str in mode_cells.get(k, ["-"]):
w = max(w, len(row_str))
col_widths[k] = w

mode_width = max(len("Mode"), max(len(m) for m in all_stats))
# Row label column shares the mode column; pick wider of mode names vs row labels
mode_width = max(mode_width, max(len(r) for r in ROW_LABELS))

print(f"\n## Benchmark Summary ({model_name})\n")

# Header
hdr = [f" {'Mode':<{mode_width}} "]
hdr += [f" {k:<{col_widths[k]}} " for k in col_keys]
print("|" + "|".join(hdr) + "|")

# Separator
sep = ["-" * (mode_width + 2)]
sep += ["-" * (col_widths[k] + 2) for k in col_keys]
print("|" + "|".join(sep) + "|")

# Data rows: 3 rows per mode
mode_list = list(all_stats.keys())
for _, mode in enumerate(mode_list):
for row_idx, _row_label in enumerate(ROW_LABELS):
label = mode if row_idx == 0 else ""
cells = [f" {label:<{mode_width}} "]
for k in col_keys:
cell_text = formatted[mode][k][row_idx]
cells.append(f" {cell_text:<{col_widths[k]}} ")
print("|" + "|".join(cells) + "|")
46 changes: 46 additions & 0 deletions python/benchmark/coding_prompt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
"""Coding-prompt benchmark: single generation, measures coding task throughput."""

from typing import cast

import numpy as np
from benchmark import BenchMode, BenchStats, CellStats, Generator, apply_mode

PROMPT = "Hi, can you write a sort program in C for me?"


def run(generator: Generator, modes: list[BenchMode]) -> BenchStats:
"""Run the coding-prompt benchmark for each mode.

Returns stats with column: Coding.
"""
stats: BenchStats = {}

for mode in modes:
apply_mode(generator, mode)
print(f"\n--- Coding-prompt benchmark ({mode.label}) ---")
print(f"Prompt: {PROMPT}")
print("Completion:")

_, time_list, accepted_counts = cast(
tuple[str, list[float], list[int]],
generator.generate(PROMPT, True, with_mtp=mode.with_mtp),
)

mode_stats: dict[str, CellStats] = {}

if mode.with_mtp and accepted_counts:
total_tokens = sum(accepted_counts)
total_time = sum(time_list)
speed = total_tokens / total_time if total_time > 0 else 0
avg_ms = total_time / len(time_list) * 1000
avg_a = total_tokens / len(accepted_counts)
acc_rate = f"{avg_a:.2f}/{min(accepted_counts)}/{max(accepted_counts)}"
mode_stats["Coding"] = CellStats(tok_s=speed, ms=avg_ms, acc_rate=acc_rate)
elif time_list:
mean_time = float(np.mean(time_list))
speed = 1 / mean_time
mode_stats["Coding"] = CellStats(tok_s=speed, ms=mean_time * 1000)

stats[mode.label] = mode_stats

return stats
46 changes: 46 additions & 0 deletions python/benchmark/long_prompt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
"""Long-prompt benchmark: single generation, measures long-form throughput."""

from typing import cast

import numpy as np
from benchmark import BenchMode, BenchStats, CellStats, Generator, apply_mode

PROMPT = "Hi, can you tell me a very long story, with roughly 3000 words?"


def run(generator: Generator, modes: list[BenchMode]) -> BenchStats:
"""Run the long-prompt benchmark for each mode.

Returns stats with column: Long.
"""
stats: BenchStats = {}

for mode in modes:
apply_mode(generator, mode)
print(f"\n--- Long-prompt benchmark ({mode.label}) ---")
print(f"Prompt: {PROMPT}")
print("Completion:")

_, time_list, accepted_counts = cast(
tuple[str, list[float], list[int]],
generator.generate(PROMPT, True, with_mtp=mode.with_mtp),
)

mode_stats: dict[str, CellStats] = {}

if mode.with_mtp and accepted_counts:
total_tokens = sum(accepted_counts)
total_time = sum(time_list)
speed = total_tokens / total_time if total_time > 0 else 0
avg_ms = total_time / len(time_list) * 1000
avg_a = total_tokens / len(accepted_counts)
acc_rate = f"{avg_a:.2f}/{min(accepted_counts)}/{max(accepted_counts)}"
mode_stats["Long"] = CellStats(tok_s=speed, ms=avg_ms, acc_rate=acc_rate)
elif time_list:
mean_time = float(np.mean(time_list))
speed = 1 / mean_time
mode_stats["Long"] = CellStats(tok_s=speed, ms=mean_time * 1000)

stats[mode.label] = mode_stats

return stats
Loading