Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,6 @@ jobs:
- name: Install lint dependencies
run: |
python -m pip install --upgrade pip
pip install --no-cache-dir -r requirements-ci.txt
pip install --no-cache-dir -r requirements-dev.txt
- name: Run all linting checks
run: ./scripts/lint.sh
122 changes: 105 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,24 +6,37 @@
<a href="https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-1E90FF"></a>
</p>
<p>
<a href="#python-package-installation"><b>Installation</b></a> |
<a href="#getting-started"><b>Getting Started</b></a>
<a href="#overview"><b>Overview</b></a> ·
<a href="#running-the-generation-example"><b>Generation</b></a> ·
<a href="#running-the-generation-example-with-multi-token-prediction-mtp"><b>MTP Generation</b></a> ·
<a href="#installation"><b>Installation</b></a> ·
<a href="#news"><b>News</b></a>
</p>
</div>

## News
______________________________________________________________________

- **\[2025-12-23\]****[v0.1.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.1)** — Achieved ~35% reduction in end-to-end token generation latency on a single node with 8× NVIDIA B200. See our latest benchmarks for detailed measurements.
<a id="news"></a>

- **\[2025-11-20\]** 🚀 **[v0.1.0-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.0-alpha.1)** — Initial release of TileRT for DeepSeek-V3.2-Exp, designed for **ultra-low-latency** inference. Available on [PyPI](https://pypi.org/project/tilert) and [HuggingFace](https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT).
## 📰 News

- :fire: **2026-01-26 · [v0.1.2-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.2-alpha.1)**. **Multi-Token Prediction (MTP) lands in TileRT**. With mtp=3, we observe decoding rates up to **590 tokens/s** under synthetic workloads.

-**2025-12-23 · [v0.1.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.1)**. Achieved ~**35% further reduction** (3 ~ 4x speedup over baseline) in end-to-end token generation latency on a single node with **8× NVIDIA B200**.

- 🚀 **2025-11-20 · [v0.1.0-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.0-alpha.1)**. Initial public release for **DeepSeek-V3.2-Exp**, targeting **ultra-low-latency** inference. Available on [PyPI](https://pypi.org/project/tilert) and [HuggingFace](https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT).

______________________________________________________________________

<a id="overview"></a>

## TileRT: Pushing LLM Latency to the Limit

TileRT is an experimental project exploring core compiler techniques for serving large language models (LLMs) in **ultra-low-latency** scenarios. Its goal is to push the latency limits of LLMs without compromising model size or quality—for example, enabling models with hundreds of billions of parameters to achieve millisecond-level **time per output token (TPOT)**.

<p align="center">
<img src="assets/generate.gif" alt="TileRT Benchmark"><br>
Figure 1. Sequence generation with TileRT.
Figure 1. Sequence generation with TileRT, now enhanced with Multi-Token Prediction (MTP) to accelerate inference.
</p>

We evaluated TileRT’s preliminary performance using the [**DeepSeek-V3.2-Exp**](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs. As shown in the benchmark below, TileRT demonstrates substantial improvements over existing inference systems.
Expand All @@ -39,6 +52,8 @@ To achieve this, TileRT introduces a **tile-level runtime engine**. Leveraging a

The project is actively evolving, and the underlying compiler techniques will be gradually shared with the community as they are integrated into **TileLang** and **TileScale**.

______________________________________________________________________

## Installation

- [Prerequisites](#prerequisites)
Expand Down Expand Up @@ -145,39 +160,112 @@ docker run --gpus all -it \
tilert:v0.1.0
```

Once inside the container, you can run the following Python script:
Once inside the container, run the following Python script to perform text generation:

```python
from tilert.models.deepseek_v3_2.dsa_show_hands import ShowHandsGenerator

generator: ShowHandsGenerator = ShowHandsGenerator(
max_new_tokens=1000,
model_weights_dir=MODEL_WEIGHTS_DIR,
with_mtp=False, # Disable MTP
)
generator.from_pretrained()

prompt = """Tell me three jokes:
1. A dad joke,
2. A programmer joke,
3. A joke that only makes sense if you've ever tried to train a large language model.
Keep each joke under 15 words.
"""
prompt = (
"Tell me three jokes:\n\n"
"1. A dad joke,\n"
"2. A programmer joke,\n"
"3. A joke that only makes sense if you've ever tried "
"to train a large language model.\n"
"Keep each joke under 15 words."
)

print("Prompt:", prompt)
print("Completion:")
completion: generator.generate(prompt)
completion = generator.generate(prompt)
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShowHandsGenerator.generate() now returns a tuple (text, time_list, accepted_counts) (see python/models/deepseek_v3_2/dsa_show_hands.py), but this example assigns it to completion as if it were a string. Update the snippet to unpack the first element (or keep generate() returning str for backward compatibility).

Suggested change
completion = generator.generate(prompt)
completion, _, _ = generator.generate(prompt)

Copilot uses AI. Check for mistakes.
```

For instance, using the above prompt, TileRT might generate:
For example, TileRT may generate:

<details>
<summary><b>Sample output (click to expand)</b></summary>

```text
1. I'm afraid for the calendar. Its days are numbered.
2. There are only 10 kinds of people: those who understand binary and those who don't.
3. My model's loss is low, but its answers are still nonsense. Overfitting.
```

This example gives you a quick idea of the type of output you can expect from the precompiled model.
</details>

This example demonstrates basic single-step autoregressive generation using the precompiled model.

### Running the Generation Example with Multi-Token Prediction (MTP)

> \[!IMPORTANT\]
> **Weights update required for MTP.** Multi-Token Prediction (MTP) introduces additional **MTP heads** in the model weights.
> If you were using TileRT **before v0.1.1**, please make sure you download the **latest weights** from Hugging Face.
> Older weights do not include the required MTP heads and will fail to run when MTP is enabled.
TileRT also supports Multi-Token Prediction (MTP), which allows the model to generate multiple tokens per forward pass and reduces sequential decoding depth.

To better illustrate MTP behavior, we use a longer prompt that encourages extended generation:

```python
from tilert.models.deepseek_v3_2.dsa_show_hands import ShowHandsGenerator

generator: ShowHandsGenerator = ShowHandsGenerator(
max_new_tokens=1000,
model_weights_dir=MODEL_WEIGHTS_DIR,
with_mtp=True, # Enable MTP
)
generator.from_pretrained()
prompt = "Tell me 10 jokes, keep them all under 100 words."

print("Prompt:", prompt)
print("Completion:")
completion = generator.generate(prompt)
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above: ShowHandsGenerator.generate() returns a tuple now, so this MTP example should unpack the returned (text, time_list, accepted_counts) instead of treating it as a single string.

Suggested change
completion = generator.generate(prompt)
completion, time_list, accepted_counts = generator.generate(prompt)
print(completion)

Copilot uses AI. Check for mistakes.
```

When MTP is enabled, TileRT may report statistics similar to the following during generation:

```text
Accepted length: mean=2.77, min=1, max=4
```

This indicates that, on average, multiple tokens are accepted per decoding step under MTP.

<details>
<summary><b>Sample output (click to expand)</b></summary>

```text
Of course! Here are 10 short jokes for you.
1. I told my wife she was drawing her eyebrows too high. She looked surprised.
2. I invented a new word: Plagiarism.
3. Why don't scientists trust atoms? Because they make up everything.
4. I'm reading a book on anti-gravity. It's impossible to put down.
5. What's the best thing about Switzerland? I don't know, but the flag is a big plus.
6. I told my computer I needed a break, and now it won't stop sending me vacation ads.
7. Why did the scarecrow win an award? He was outstanding in his field.
8. What do you call a fake noodle? An impasta.
9. I told my suitcase there's no vacation, and now it has a lot of baggage.
10. Why don't skeletons fight each other? They don't have the guts.
```

</details>

This example highlights how MTP enables TileRT to efficiently generate longer outputs by accepting multiple tokens per decoding step, while preserving the same Python API interface.
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence claims the “same Python API interface” is preserved, but ShowHandsGenerator.generate()’s return type has changed from str to a tuple in this PR. Either adjust the wording or keep the return type stable and expose performance metrics via an optional flag/side channel.

Suggested change
This example highlights how MTP enables TileRT to efficiently generate longer outputs by accepting multiple tokens per decoding step, while preserving the same Python API interface.
This example highlights how MTP enables TileRT to efficiently generate longer outputs by accepting multiple tokens per decoding step.

Copilot uses AI. Check for mistakes.

For more details, please refer to the [generation script](https://github.com/tile-ai/TileRT/blob/main/python/generate.py).

Expand Down
Binary file modified assets/generate.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion python/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,8 @@ def _load_library(filename: str) -> Any:
lib_path = Path(__file__).parent / filename

try:
return ctypes.CDLL(str(lib_path))
torch.ops.load_library(str(lib_path))
return lib_path
except Exception as e:
Comment on lines 42 to 45
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_load_library now uses torch.ops.load_library(...) and returns lib_path, but the docstring says it returns “the loaded library”. Since the return value is unused (module-level call doesn’t capture it), consider returning None and updating the docstring/annotation accordingly, or return a meaningful handle consistently.

Copilot uses AI. Check for mistakes.
raise RuntimeError(f"Failed to load library from {lib_path}") from e

Expand Down
99 changes: 88 additions & 11 deletions python/generate.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
"""Text generation script for TileRT."""

from argparse import ArgumentParser
from typing import cast

import numpy as np

from tilert.models.deepseek_v3_2.dsa_show_hands import ShowHandsGenerator

Expand All @@ -16,7 +19,16 @@ def parse_args(): # type: ignore
parser.add_argument("--max-new-tokens", type=int, default=4000, help="Max tokens to generate")
parser.add_argument("--temperature", type=float, default=0.0, help="Sampling temperature")
parser.add_argument("--interactive", action="store_true")
parser.add_argument("--fp8", action="store_true")
parser.add_argument(
"--with-mtp",
action="store_true",
help="Enable MTP (Multi-Token Prediction) for speculative decoding",
)
parser.add_argument(
"--use-random-weights",
action="store_true",
help="Use random weights instead of pretrained (for testing MTP without real weights)",
)
return parser.parse_args()


Expand All @@ -25,22 +37,31 @@ def parse_args(): # type: ignore
usage:
execute below command under tilert root directory:

# Standard generation with pretrained weights:
python python/generate.py --model-weights-dir "xxxx" 2>&1 | tee test.log

# MTP generation with random weights (for testing):
python python/generate.py --model-weights-dir "xxxx" --with-mtp \
--use-random-weights 2>&1 | tee test.log

# MTP generation with pretrained weights (when available):
python python/generate.py --model-weights-dir "xxxx" --with-mtp 2>&1 | tee test.log
"""
args = parse_args()

generator: ShowHandsGenerator = ShowHandsGenerator(
max_new_tokens=args.max_new_tokens,
temperature=args.temperature,
model_weights_dir=args.model_weights_dir,
enable_fp8_ops=args.fp8,
with_mtp=args.with_mtp,
)

# uncomment to use random weights
# generator.init_random_weights()

# use pretrained weights
generator.from_pretrained()
if args.use_random_weights:
print("Initializing with random weights...")
generator.init_random_weights()
else:
print("Loading pretrained weights...")
generator.from_pretrained()

# simple memoryless interactive mode
if args.interactive:
Expand All @@ -53,14 +74,70 @@ def parse_args(): # type: ignore
else:
# This prompt is to test the model’s ability to follow instructions
# (in terms of quantity, type, and length) while keeping it fun.
print("==== Performance ====")
prompt = "Tell me 10 jokes, keep them all under 100 words."

print("Prompt:", prompt)
print("Completion:")
completion: str = generator.generate(prompt) # type: ignore[has-type]
all_times = []
all_accepted = []
for _iter in range(20):
if _iter % 5 == 0:
print(f"Executing iter {_iter}...")
results, time_list, accepted_counts = cast(
tuple[str, list[float], list[int]],
generator.generate(prompt, False), # type: ignore[has-type]
)
all_times.append(time_list)
all_accepted.append(accepted_counts)

if args.with_mtp:
for token_num in range(100, 300, 100):
times_to_token_num = []
for time_list, accepted_list in zip(all_times, all_accepted):
if len(time_list) > 5 and len(accepted_list) > 5:
times = time_list[5:]
accepted = accepted_list[5:]
cumsum_tokens = np.cumsum(accepted)
cumsum_times = np.cumsum(times)
# Find index where we reach token_num tokens
idx = np.searchsorted(cumsum_tokens, token_num)
if idx < len(cumsum_times):
times_to_token_num.append(cumsum_times[idx])
if times_to_token_num:
mean_total_time = np.mean(times_to_token_num)
mean_time = mean_total_time / token_num
speed = 1 / mean_time
out_str = (
f"**Perf@{token_num}: {speed:.3f} tokens/s & "
f"{(mean_time * 1000):.3f} ms**"
)
print(out_str)

# Print accepted tokens statistics
flat_accepted = [a for accepted_list in all_accepted for a in accepted_list]
if flat_accepted:
avg_accepted = sum(flat_accepted) / len(flat_accepted)
min_accepted = min(flat_accepted)
max_accepted = max(flat_accepted)
print(
f"**Accepted length: mean={avg_accepted:.2f}, "
f"min={min_accepted}, max={max_accepted}**"
)
else:
all_times_np = np.array(all_times)
for token_num in range(100, 300, 100):
mean_time = np.mean(all_times_np[..., 5:token_num])
speed = 1 / mean_time
out_str = (
f"**Perf@{token_num}: {speed:.3f} tokens/s & {(mean_time * 1000):.3f} ms**"
)
print(out_str)
Comment on lines +126 to +133
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all_times_np = np.array(all_times) will produce a 1D object array if per-iteration time_list lengths differ (common if EOS is hit early). In that case, all_times_np[..., 5:token_num] slices the outer array, not each run’s token timings, and np.mean can return incorrect results or NaN. Compute per-run metrics first (e.g., iterate all_times and average time_list[5:token_num] when available) and then aggregate those scalars across runs.

Suggested change
all_times_np = np.array(all_times)
for token_num in range(100, 300, 100):
mean_time = np.mean(all_times_np[..., 5:token_num])
speed = 1 / mean_time
out_str = (
f"**Perf@{token_num}: {speed:.3f} tokens/s & {(mean_time * 1000):.3f} ms**"
)
print(out_str)
for token_num in range(100, 300, 100):
per_run_means = []
for time_list in all_times:
# Require enough tokens to compute stats from token 5 up to token_num
if len(time_list) > 5 and len(time_list) >= token_num:
slice_times = time_list[5:token_num]
if slice_times:
per_run_means.append(float(np.mean(slice_times)))
if per_run_means:
mean_time = float(np.mean(per_run_means))
speed = 1 / mean_time
out_str = (
f"**Perf@{token_num}: {speed:.3f} tokens/s & {(mean_time * 1000):.3f} ms**"
)
print(out_str)

Copilot uses AI. Check for mistakes.
print(results)

# This prompt is used to test long sequence generation
prompt = "Hi, can you tell me a very long story, with roughly 3000 words?"
print("Prompt:", prompt)
print("Completion:")
completion = generator.generate(prompt) # type: ignore[has-type]
completion, _, _ = generator.generate(prompt) # type: ignore[has-type]

print("Cleaning up...")
generator.cleanup()
8 changes: 5 additions & 3 deletions python/models/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

from tilert import logger
from tilert.models.deepseek_config import get_rank, get_world_size
from tilert.models.deepseek_v3_2.params import BaseParams
from tilert.models.preprocess import WeightLoader
from tilert.utils import get_profile_log_tensor

Expand Down Expand Up @@ -52,9 +53,10 @@ def __init__(

self.flag_enable_tilert = False

if compute_kernel_type not in ["bf16", "fp8"]:
if compute_kernel_type not in ["bf16", "fp8", "fp8mma"]:
raise ValueError(
f"Invalid compute kernel type: {compute_kernel_type}, must be one of bf16, fp8."
f"Invalid compute kernel type: {compute_kernel_type}, \
must be one of bf16, fp8, fp8mma."
)
self.compute_kernel_type = compute_kernel_type

Comment on lines +58 to 62
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ValueError message is built using a backslash-escaped newline inside the f-string, which results in awkward whitespace in the final message. Prefer a single-line message or explicit string concatenation within parentheses so the error text is clean and predictable.

Suggested change
f"Invalid compute kernel type: {compute_kernel_type}, \
must be one of bf16, fp8, fp8mma."
)
self.compute_kernel_type = compute_kernel_type
f"Invalid compute kernel type: {compute_kernel_type}, must be one of bf16, fp8, fp8mma."
)
self.compute_kernel_type = compute_kernel_type

Copilot uses AI. Check for mistakes.
Expand Down Expand Up @@ -215,7 +217,7 @@ def tilert_forward(self, *args: Any, **kwargs: Any) -> Any: # noqa: U100
raise NotImplementedError("Tilert forward not implemented")

@abstractmethod
def to_tilert_weights(self, *args: Any, **kwargs: Any) -> None:
def to_tilert_weights(self, *args: Any, **kwargs: Any) -> BaseParams | None:
"""Convert weights to tilert.

Args:
Expand Down
1 change: 1 addition & 0 deletions python/models/deepseek_v3_2/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""DeepSeek v3.2 model package."""
Loading