tile-ai · lcy-seso · Jan 26, 2026 · Jan 25, 2026 · Copilot · Jan 26, 2026
diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
@@ -36,6 +36,6 @@ jobs:
       - name: Install lint dependencies
         run: |
           python -m pip install --upgrade pip
-          pip install --no-cache-dir -r requirements-ci.txt
+          pip install --no-cache-dir -r requirements-dev.txt
       - name: Run all linting checks
         run: ./scripts/lint.sh
diff --git a/README.md b/README.md
@@ -6,24 +6,37 @@
     <a href="https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-1E90FF"></a>
   </p>
   <p>
-    <a href="#python-package-installation"><b>Installation</b></a> |
-    <a href="#getting-started"><b>Getting Started</b></a>
+    <a href="#overview"><b>Overview</b></a> ·
+    <a href="#running-the-generation-example"><b>Generation</b></a> ·
+    <a href="#running-the-generation-example-with-multi-token-prediction-mtp"><b>MTP Generation</b></a> ·
+    <a href="#installation"><b>Installation</b></a> ·
+    <a href="#news"><b>News</b></a>
   </p>
 </div>
 
-## News
+______________________________________________________________________
 
-- **\[2025-12-23\]** ⚡ **[v0.1.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.1)** — Achieved ~35% reduction in end-to-end token generation latency on a single node with 8× NVIDIA B200. See our latest benchmarks for detailed measurements.
+<a id="news"></a>
 
-- **\[2025-11-20\]** 🚀 **[v0.1.0-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.0-alpha.1)** — Initial release of TileRT for DeepSeek-V3.2-Exp, designed for **ultra-low-latency** inference. Available on [PyPI](https://pypi.org/project/tilert) and [HuggingFace](https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT).
+## 📰 News
+
+- :fire: **2026-01-26 · [v0.1.2-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.2-alpha.1)**. **Multi-Token Prediction (MTP) lands in TileRT**. With mtp=3, we observe decoding rates up to **590 tokens/s** under synthetic workloads.
+
+- ⚡ **2025-12-23 · [v0.1.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.1)**. Achieved ~**35% further reduction** (3 ~ 4x speedup over baseline) in end-to-end token generation latency on a single node with **8× NVIDIA B200**.
+
+- 🚀 **2025-11-20 · [v0.1.0-alpha.1](https://github.com/tile-ai/TileRT/releases/tag/v0.1.0-alpha.1)**. Initial public release for **DeepSeek-V3.2-Exp**, targeting **ultra-low-latency** inference. Available on [PyPI](https://pypi.org/project/tilert) and [HuggingFace](https://huggingface.co/Tile-AI/DeepSeek-V3.2-Exp-TileRT).
+
+______________________________________________________________________
+
+<a id="overview"></a>
 
 ## TileRT: Pushing LLM Latency to the Limit
 
 TileRT is an experimental project exploring core compiler techniques for serving large language models (LLMs) in **ultra-low-latency** scenarios. Its goal is to push the latency limits of LLMs without compromising model size or quality—for example, enabling models with hundreds of billions of parameters to achieve millisecond-level **time per output token (TPOT)**.
 
 <p align="center">
 <img src="assets/generate.gif" alt="TileRT Benchmark"><br>
-Figure 1. Sequence generation with TileRT.
+Figure 1. Sequence generation with TileRT, now enhanced with Multi-Token Prediction (MTP) to accelerate inference.
 </p>
 
 We evaluated TileRT’s preliminary performance using the [**DeepSeek-V3.2-Exp**](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) model (without lossy optimizations such as quantization or distillation) with a batch size of 1 on 8× NVIDIA B200 GPUs. As shown in the benchmark below, TileRT demonstrates substantial improvements over existing inference systems.
@@ -39,6 +52,8 @@ To achieve this, TileRT introduces a **tile-level runtime engine**. Leveraging a
 
 The project is actively evolving, and the underlying compiler techniques will be gradually shared with the community as they are integrated into **TileLang** and **TileScale**.
 
+______________________________________________________________________
+
 ## Installation
 
 - [Prerequisites](#prerequisites)
@@ -145,39 +160,112 @@ docker run --gpus all -it \
     tilert:v0.1.0
 ```
 
-Once inside the container, you can run the following Python script:
+Once inside the container, run the following Python script to perform text generation:
 
 ```python
 from tilert.models.deepseek_v3_2.dsa_show_hands import ShowHandsGenerator
 
 generator: ShowHandsGenerator = ShowHandsGenerator(
     max_new_tokens=1000,
     model_weights_dir=MODEL_WEIGHTS_DIR,
+    with_mtp=False,  # Disable MTP
 )
 generator.from_pretrained()
 
-prompt = """Tell me three jokes:
-
-1. A dad joke,
-2. A programmer joke,
-3. A joke that only makes sense if you've ever tried to train a large language model.
-Keep each joke under 15 words.
-"""
+prompt = (
+    "Tell me three jokes:\n\n"
+    "1. A dad joke,\n"
+    "2. A programmer joke,\n"
+    "3. A joke that only makes sense if you've ever tried "
+    "to train a large language model.\n"
+    "Keep each joke under 15 words."
+)
 
 print("Prompt:", prompt)
 print("Completion:")
-completion: generator.generate(prompt)
+completion = generator.generate(prompt)
-completion = generator.generate(prompt)
+completion, _, _ = generator.generate(prompt)
-completion = generator.generate(prompt)
+completion, _, _ = generator.generate(prompt)
 ```
 
-For instance, using the above prompt, TileRT might generate:
+For example, TileRT may generate:
+
+<details>
+<summary><b>Sample output (click to expand)</b></summary>
 
 ```text
 1. I'm afraid for the calendar. Its days are numbered.
 2. There are only 10 kinds of people: those who understand binary and those who don't.
 3. My model's loss is low, but its answers are still nonsense. Overfitting.
 ```
 
-This example gives you a quick idea of the type of output you can expect from the precompiled model.
+</details>
+
+This example demonstrates basic single-step autoregressive generation using the precompiled model.
+
+### Running the Generation Example with Multi-Token Prediction (MTP)
+
+> \[!IMPORTANT\]
+> **Weights update required for MTP.** Multi-Token Prediction (MTP) introduces additional **MTP heads** in the model weights.
+> If you were using TileRT **before v0.1.1**, please make sure you download the **latest weights** from Hugging Face.
+> Older weights do not include the required MTP heads and will fail to run when MTP is enabled.
+
+TileRT also supports Multi-Token Prediction (MTP), which allows the model to generate multiple tokens per forward pass and reduces sequential decoding depth.
+
+To better illustrate MTP behavior, we use a longer prompt that encourages extended generation:
+
+```python
+from tilert.models.deepseek_v3_2.dsa_show_hands import ShowHandsGenerator
+
+generator: ShowHandsGenerator = ShowHandsGenerator(
+    max_new_tokens=1000,
+    model_weights_dir=MODEL_WEIGHTS_DIR,
+    with_mtp=True,  # Enable MTP
+)
+generator.from_pretrained()
+prompt = "Tell me 10 jokes, keep them all under 100 words."
+
+print("Prompt:", prompt)
+print("Completion:")
+completion = generator.generate(prompt)
-completion = generator.generate(prompt)
+completion, time_list, accepted_counts = generator.generate(prompt)
+print(completion)
-completion = generator.generate(prompt)
+completion, time_list, accepted_counts = generator.generate(prompt)
+print(completion)
+```
+
+When MTP is enabled, TileRT may report statistics similar to the following during generation:
+
+```text
+Accepted length: mean=2.77, min=1, max=4
+```
+
+This indicates that, on average, multiple tokens are accepted per decoding step under MTP.
+
+<details>
+<summary><b>Sample output (click to expand)</b></summary>
+
+```text
+Of course! Here are 10 short jokes for you.
+
+1. I told my wife she was drawing her eyebrows too high. She looked surprised.
+
+2. I invented a new word: Plagiarism.
+
+3. Why don't scientists trust atoms? Because they make up everything.
+
+4. I'm reading a book on anti-gravity. It's impossible to put down.
+
+5. What's the best thing about Switzerland? I don't know, but the flag is a big plus.
+
+6. I told my computer I needed a break, and now it won't stop sending me vacation ads.
+
+7. Why did the scarecrow win an award? He was outstanding in his field.
+
+8. What do you call a fake noodle? An impasta.
+
+9. I told my suitcase there's no vacation, and now it has a lot of baggage.
+
+10. Why don't skeletons fight each other? They don't have the guts.
+```
+
+</details>
+
+This example highlights how MTP enables TileRT to efficiently generate longer outputs by accepting multiple tokens per decoding step, while preserving the same Python API interface.
-This example highlights how MTP enables TileRT to efficiently generate longer outputs by accepting multiple tokens per decoding step, while preserving the same Python API interface.
+This example highlights how MTP enables TileRT to efficiently generate longer outputs by accepting multiple tokens per decoding step.
-This example highlights how MTP enables TileRT to efficiently generate longer outputs by accepting multiple tokens per decoding step, while preserving the same Python API interface.
+This example highlights how MTP enables TileRT to efficiently generate longer outputs by accepting multiple tokens per decoding step.
 
 For more details, please refer to the [generation script](https://github.com/tile-ai/TileRT/blob/main/python/generate.py).
 

diff --git a/assets/generate.gif b/assets/generate.gif
diff --git a/python/__init__.py b/python/__init__.py
@@ -40,7 +40,8 @@ def _load_library(filename: str) -> Any:
     lib_path = Path(__file__).parent / filename
 
     try:
-        return ctypes.CDLL(str(lib_path))
+        torch.ops.load_library(str(lib_path))
+        return lib_path
     except Exception as e:
         raise RuntimeError(f"Failed to load library from {lib_path}") from e
 

diff --git a/python/generate.py b/python/generate.py
@@ -1,6 +1,9 @@
 """Text generation script for TileRT."""
 
 from argparse import ArgumentParser
+from typing import cast
+
+import numpy as np
 
 from tilert.models.deepseek_v3_2.dsa_show_hands import ShowHandsGenerator
 
@@ -16,7 +19,16 @@ def parse_args():  # type: ignore
     parser.add_argument("--max-new-tokens", type=int, default=4000, help="Max tokens to generate")
     parser.add_argument("--temperature", type=float, default=0.0, help="Sampling temperature")
     parser.add_argument("--interactive", action="store_true")
-    parser.add_argument("--fp8", action="store_true")
+    parser.add_argument(
+        "--with-mtp",
+        action="store_true",
+        help="Enable MTP (Multi-Token Prediction) for speculative decoding",
+    )
+    parser.add_argument(
+        "--use-random-weights",
+        action="store_true",
+        help="Use random weights instead of pretrained (for testing MTP without real weights)",
+    )
     return parser.parse_args()
 
 
@@ -25,22 +37,31 @@ def parse_args():  # type: ignore
     usage:
     execute below command under tilert root directory:
 
+    # Standard generation with pretrained weights:
     python python/generate.py --model-weights-dir "xxxx" 2>&1 | tee test.log
+
+    # MTP generation with random weights (for testing):
+    python python/generate.py --model-weights-dir "xxxx" --with-mtp \
+        --use-random-weights 2>&1 | tee test.log
+
+    # MTP generation with pretrained weights (when available):
+    python python/generate.py --model-weights-dir "xxxx" --with-mtp 2>&1 | tee test.log
     """
     args = parse_args()
 
     generator: ShowHandsGenerator = ShowHandsGenerator(
         max_new_tokens=args.max_new_tokens,
         temperature=args.temperature,
         model_weights_dir=args.model_weights_dir,
-        enable_fp8_ops=args.fp8,
+        with_mtp=args.with_mtp,
     )
 
-    # uncomment to use random weights
-    # generator.init_random_weights()
-
-    # use pretrained weights
-    generator.from_pretrained()
+    if args.use_random_weights:
+        print("Initializing with random weights...")
+        generator.init_random_weights()
+    else:
+        print("Loading pretrained weights...")
+        generator.from_pretrained()
 
     # simple memoryless interactive mode
     if args.interactive:
@@ -53,14 +74,70 @@ def parse_args():  # type: ignore
     else:
         # This prompt is to test the model’s ability to follow instructions
         # (in terms of quantity, type, and length) while keeping it fun.
+        print("==== Performance ====")
         prompt = "Tell me 10 jokes, keep them all under 100 words."
-
         print("Prompt:", prompt)
-        print("Completion:")
-        completion: str = generator.generate(prompt)  # type: ignore[has-type]
+        all_times = []
+        all_accepted = []
+        for _iter in range(20):
+            if _iter % 5 == 0:
+                print(f"Executing iter {_iter}...")
+            results, time_list, accepted_counts = cast(
+                tuple[str, list[float], list[int]],
+                generator.generate(prompt, False),  # type: ignore[has-type]
+            )
+            all_times.append(time_list)
+            all_accepted.append(accepted_counts)
+
+        if args.with_mtp:
+            for token_num in range(100, 300, 100):
+                times_to_token_num = []
+                for time_list, accepted_list in zip(all_times, all_accepted):
+                    if len(time_list) > 5 and len(accepted_list) > 5:
+                        times = time_list[5:]
+                        accepted = accepted_list[5:]
+                        cumsum_tokens = np.cumsum(accepted)
+                        cumsum_times = np.cumsum(times)
+                        # Find index where we reach token_num tokens
+                        idx = np.searchsorted(cumsum_tokens, token_num)
+                        if idx < len(cumsum_times):
+                            times_to_token_num.append(cumsum_times[idx])
+                if times_to_token_num:
+                    mean_total_time = np.mean(times_to_token_num)
+                    mean_time = mean_total_time / token_num
+                    speed = 1 / mean_time
+                    out_str = (
+                        f"**Perf@{token_num}: {speed:.3f} tokens/s & "
+                        f"{(mean_time * 1000):.3f} ms**"
+                    )
+                    print(out_str)
+
+            # Print accepted tokens statistics
+            flat_accepted = [a for accepted_list in all_accepted for a in accepted_list]
+            if flat_accepted:
+                avg_accepted = sum(flat_accepted) / len(flat_accepted)
+                min_accepted = min(flat_accepted)
+                max_accepted = max(flat_accepted)
+                print(
+                    f"**Accepted length: mean={avg_accepted:.2f}, "
+                    f"min={min_accepted}, max={max_accepted}**"
+                )
+        else:
+            all_times_np = np.array(all_times)
+            for token_num in range(100, 300, 100):
+                mean_time = np.mean(all_times_np[..., 5:token_num])
+                speed = 1 / mean_time
+                out_str = (
+                    f"**Perf@{token_num}: {speed:.3f} tokens/s & {(mean_time * 1000):.3f} ms**"
+                )
+                print(out_str)
-            all_times_np = np.array(all_times)
-            for token_num in range(100, 300, 100):
-                mean_time = np.mean(all_times_np[..., 5:token_num])
-                speed = 1 / mean_time
-                out_str = (
-                    f"**Perf@{token_num}: {speed:.3f} tokens/s & {(mean_time * 1000):.3f} ms**"
-                )
-                print(out_str)
+            for token_num in range(100, 300, 100):
+                per_run_means = []
+                for time_list in all_times:
+                    # Require enough tokens to compute stats from token 5 up to token_num
+                    if len(time_list) > 5 and len(time_list) >= token_num:
+                        slice_times = time_list[5:token_num]
+                        if slice_times:
+                            per_run_means.append(float(np.mean(slice_times)))
+                if per_run_means:
+                    mean_time = float(np.mean(per_run_means))
+                    speed = 1 / mean_time
+                    out_str = (
+                        f"**Perf@{token_num}: {speed:.3f} tokens/s & {(mean_time * 1000):.3f} ms**"
+                    )
+                    print(out_str)
-            all_times_np = np.array(all_times)
-            for token_num in range(100, 300, 100):
-                mean_time = np.mean(all_times_np[..., 5:token_num])
-                speed = 1 / mean_time
-                out_str = (
-                    f"**Perf@{token_num}: {speed:.3f} tokens/s & {(mean_time * 1000):.3f} ms**"
-                )
-                print(out_str)
+            for token_num in range(100, 300, 100):
+                per_run_means = []
+                for time_list in all_times:
+                    # Require enough tokens to compute stats from token 5 up to token_num
+                    if len(time_list) > 5 and len(time_list) >= token_num:
+                        slice_times = time_list[5:token_num]
+                        if slice_times:
+                            per_run_means.append(float(np.mean(slice_times)))
+                if per_run_means:
+                    mean_time = float(np.mean(per_run_means))
+                    speed = 1 / mean_time
+                    out_str = (
+                        f"**Perf@{token_num}: {speed:.3f} tokens/s & {(mean_time * 1000):.3f} ms**"
+                    )
+                    print(out_str)
+        print(results)
 
         # This prompt is used to test long sequence generation
         prompt = "Hi, can you tell me a very long story, with roughly 3000 words?"
         print("Prompt:", prompt)
         print("Completion:")
-        completion = generator.generate(prompt)  # type: ignore[has-type]
+        completion, _, _ = generator.generate(prompt)  # type: ignore[has-type]
+
+    print("Cleaning up...")
+    generator.cleanup()
diff --git a/python/models/base.py b/python/models/base.py
@@ -9,6 +9,7 @@
 
 from tilert import logger
 from tilert.models.deepseek_config import get_rank, get_world_size
+from tilert.models.deepseek_v3_2.params import BaseParams
 from tilert.models.preprocess import WeightLoader
 from tilert.utils import get_profile_log_tensor
 
@@ -52,9 +53,10 @@ def __init__(
 
         self.flag_enable_tilert = False
 
-        if compute_kernel_type not in ["bf16", "fp8"]:
+        if compute_kernel_type not in ["bf16", "fp8", "fp8mma"]:
             raise ValueError(
-                f"Invalid compute kernel type: {compute_kernel_type}, must be one of bf16, fp8."
+                f"Invalid compute kernel type: {compute_kernel_type}, \
+                must be one of bf16, fp8, fp8mma."
             )
         self.compute_kernel_type = compute_kernel_type
 
-                f"Invalid compute kernel type: {compute_kernel_type}, \
-                must be one of bf16, fp8, fp8mma."
-            )
-        self.compute_kernel_type = compute_kernel_type
+                f"Invalid compute kernel type: {compute_kernel_type}, must be one of bf16, fp8, fp8mma."
+            )
+        self.compute_kernel_type = compute_kernel_type
-                f"Invalid compute kernel type: {compute_kernel_type}, \
-                must be one of bf16, fp8, fp8mma."
-            )
-        self.compute_kernel_type = compute_kernel_type
+                f"Invalid compute kernel type: {compute_kernel_type}, must be one of bf16, fp8, fp8mma."
+            )
+        self.compute_kernel_type = compute_kernel_type
@@ -215,7 +217,7 @@ def tilert_forward(self, *args: Any, **kwargs: Any) -> Any:  # noqa: U100
         raise NotImplementedError("Tilert forward not implemented")
 
     @abstractmethod
-    def to_tilert_weights(self, *args: Any, **kwargs: Any) -> None:
+    def to_tilert_weights(self, *args: Any, **kwargs: Any) -> BaseParams | None:
         """Convert weights to tilert.
 
         Args:

diff --git a/python/models/deepseek_v3_2/__init__.py b/python/models/deepseek_v3_2/__init__.py
@@ -0,0 +1 @@
+"""DeepSeek v3.2 model package."""