Skip to content

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

License

Notifications You must be signed in to change notification settings

MooreThreads/MT-TransformerEngine

 
 

Introduction

MT-TransformerEngine is a high-performance deep learning framework developed by the Moore Threads AI-Infra Team. Built upon TransformerEngine and torch_musa, MT-TransformerEngine delivers optimized support for FP8 training on Moore Threads GPUs. When integrated with MT-Megatron, MT-TransformerEngine enables:

  • FP8 training recipe on Moore Threads GPUs. And we provide the same FP8 training strategy as the DeepSeek-v3 with the MTFP8BlockScalingRecipeState in transformer_engine/musa/pytorch/fp8.py.
  • Scalable large-model training across clusters of thousands of GPUs. For detailed introduction on large model training, refer to the MT-Megatron.

Installation

Install MT-TransformerEngine via the provided installation script.

bash install.sh

The script will compile MUSA kernels and C++ source files from transformer_engine/musa/common and transformer_engine/musa/pytorch/csrc

MUSA Example

To execute CUDA-compatible training on Moore Threads GPUs:

  1. Import torch and torch_musa
  2. Replace cuda device strings with musa
import torch
import torch_musa
import transformer_engine.pytorch as te
from transformer_engine.common import recipe

# Set dimensions.
in_features = 768
out_features = 3072
hidden_size = 2048

# Initialize model and inputs.
model = te.Linear(in_features, out_features, bias=True)
inp = torch.randn(hidden_size, in_features, device="musa")

# Create an FP8 recipe. Note: All input args are optional.
fp8_recipe = recipe.DelayedScaling(margin=0, fp8_format=recipe.Format.E4M3)

# Enable autocasting for the forward pass
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    out = model(inp)

loss = out.sum()
loss.backward()

Feature

Feature Availability
per-tensor fp8
per-block fp8
tp overlap (with fp8)
moe recompute
zero bubble
fp8 alltoall Coming Soon

Community

Issue Reporting

If you find any problems for large model training using MT-TE, please open an issue.

Contributions

Welcome any form of contribution of code and document!

About

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 53.8%
  • Cuda 28.3%
  • C++ 15.9%
  • C 1.4%
  • Other 0.6%