Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ __pycache__/

# Distribution / packaging
.Python
build/
docs/build/
develop-eggs/
dist/
downloads/
Expand Down Expand Up @@ -68,5 +68,4 @@ pip-delete-this-directory.txt
*.pyc
*.json
*.jsonl
*_ignore.py
.idea
.idea
51 changes: 26 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@


<h1 align="center"> Linghe </h1>
<h1 align="center"> linghe </h1>

<div style="text-align: center;">
<img src="docs/linghe.png" alt="Logo" width="200">
Expand All @@ -20,42 +20,43 @@

## *News or Update* 🔥
---
- [2025/07] We implement multiple kernels for fp8 training with `Megatron-LM` blockwise quantization.
- [2025/07] We implement multiple kernels for FP8 training with `Megatron-LM` blockwise quantization.


## Introduction
---
Our repo, FLOPS, is designed for LLM training, especially for MoE training with fp8 quantizaiton. It provides 3 main categories of kernels:
Our repo, linghe, is designed for LLM training, especially for MoE training with FP8 quantizaiton. It provides 2 main categories of kernels:

- **Fused quantization kernels**: fuse quantization with previous layer, e.g., RMS norm and Silu.
- **Memory-friendly kernels**: use dtype cast in kernels instead of casting out kernels, e.g., softmax cross entropy and moe router gemm.
- **Other fused kernels**: fuse multiple IO-itensive operations, e.g., ROPE with qk-norm and transpose, permute and padding, group RMS norm with sigmoid gate.
- **Memory-efficiency kernels**: fuse multiple IO-itensive operations, e.g., ROPE with qk-norm.
- **Implementation-optimized kernels**: use efficient triton implementation, e.g., routing map padding instead of activation padding.


## Benchmark
---
We benchmark on H800 with batch size 8192, hidden size 2048, num experts 256, activation experts 8.

| kernel | baseline(us) | linghe(us) | speedup |
|--------|--------------|-----------|---------|
| RMSNorm+Quantization(forward) | 159.3 us | 72.4 us | 2.2 |
| Split+qk-norm+rope+transpose(forward) | 472 us | 59.1 us | 7.99 |
| Split+qk-norm+rope+transpose(backward) | 645 us | 107.5 us | 6.0 |
| Fp32 router gemm(forward) | 242.3 us | 61.6 us | 3.931 |
| Fp32 router gemm(backward) | 232.7 us | 78.1 us | 2.979 |
| Permute with padded indices | 388 us | 229.4 us | 1.69 |
| Unpermute with padding indices | 988.6 us | 806.9 us | 1.23 |
| Batch Silu+quantization(forward) | 6241.7 us | 1181.7 us | 5.28 |
| Batch Silu+quantization(backward) | 7147.7 us | 2317.9 us | 3.08 |
| Silu+quantization(forward) | 144.9 us | 58.2 us | 2.48 |
| Silu+quantization(backward) | 163.4 us | 74.2 us | 2.2 |
| fused linear gate(forward) | 160.4 us | 46.9 us | 3.42 |
| fused linear gate(backward) | 572.9 us | 81.1 us | 7.06 |
| Cross entropy(forward) | 2780.8 us | 818.2 us | 3.4 |
| Cross entropy(backward) | 7086.3 us | 1781.0 us | 3.98 |
| batch grad norm | 1733.7 us | 1413.7 us | 1.23 |
| Batch count zero | 4997.9 us | 746.8 us | 6.69 |

|--------|--------------|------------|---------|
| RMSNorm+Quantization(forward) | 159.3 us | 72.4 us | 2.2 |
| Split+qk-norm+rope+transpose(forward) | 472 us | 59.1 us | 7.99 |
| Split+qk-norm+rope+transpose(backward) | 645 us | 107.5 us | 6.0 |
| Fp32 router gemm(forward) | 242.3 us | 61.6 us | 3.931 |
| Fp32 router gemm(backward) | 232.7 us | 78.1 us | 2.979 |
| Permute with padded indices | 388 us | 229.4 us | 1.69 |
| Unpermute with padding indices | 988.6 us | 806.9 us | 1.23 |
| Batch Silu+quantization(forward) | 6241.7 us | 1181.7 us | 5.28 |
| Batch Silu+quantization(backward) | 7147.7 us | 2317.9 us | 3.08 |
| Silu+quantization(forward) | 144.9 us | 58.2 us | 2.48 |
| Silu+quantization(backward) | 163.4 us | 74.2 us | 2.2 |
| fused linear gate(forward) | 160.4 us | 46.9 us | 3.42 |
| fused linear gate(backward) | 572.9 us | 81.1 us | 7.06 |
| Cross entropy(forward) | 2780.8 us | 818.2 us | 3.4 |
| Cross entropy(backward) | 7086.3 us | 1781.0 us | 3.98 |
| batch grad norm | 1733.7 us | 1413.7 us | 1.23 |
| Batch count zero | 4997.9 us | 746.8 us | 6.69 |

Other benchmark results can be obtained by running scripts in tests and benchmark folders.

## Examples
---
Expand All @@ -65,4 +66,4 @@ Examples can be found in tests.
## Api Reference
---

Please refer to [API doc](docs/api.md)
Please refer to [API](https://inclusionai.github.io/linghe/)
Binary file added asserts/linghe.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 3 additions & 1 deletion build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,6 @@ rm -rf build &&
rm -rf dist &&
rm -rf linghe.egg-info &&
python setup.py develop &&
python setup.py bdist_wheel &&
python setup.py bdist_wheel

# pdoc --output-dir docs -d google --no-include-undocumented --no-search --no-show-source linghe
212 changes: 0 additions & 212 deletions docs/api.md

This file was deleted.

7 changes: 7 additions & 0 deletions docs/index.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="refresh" content="0; url=./linghe.html"/>
</head>
</html>
Loading