Doc #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

zheyishine merged 7 commits into main from doc

Oct 16, 2025

.gitignore

-Original file line number
+Diff line change
@@ Expand Up / @@ -31,7 +31,7 @@ __pycache__/ @@
     # Distribution / packaging
     .Python
-    build/
+    docs/build/
     develop-eggs/
     dist/
     downloads/
@@ Expand Down Expand Up / @@ -68,5 +68,4 @@ pip-delete-this-directory.txt @@
     *.pyc
     *.json
     *.jsonl
-    *_ignore.py
-    .idea
+    .idea

README.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,6 +1,6 @@
  
    <h1 align="center"> Linghe </h1>

    <h1 align="center"> linghe </h1>

    <div style="text-align: center;">

    <img src="docs/linghe.png" alt="Logo" width="200">

    @@ -20,42 +20,43 @@
  
    ## *News or Update* 🔥

    ---

    - [2025/07] We implement multiple kernels for fp8 training with `Megatron-LM` blockwise quantization. 

    - [2025/07] We implement multiple kernels for FP8 training with `Megatron-LM` blockwise quantization. 

    ## Introduction

    ---

    Our repo, FLOPS, is designed for LLM training, especially for MoE training with fp8 quantizaiton. It provides 3 main categories of kernels:

    Our repo, linghe, is designed for LLM training, especially for MoE training with FP8 quantizaiton. It provides 2 main categories of kernels:

    - **Fused quantization kernels**: fuse quantization with previous layer, e.g., RMS norm and Silu.

    - **Memory-friendly kernels**: use dtype cast in kernels instead of casting out kernels, e.g., softmax cross entropy and moe router gemm.

    - **Other fused kernels**: fuse multiple IO-itensive operations, e.g., ROPE with qk-norm and transpose, permute and padding, group RMS norm with sigmoid gate.

    - **Memory-efficiency kernels**: fuse multiple IO-itensive operations, e.g., ROPE with qk-norm.

    - **Implementation-optimized kernels**: use efficient triton implementation, e.g., routing map padding instead of activation padding.

    ## Benchmark

    ---

    We benchmark on H800 with batch size 8192, hidden size 2048, num experts 256, activation experts 8.

    | kernel | baseline(us) | linghe(us) | speedup |

    |--------|--------------|-----------|---------|

    | RMSNorm+Quantization(forward) | 159.3 us | 72.4 us | 2.2 |

    | Split+qk-norm+rope+transpose(forward) | 472 us | 59.1 us | 7.99 |

    | Split+qk-norm+rope+transpose(backward) | 645 us | 107.5 us | 6.0 |

    | Fp32 router gemm(forward) | 242.3 us | 61.6 us | 3.931 |

    | Fp32 router gemm(backward) | 232.7 us | 78.1 us | 2.979 |

    | Permute with padded indices | 388 us | 229.4 us | 1.69 |

    | Unpermute with padding indices | 988.6 us | 806.9 us | 1.23 |

    | Batch Silu+quantization(forward) | 6241.7 us | 1181.7 us | 5.28 |

    | Batch Silu+quantization(backward) | 7147.7 us | 2317.9 us | 3.08 |

    | Silu+quantization(forward) | 144.9 us | 58.2 us | 2.48 |

    | Silu+quantization(backward) | 163.4 us | 74.2 us | 2.2 |

    | fused linear gate(forward) | 160.4 us | 46.9 us | 3.42 |

    | fused linear gate(backward) | 572.9 us | 81.1 us | 7.06 |

    | Cross entropy(forward) | 2780.8 us | 818.2 us | 3.4 |

    | Cross entropy(backward) | 7086.3 us | 1781.0 us | 3.98 |

    | batch grad norm | 1733.7 us | 1413.7 us | 1.23 | 

    | Batch count zero | 4997.9 us | 746.8 us | 6.69 | 

    |--------|--------------|------------|---------|

    | RMSNorm+Quantization(forward) | 159.3 us | 72.4 us    | 2.2 |

    | Split+qk-norm+rope+transpose(forward) | 472 us | 59.1 us    | 7.99 |

    | Split+qk-norm+rope+transpose(backward) | 645 us | 107.5 us   | 6.0 |

    | Fp32 router gemm(forward) | 242.3 us | 61.6 us    | 3.931 |

    | Fp32 router gemm(backward) | 232.7 us | 78.1 us    | 2.979 |

    | Permute with padded indices | 388 us | 229.4 us   | 1.69 |

    | Unpermute with padding indices | 988.6 us | 806.9 us   | 1.23 |

    | Batch Silu+quantization(forward) | 6241.7 us | 1181.7 us  | 5.28 |

    | Batch Silu+quantization(backward) | 7147.7 us | 2317.9 us  | 3.08 |

    | Silu+quantization(forward) | 144.9 us | 58.2 us    | 2.48 |

    | Silu+quantization(backward) | 163.4 us | 74.2 us    | 2.2 |

    | fused linear gate(forward) | 160.4 us | 46.9 us    | 3.42 |

    | fused linear gate(backward) | 572.9 us | 81.1 us    | 7.06 |

    | Cross entropy(forward) | 2780.8 us | 818.2 us   | 3.4 |

    | Cross entropy(backward) | 7086.3 us | 1781.0 us  | 3.98 |

    | batch grad norm | 1733.7 us | 1413.7 us  | 1.23 | 

    | Batch count zero | 4997.9 us | 746.8 us   | 6.69 | 

    Other benchmark results can be obtained by running scripts in tests and benchmark folders.

    ## Examples

    ---

    @@ -65,4 +66,4 @@ Examples can be found in tests.
  
    ## Api Reference

    ---

    Please refer to [API doc](docs/api.md)
      
    Please refer to [API](https://inclusionai.github.io/linghe/)

asserts/linghe.png

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

build.sh

-Original file line number
+Diff line change
@@ Expand Up / @@ -2,4 +2,6 @@ rm -rf build && @@
     rm -rf dist &&
     rm -rf linghe.egg-info &&
     python setup.py develop &&
-    python setup.py bdist_wheel &&
+    python setup.py bdist_wheel
+    # pdoc --output-dir docs -d google --no-include-undocumented --no-search --no-show-source  linghe

docs/api.md

This file was deleted.

docs/index.html

-Original file line number
+Diff line change
@@ -0,0 +1,7 @@
+    <!doctype html>
+    <html>
+    <head>
+        <meta charset="utf-8">
+        <meta http-equiv="refresh" content="0; url=./linghe.html"/>
+    </head>
+    </html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc #1

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!