From 7f5c2f4990f4f645453cd40b0afcd854c1c7b07d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E5=8D=97=E9=9C=84?= Date: Wed, 15 Oct 2025 16:46:00 +0800 Subject: [PATCH 1/7] add pdoc --- .gitignore | 5 +- README.md | 51 +-- {docs => asserts}/api.md | 44 +-- {docs => asserts}/linghe.png | Bin docs/api/index.html | 7 + docs/api/linghe.html | 52 +++ docs/api/linghe/facade.html | 59 ++++ docs/api/linghe/facade/add.html | 209 ++++++++++++ docs/api/linghe/facade/fp32_linear.html | 209 ++++++++++++ docs/api/linghe/facade/loss.html | 362 +++++++++++++++++++++ docs/api/linghe/facade/norm.html | 362 +++++++++++++++++++++ docs/api/linghe/facade/rope.html | 209 ++++++++++++ docs/api/linghe/facade/transpose.html | 209 ++++++++++++ docs/api/linghe/gemm.html | 54 +++ docs/api/linghe/gemm/fp32_gemm.html | 56 ++++ docs/api/linghe/quant.html | 55 ++++ docs/api/linghe/quant/block.html | 55 ++++ docs/api/linghe/quant/block/block.html | 56 ++++ docs/api/linghe/quant/block/group.html | 56 ++++ docs/api/linghe/quant/channel.html | 54 +++ docs/api/linghe/quant/channel/channel.html | 56 ++++ docs/api/linghe/utils.html | 64 ++++ docs/api/linghe/utils/add.html | 84 +++++ docs/api/linghe/utils/dot.html | 56 ++++ docs/api/linghe/utils/gather.html | 56 ++++ docs/api/linghe/utils/loss.html | 56 ++++ docs/api/linghe/utils/norm.html | 128 ++++++++ docs/api/linghe/utils/rearange.html | 56 ++++ docs/api/linghe/utils/reduce.html | 56 ++++ docs/api/linghe/utils/rope.html | 56 ++++ docs/api/linghe/utils/scatter.html | 56 ++++ docs/api/linghe/utils/silu.html | 56 ++++ docs/api/linghe/utils/transpose.html | 56 ++++ docs/api/search.js | 46 +++ linghe/__init__.py | 1 + linghe/utils/add.py | 73 +---- linghe/utils/norm.py | 49 ++- setup.py | 2 +- 38 files changed, 3049 insertions(+), 122 deletions(-) rename {docs => asserts}/api.md (85%) rename {docs => asserts}/linghe.png (100%) create mode 100644 docs/api/index.html create mode 100644 docs/api/linghe.html create mode 100644 docs/api/linghe/facade.html create mode 100644 docs/api/linghe/facade/add.html create mode 100644 docs/api/linghe/facade/fp32_linear.html create mode 100644 docs/api/linghe/facade/loss.html create mode 100644 docs/api/linghe/facade/norm.html create mode 100644 docs/api/linghe/facade/rope.html create mode 100644 docs/api/linghe/facade/transpose.html create mode 100644 docs/api/linghe/gemm.html create mode 100644 docs/api/linghe/gemm/fp32_gemm.html create mode 100644 docs/api/linghe/quant.html create mode 100644 docs/api/linghe/quant/block.html create mode 100644 docs/api/linghe/quant/block/block.html create mode 100644 docs/api/linghe/quant/block/group.html create mode 100644 docs/api/linghe/quant/channel.html create mode 100644 docs/api/linghe/quant/channel/channel.html create mode 100644 docs/api/linghe/utils.html create mode 100644 docs/api/linghe/utils/add.html create mode 100644 docs/api/linghe/utils/dot.html create mode 100644 docs/api/linghe/utils/gather.html create mode 100644 docs/api/linghe/utils/loss.html create mode 100644 docs/api/linghe/utils/norm.html create mode 100644 docs/api/linghe/utils/rearange.html create mode 100644 docs/api/linghe/utils/reduce.html create mode 100644 docs/api/linghe/utils/rope.html create mode 100644 docs/api/linghe/utils/scatter.html create mode 100644 docs/api/linghe/utils/silu.html create mode 100644 docs/api/linghe/utils/transpose.html create mode 100644 docs/api/search.js diff --git a/.gitignore b/.gitignore index aaa3d00..2f5fb50 100644 --- a/.gitignore +++ b/.gitignore @@ -31,7 +31,7 @@ __pycache__/ # Distribution / packaging .Python -build/ +docs/build/ develop-eggs/ dist/ downloads/ @@ -68,5 +68,4 @@ pip-delete-this-directory.txt *.pyc *.json *.jsonl -*_ignore.py -.idea \ No newline at end of file +.idea diff --git a/README.md b/README.md index e7aedcd..3ad17b6 100644 --- a/README.md +++ b/README.md @@ -20,42 +20,43 @@ ## *News or Update* 🔥 --- -- [2025/07] We implement multiple kernels for fp8 training with `Megatron-LM` blockwise quantization. +- [2025/07] We implement multiple kernels for FP8 training with `Megatron-LM` blockwise quantization. ## Introduction --- -Our repo, FLOPS, is designed for LLM training, especially for MoE training with fp8 quantizaiton. It provides 3 main categories of kernels: +Our repo, Linghe, is designed for LLM training, especially for MoE training with FP8 quantizaiton. It provides 2 main categories of kernels: - **Fused quantization kernels**: fuse quantization with previous layer, e.g., RMS norm and Silu. -- **Memory-friendly kernels**: use dtype cast in kernels instead of casting out kernels, e.g., softmax cross entropy and moe router gemm. -- **Other fused kernels**: fuse multiple IO-itensive operations, e.g., ROPE with qk-norm and transpose, permute and padding, group RMS norm with sigmoid gate. +- **Memory-efficiency kernels**: fuse multiple IO-itensive operations, e.g., ROPE with qk-norm. +- **Implementation-optimized kernels**: use efficient triton implementation, e.g., routing map padding instead of activation padding. ## Benchmark --- We benchmark on H800 with batch size 8192, hidden size 2048, num experts 256, activation experts 8. -| kernel | baseline(us) | linghe(us) | speedup | -|--------|--------------|-----------|---------| -| RMSNorm+Quantization(forward) | 159.3 us | 72.4 us | 2.2 | -| Split+qk-norm+rope+transpose(forward) | 472 us | 59.1 us | 7.99 | -| Split+qk-norm+rope+transpose(backward) | 645 us | 107.5 us | 6.0 | -| Fp32 router gemm(forward) | 242.3 us | 61.6 us | 3.931 | -| Fp32 router gemm(backward) | 232.7 us | 78.1 us | 2.979 | -| Permute with padded indices | 388 us | 229.4 us | 1.69 | -| Unpermute with padding indices | 988.6 us | 806.9 us | 1.23 | -| Batch Silu+quantization(forward) | 6241.7 us | 1181.7 us | 5.28 | -| Batch Silu+quantization(backward) | 7147.7 us | 2317.9 us | 3.08 | -| Silu+quantization(forward) | 144.9 us | 58.2 us | 2.48 | -| Silu+quantization(backward) | 163.4 us | 74.2 us | 2.2 | -| fused linear gate(forward) | 160.4 us | 46.9 us | 3.42 | -| fused linear gate(backward) | 572.9 us | 81.1 us | 7.06 | -| Cross entropy(forward) | 2780.8 us | 818.2 us | 3.4 | -| Cross entropy(backward) | 7086.3 us | 1781.0 us | 3.98 | -| batch grad norm | 1733.7 us | 1413.7 us | 1.23 | -| Batch count zero | 4997.9 us | 746.8 us | 6.69 | - +| kernel | baseline(us) | Linghe(us) | speedup | +|--------|--------------|------------|---------| +| RMSNorm+Quantization(forward) | 159.3 us | 72.4 us | 2.2 | +| Split+qk-norm+rope+transpose(forward) | 472 us | 59.1 us | 7.99 | +| Split+qk-norm+rope+transpose(backward) | 645 us | 107.5 us | 6.0 | +| Fp32 router gemm(forward) | 242.3 us | 61.6 us | 3.931 | +| Fp32 router gemm(backward) | 232.7 us | 78.1 us | 2.979 | +| Permute with padded indices | 388 us | 229.4 us | 1.69 | +| Unpermute with padding indices | 988.6 us | 806.9 us | 1.23 | +| Batch Silu+quantization(forward) | 6241.7 us | 1181.7 us | 5.28 | +| Batch Silu+quantization(backward) | 7147.7 us | 2317.9 us | 3.08 | +| Silu+quantization(forward) | 144.9 us | 58.2 us | 2.48 | +| Silu+quantization(backward) | 163.4 us | 74.2 us | 2.2 | +| fused linear gate(forward) | 160.4 us | 46.9 us | 3.42 | +| fused linear gate(backward) | 572.9 us | 81.1 us | 7.06 | +| Cross entropy(forward) | 2780.8 us | 818.2 us | 3.4 | +| Cross entropy(backward) | 7086.3 us | 1781.0 us | 3.98 | +| batch grad norm | 1733.7 us | 1413.7 us | 1.23 | +| Batch count zero | 4997.9 us | 746.8 us | 6.69 | + +Other benchmark results can be obtained by running scripts in tests and benchmark folders. ## Examples --- @@ -65,4 +66,4 @@ Examples can be found in tests. ## Api Reference --- -Please refer to [API doc](docs/api.md) \ No newline at end of file +Please refer to [API doc](asserts/api.md) \ No newline at end of file diff --git a/docs/api.md b/asserts/api.md similarity index 85% rename from docs/api.md rename to asserts/api.md index 000bfc2..a6ff272 100644 --- a/docs/api.md +++ b/asserts/api.md @@ -133,23 +133,25 @@ Fused op for batched weighted SiLU and block quant. ``` linghe.util.silu.triton_batch_weighted_silu_and_block_quant_backward(g, x, weight, counts, splits:Optional[List]=None, round_scale:Optional[bool]=False) ``` +Return blockwise quantized gradient of silu backward. +Quantized tensor is A tuple of () **Parameters:** - g(*torch.Tensor*) - Input gradient tensor. - x(*torch.Tensor*) - Input tensor. -- weight(*torch.Tensor*) - Permuted probs -- couts(*torch.Tensor*) - Tokens per expert cuda tensor. -- splits(*List[int]*) - List of tokens per expert. If compute in batch mode should not be None. - +- weight(*torch.Tensor*) - Permuted probs, +- counts(*torch.Tensor*) - Tokens per expert, it is a CUDA tensor. +- splits(*List[int]*) - Tokens per expert, it is a list of int. +- round_scale(bool) - round scale to integer pow of 2 --- **` Class linghe.facade.loss.SoftmaxCrossEntropyFunction `** -Prallel version of SoftmaxCrossEntropy. +SoftmaxCrossEntropy. ``` -forward(logits, labels, inplace:Optional[bool]=False) +forward(logits, labels, inplace: Optional[bool]=False) ``` Fast impl of softmax cross entropy. @@ -157,35 +159,29 @@ Fast impl of softmax cross entropy. **Parameters:** - logits(*torch.Tensor*) - Input logits. - labels(*torch.Tensor*) - Input labels. -- inplace(*bool*) - Flag save for backward, whether logits ptr should replaced by grads tensor ptr. +- inplace(*bool*) - reuse the `logits` tensor as gradient tensor if inplace=True, else allocate a new tensor. -``` -backward(grad_output) ``` -**Parameters:** -- grad_output(*torch.Tensor*) - Gradients tensor. - ---- ``` linghe.util.reduce.triton_batch_sum_with_ord(xs, ord:Optional[int]=2) ``` -Square sum the gards of all the experts. All the experts grads are applied simultaneously. +return sum(abs(x)**ord). **Parameters:** -- xs(*List[torch.Tensor]*) - Grads lists. -- ord(*int*) - Sum type. 1 for abs add and 2 for square add. +- xs(*List[torch.Tensor]*) - Tensor lists. +- ord(*int*) - the order of tensor. --- ``` linghe.util.reduce.triton_batch_count_zero(xs) ``` -Prallel cout zeros in all the given grads lists. +Parallel count zeros in the given tensor lists, return the total zero number. **Parameters:** -- xs(*List[torch.Tensor]*) - Grads lists. +- xs(*List[torch.Tensor]*) - Tensor lists. --- @@ -201,12 +197,8 @@ Note that the output shape is transposed [S, B, dim] **Parameters:** -- x(*torch.Tensor*) - [B, S, dim] Input tensor. -- gate(*torch.Tensor*) - [S, B, dim] -- weight(*torch.Tensor*) - [dim] - +- x(*torch.Tensor*) - [B, S, dim], output tensor of attention kernel. +- gate(*torch.Tensor*) - [S, B, dim], gate tensor. +- weight(*torch.Tensor*) - [dim], RMSNorm weight tensor. +- group_size(int) - group size of RMSNorm ``` -backward(grad) -``` -**Parameters:** -- grad(*torch.Tensor*) - [S, B, dim] Grads of input tensor. diff --git a/docs/linghe.png b/asserts/linghe.png similarity index 100% rename from docs/linghe.png rename to asserts/linghe.png diff --git a/docs/api/index.html b/docs/api/index.html new file mode 100644 index 0000000..514a509 --- /dev/null +++ b/docs/api/index.html @@ -0,0 +1,7 @@ + + + + + + + diff --git a/docs/api/linghe.html b/docs/api/linghe.html new file mode 100644 index 0000000..a7cef32 --- /dev/null +++ b/docs/api/linghe.html @@ -0,0 +1,52 @@ + + + + + + + linghe API documentation + + + + + + + + + +
+
+

+linghe

+ + + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/facade.html b/docs/api/linghe/facade.html new file mode 100644 index 0000000..ffe1d6e --- /dev/null +++ b/docs/api/linghe/facade.html @@ -0,0 +1,59 @@ + + + + + + + linghe.facade API documentation + + + + + + + + + +
+
+

+linghe.facade

+ + + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/facade/add.html b/docs/api/linghe/facade/add.html new file mode 100644 index 0000000..203f198 --- /dev/null +++ b/docs/api/linghe/facade/add.html @@ -0,0 +1,209 @@ + + + + + + + linghe.facade.add API documentation + + + + + + + + + +
+
+

+linghe.facade.add

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+
+ + class + InplaceAddFunction(torch.autograd.function.Function): + + +
+ + +

Base class to create custom autograd.Function.

+ +

To create a custom autograd.Function, subclass this class and implement +the forward() and backward() static methods. Then, to use your custom +op in the forward pass, call the class method apply. Do not call +forward() directly.

+ +

To ensure correctness and best performance, make sure you are calling the +correct methods on ctx and validating your backward function using +torch.autograd.gradcheck().

+ +

See :ref:extending-autograd for more details on how to use this class.

+ +

Examples::

+ +
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)
+>>> class Exp(Function):
+>>>     @staticmethod
+>>>     def forward(ctx, i):
+>>>         result = i.exp()
+>>>         ctx.save_for_backward(result)
+>>>         return result
+>>>
+>>>     @staticmethod
+>>>     def backward(ctx, grad_output):
+>>>         result, = ctx.saved_tensors
+>>>         return grad_output * result
+>>>
+>>> # Use it by calling the apply method:
+>>> # xdoctest: +SKIP
+>>> output = Exp.apply(input)
+
+
+ + +
+
+
@staticmethod
+ + def + forward(ctx, x, y): + + +
+ + +

Define the forward of the custom autograd Function.

+ +

This function is to be overridden by all subclasses. +There are two ways to define forward:

+ +

Usage 1 (Combined forward and ctx)::

+ +
@staticmethod
+def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
+    pass
+
+ +
    +
  • It must accept a context ctx as the first argument, followed by any +number of arguments (tensors or other types).
  • +
  • See :ref:combining-forward-context for more details
  • +
+ +

Usage 2 (Separate forward and ctx)::

+ +
@staticmethod
+def forward(*args: Any, **kwargs: Any) -> Any:
+    pass
+
+@staticmethod
+def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
+    pass
+
+ +
    +
  • The forward no longer accepts a ctx argument.
  • +
  • Instead, you must also override the torch.autograd.Function.setup_context() +staticmethod to handle setting up the ctx object. +output is the output of the forward, inputs are a Tuple of inputs +to the forward.
  • +
  • See :ref:extending-autograd for more details
  • +
+ +

The context can be used to store arbitrary data that can be then +retrieved during the backward pass. Tensors should not be stored +directly on ctx (though this is not currently enforced for +backward compatibility). Instead, tensors should be saved either with +ctx.save_for_backward() if they are intended to be used in +backward (equivalently, vjp) or ctx.save_for_forward() +if they are intended to be used for in jvp.

+
+ + +
+
+
+
@staticmethod
+ + def + backward(ctx, grad_output): + + +
+ + +

Define a formula for differentiating the operation with backward mode automatic differentiation.

+ +

This function is to be overridden by all subclasses. +(Defining this function is equivalent to defining the vjp function.)

+ +

It must accept a context ctx as the first argument, followed by +as many outputs as the forward() returned (None will be passed in +for non tensor outputs of the forward function), +and it should return as many tensors, as there were inputs to +forward(). Each argument is the gradient w.r.t the given output, +and each returned value should be the gradient w.r.t. the +corresponding input. If an input is not a Tensor or is a Tensor not +requiring grads, you can just pass None as a gradient for that input.

+ +

The context can be used to retrieve tensors saved during the forward +pass. It also has an attribute ctx.needs_input_grad as a tuple +of booleans representing whether each input needs gradient. E.g., +backward() will have ctx.needs_input_grad[0] = True if the +first input to forward() needs gradient computed w.r.t. the +output.

+
+ + +
+
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/facade/fp32_linear.html b/docs/api/linghe/facade/fp32_linear.html new file mode 100644 index 0000000..43c7dfc --- /dev/null +++ b/docs/api/linghe/facade/fp32_linear.html @@ -0,0 +1,209 @@ + + + + + + + linghe.facade.fp32_linear API documentation + + + + + + + + + +
+
+

+linghe.facade.fp32_linear

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+
+ + class + FusedFp32GEMM(torch.autograd.function.Function): + + +
+ + +

Base class to create custom autograd.Function.

+ +

To create a custom autograd.Function, subclass this class and implement +the forward() and backward() static methods. Then, to use your custom +op in the forward pass, call the class method apply. Do not call +forward() directly.

+ +

To ensure correctness and best performance, make sure you are calling the +correct methods on ctx and validating your backward function using +torch.autograd.gradcheck().

+ +

See :ref:extending-autograd for more details on how to use this class.

+ +

Examples::

+ +
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)
+>>> class Exp(Function):
+>>>     @staticmethod
+>>>     def forward(ctx, i):
+>>>         result = i.exp()
+>>>         ctx.save_for_backward(result)
+>>>         return result
+>>>
+>>>     @staticmethod
+>>>     def backward(ctx, grad_output):
+>>>         result, = ctx.saved_tensors
+>>>         return grad_output * result
+>>>
+>>> # Use it by calling the apply method:
+>>> # xdoctest: +SKIP
+>>> output = Exp.apply(input)
+
+
+ + +
+
+
@staticmethod
+ + def + forward(ctx, input, weight): + + +
+ + +

Define the forward of the custom autograd Function.

+ +

This function is to be overridden by all subclasses. +There are two ways to define forward:

+ +

Usage 1 (Combined forward and ctx)::

+ +
@staticmethod
+def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
+    pass
+
+ +
    +
  • It must accept a context ctx as the first argument, followed by any +number of arguments (tensors or other types).
  • +
  • See :ref:combining-forward-context for more details
  • +
+ +

Usage 2 (Separate forward and ctx)::

+ +
@staticmethod
+def forward(*args: Any, **kwargs: Any) -> Any:
+    pass
+
+@staticmethod
+def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
+    pass
+
+ +
    +
  • The forward no longer accepts a ctx argument.
  • +
  • Instead, you must also override the torch.autograd.Function.setup_context() +staticmethod to handle setting up the ctx object. +output is the output of the forward, inputs are a Tuple of inputs +to the forward.
  • +
  • See :ref:extending-autograd for more details
  • +
+ +

The context can be used to store arbitrary data that can be then +retrieved during the backward pass. Tensors should not be stored +directly on ctx (though this is not currently enforced for +backward compatibility). Instead, tensors should be saved either with +ctx.save_for_backward() if they are intended to be used in +backward (equivalently, vjp) or ctx.save_for_forward() +if they are intended to be used for in jvp.

+
+ + +
+
+
+
@staticmethod
+ + def + backward(ctx, grad_output): + + +
+ + +

Define a formula for differentiating the operation with backward mode automatic differentiation.

+ +

This function is to be overridden by all subclasses. +(Defining this function is equivalent to defining the vjp function.)

+ +

It must accept a context ctx as the first argument, followed by +as many outputs as the forward() returned (None will be passed in +for non tensor outputs of the forward function), +and it should return as many tensors, as there were inputs to +forward(). Each argument is the gradient w.r.t the given output, +and each returned value should be the gradient w.r.t. the +corresponding input. If an input is not a Tensor or is a Tensor not +requiring grads, you can just pass None as a gradient for that input.

+ +

The context can be used to retrieve tensors saved during the forward +pass. It also has an attribute ctx.needs_input_grad as a tuple +of booleans representing whether each input needs gradient. E.g., +backward() will have ctx.needs_input_grad[0] = True if the +first input to forward() needs gradient computed w.r.t. the +output.

+
+ + +
+
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/facade/loss.html b/docs/api/linghe/facade/loss.html new file mode 100644 index 0000000..6229b49 --- /dev/null +++ b/docs/api/linghe/facade/loss.html @@ -0,0 +1,362 @@ + + + + + + + linghe.facade.loss API documentation + + + + + + + + + +
+
+

+linghe.facade.loss

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+
+ + class + SoftmaxCrossEntropyFunction(torch.autograd.function.Function): + + +
+ + +

Base class to create custom autograd.Function.

+ +

To create a custom autograd.Function, subclass this class and implement +the forward() and backward() static methods. Then, to use your custom +op in the forward pass, call the class method apply. Do not call +forward() directly.

+ +

To ensure correctness and best performance, make sure you are calling the +correct methods on ctx and validating your backward function using +torch.autograd.gradcheck().

+ +

See :ref:extending-autograd for more details on how to use this class.

+ +

Examples::

+ +
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)
+>>> class Exp(Function):
+>>>     @staticmethod
+>>>     def forward(ctx, i):
+>>>         result = i.exp()
+>>>         ctx.save_for_backward(result)
+>>>         return result
+>>>
+>>>     @staticmethod
+>>>     def backward(ctx, grad_output):
+>>>         result, = ctx.saved_tensors
+>>>         return grad_output * result
+>>>
+>>> # Use it by calling the apply method:
+>>> # xdoctest: +SKIP
+>>> output = Exp.apply(input)
+
+
+ + +
+
+
@staticmethod
+ + def + forward(ctx, logits, labels, inplace=False): + + +
+ + +

Define the forward of the custom autograd Function.

+ +

This function is to be overridden by all subclasses. +There are two ways to define forward:

+ +

Usage 1 (Combined forward and ctx)::

+ +
@staticmethod
+def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
+    pass
+
+ +
    +
  • It must accept a context ctx as the first argument, followed by any +number of arguments (tensors or other types).
  • +
  • See :ref:combining-forward-context for more details
  • +
+ +

Usage 2 (Separate forward and ctx)::

+ +
@staticmethod
+def forward(*args: Any, **kwargs: Any) -> Any:
+    pass
+
+@staticmethod
+def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
+    pass
+
+ +
    +
  • The forward no longer accepts a ctx argument.
  • +
  • Instead, you must also override the torch.autograd.Function.setup_context() +staticmethod to handle setting up the ctx object. +output is the output of the forward, inputs are a Tuple of inputs +to the forward.
  • +
  • See :ref:extending-autograd for more details
  • +
+ +

The context can be used to store arbitrary data that can be then +retrieved during the backward pass. Tensors should not be stored +directly on ctx (though this is not currently enforced for +backward compatibility). Instead, tensors should be saved either with +ctx.save_for_backward() if they are intended to be used in +backward (equivalently, vjp) or ctx.save_for_forward() +if they are intended to be used for in jvp.

+
+ + +
+
+
+
@staticmethod
+ + def + backward(ctx, grad_output): + + +
+ + +

Define a formula for differentiating the operation with backward mode automatic differentiation.

+ +

This function is to be overridden by all subclasses. +(Defining this function is equivalent to defining the vjp function.)

+ +

It must accept a context ctx as the first argument, followed by +as many outputs as the forward() returned (None will be passed in +for non tensor outputs of the forward function), +and it should return as many tensors, as there were inputs to +forward(). Each argument is the gradient w.r.t the given output, +and each returned value should be the gradient w.r.t. the +corresponding input. If an input is not a Tensor or is a Tensor not +requiring grads, you can just pass None as a gradient for that input.

+ +

The context can be used to retrieve tensors saved during the forward +pass. It also has an attribute ctx.needs_input_grad as a tuple +of booleans representing whether each input needs gradient. E.g., +backward() will have ctx.needs_input_grad[0] = True if the +first input to forward() needs gradient computed w.r.t. the +output.

+
+ + +
+
+
+
+ + class + GradScalingFunction(torch.autograd.function.Function): + + +
+ + +

Base class to create custom autograd.Function.

+ +

To create a custom autograd.Function, subclass this class and implement +the forward() and backward() static methods. Then, to use your custom +op in the forward pass, call the class method apply. Do not call +forward() directly.

+ +

To ensure correctness and best performance, make sure you are calling the +correct methods on ctx and validating your backward function using +torch.autograd.gradcheck().

+ +

See :ref:extending-autograd for more details on how to use this class.

+ +

Examples::

+ +
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)
+>>> class Exp(Function):
+>>>     @staticmethod
+>>>     def forward(ctx, i):
+>>>         result = i.exp()
+>>>         ctx.save_for_backward(result)
+>>>         return result
+>>>
+>>>     @staticmethod
+>>>     def backward(ctx, grad_output):
+>>>         result, = ctx.saved_tensors
+>>>         return grad_output * result
+>>>
+>>> # Use it by calling the apply method:
+>>> # xdoctest: +SKIP
+>>> output = Exp.apply(input)
+
+
+ + +
+
+
@staticmethod
+ + def + forward(ctx, x, coef=0.2): + + +
+ + +

Define the forward of the custom autograd Function.

+ +

This function is to be overridden by all subclasses. +There are two ways to define forward:

+ +

Usage 1 (Combined forward and ctx)::

+ +
@staticmethod
+def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
+    pass
+
+ +
    +
  • It must accept a context ctx as the first argument, followed by any +number of arguments (tensors or other types).
  • +
  • See :ref:combining-forward-context for more details
  • +
+ +

Usage 2 (Separate forward and ctx)::

+ +
@staticmethod
+def forward(*args: Any, **kwargs: Any) -> Any:
+    pass
+
+@staticmethod
+def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
+    pass
+
+ +
    +
  • The forward no longer accepts a ctx argument.
  • +
  • Instead, you must also override the torch.autograd.Function.setup_context() +staticmethod to handle setting up the ctx object. +output is the output of the forward, inputs are a Tuple of inputs +to the forward.
  • +
  • See :ref:extending-autograd for more details
  • +
+ +

The context can be used to store arbitrary data that can be then +retrieved during the backward pass. Tensors should not be stored +directly on ctx (though this is not currently enforced for +backward compatibility). Instead, tensors should be saved either with +ctx.save_for_backward() if they are intended to be used in +backward (equivalently, vjp) or ctx.save_for_forward() +if they are intended to be used for in jvp.

+
+ + +
+
+
+
@staticmethod
+ + def + backward(ctx, grad_output): + + +
+ + +

Define a formula for differentiating the operation with backward mode automatic differentiation.

+ +

This function is to be overridden by all subclasses. +(Defining this function is equivalent to defining the vjp function.)

+ +

It must accept a context ctx as the first argument, followed by +as many outputs as the forward() returned (None will be passed in +for non tensor outputs of the forward function), +and it should return as many tensors, as there were inputs to +forward(). Each argument is the gradient w.r.t the given output, +and each returned value should be the gradient w.r.t. the +corresponding input. If an input is not a Tensor or is a Tensor not +requiring grads, you can just pass None as a gradient for that input.

+ +

The context can be used to retrieve tensors saved during the forward +pass. It also has an attribute ctx.needs_input_grad as a tuple +of booleans representing whether each input needs gradient. E.g., +backward() will have ctx.needs_input_grad[0] = True if the +first input to forward() needs gradient computed w.r.t. the +output.

+
+ + +
+
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/facade/norm.html b/docs/api/linghe/facade/norm.html new file mode 100644 index 0000000..89b4a1c --- /dev/null +++ b/docs/api/linghe/facade/norm.html @@ -0,0 +1,362 @@ + + + + + + + linghe.facade.norm API documentation + + + + + + + + + +
+
+

+linghe.facade.norm

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+
+ + class + RMSNormFunction(torch.autograd.function.Function): + + +
+ + +

Base class to create custom autograd.Function.

+ +

To create a custom autograd.Function, subclass this class and implement +the forward() and backward() static methods. Then, to use your custom +op in the forward pass, call the class method apply. Do not call +forward() directly.

+ +

To ensure correctness and best performance, make sure you are calling the +correct methods on ctx and validating your backward function using +torch.autograd.gradcheck().

+ +

See :ref:extending-autograd for more details on how to use this class.

+ +

Examples::

+ +
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)
+>>> class Exp(Function):
+>>>     @staticmethod
+>>>     def forward(ctx, i):
+>>>         result = i.exp()
+>>>         ctx.save_for_backward(result)
+>>>         return result
+>>>
+>>>     @staticmethod
+>>>     def backward(ctx, grad_output):
+>>>         result, = ctx.saved_tensors
+>>>         return grad_output * result
+>>>
+>>> # Use it by calling the apply method:
+>>> # xdoctest: +SKIP
+>>> output = Exp.apply(input)
+
+
+ + +
+
+
@staticmethod
+ + def + forward(ctx, x, weight, eps=1e-06): + + +
+ + +

Define the forward of the custom autograd Function.

+ +

This function is to be overridden by all subclasses. +There are two ways to define forward:

+ +

Usage 1 (Combined forward and ctx)::

+ +
@staticmethod
+def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
+    pass
+
+ +
    +
  • It must accept a context ctx as the first argument, followed by any +number of arguments (tensors or other types).
  • +
  • See :ref:combining-forward-context for more details
  • +
+ +

Usage 2 (Separate forward and ctx)::

+ +
@staticmethod
+def forward(*args: Any, **kwargs: Any) -> Any:
+    pass
+
+@staticmethod
+def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
+    pass
+
+ +
    +
  • The forward no longer accepts a ctx argument.
  • +
  • Instead, you must also override the torch.autograd.Function.setup_context() +staticmethod to handle setting up the ctx object. +output is the output of the forward, inputs are a Tuple of inputs +to the forward.
  • +
  • See :ref:extending-autograd for more details
  • +
+ +

The context can be used to store arbitrary data that can be then +retrieved during the backward pass. Tensors should not be stored +directly on ctx (though this is not currently enforced for +backward compatibility). Instead, tensors should be saved either with +ctx.save_for_backward() if they are intended to be used in +backward (equivalently, vjp) or ctx.save_for_forward() +if they are intended to be used for in jvp.

+
+ + +
+
+
+
@staticmethod
+ + def + backward(ctx, dy): + + +
+ + +

Define a formula for differentiating the operation with backward mode automatic differentiation.

+ +

This function is to be overridden by all subclasses. +(Defining this function is equivalent to defining the vjp function.)

+ +

It must accept a context ctx as the first argument, followed by +as many outputs as the forward() returned (None will be passed in +for non tensor outputs of the forward function), +and it should return as many tensors, as there were inputs to +forward(). Each argument is the gradient w.r.t the given output, +and each returned value should be the gradient w.r.t. the +corresponding input. If an input is not a Tensor or is a Tensor not +requiring grads, you can just pass None as a gradient for that input.

+ +

The context can be used to retrieve tensors saved during the forward +pass. It also has an attribute ctx.needs_input_grad as a tuple +of booleans representing whether each input needs gradient. E.g., +backward() will have ctx.needs_input_grad[0] = True if the +first input to forward() needs gradient computed w.r.t. the +output.

+
+ + +
+
+
+
+ + class + GroupNormGateFunction(torch.autograd.function.Function): + + +
+ + +

Base class to create custom autograd.Function.

+ +

To create a custom autograd.Function, subclass this class and implement +the forward() and backward() static methods. Then, to use your custom +op in the forward pass, call the class method apply. Do not call +forward() directly.

+ +

To ensure correctness and best performance, make sure you are calling the +correct methods on ctx and validating your backward function using +torch.autograd.gradcheck().

+ +

See :ref:extending-autograd for more details on how to use this class.

+ +

Examples::

+ +
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)
+>>> class Exp(Function):
+>>>     @staticmethod
+>>>     def forward(ctx, i):
+>>>         result = i.exp()
+>>>         ctx.save_for_backward(result)
+>>>         return result
+>>>
+>>>     @staticmethod
+>>>     def backward(ctx, grad_output):
+>>>         result, = ctx.saved_tensors
+>>>         return grad_output * result
+>>>
+>>> # Use it by calling the apply method:
+>>> # xdoctest: +SKIP
+>>> output = Exp.apply(input)
+
+
+ + +
+
+
@staticmethod
+ + def + forward(ctx, x, gate, weight, eps=1e-06, group_size=4): + + +
+ + +

Define the forward of the custom autograd Function.

+ +

This function is to be overridden by all subclasses. +There are two ways to define forward:

+ +

Usage 1 (Combined forward and ctx)::

+ +
@staticmethod
+def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
+    pass
+
+ +
    +
  • It must accept a context ctx as the first argument, followed by any +number of arguments (tensors or other types).
  • +
  • See :ref:combining-forward-context for more details
  • +
+ +

Usage 2 (Separate forward and ctx)::

+ +
@staticmethod
+def forward(*args: Any, **kwargs: Any) -> Any:
+    pass
+
+@staticmethod
+def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
+    pass
+
+ +
    +
  • The forward no longer accepts a ctx argument.
  • +
  • Instead, you must also override the torch.autograd.Function.setup_context() +staticmethod to handle setting up the ctx object. +output is the output of the forward, inputs are a Tuple of inputs +to the forward.
  • +
  • See :ref:extending-autograd for more details
  • +
+ +

The context can be used to store arbitrary data that can be then +retrieved during the backward pass. Tensors should not be stored +directly on ctx (though this is not currently enforced for +backward compatibility). Instead, tensors should be saved either with +ctx.save_for_backward() if they are intended to be used in +backward (equivalently, vjp) or ctx.save_for_forward() +if they are intended to be used for in jvp.

+
+ + +
+
+
+
@staticmethod
+ + def + backward(ctx, dy): + + +
+ + +

Define a formula for differentiating the operation with backward mode automatic differentiation.

+ +

This function is to be overridden by all subclasses. +(Defining this function is equivalent to defining the vjp function.)

+ +

It must accept a context ctx as the first argument, followed by +as many outputs as the forward() returned (None will be passed in +for non tensor outputs of the forward function), +and it should return as many tensors, as there were inputs to +forward(). Each argument is the gradient w.r.t the given output, +and each returned value should be the gradient w.r.t. the +corresponding input. If an input is not a Tensor or is a Tensor not +requiring grads, you can just pass None as a gradient for that input.

+ +

The context can be used to retrieve tensors saved during the forward +pass. It also has an attribute ctx.needs_input_grad as a tuple +of booleans representing whether each input needs gradient. E.g., +backward() will have ctx.needs_input_grad[0] = True if the +first input to forward() needs gradient computed w.r.t. the +output.

+
+ + +
+
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/facade/rope.html b/docs/api/linghe/facade/rope.html new file mode 100644 index 0000000..8cf1b21 --- /dev/null +++ b/docs/api/linghe/facade/rope.html @@ -0,0 +1,209 @@ + + + + + + + linghe.facade.rope API documentation + + + + + + + + + +
+
+

+linghe.facade.rope

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+
+ + class + QkNormHalfRopeFunction(torch.autograd.function.Function): + + +
+ + +

Base class to create custom autograd.Function.

+ +

To create a custom autograd.Function, subclass this class and implement +the forward() and backward() static methods. Then, to use your custom +op in the forward pass, call the class method apply. Do not call +forward() directly.

+ +

To ensure correctness and best performance, make sure you are calling the +correct methods on ctx and validating your backward function using +torch.autograd.gradcheck().

+ +

See :ref:extending-autograd for more details on how to use this class.

+ +

Examples::

+ +
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)
+>>> class Exp(Function):
+>>>     @staticmethod
+>>>     def forward(ctx, i):
+>>>         result = i.exp()
+>>>         ctx.save_for_backward(result)
+>>>         return result
+>>>
+>>>     @staticmethod
+>>>     def backward(ctx, grad_output):
+>>>         result, = ctx.saved_tensors
+>>>         return grad_output * result
+>>>
+>>> # Use it by calling the apply method:
+>>> # xdoctest: +SKIP
+>>> output = Exp.apply(input)
+
+
+ + +
+
+
@staticmethod
+ + def + forward(ctx, qkv, q_norm_weight, k_norm_weight, freqs, H=32, h=4, eps=1e-06): + + +
+ + +

Define the forward of the custom autograd Function.

+ +

This function is to be overridden by all subclasses. +There are two ways to define forward:

+ +

Usage 1 (Combined forward and ctx)::

+ +
@staticmethod
+def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
+    pass
+
+ +
    +
  • It must accept a context ctx as the first argument, followed by any +number of arguments (tensors or other types).
  • +
  • See :ref:combining-forward-context for more details
  • +
+ +

Usage 2 (Separate forward and ctx)::

+ +
@staticmethod
+def forward(*args: Any, **kwargs: Any) -> Any:
+    pass
+
+@staticmethod
+def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
+    pass
+
+ +
    +
  • The forward no longer accepts a ctx argument.
  • +
  • Instead, you must also override the torch.autograd.Function.setup_context() +staticmethod to handle setting up the ctx object. +output is the output of the forward, inputs are a Tuple of inputs +to the forward.
  • +
  • See :ref:extending-autograd for more details
  • +
+ +

The context can be used to store arbitrary data that can be then +retrieved during the backward pass. Tensors should not be stored +directly on ctx (though this is not currently enforced for +backward compatibility). Instead, tensors should be saved either with +ctx.save_for_backward() if they are intended to be used in +backward (equivalently, vjp) or ctx.save_for_forward() +if they are intended to be used for in jvp.

+
+ + +
+
+
+
@staticmethod
+ + def + backward(ctx, grad_q, grad_k, grad_v): + + +
+ + +

Define a formula for differentiating the operation with backward mode automatic differentiation.

+ +

This function is to be overridden by all subclasses. +(Defining this function is equivalent to defining the vjp function.)

+ +

It must accept a context ctx as the first argument, followed by +as many outputs as the forward() returned (None will be passed in +for non tensor outputs of the forward function), +and it should return as many tensors, as there were inputs to +forward(). Each argument is the gradient w.r.t the given output, +and each returned value should be the gradient w.r.t. the +corresponding input. If an input is not a Tensor or is a Tensor not +requiring grads, you can just pass None as a gradient for that input.

+ +

The context can be used to retrieve tensors saved during the forward +pass. It also has an attribute ctx.needs_input_grad as a tuple +of booleans representing whether each input needs gradient. E.g., +backward() will have ctx.needs_input_grad[0] = True if the +first input to forward() needs gradient computed w.r.t. the +output.

+
+ + +
+
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/facade/transpose.html b/docs/api/linghe/facade/transpose.html new file mode 100644 index 0000000..9efef88 --- /dev/null +++ b/docs/api/linghe/facade/transpose.html @@ -0,0 +1,209 @@ + + + + + + + linghe.facade.transpose API documentation + + + + + + + + + +
+
+

+linghe.facade.transpose

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+
+ + class + TransposeDim01Function(torch.autograd.function.Function): + + +
+ + +

Base class to create custom autograd.Function.

+ +

To create a custom autograd.Function, subclass this class and implement +the forward() and backward() static methods. Then, to use your custom +op in the forward pass, call the class method apply. Do not call +forward() directly.

+ +

To ensure correctness and best performance, make sure you are calling the +correct methods on ctx and validating your backward function using +torch.autograd.gradcheck().

+ +

See :ref:extending-autograd for more details on how to use this class.

+ +

Examples::

+ +
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)
+>>> class Exp(Function):
+>>>     @staticmethod
+>>>     def forward(ctx, i):
+>>>         result = i.exp()
+>>>         ctx.save_for_backward(result)
+>>>         return result
+>>>
+>>>     @staticmethod
+>>>     def backward(ctx, grad_output):
+>>>         result, = ctx.saved_tensors
+>>>         return grad_output * result
+>>>
+>>> # Use it by calling the apply method:
+>>> # xdoctest: +SKIP
+>>> output = Exp.apply(input)
+
+
+ + +
+
+
@staticmethod
+ + def + forward(ctx, x): + + +
+ + +

Define the forward of the custom autograd Function.

+ +

This function is to be overridden by all subclasses. +There are two ways to define forward:

+ +

Usage 1 (Combined forward and ctx)::

+ +
@staticmethod
+def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
+    pass
+
+ +
    +
  • It must accept a context ctx as the first argument, followed by any +number of arguments (tensors or other types).
  • +
  • See :ref:combining-forward-context for more details
  • +
+ +

Usage 2 (Separate forward and ctx)::

+ +
@staticmethod
+def forward(*args: Any, **kwargs: Any) -> Any:
+    pass
+
+@staticmethod
+def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
+    pass
+
+ +
    +
  • The forward no longer accepts a ctx argument.
  • +
  • Instead, you must also override the torch.autograd.Function.setup_context() +staticmethod to handle setting up the ctx object. +output is the output of the forward, inputs are a Tuple of inputs +to the forward.
  • +
  • See :ref:extending-autograd for more details
  • +
+ +

The context can be used to store arbitrary data that can be then +retrieved during the backward pass. Tensors should not be stored +directly on ctx (though this is not currently enforced for +backward compatibility). Instead, tensors should be saved either with +ctx.save_for_backward() if they are intended to be used in +backward (equivalently, vjp) or ctx.save_for_forward() +if they are intended to be used for in jvp.

+
+ + +
+
+
+
@staticmethod
+ + def + backward(ctx, grad_output): + + +
+ + +

Define a formula for differentiating the operation with backward mode automatic differentiation.

+ +

This function is to be overridden by all subclasses. +(Defining this function is equivalent to defining the vjp function.)

+ +

It must accept a context ctx as the first argument, followed by +as many outputs as the forward() returned (None will be passed in +for non tensor outputs of the forward function), +and it should return as many tensors, as there were inputs to +forward(). Each argument is the gradient w.r.t the given output, +and each returned value should be the gradient w.r.t. the +corresponding input. If an input is not a Tensor or is a Tensor not +requiring grads, you can just pass None as a gradient for that input.

+ +

The context can be used to retrieve tensors saved during the forward +pass. It also has an attribute ctx.needs_input_grad as a tuple +of booleans representing whether each input needs gradient. E.g., +backward() will have ctx.needs_input_grad[0] = True if the +first input to forward() needs gradient computed w.r.t. the +output.

+
+ + +
+
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/gemm.html b/docs/api/linghe/gemm.html new file mode 100644 index 0000000..de3721c --- /dev/null +++ b/docs/api/linghe/gemm.html @@ -0,0 +1,54 @@ + + + + + + + linghe.gemm API documentation + + + + + + + + + +
+
+

+linghe.gemm

+ + + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/gemm/fp32_gemm.html b/docs/api/linghe/gemm/fp32_gemm.html new file mode 100644 index 0000000..50a667e --- /dev/null +++ b/docs/api/linghe/gemm/fp32_gemm.html @@ -0,0 +1,56 @@ + + + + + + + linghe.gemm.fp32_gemm API documentation + + + + + + + + + +
+
+

+linghe.gemm.fp32_gemm

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/quant.html b/docs/api/linghe/quant.html new file mode 100644 index 0000000..7de4b97 --- /dev/null +++ b/docs/api/linghe/quant.html @@ -0,0 +1,55 @@ + + + + + + + linghe.quant API documentation + + + + + + + + + +
+
+

+linghe.quant

+ + + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/quant/block.html b/docs/api/linghe/quant/block.html new file mode 100644 index 0000000..6465535 --- /dev/null +++ b/docs/api/linghe/quant/block.html @@ -0,0 +1,55 @@ + + + + + + + linghe.quant.block API documentation + + + + + + + + + +
+
+

+linghe.quant.block

+ + + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/quant/block/block.html b/docs/api/linghe/quant/block/block.html new file mode 100644 index 0000000..58f0ab6 --- /dev/null +++ b/docs/api/linghe/quant/block/block.html @@ -0,0 +1,56 @@ + + + + + + + linghe.quant.block.block API documentation + + + + + + + + + +
+
+

+linghe.quant.block.block

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/quant/block/group.html b/docs/api/linghe/quant/block/group.html new file mode 100644 index 0000000..11958c6 --- /dev/null +++ b/docs/api/linghe/quant/block/group.html @@ -0,0 +1,56 @@ + + + + + + + linghe.quant.block.group API documentation + + + + + + + + + +
+
+

+linghe.quant.block.group

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/quant/channel.html b/docs/api/linghe/quant/channel.html new file mode 100644 index 0000000..54287c2 --- /dev/null +++ b/docs/api/linghe/quant/channel.html @@ -0,0 +1,54 @@ + + + + + + + linghe.quant.channel API documentation + + + + + + + + + +
+
+

+linghe.quant.channel

+ + + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/quant/channel/channel.html b/docs/api/linghe/quant/channel/channel.html new file mode 100644 index 0000000..7030e74 --- /dev/null +++ b/docs/api/linghe/quant/channel/channel.html @@ -0,0 +1,56 @@ + + + + + + + linghe.quant.channel.channel API documentation + + + + + + + + + +
+
+

+linghe.quant.channel.channel

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/utils.html b/docs/api/linghe/utils.html new file mode 100644 index 0000000..e37e3fe --- /dev/null +++ b/docs/api/linghe/utils.html @@ -0,0 +1,64 @@ + + + + + + + linghe.utils API documentation + + + + + + + + + +
+
+

+linghe.utils

+ + + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/utils/add.html b/docs/api/linghe/utils/add.html new file mode 100644 index 0000000..10a2108 --- /dev/null +++ b/docs/api/linghe/utils/add.html @@ -0,0 +1,84 @@ + + + + + + + linghe.utils.add API documentation + + + + + + + + + +
+
+

+linghe.utils.add

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+
+ + def + triton_inplace_add(x: torch.Tensor, y: torch.Tensor, accum: bool = True): + + +
+ + +

inplace add y to x

+ +
Arguments:
+ +
    +
  • x: Tensor
  • +
  • y: Tensor
  • +
  • accum: whether accum y to x
  • +
+ +

Returns: x += y if accum=True else x.copy_(y)

+
+ + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/utils/dot.html b/docs/api/linghe/utils/dot.html new file mode 100644 index 0000000..dc32301 --- /dev/null +++ b/docs/api/linghe/utils/dot.html @@ -0,0 +1,56 @@ + + + + + + + linghe.utils.dot API documentation + + + + + + + + + +
+
+

+linghe.utils.dot

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/utils/gather.html b/docs/api/linghe/utils/gather.html new file mode 100644 index 0000000..17a5e5d --- /dev/null +++ b/docs/api/linghe/utils/gather.html @@ -0,0 +1,56 @@ + + + + + + + linghe.utils.gather API documentation + + + + + + + + + +
+
+

+linghe.utils.gather

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/utils/loss.html b/docs/api/linghe/utils/loss.html new file mode 100644 index 0000000..98e8f96 --- /dev/null +++ b/docs/api/linghe/utils/loss.html @@ -0,0 +1,56 @@ + + + + + + + linghe.utils.loss API documentation + + + + + + + + + +
+
+

+linghe.utils.loss

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/utils/norm.html b/docs/api/linghe/utils/norm.html new file mode 100644 index 0000000..8c606d3 --- /dev/null +++ b/docs/api/linghe/utils/norm.html @@ -0,0 +1,128 @@ + + + + + + + linghe.utils.norm API documentation + + + + + + + + + +
+
+

+linghe.utils.norm

+ + + + + +
+
+
+ + def + triton_rms_norm_and_block_quant_forward( x: torch.Tensor, weight: torch.Tensor, eps: float = 1e-06, out: Optional[torch.Tensor] = None, scale: Optional[torch.Tensor] = None, rms: Optional[torch.Tensor] = None, round_scale: bool = False, output_mode: int = 2): + + +
+ + +

Fused RMSNorm forward and block quantization.

+ +
Arguments:
+ +
    +
  • x: Input tensor, shape [M, N]
  • +
  • weight: RMSNorm weight, shape [N]
  • +
  • eps: epsilon value for L2 normalization.
  • +
  • out: output of quantization data
  • +
  • scale: output of quantization scale.
  • +
  • rms: output of rms
  • +
  • round_scale: Set whether to force power of 2 scales.
  • +
  • output_mode: one of {0, 1, 2}. +0: only output non-transpose tensor +1: only output transposed tensor +2: return both
  • +
+ +
Returns:
+ +
+

out: quantization data + scale: quantization scale + rms: Reciprocal of the root mean square of the input calculated over the last dimension. + transpose_output: quantization data of transposed gradient + transpose_scale: quantization scale of transposed gradient

+
+
+ + +
+
+
+ + def + triton_group_norm_gate_forward(x: torch.Tensor, gate, weight, eps=1e-06, group_size=4): + + +
+ + +

norm and gate in linear attention

+ +
Arguments:
+ +
    +
  • x:
  • +
  • gate:
  • +
  • weight:
  • +
  • eps:
  • +
  • group_size:
  • +
+ +

Returns:

+
+ + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/utils/rearange.html b/docs/api/linghe/utils/rearange.html new file mode 100644 index 0000000..ff1bdb1 --- /dev/null +++ b/docs/api/linghe/utils/rearange.html @@ -0,0 +1,56 @@ + + + + + + + linghe.utils.rearange API documentation + + + + + + + + + +
+
+

+linghe.utils.rearange

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/utils/reduce.html b/docs/api/linghe/utils/reduce.html new file mode 100644 index 0000000..06ac7f5 --- /dev/null +++ b/docs/api/linghe/utils/reduce.html @@ -0,0 +1,56 @@ + + + + + + + linghe.utils.reduce API documentation + + + + + + + + + +
+
+

+linghe.utils.reduce

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/utils/rope.html b/docs/api/linghe/utils/rope.html new file mode 100644 index 0000000..49aefc3 --- /dev/null +++ b/docs/api/linghe/utils/rope.html @@ -0,0 +1,56 @@ + + + + + + + linghe.utils.rope API documentation + + + + + + + + + +
+
+

+linghe.utils.rope

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/utils/scatter.html b/docs/api/linghe/utils/scatter.html new file mode 100644 index 0000000..469c703 --- /dev/null +++ b/docs/api/linghe/utils/scatter.html @@ -0,0 +1,56 @@ + + + + + + + linghe.utils.scatter API documentation + + + + + + + + + +
+
+

+linghe.utils.scatter

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/utils/silu.html b/docs/api/linghe/utils/silu.html new file mode 100644 index 0000000..3b5a58c --- /dev/null +++ b/docs/api/linghe/utils/silu.html @@ -0,0 +1,56 @@ + + + + + + + linghe.utils.silu API documentation + + + + + + + + + +
+
+

+linghe.utils.silu

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/linghe/utils/transpose.html b/docs/api/linghe/utils/transpose.html new file mode 100644 index 0000000..3512573 --- /dev/null +++ b/docs/api/linghe/utils/transpose.html @@ -0,0 +1,56 @@ + + + + + + + linghe.utils.transpose API documentation + + + + + + + + + +
+
+

+linghe.utils.transpose

+ +

Copyright (c) Ant Financial Service Group and its affiliates.

+
+ + + + +
+
+ + \ No newline at end of file diff --git a/docs/api/search.js b/docs/api/search.js new file mode 100644 index 0000000..23741f0 --- /dev/null +++ b/docs/api/search.js @@ -0,0 +1,46 @@ +window.pdocSearch = (function(){ +/** elasticlunr - http://weixsong.github.io * Copyright (C) 2017 Oliver Nightingale * Copyright (C) 2017 Wei Song * MIT Licensed */!function(){function e(e){if(null===e||"object"!=typeof e)return e;var t=e.constructor();for(var n in e)e.hasOwnProperty(n)&&(t[n]=e[n]);return t}var t=function(e){var n=new t.Index;return n.pipeline.add(t.trimmer,t.stopWordFilter,t.stemmer),e&&e.call(n,n),n};t.version="0.9.5",lunr=t,t.utils={},t.utils.warn=function(e){return function(t){e.console&&console.warn&&console.warn(t)}}(this),t.utils.toString=function(e){return void 0===e||null===e?"":e.toString()},t.EventEmitter=function(){this.events={}},t.EventEmitter.prototype.addListener=function(){var e=Array.prototype.slice.call(arguments),t=e.pop(),n=e;if("function"!=typeof t)throw new TypeError("last argument must be a function");n.forEach(function(e){this.hasHandler(e)||(this.events[e]=[]),this.events[e].push(t)},this)},t.EventEmitter.prototype.removeListener=function(e,t){if(this.hasHandler(e)){var n=this.events[e].indexOf(t);-1!==n&&(this.events[e].splice(n,1),0==this.events[e].length&&delete this.events[e])}},t.EventEmitter.prototype.emit=function(e){if(this.hasHandler(e)){var t=Array.prototype.slice.call(arguments,1);this.events[e].forEach(function(e){e.apply(void 0,t)},this)}},t.EventEmitter.prototype.hasHandler=function(e){return e in this.events},t.tokenizer=function(e){if(!arguments.length||null===e||void 0===e)return[];if(Array.isArray(e)){var n=e.filter(function(e){return null===e||void 0===e?!1:!0});n=n.map(function(e){return t.utils.toString(e).toLowerCase()});var i=[];return n.forEach(function(e){var n=e.split(t.tokenizer.seperator);i=i.concat(n)},this),i}return e.toString().trim().toLowerCase().split(t.tokenizer.seperator)},t.tokenizer.defaultSeperator=/[\s\-]+/,t.tokenizer.seperator=t.tokenizer.defaultSeperator,t.tokenizer.setSeperator=function(e){null!==e&&void 0!==e&&"object"==typeof e&&(t.tokenizer.seperator=e)},t.tokenizer.resetSeperator=function(){t.tokenizer.seperator=t.tokenizer.defaultSeperator},t.tokenizer.getSeperator=function(){return t.tokenizer.seperator},t.Pipeline=function(){this._queue=[]},t.Pipeline.registeredFunctions={},t.Pipeline.registerFunction=function(e,n){n in t.Pipeline.registeredFunctions&&t.utils.warn("Overwriting existing registered function: "+n),e.label=n,t.Pipeline.registeredFunctions[n]=e},t.Pipeline.getRegisteredFunction=function(e){return e in t.Pipeline.registeredFunctions!=!0?null:t.Pipeline.registeredFunctions[e]},t.Pipeline.warnIfFunctionNotRegistered=function(e){var n=e.label&&e.label in this.registeredFunctions;n||t.utils.warn("Function is not registered with pipeline. This may cause problems when serialising the index.\n",e)},t.Pipeline.load=function(e){var n=new t.Pipeline;return e.forEach(function(e){var i=t.Pipeline.getRegisteredFunction(e);if(!i)throw new Error("Cannot load un-registered function: "+e);n.add(i)}),n},t.Pipeline.prototype.add=function(){var e=Array.prototype.slice.call(arguments);e.forEach(function(e){t.Pipeline.warnIfFunctionNotRegistered(e),this._queue.push(e)},this)},t.Pipeline.prototype.after=function(e,n){t.Pipeline.warnIfFunctionNotRegistered(n);var i=this._queue.indexOf(e);if(-1===i)throw new Error("Cannot find existingFn");this._queue.splice(i+1,0,n)},t.Pipeline.prototype.before=function(e,n){t.Pipeline.warnIfFunctionNotRegistered(n);var i=this._queue.indexOf(e);if(-1===i)throw new Error("Cannot find existingFn");this._queue.splice(i,0,n)},t.Pipeline.prototype.remove=function(e){var t=this._queue.indexOf(e);-1!==t&&this._queue.splice(t,1)},t.Pipeline.prototype.run=function(e){for(var t=[],n=e.length,i=this._queue.length,o=0;n>o;o++){for(var r=e[o],s=0;i>s&&(r=this._queue[s](r,o,e),void 0!==r&&null!==r);s++);void 0!==r&&null!==r&&t.push(r)}return t},t.Pipeline.prototype.reset=function(){this._queue=[]},t.Pipeline.prototype.get=function(){return this._queue},t.Pipeline.prototype.toJSON=function(){return this._queue.map(function(e){return t.Pipeline.warnIfFunctionNotRegistered(e),e.label})},t.Index=function(){this._fields=[],this._ref="id",this.pipeline=new t.Pipeline,this.documentStore=new t.DocumentStore,this.index={},this.eventEmitter=new t.EventEmitter,this._idfCache={},this.on("add","remove","update",function(){this._idfCache={}}.bind(this))},t.Index.prototype.on=function(){var e=Array.prototype.slice.call(arguments);return this.eventEmitter.addListener.apply(this.eventEmitter,e)},t.Index.prototype.off=function(e,t){return this.eventEmitter.removeListener(e,t)},t.Index.load=function(e){e.version!==t.version&&t.utils.warn("version mismatch: current "+t.version+" importing "+e.version);var n=new this;n._fields=e.fields,n._ref=e.ref,n.documentStore=t.DocumentStore.load(e.documentStore),n.pipeline=t.Pipeline.load(e.pipeline),n.index={};for(var i in e.index)n.index[i]=t.InvertedIndex.load(e.index[i]);return n},t.Index.prototype.addField=function(e){return this._fields.push(e),this.index[e]=new t.InvertedIndex,this},t.Index.prototype.setRef=function(e){return this._ref=e,this},t.Index.prototype.saveDocument=function(e){return this.documentStore=new t.DocumentStore(e),this},t.Index.prototype.addDoc=function(e,n){if(e){var n=void 0===n?!0:n,i=e[this._ref];this.documentStore.addDoc(i,e),this._fields.forEach(function(n){var o=this.pipeline.run(t.tokenizer(e[n]));this.documentStore.addFieldLength(i,n,o.length);var r={};o.forEach(function(e){e in r?r[e]+=1:r[e]=1},this);for(var s in r){var u=r[s];u=Math.sqrt(u),this.index[n].addToken(s,{ref:i,tf:u})}},this),n&&this.eventEmitter.emit("add",e,this)}},t.Index.prototype.removeDocByRef=function(e){if(e&&this.documentStore.isDocStored()!==!1&&this.documentStore.hasDoc(e)){var t=this.documentStore.getDoc(e);this.removeDoc(t,!1)}},t.Index.prototype.removeDoc=function(e,n){if(e){var n=void 0===n?!0:n,i=e[this._ref];this.documentStore.hasDoc(i)&&(this.documentStore.removeDoc(i),this._fields.forEach(function(n){var o=this.pipeline.run(t.tokenizer(e[n]));o.forEach(function(e){this.index[n].removeToken(e,i)},this)},this),n&&this.eventEmitter.emit("remove",e,this))}},t.Index.prototype.updateDoc=function(e,t){var t=void 0===t?!0:t;this.removeDocByRef(e[this._ref],!1),this.addDoc(e,!1),t&&this.eventEmitter.emit("update",e,this)},t.Index.prototype.idf=function(e,t){var n="@"+t+"/"+e;if(Object.prototype.hasOwnProperty.call(this._idfCache,n))return this._idfCache[n];var i=this.index[t].getDocFreq(e),o=1+Math.log(this.documentStore.length/(i+1));return this._idfCache[n]=o,o},t.Index.prototype.getFields=function(){return this._fields.slice()},t.Index.prototype.search=function(e,n){if(!e)return[];e="string"==typeof e?{any:e}:JSON.parse(JSON.stringify(e));var i=null;null!=n&&(i=JSON.stringify(n));for(var o=new t.Configuration(i,this.getFields()).get(),r={},s=Object.keys(e),u=0;u0&&t.push(e);for(var i in n)"docs"!==i&&"df"!==i&&this.expandToken(e+i,t,n[i]);return t},t.InvertedIndex.prototype.toJSON=function(){return{root:this.root}},t.Configuration=function(e,n){var e=e||"";if(void 0==n||null==n)throw new Error("fields should not be null");this.config={};var i;try{i=JSON.parse(e),this.buildUserConfig(i,n)}catch(o){t.utils.warn("user configuration parse failed, will use default configuration"),this.buildDefaultConfig(n)}},t.Configuration.prototype.buildDefaultConfig=function(e){this.reset(),e.forEach(function(e){this.config[e]={boost:1,bool:"OR",expand:!1}},this)},t.Configuration.prototype.buildUserConfig=function(e,n){var i="OR",o=!1;if(this.reset(),"bool"in e&&(i=e.bool||i),"expand"in e&&(o=e.expand||o),"fields"in e)for(var r in e.fields)if(n.indexOf(r)>-1){var s=e.fields[r],u=o;void 0!=s.expand&&(u=s.expand),this.config[r]={boost:s.boost||0===s.boost?s.boost:1,bool:s.bool||i,expand:u}}else t.utils.warn("field name in user configuration not found in index instance fields");else this.addAllFields2UserConfig(i,o,n)},t.Configuration.prototype.addAllFields2UserConfig=function(e,t,n){n.forEach(function(n){this.config[n]={boost:1,bool:e,expand:t}},this)},t.Configuration.prototype.get=function(){return this.config},t.Configuration.prototype.reset=function(){this.config={}},lunr.SortedSet=function(){this.length=0,this.elements=[]},lunr.SortedSet.load=function(e){var t=new this;return t.elements=e,t.length=e.length,t},lunr.SortedSet.prototype.add=function(){var e,t;for(e=0;e1;){if(r===e)return o;e>r&&(t=o),r>e&&(n=o),i=n-t,o=t+Math.floor(i/2),r=this.elements[o]}return r===e?o:-1},lunr.SortedSet.prototype.locationFor=function(e){for(var t=0,n=this.elements.length,i=n-t,o=t+Math.floor(i/2),r=this.elements[o];i>1;)e>r&&(t=o),r>e&&(n=o),i=n-t,o=t+Math.floor(i/2),r=this.elements[o];return r>e?o:e>r?o+1:void 0},lunr.SortedSet.prototype.intersect=function(e){for(var t=new lunr.SortedSet,n=0,i=0,o=this.length,r=e.length,s=this.elements,u=e.elements;;){if(n>o-1||i>r-1)break;s[n]!==u[i]?s[n]u[i]&&i++:(t.add(s[n]),n++,i++)}return t},lunr.SortedSet.prototype.clone=function(){var e=new lunr.SortedSet;return e.elements=this.toArray(),e.length=e.elements.length,e},lunr.SortedSet.prototype.union=function(e){var t,n,i;this.length>=e.length?(t=this,n=e):(t=e,n=this),i=t.clone();for(var o=0,r=n.toArray();o

\n"}, {"fullname": "linghe.facade", "modulename": "linghe.facade", "kind": "module", "doc": "

\n"}, {"fullname": "linghe.facade.add", "modulename": "linghe.facade.add", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.facade.add.InplaceAddFunction", "modulename": "linghe.facade.add", "qualname": "InplaceAddFunction", "kind": "class", "doc": "

Base class to create custom autograd.Function.

\n\n

To create a custom autograd.Function, subclass this class and implement\nthe :meth:forward and :meth:backward static methods. Then, to use your custom\nop in the forward pass, call the class method apply. Do not call\n:meth:forward directly.

\n\n

To ensure correctness and best performance, make sure you are calling the\ncorrect methods on ctx and validating your backward function using\n:func:torch.autograd.gradcheck.

\n\n

See :ref:extending-autograd for more details on how to use this class.

\n\n

Examples::

\n\n
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)\n>>> class Exp(Function):\n>>>     @staticmethod\n>>>     def forward(ctx, i):\n>>>         result = i.exp()\n>>>         ctx.save_for_backward(result)\n>>>         return result\n>>>\n>>>     @staticmethod\n>>>     def backward(ctx, grad_output):\n>>>         result, = ctx.saved_tensors\n>>>         return grad_output * result\n>>>\n>>> # Use it by calling the apply method:\n>>> # xdoctest: +SKIP\n>>> output = Exp.apply(input)\n
\n", "bases": "torch.autograd.function.Function"}, {"fullname": "linghe.facade.add.InplaceAddFunction.forward", "modulename": "linghe.facade.add", "qualname": "InplaceAddFunction.forward", "kind": "function", "doc": "

Define the forward of the custom autograd Function.

\n\n

This function is to be overridden by all subclasses.\nThere are two ways to define forward:

\n\n

Usage 1 (Combined forward and ctx)::

\n\n
@staticmethod\ndef forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:\n    pass\n
\n\n
    \n
  • It must accept a context ctx as the first argument, followed by any\nnumber of arguments (tensors or other types).
  • \n
  • See :ref:combining-forward-context for more details
  • \n
\n\n

Usage 2 (Separate forward and ctx)::

\n\n
@staticmethod\ndef forward(*args: Any, **kwargs: Any) -> Any:\n    pass\n\n@staticmethod\ndef setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:\n    pass\n
\n\n
    \n
  • The forward no longer accepts a ctx argument.
  • \n
  • Instead, you must also override the :meth:torch.autograd.Function.setup_context\nstaticmethod to handle setting up the ctx object.\noutput is the output of the forward, inputs are a Tuple of inputs\nto the forward.
  • \n
  • See :ref:extending-autograd for more details
  • \n
\n\n

The context can be used to store arbitrary data that can be then\nretrieved during the backward pass. Tensors should not be stored\ndirectly on ctx (though this is not currently enforced for\nbackward compatibility). Instead, tensors should be saved either with\n:func:ctx.save_for_backward if they are intended to be used in\nbackward (equivalently, vjp) or :func:ctx.save_for_forward\nif they are intended to be used for in jvp.

\n", "signature": "(ctx, x, y):", "funcdef": "def"}, {"fullname": "linghe.facade.add.InplaceAddFunction.backward", "modulename": "linghe.facade.add", "qualname": "InplaceAddFunction.backward", "kind": "function", "doc": "

Define a formula for differentiating the operation with backward mode automatic differentiation.

\n\n

This function is to be overridden by all subclasses.\n(Defining this function is equivalent to defining the vjp function.)

\n\n

It must accept a context :attr:ctx as the first argument, followed by\nas many outputs as the :func:forward returned (None will be passed in\nfor non tensor outputs of the forward function),\nand it should return as many tensors, as there were inputs to\n:func:forward. Each argument is the gradient w.r.t the given output,\nand each returned value should be the gradient w.r.t. the\ncorresponding input. If an input is not a Tensor or is a Tensor not\nrequiring grads, you can just pass None as a gradient for that input.

\n\n

The context can be used to retrieve tensors saved during the forward\npass. It also has an attribute :attr:ctx.needs_input_grad as a tuple\nof booleans representing whether each input needs gradient. E.g.,\n:func:backward will have ctx.needs_input_grad[0] = True if the\nfirst input to :func:forward needs gradient computed w.r.t. the\noutput.

\n", "signature": "(ctx, grad_output):", "funcdef": "def"}, {"fullname": "linghe.facade.fp32_linear", "modulename": "linghe.facade.fp32_linear", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.facade.fp32_linear.FusedFp32GEMM", "modulename": "linghe.facade.fp32_linear", "qualname": "FusedFp32GEMM", "kind": "class", "doc": "

Base class to create custom autograd.Function.

\n\n

To create a custom autograd.Function, subclass this class and implement\nthe :meth:forward and :meth:backward static methods. Then, to use your custom\nop in the forward pass, call the class method apply. Do not call\n:meth:forward directly.

\n\n

To ensure correctness and best performance, make sure you are calling the\ncorrect methods on ctx and validating your backward function using\n:func:torch.autograd.gradcheck.

\n\n

See :ref:extending-autograd for more details on how to use this class.

\n\n

Examples::

\n\n
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)\n>>> class Exp(Function):\n>>>     @staticmethod\n>>>     def forward(ctx, i):\n>>>         result = i.exp()\n>>>         ctx.save_for_backward(result)\n>>>         return result\n>>>\n>>>     @staticmethod\n>>>     def backward(ctx, grad_output):\n>>>         result, = ctx.saved_tensors\n>>>         return grad_output * result\n>>>\n>>> # Use it by calling the apply method:\n>>> # xdoctest: +SKIP\n>>> output = Exp.apply(input)\n
\n", "bases": "torch.autograd.function.Function"}, {"fullname": "linghe.facade.fp32_linear.FusedFp32GEMM.forward", "modulename": "linghe.facade.fp32_linear", "qualname": "FusedFp32GEMM.forward", "kind": "function", "doc": "

Define the forward of the custom autograd Function.

\n\n

This function is to be overridden by all subclasses.\nThere are two ways to define forward:

\n\n

Usage 1 (Combined forward and ctx)::

\n\n
@staticmethod\ndef forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:\n    pass\n
\n\n
    \n
  • It must accept a context ctx as the first argument, followed by any\nnumber of arguments (tensors or other types).
  • \n
  • See :ref:combining-forward-context for more details
  • \n
\n\n

Usage 2 (Separate forward and ctx)::

\n\n
@staticmethod\ndef forward(*args: Any, **kwargs: Any) -> Any:\n    pass\n\n@staticmethod\ndef setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:\n    pass\n
\n\n
    \n
  • The forward no longer accepts a ctx argument.
  • \n
  • Instead, you must also override the :meth:torch.autograd.Function.setup_context\nstaticmethod to handle setting up the ctx object.\noutput is the output of the forward, inputs are a Tuple of inputs\nto the forward.
  • \n
  • See :ref:extending-autograd for more details
  • \n
\n\n

The context can be used to store arbitrary data that can be then\nretrieved during the backward pass. Tensors should not be stored\ndirectly on ctx (though this is not currently enforced for\nbackward compatibility). Instead, tensors should be saved either with\n:func:ctx.save_for_backward if they are intended to be used in\nbackward (equivalently, vjp) or :func:ctx.save_for_forward\nif they are intended to be used for in jvp.

\n", "signature": "(ctx, input, weight):", "funcdef": "def"}, {"fullname": "linghe.facade.fp32_linear.FusedFp32GEMM.backward", "modulename": "linghe.facade.fp32_linear", "qualname": "FusedFp32GEMM.backward", "kind": "function", "doc": "

Define a formula for differentiating the operation with backward mode automatic differentiation.

\n\n

This function is to be overridden by all subclasses.\n(Defining this function is equivalent to defining the vjp function.)

\n\n

It must accept a context :attr:ctx as the first argument, followed by\nas many outputs as the :func:forward returned (None will be passed in\nfor non tensor outputs of the forward function),\nand it should return as many tensors, as there were inputs to\n:func:forward. Each argument is the gradient w.r.t the given output,\nand each returned value should be the gradient w.r.t. the\ncorresponding input. If an input is not a Tensor or is a Tensor not\nrequiring grads, you can just pass None as a gradient for that input.

\n\n

The context can be used to retrieve tensors saved during the forward\npass. It also has an attribute :attr:ctx.needs_input_grad as a tuple\nof booleans representing whether each input needs gradient. E.g.,\n:func:backward will have ctx.needs_input_grad[0] = True if the\nfirst input to :func:forward needs gradient computed w.r.t. the\noutput.

\n", "signature": "(ctx, grad_output):", "funcdef": "def"}, {"fullname": "linghe.facade.loss", "modulename": "linghe.facade.loss", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.facade.loss.SoftmaxCrossEntropyFunction", "modulename": "linghe.facade.loss", "qualname": "SoftmaxCrossEntropyFunction", "kind": "class", "doc": "

Base class to create custom autograd.Function.

\n\n

To create a custom autograd.Function, subclass this class and implement\nthe :meth:forward and :meth:backward static methods. Then, to use your custom\nop in the forward pass, call the class method apply. Do not call\n:meth:forward directly.

\n\n

To ensure correctness and best performance, make sure you are calling the\ncorrect methods on ctx and validating your backward function using\n:func:torch.autograd.gradcheck.

\n\n

See :ref:extending-autograd for more details on how to use this class.

\n\n

Examples::

\n\n
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)\n>>> class Exp(Function):\n>>>     @staticmethod\n>>>     def forward(ctx, i):\n>>>         result = i.exp()\n>>>         ctx.save_for_backward(result)\n>>>         return result\n>>>\n>>>     @staticmethod\n>>>     def backward(ctx, grad_output):\n>>>         result, = ctx.saved_tensors\n>>>         return grad_output * result\n>>>\n>>> # Use it by calling the apply method:\n>>> # xdoctest: +SKIP\n>>> output = Exp.apply(input)\n
\n", "bases": "torch.autograd.function.Function"}, {"fullname": "linghe.facade.loss.SoftmaxCrossEntropyFunction.forward", "modulename": "linghe.facade.loss", "qualname": "SoftmaxCrossEntropyFunction.forward", "kind": "function", "doc": "

Define the forward of the custom autograd Function.

\n\n

This function is to be overridden by all subclasses.\nThere are two ways to define forward:

\n\n

Usage 1 (Combined forward and ctx)::

\n\n
@staticmethod\ndef forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:\n    pass\n
\n\n
    \n
  • It must accept a context ctx as the first argument, followed by any\nnumber of arguments (tensors or other types).
  • \n
  • See :ref:combining-forward-context for more details
  • \n
\n\n

Usage 2 (Separate forward and ctx)::

\n\n
@staticmethod\ndef forward(*args: Any, **kwargs: Any) -> Any:\n    pass\n\n@staticmethod\ndef setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:\n    pass\n
\n\n
    \n
  • The forward no longer accepts a ctx argument.
  • \n
  • Instead, you must also override the :meth:torch.autograd.Function.setup_context\nstaticmethod to handle setting up the ctx object.\noutput is the output of the forward, inputs are a Tuple of inputs\nto the forward.
  • \n
  • See :ref:extending-autograd for more details
  • \n
\n\n

The context can be used to store arbitrary data that can be then\nretrieved during the backward pass. Tensors should not be stored\ndirectly on ctx (though this is not currently enforced for\nbackward compatibility). Instead, tensors should be saved either with\n:func:ctx.save_for_backward if they are intended to be used in\nbackward (equivalently, vjp) or :func:ctx.save_for_forward\nif they are intended to be used for in jvp.

\n", "signature": "(ctx, logits, labels, inplace=False):", "funcdef": "def"}, {"fullname": "linghe.facade.loss.SoftmaxCrossEntropyFunction.backward", "modulename": "linghe.facade.loss", "qualname": "SoftmaxCrossEntropyFunction.backward", "kind": "function", "doc": "

Define a formula for differentiating the operation with backward mode automatic differentiation.

\n\n

This function is to be overridden by all subclasses.\n(Defining this function is equivalent to defining the vjp function.)

\n\n

It must accept a context :attr:ctx as the first argument, followed by\nas many outputs as the :func:forward returned (None will be passed in\nfor non tensor outputs of the forward function),\nand it should return as many tensors, as there were inputs to\n:func:forward. Each argument is the gradient w.r.t the given output,\nand each returned value should be the gradient w.r.t. the\ncorresponding input. If an input is not a Tensor or is a Tensor not\nrequiring grads, you can just pass None as a gradient for that input.

\n\n

The context can be used to retrieve tensors saved during the forward\npass. It also has an attribute :attr:ctx.needs_input_grad as a tuple\nof booleans representing whether each input needs gradient. E.g.,\n:func:backward will have ctx.needs_input_grad[0] = True if the\nfirst input to :func:forward needs gradient computed w.r.t. the\noutput.

\n", "signature": "(ctx, grad_output):", "funcdef": "def"}, {"fullname": "linghe.facade.loss.GradScalingFunction", "modulename": "linghe.facade.loss", "qualname": "GradScalingFunction", "kind": "class", "doc": "

Base class to create custom autograd.Function.

\n\n

To create a custom autograd.Function, subclass this class and implement\nthe :meth:forward and :meth:backward static methods. Then, to use your custom\nop in the forward pass, call the class method apply. Do not call\n:meth:forward directly.

\n\n

To ensure correctness and best performance, make sure you are calling the\ncorrect methods on ctx and validating your backward function using\n:func:torch.autograd.gradcheck.

\n\n

See :ref:extending-autograd for more details on how to use this class.

\n\n

Examples::

\n\n
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)\n>>> class Exp(Function):\n>>>     @staticmethod\n>>>     def forward(ctx, i):\n>>>         result = i.exp()\n>>>         ctx.save_for_backward(result)\n>>>         return result\n>>>\n>>>     @staticmethod\n>>>     def backward(ctx, grad_output):\n>>>         result, = ctx.saved_tensors\n>>>         return grad_output * result\n>>>\n>>> # Use it by calling the apply method:\n>>> # xdoctest: +SKIP\n>>> output = Exp.apply(input)\n
\n", "bases": "torch.autograd.function.Function"}, {"fullname": "linghe.facade.loss.GradScalingFunction.forward", "modulename": "linghe.facade.loss", "qualname": "GradScalingFunction.forward", "kind": "function", "doc": "

Define the forward of the custom autograd Function.

\n\n

This function is to be overridden by all subclasses.\nThere are two ways to define forward:

\n\n

Usage 1 (Combined forward and ctx)::

\n\n
@staticmethod\ndef forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:\n    pass\n
\n\n
    \n
  • It must accept a context ctx as the first argument, followed by any\nnumber of arguments (tensors or other types).
  • \n
  • See :ref:combining-forward-context for more details
  • \n
\n\n

Usage 2 (Separate forward and ctx)::

\n\n
@staticmethod\ndef forward(*args: Any, **kwargs: Any) -> Any:\n    pass\n\n@staticmethod\ndef setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:\n    pass\n
\n\n
    \n
  • The forward no longer accepts a ctx argument.
  • \n
  • Instead, you must also override the :meth:torch.autograd.Function.setup_context\nstaticmethod to handle setting up the ctx object.\noutput is the output of the forward, inputs are a Tuple of inputs\nto the forward.
  • \n
  • See :ref:extending-autograd for more details
  • \n
\n\n

The context can be used to store arbitrary data that can be then\nretrieved during the backward pass. Tensors should not be stored\ndirectly on ctx (though this is not currently enforced for\nbackward compatibility). Instead, tensors should be saved either with\n:func:ctx.save_for_backward if they are intended to be used in\nbackward (equivalently, vjp) or :func:ctx.save_for_forward\nif they are intended to be used for in jvp.

\n", "signature": "(ctx, x, coef=0.2):", "funcdef": "def"}, {"fullname": "linghe.facade.loss.GradScalingFunction.backward", "modulename": "linghe.facade.loss", "qualname": "GradScalingFunction.backward", "kind": "function", "doc": "

Define a formula for differentiating the operation with backward mode automatic differentiation.

\n\n

This function is to be overridden by all subclasses.\n(Defining this function is equivalent to defining the vjp function.)

\n\n

It must accept a context :attr:ctx as the first argument, followed by\nas many outputs as the :func:forward returned (None will be passed in\nfor non tensor outputs of the forward function),\nand it should return as many tensors, as there were inputs to\n:func:forward. Each argument is the gradient w.r.t the given output,\nand each returned value should be the gradient w.r.t. the\ncorresponding input. If an input is not a Tensor or is a Tensor not\nrequiring grads, you can just pass None as a gradient for that input.

\n\n

The context can be used to retrieve tensors saved during the forward\npass. It also has an attribute :attr:ctx.needs_input_grad as a tuple\nof booleans representing whether each input needs gradient. E.g.,\n:func:backward will have ctx.needs_input_grad[0] = True if the\nfirst input to :func:forward needs gradient computed w.r.t. the\noutput.

\n", "signature": "(ctx, grad_output):", "funcdef": "def"}, {"fullname": "linghe.facade.norm", "modulename": "linghe.facade.norm", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.facade.norm.RMSNormFunction", "modulename": "linghe.facade.norm", "qualname": "RMSNormFunction", "kind": "class", "doc": "

Base class to create custom autograd.Function.

\n\n

To create a custom autograd.Function, subclass this class and implement\nthe :meth:forward and :meth:backward static methods. Then, to use your custom\nop in the forward pass, call the class method apply. Do not call\n:meth:forward directly.

\n\n

To ensure correctness and best performance, make sure you are calling the\ncorrect methods on ctx and validating your backward function using\n:func:torch.autograd.gradcheck.

\n\n

See :ref:extending-autograd for more details on how to use this class.

\n\n

Examples::

\n\n
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)\n>>> class Exp(Function):\n>>>     @staticmethod\n>>>     def forward(ctx, i):\n>>>         result = i.exp()\n>>>         ctx.save_for_backward(result)\n>>>         return result\n>>>\n>>>     @staticmethod\n>>>     def backward(ctx, grad_output):\n>>>         result, = ctx.saved_tensors\n>>>         return grad_output * result\n>>>\n>>> # Use it by calling the apply method:\n>>> # xdoctest: +SKIP\n>>> output = Exp.apply(input)\n
\n", "bases": "torch.autograd.function.Function"}, {"fullname": "linghe.facade.norm.RMSNormFunction.forward", "modulename": "linghe.facade.norm", "qualname": "RMSNormFunction.forward", "kind": "function", "doc": "

Define the forward of the custom autograd Function.

\n\n

This function is to be overridden by all subclasses.\nThere are two ways to define forward:

\n\n

Usage 1 (Combined forward and ctx)::

\n\n
@staticmethod\ndef forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:\n    pass\n
\n\n
    \n
  • It must accept a context ctx as the first argument, followed by any\nnumber of arguments (tensors or other types).
  • \n
  • See :ref:combining-forward-context for more details
  • \n
\n\n

Usage 2 (Separate forward and ctx)::

\n\n
@staticmethod\ndef forward(*args: Any, **kwargs: Any) -> Any:\n    pass\n\n@staticmethod\ndef setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:\n    pass\n
\n\n
    \n
  • The forward no longer accepts a ctx argument.
  • \n
  • Instead, you must also override the :meth:torch.autograd.Function.setup_context\nstaticmethod to handle setting up the ctx object.\noutput is the output of the forward, inputs are a Tuple of inputs\nto the forward.
  • \n
  • See :ref:extending-autograd for more details
  • \n
\n\n

The context can be used to store arbitrary data that can be then\nretrieved during the backward pass. Tensors should not be stored\ndirectly on ctx (though this is not currently enforced for\nbackward compatibility). Instead, tensors should be saved either with\n:func:ctx.save_for_backward if they are intended to be used in\nbackward (equivalently, vjp) or :func:ctx.save_for_forward\nif they are intended to be used for in jvp.

\n", "signature": "(ctx, x, weight, eps=1e-06):", "funcdef": "def"}, {"fullname": "linghe.facade.norm.RMSNormFunction.backward", "modulename": "linghe.facade.norm", "qualname": "RMSNormFunction.backward", "kind": "function", "doc": "

Define a formula for differentiating the operation with backward mode automatic differentiation.

\n\n

This function is to be overridden by all subclasses.\n(Defining this function is equivalent to defining the vjp function.)

\n\n

It must accept a context :attr:ctx as the first argument, followed by\nas many outputs as the :func:forward returned (None will be passed in\nfor non tensor outputs of the forward function),\nand it should return as many tensors, as there were inputs to\n:func:forward. Each argument is the gradient w.r.t the given output,\nand each returned value should be the gradient w.r.t. the\ncorresponding input. If an input is not a Tensor or is a Tensor not\nrequiring grads, you can just pass None as a gradient for that input.

\n\n

The context can be used to retrieve tensors saved during the forward\npass. It also has an attribute :attr:ctx.needs_input_grad as a tuple\nof booleans representing whether each input needs gradient. E.g.,\n:func:backward will have ctx.needs_input_grad[0] = True if the\nfirst input to :func:forward needs gradient computed w.r.t. the\noutput.

\n", "signature": "(ctx, dy):", "funcdef": "def"}, {"fullname": "linghe.facade.norm.GroupNormGateFunction", "modulename": "linghe.facade.norm", "qualname": "GroupNormGateFunction", "kind": "class", "doc": "

Base class to create custom autograd.Function.

\n\n

To create a custom autograd.Function, subclass this class and implement\nthe :meth:forward and :meth:backward static methods. Then, to use your custom\nop in the forward pass, call the class method apply. Do not call\n:meth:forward directly.

\n\n

To ensure correctness and best performance, make sure you are calling the\ncorrect methods on ctx and validating your backward function using\n:func:torch.autograd.gradcheck.

\n\n

See :ref:extending-autograd for more details on how to use this class.

\n\n

Examples::

\n\n
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)\n>>> class Exp(Function):\n>>>     @staticmethod\n>>>     def forward(ctx, i):\n>>>         result = i.exp()\n>>>         ctx.save_for_backward(result)\n>>>         return result\n>>>\n>>>     @staticmethod\n>>>     def backward(ctx, grad_output):\n>>>         result, = ctx.saved_tensors\n>>>         return grad_output * result\n>>>\n>>> # Use it by calling the apply method:\n>>> # xdoctest: +SKIP\n>>> output = Exp.apply(input)\n
\n", "bases": "torch.autograd.function.Function"}, {"fullname": "linghe.facade.norm.GroupNormGateFunction.forward", "modulename": "linghe.facade.norm", "qualname": "GroupNormGateFunction.forward", "kind": "function", "doc": "

Define the forward of the custom autograd Function.

\n\n

This function is to be overridden by all subclasses.\nThere are two ways to define forward:

\n\n

Usage 1 (Combined forward and ctx)::

\n\n
@staticmethod\ndef forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:\n    pass\n
\n\n
    \n
  • It must accept a context ctx as the first argument, followed by any\nnumber of arguments (tensors or other types).
  • \n
  • See :ref:combining-forward-context for more details
  • \n
\n\n

Usage 2 (Separate forward and ctx)::

\n\n
@staticmethod\ndef forward(*args: Any, **kwargs: Any) -> Any:\n    pass\n\n@staticmethod\ndef setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:\n    pass\n
\n\n
    \n
  • The forward no longer accepts a ctx argument.
  • \n
  • Instead, you must also override the :meth:torch.autograd.Function.setup_context\nstaticmethod to handle setting up the ctx object.\noutput is the output of the forward, inputs are a Tuple of inputs\nto the forward.
  • \n
  • See :ref:extending-autograd for more details
  • \n
\n\n

The context can be used to store arbitrary data that can be then\nretrieved during the backward pass. Tensors should not be stored\ndirectly on ctx (though this is not currently enforced for\nbackward compatibility). Instead, tensors should be saved either with\n:func:ctx.save_for_backward if they are intended to be used in\nbackward (equivalently, vjp) or :func:ctx.save_for_forward\nif they are intended to be used for in jvp.

\n", "signature": "(ctx, x, gate, weight, eps=1e-06, group_size=4):", "funcdef": "def"}, {"fullname": "linghe.facade.norm.GroupNormGateFunction.backward", "modulename": "linghe.facade.norm", "qualname": "GroupNormGateFunction.backward", "kind": "function", "doc": "

Define a formula for differentiating the operation with backward mode automatic differentiation.

\n\n

This function is to be overridden by all subclasses.\n(Defining this function is equivalent to defining the vjp function.)

\n\n

It must accept a context :attr:ctx as the first argument, followed by\nas many outputs as the :func:forward returned (None will be passed in\nfor non tensor outputs of the forward function),\nand it should return as many tensors, as there were inputs to\n:func:forward. Each argument is the gradient w.r.t the given output,\nand each returned value should be the gradient w.r.t. the\ncorresponding input. If an input is not a Tensor or is a Tensor not\nrequiring grads, you can just pass None as a gradient for that input.

\n\n

The context can be used to retrieve tensors saved during the forward\npass. It also has an attribute :attr:ctx.needs_input_grad as a tuple\nof booleans representing whether each input needs gradient. E.g.,\n:func:backward will have ctx.needs_input_grad[0] = True if the\nfirst input to :func:forward needs gradient computed w.r.t. the\noutput.

\n", "signature": "(ctx, dy):", "funcdef": "def"}, {"fullname": "linghe.facade.rope", "modulename": "linghe.facade.rope", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.facade.rope.QkNormHalfRopeFunction", "modulename": "linghe.facade.rope", "qualname": "QkNormHalfRopeFunction", "kind": "class", "doc": "

Base class to create custom autograd.Function.

\n\n

To create a custom autograd.Function, subclass this class and implement\nthe :meth:forward and :meth:backward static methods. Then, to use your custom\nop in the forward pass, call the class method apply. Do not call\n:meth:forward directly.

\n\n

To ensure correctness and best performance, make sure you are calling the\ncorrect methods on ctx and validating your backward function using\n:func:torch.autograd.gradcheck.

\n\n

See :ref:extending-autograd for more details on how to use this class.

\n\n

Examples::

\n\n
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)\n>>> class Exp(Function):\n>>>     @staticmethod\n>>>     def forward(ctx, i):\n>>>         result = i.exp()\n>>>         ctx.save_for_backward(result)\n>>>         return result\n>>>\n>>>     @staticmethod\n>>>     def backward(ctx, grad_output):\n>>>         result, = ctx.saved_tensors\n>>>         return grad_output * result\n>>>\n>>> # Use it by calling the apply method:\n>>> # xdoctest: +SKIP\n>>> output = Exp.apply(input)\n
\n", "bases": "torch.autograd.function.Function"}, {"fullname": "linghe.facade.rope.QkNormHalfRopeFunction.forward", "modulename": "linghe.facade.rope", "qualname": "QkNormHalfRopeFunction.forward", "kind": "function", "doc": "

Define the forward of the custom autograd Function.

\n\n

This function is to be overridden by all subclasses.\nThere are two ways to define forward:

\n\n

Usage 1 (Combined forward and ctx)::

\n\n
@staticmethod\ndef forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:\n    pass\n
\n\n
    \n
  • It must accept a context ctx as the first argument, followed by any\nnumber of arguments (tensors or other types).
  • \n
  • See :ref:combining-forward-context for more details
  • \n
\n\n

Usage 2 (Separate forward and ctx)::

\n\n
@staticmethod\ndef forward(*args: Any, **kwargs: Any) -> Any:\n    pass\n\n@staticmethod\ndef setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:\n    pass\n
\n\n
    \n
  • The forward no longer accepts a ctx argument.
  • \n
  • Instead, you must also override the :meth:torch.autograd.Function.setup_context\nstaticmethod to handle setting up the ctx object.\noutput is the output of the forward, inputs are a Tuple of inputs\nto the forward.
  • \n
  • See :ref:extending-autograd for more details
  • \n
\n\n

The context can be used to store arbitrary data that can be then\nretrieved during the backward pass. Tensors should not be stored\ndirectly on ctx (though this is not currently enforced for\nbackward compatibility). Instead, tensors should be saved either with\n:func:ctx.save_for_backward if they are intended to be used in\nbackward (equivalently, vjp) or :func:ctx.save_for_forward\nif they are intended to be used for in jvp.

\n", "signature": "(ctx, qkv, q_norm_weight, k_norm_weight, freqs, H=32, h=4, eps=1e-06):", "funcdef": "def"}, {"fullname": "linghe.facade.rope.QkNormHalfRopeFunction.backward", "modulename": "linghe.facade.rope", "qualname": "QkNormHalfRopeFunction.backward", "kind": "function", "doc": "

Define a formula for differentiating the operation with backward mode automatic differentiation.

\n\n

This function is to be overridden by all subclasses.\n(Defining this function is equivalent to defining the vjp function.)

\n\n

It must accept a context :attr:ctx as the first argument, followed by\nas many outputs as the :func:forward returned (None will be passed in\nfor non tensor outputs of the forward function),\nand it should return as many tensors, as there were inputs to\n:func:forward. Each argument is the gradient w.r.t the given output,\nand each returned value should be the gradient w.r.t. the\ncorresponding input. If an input is not a Tensor or is a Tensor not\nrequiring grads, you can just pass None as a gradient for that input.

\n\n

The context can be used to retrieve tensors saved during the forward\npass. It also has an attribute :attr:ctx.needs_input_grad as a tuple\nof booleans representing whether each input needs gradient. E.g.,\n:func:backward will have ctx.needs_input_grad[0] = True if the\nfirst input to :func:forward needs gradient computed w.r.t. the\noutput.

\n", "signature": "(ctx, grad_q, grad_k, grad_v):", "funcdef": "def"}, {"fullname": "linghe.facade.transpose", "modulename": "linghe.facade.transpose", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.facade.transpose.TransposeDim01Function", "modulename": "linghe.facade.transpose", "qualname": "TransposeDim01Function", "kind": "class", "doc": "

Base class to create custom autograd.Function.

\n\n

To create a custom autograd.Function, subclass this class and implement\nthe :meth:forward and :meth:backward static methods. Then, to use your custom\nop in the forward pass, call the class method apply. Do not call\n:meth:forward directly.

\n\n

To ensure correctness and best performance, make sure you are calling the\ncorrect methods on ctx and validating your backward function using\n:func:torch.autograd.gradcheck.

\n\n

See :ref:extending-autograd for more details on how to use this class.

\n\n

Examples::

\n\n
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)\n>>> class Exp(Function):\n>>>     @staticmethod\n>>>     def forward(ctx, i):\n>>>         result = i.exp()\n>>>         ctx.save_for_backward(result)\n>>>         return result\n>>>\n>>>     @staticmethod\n>>>     def backward(ctx, grad_output):\n>>>         result, = ctx.saved_tensors\n>>>         return grad_output * result\n>>>\n>>> # Use it by calling the apply method:\n>>> # xdoctest: +SKIP\n>>> output = Exp.apply(input)\n
\n", "bases": "torch.autograd.function.Function"}, {"fullname": "linghe.facade.transpose.TransposeDim01Function.forward", "modulename": "linghe.facade.transpose", "qualname": "TransposeDim01Function.forward", "kind": "function", "doc": "

Define the forward of the custom autograd Function.

\n\n

This function is to be overridden by all subclasses.\nThere are two ways to define forward:

\n\n

Usage 1 (Combined forward and ctx)::

\n\n
@staticmethod\ndef forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:\n    pass\n
\n\n
    \n
  • It must accept a context ctx as the first argument, followed by any\nnumber of arguments (tensors or other types).
  • \n
  • See :ref:combining-forward-context for more details
  • \n
\n\n

Usage 2 (Separate forward and ctx)::

\n\n
@staticmethod\ndef forward(*args: Any, **kwargs: Any) -> Any:\n    pass\n\n@staticmethod\ndef setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:\n    pass\n
\n\n
    \n
  • The forward no longer accepts a ctx argument.
  • \n
  • Instead, you must also override the :meth:torch.autograd.Function.setup_context\nstaticmethod to handle setting up the ctx object.\noutput is the output of the forward, inputs are a Tuple of inputs\nto the forward.
  • \n
  • See :ref:extending-autograd for more details
  • \n
\n\n

The context can be used to store arbitrary data that can be then\nretrieved during the backward pass. Tensors should not be stored\ndirectly on ctx (though this is not currently enforced for\nbackward compatibility). Instead, tensors should be saved either with\n:func:ctx.save_for_backward if they are intended to be used in\nbackward (equivalently, vjp) or :func:ctx.save_for_forward\nif they are intended to be used for in jvp.

\n", "signature": "(ctx, x):", "funcdef": "def"}, {"fullname": "linghe.facade.transpose.TransposeDim01Function.backward", "modulename": "linghe.facade.transpose", "qualname": "TransposeDim01Function.backward", "kind": "function", "doc": "

Define a formula for differentiating the operation with backward mode automatic differentiation.

\n\n

This function is to be overridden by all subclasses.\n(Defining this function is equivalent to defining the vjp function.)

\n\n

It must accept a context :attr:ctx as the first argument, followed by\nas many outputs as the :func:forward returned (None will be passed in\nfor non tensor outputs of the forward function),\nand it should return as many tensors, as there were inputs to\n:func:forward. Each argument is the gradient w.r.t the given output,\nand each returned value should be the gradient w.r.t. the\ncorresponding input. If an input is not a Tensor or is a Tensor not\nrequiring grads, you can just pass None as a gradient for that input.

\n\n

The context can be used to retrieve tensors saved during the forward\npass. It also has an attribute :attr:ctx.needs_input_grad as a tuple\nof booleans representing whether each input needs gradient. E.g.,\n:func:backward will have ctx.needs_input_grad[0] = True if the\nfirst input to :func:forward needs gradient computed w.r.t. the\noutput.

\n", "signature": "(ctx, grad_output):", "funcdef": "def"}, {"fullname": "linghe.gemm", "modulename": "linghe.gemm", "kind": "module", "doc": "

\n"}, {"fullname": "linghe.gemm.fp32_gemm", "modulename": "linghe.gemm.fp32_gemm", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.gemm.fp32_gemm.fp32_gemm_kernel", "modulename": "linghe.gemm.fp32_gemm", "qualname": "fp32_gemm_kernel", "kind": "function", "doc": "

\n", "signature": "(\ta_ptr,\tb_ptr,\tc_ptr,\tM,\tN: int,\tK: int,\tBLOCK_SIZE_K: int,\tBLOCK_SIZE_M: int,\tBLOCK_SIZE_N: int):", "funcdef": "def"}, {"fullname": "linghe.gemm.fp32_gemm.triton_fp32_gemm", "modulename": "linghe.gemm.fp32_gemm", "qualname": "triton_fp32_gemm", "kind": "function", "doc": "

\n", "signature": "(a: torch.Tensor, b: torch.Tensor):", "funcdef": "def"}, {"fullname": "linghe.gemm.fp32_gemm.scaled_fp32_gemm_kernel", "modulename": "linghe.gemm.fp32_gemm", "qualname": "scaled_fp32_gemm_kernel", "kind": "function", "doc": "

\n", "signature": "(\ta_ptr,\tb_ptr,\tscale_ptr,\tc_ptr,\tM,\tN: int,\tK: int,\tBLOCK_SIZE_K: int,\tBLOCK_SIZE_M: int,\tBLOCK_SIZE_N: int):", "funcdef": "def"}, {"fullname": "linghe.gemm.fp32_gemm.triton_scaled_fp32_gemm", "modulename": "linghe.gemm.fp32_gemm", "qualname": "triton_scaled_fp32_gemm", "kind": "function", "doc": "

\n", "signature": "(a: torch.Tensor, b: torch.Tensor, scale: torch.Tensor):", "funcdef": "def"}, {"fullname": "linghe.gemm.fp32_gemm.fp32_gemm_for_backward_kernel", "modulename": "linghe.gemm.fp32_gemm", "qualname": "fp32_gemm_for_backward_kernel", "kind": "function", "doc": "

\n", "signature": "(\ta_ptr,\tb_ptr,\tc_ptr,\tM,\tN: int,\tK: int,\tACCUM: int,\tBLOCK_SIZE_K: int,\tBLOCK_SIZE_M: int,\tBLOCK_SIZE_N: int):", "funcdef": "def"}, {"fullname": "linghe.gemm.fp32_gemm.triton_fp32_gemm_for_backward", "modulename": "linghe.gemm.fp32_gemm", "qualname": "triton_fp32_gemm_for_backward", "kind": "function", "doc": "

\n", "signature": "(\ta: torch.Tensor,\tb: torch.Tensor,\tc: Optional[torch.Tensor] = None,\taccum=False):", "funcdef": "def"}, {"fullname": "linghe.gemm.fp32_gemm.fp32_gemm_for_update_kernel", "modulename": "linghe.gemm.fp32_gemm", "qualname": "fp32_gemm_for_update_kernel", "kind": "function", "doc": "

\n", "signature": "(\ta_ptr,\tb_ptr,\tc_ptr,\tM,\tN: int,\tK: int,\tBLOCK_SIZE_K: int,\tBLOCK_SIZE_M: int,\tBLOCK_SIZE_N: int):", "funcdef": "def"}, {"fullname": "linghe.gemm.fp32_gemm.triton_fp32_gemm_for_update", "modulename": "linghe.gemm.fp32_gemm", "qualname": "triton_fp32_gemm_for_update", "kind": "function", "doc": "

\n", "signature": "(a: torch.Tensor, b: torch.Tensor):", "funcdef": "def"}, {"fullname": "linghe.gemm.fp32_gemm.scaled_fp32_gemm_for_update_kernel", "modulename": "linghe.gemm.fp32_gemm", "qualname": "scaled_fp32_gemm_for_update_kernel", "kind": "function", "doc": "

\n", "signature": "(\ta_ptr,\tb_ptr,\tscale_ptr,\tc_ptr,\tM,\tN: int,\tK: int,\tBLOCK_SIZE_K: int,\tBLOCK_SIZE_M: int,\tBLOCK_SIZE_N: int):", "funcdef": "def"}, {"fullname": "linghe.gemm.fp32_gemm.triton_scaled_fp32_gemm_for_update", "modulename": "linghe.gemm.fp32_gemm", "qualname": "triton_scaled_fp32_gemm_for_update", "kind": "function", "doc": "

\n", "signature": "(a: torch.Tensor, b: torch.Tensor, scale: torch.Tensor):", "funcdef": "def"}, {"fullname": "linghe.quant", "modulename": "linghe.quant", "kind": "module", "doc": "

\n"}, {"fullname": "linghe.quant.block", "modulename": "linghe.quant.block", "kind": "module", "doc": "

\n"}, {"fullname": "linghe.quant.block.block", "modulename": "linghe.quant.block.block", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.quant.block.block.block_quant_kernel", "modulename": "linghe.quant.block.block", "qualname": "block_quant_kernel", "kind": "function", "doc": "

\n", "signature": "(x_ptr, y_ptr, s_ptr, M, N, BLOCK_SIZE: int, ROUND: int):", "funcdef": "def"}, {"fullname": "linghe.quant.block.block.block_quant", "modulename": "linghe.quant.block.block", "qualname": "block_quant", "kind": "function", "doc": "

\n", "signature": "(x, block_size=128, round_scale=False):", "funcdef": "def"}, {"fullname": "linghe.quant.block.group", "modulename": "linghe.quant.block.group", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.quant.block.group.group_quant_kernel", "modulename": "linghe.quant.block.group", "qualname": "group_quant_kernel", "kind": "function", "doc": "

\n", "signature": "(x_ptr, y_ptr, s_ptr, N, BLOCK_SIZE: int, K: int, ROUND: int):", "funcdef": "def"}, {"fullname": "linghe.quant.block.group.triton_group_quant", "modulename": "linghe.quant.block.group", "qualname": "triton_group_quant", "kind": "function", "doc": "

\n", "signature": "(x, dtype=torch.float8_e4m3fn, group_size=128, round_scale=False):", "funcdef": "def"}, {"fullname": "linghe.quant.block.group.persist_group_quant_kernel", "modulename": "linghe.quant.block.group", "qualname": "persist_group_quant_kernel", "kind": "function", "doc": "

\n", "signature": "(x_ptr, y_ptr, s_ptr, N, BLOCK_SIZE: int, B: int, K: int, ROUND: int):", "funcdef": "def"}, {"fullname": "linghe.quant.block.group.triton_persist_group_quant", "modulename": "linghe.quant.block.group", "qualname": "triton_persist_group_quant", "kind": "function", "doc": "

\n", "signature": "(x, dtype=torch.float8_e4m3fn, group_size=128, round_scale=False):", "funcdef": "def"}, {"fullname": "linghe.quant.channel", "modulename": "linghe.quant.channel", "kind": "module", "doc": "

\n"}, {"fullname": "linghe.quant.channel.channel", "modulename": "linghe.quant.channel.channel", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.quant.channel.channel.row_quant_kernel", "modulename": "linghe.quant.channel.channel", "qualname": "row_quant_kernel", "kind": "function", "doc": "

\n", "signature": "(x_ptr, q_ptr, s_ptr, M, N, BLOCK_SIZE: int, ROUND: int):", "funcdef": "def"}, {"fullname": "linghe.quant.channel.channel.triton_row_quant", "modulename": "linghe.quant.channel.channel", "qualname": "triton_row_quant", "kind": "function", "doc": "

\n", "signature": "(x, round_scale=False):", "funcdef": "def"}, {"fullname": "linghe.quant.channel.channel.deprecated_tokenwise_row_quant_kernel", "modulename": "linghe.quant.channel.channel", "qualname": "deprecated_tokenwise_row_quant_kernel", "kind": "function", "doc": "

\n", "signature": "(x_ptr, out_ptr, scale_ptr, M, T: int, N: int, ROUND: int):", "funcdef": "def"}, {"fullname": "linghe.quant.channel.channel.triton_deprecated_tokenwise_row_quant", "modulename": "linghe.quant.channel.channel", "qualname": "triton_deprecated_tokenwise_row_quant", "kind": "function", "doc": "

\n", "signature": "(x, out=None, scale=None, round_scale=False):", "funcdef": "def"}, {"fullname": "linghe.quant.channel.channel.tokenwise_row_quant_kernel", "modulename": "linghe.quant.channel.channel", "qualname": "tokenwise_row_quant_kernel", "kind": "function", "doc": "

\n", "signature": "(x_ptr, out_ptr, scale_ptr, N: int, ROUND: int):", "funcdef": "def"}, {"fullname": "linghe.quant.channel.channel.triton_tokenwise_row_quant", "modulename": "linghe.quant.channel.channel", "qualname": "triton_tokenwise_row_quant", "kind": "function", "doc": "

\n", "signature": "(x, out=None, scale=None, round_scale=False):", "funcdef": "def"}, {"fullname": "linghe.quant.channel.channel.transpose_row_quant_kernel", "modulename": "linghe.quant.channel.channel", "qualname": "transpose_row_quant_kernel", "kind": "function", "doc": "

\n", "signature": "(x_ptr, q_ptr, s_ptr, M, N, H: int, W: int, ROUND: int):", "funcdef": "def"}, {"fullname": "linghe.quant.channel.channel.triton_transpose_row_quant", "modulename": "linghe.quant.channel.channel", "qualname": "triton_transpose_row_quant", "kind": "function", "doc": "

\n", "signature": "(x, side=0, round_scale=False):", "funcdef": "def"}, {"fullname": "linghe.quant.channel.channel.triton_channel_quant_nt", "modulename": "linghe.quant.channel.channel", "qualname": "triton_channel_quant_nt", "kind": "function", "doc": "

\n", "signature": "(x, w):", "funcdef": "def"}, {"fullname": "linghe.quant.channel.channel.triton_channel_quant_nn", "modulename": "linghe.quant.channel.channel", "qualname": "triton_channel_quant_nn", "kind": "function", "doc": "

\n", "signature": "(y, w):", "funcdef": "def"}, {"fullname": "linghe.quant.channel.channel.triton_channel_quant_tn", "modulename": "linghe.quant.channel.channel", "qualname": "triton_channel_quant_tn", "kind": "function", "doc": "

\n", "signature": "(y, x):", "funcdef": "def"}, {"fullname": "linghe.quant.channel.channel.channel_quant_forward", "modulename": "linghe.quant.channel.channel", "qualname": "channel_quant_forward", "kind": "function", "doc": "

\n", "signature": "(x, w):", "funcdef": "def"}, {"fullname": "linghe.quant.channel.channel.channel_quant_backward", "modulename": "linghe.quant.channel.channel", "qualname": "channel_quant_backward", "kind": "function", "doc": "

\n", "signature": "(y, w):", "funcdef": "def"}, {"fullname": "linghe.quant.channel.channel.channel_quant_update", "modulename": "linghe.quant.channel.channel", "qualname": "channel_quant_update", "kind": "function", "doc": "

\n", "signature": "(y, x):", "funcdef": "def"}, {"fullname": "linghe.quant.channel.channel.fp8_channel_f_and_b", "modulename": "linghe.quant.channel.channel", "qualname": "fp8_channel_f_and_b", "kind": "function", "doc": "

\n", "signature": "(x, w, y):", "funcdef": "def"}, {"fullname": "linghe.utils", "modulename": "linghe.utils", "kind": "module", "doc": "

\n"}, {"fullname": "linghe.utils.add", "modulename": "linghe.utils.add", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.utils.add.inplace_add_kernel", "modulename": "linghe.utils.add", "qualname": "inplace_add_kernel", "kind": "function", "doc": "

\n", "signature": "(x_ptr, y_ptr, M, N, H: int, W: int, EVEN: int, ACCUM: int):", "funcdef": "def"}, {"fullname": "linghe.utils.add.triton_inplace_add", "modulename": "linghe.utils.add", "qualname": "triton_inplace_add", "kind": "function", "doc": "

inplace add y to x\nArgs:\n x: Tensor\n y: Tensor\n accum: whether accum y to x

\n\n

Returns: x += y if accum=True else x.copy_(y)

\n", "signature": "(x: torch.Tensor, y: torch.Tensor, accum: bool = True):", "funcdef": "def"}, {"fullname": "linghe.utils.dot", "modulename": "linghe.utils.dot", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.utils.dot.dot_kernel", "modulename": "linghe.utils.dot", "qualname": "dot_kernel", "kind": "function", "doc": "

\n", "signature": "(x_ptr, y_ptr, sum_ptr, M, N, H: int, W: int):", "funcdef": "def"}, {"fullname": "linghe.utils.dot.triton_dot", "modulename": "linghe.utils.dot", "qualname": "triton_dot", "kind": "function", "doc": "

\n", "signature": "(x, y):", "funcdef": "def"}, {"fullname": "linghe.utils.dot.mix_precise_dot_kernel", "modulename": "linghe.utils.dot", "qualname": "mix_precise_dot_kernel", "kind": "function", "doc": "

\n", "signature": "(\tx_ptr,\tq_ptr,\tsum_ptr,\tsmooth_scale_ptr,\tquant_scale_ptr,\tM,\tN,\tH: int,\tW: int):", "funcdef": "def"}, {"fullname": "linghe.utils.dot.triton_mix_precise_dot", "modulename": "linghe.utils.dot", "qualname": "triton_mix_precise_dot", "kind": "function", "doc": "

\n", "signature": "(x, q, smooth_scale, quant_scale, reverse=False):", "funcdef": "def"}, {"fullname": "linghe.utils.gather", "modulename": "linghe.utils.gather", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.utils.gather.block_count_kernel", "modulename": "linghe.utils.gather", "qualname": "block_count_kernel", "kind": "function", "doc": "

\n", "signature": "(map_ptr, count_ptr, M, B, T: int, b: int, E: int):", "funcdef": "def"}, {"fullname": "linghe.utils.gather.make_row_id_map_kernel", "modulename": "linghe.utils.gather", "qualname": "make_row_id_map_kernel", "kind": "function", "doc": "

\n", "signature": "(map_ptr, count_ptr, output_ptr, M, B, P, T: int, b: int, E: int):", "funcdef": "def"}, {"fullname": "linghe.utils.gather.triton_make_row_id_map", "modulename": "linghe.utils.gather", "qualname": "triton_make_row_id_map", "kind": "function", "doc": "

\n", "signature": "(routing_map: torch.Tensor, multiple_of: int = 1):", "funcdef": "def"}, {"fullname": "linghe.utils.gather.make_row_id_map_and_indices_kernel", "modulename": "linghe.utils.gather", "qualname": "make_row_id_map_and_indices_kernel", "kind": "function", "doc": "

\n", "signature": "(\tmap_ptr,\tcount_ptr,\trow_map_ptr,\trow_indices_ptr,\tM,\tB,\tP,\tT: int,\tb: int,\tE: int):", "funcdef": "def"}, {"fullname": "linghe.utils.gather.triton_make_row_id_map_and_indices", "modulename": "linghe.utils.gather", "qualname": "triton_make_row_id_map_and_indices", "kind": "function", "doc": "

\n", "signature": "(routing_map: torch.Tensor, num_out_tokens: int, multiple_of: int = 1):", "funcdef": "def"}, {"fullname": "linghe.utils.gather.index_select_kernel", "modulename": "linghe.utils.gather", "qualname": "index_select_kernel", "kind": "function", "doc": "

\n", "signature": "(\tx_ptr,\tout_ptr,\tscale_ptr,\tscale_out_ptr,\tindex_ptr,\tM,\tT,\tN: int,\tSCALE: int):", "funcdef": "def"}, {"fullname": "linghe.utils.gather.triton_index_select", "modulename": "linghe.utils.gather", "qualname": "triton_index_select", "kind": "function", "doc": "

\n", "signature": "(x, indices, scale=None, out=None, scale_out=None):", "funcdef": "def"}, {"fullname": "linghe.utils.gather.permute_with_mask_map_kernel", "modulename": "linghe.utils.gather", "qualname": "permute_with_mask_map_kernel", "kind": "function", "doc": "

\n", "signature": "(\tdata_ptr,\tscale_ptr,\tprobs_ptr,\tmask_map_ptr,\toutput_data_ptr,\toutput_scale_ptr,\toutput_probs_ptr,\tnum_experts: int,\tN: int,\ths: int,\tSCALE: int,\tPROB: int):", "funcdef": "def"}, {"fullname": "linghe.utils.gather.fill_padded_token_with_zero_kernel", "modulename": "linghe.utils.gather", "qualname": "fill_padded_token_with_zero_kernel", "kind": "function", "doc": "

\n", "signature": "(\tdata_ptr,\tscale_ptr,\tprobs_ptr,\tmax_indices_ptr,\ttoken_per_expert_ptr,\tN: int,\ths: int,\tSCALE: int,\tPROB: int):", "funcdef": "def"}, {"fullname": "linghe.utils.gather.triton_permute_with_mask_map", "modulename": "linghe.utils.gather", "qualname": "triton_permute_with_mask_map", "kind": "function", "doc": "

\n", "signature": "(\tinp: torch.Tensor,\tscale: torch.Tensor,\tprobs: torch.Tensor,\trow_id_map: torch.Tensor,\tnum_out_tokens: int,\tcontiguous: bool = True,\ttokens_per_expert: Optional[torch.Tensor] = None):", "funcdef": "def"}, {"fullname": "linghe.utils.gather.batch_smooth_transpose_smooth_permute_kernel", "modulename": "linghe.utils.gather", "qualname": "batch_smooth_transpose_smooth_permute_kernel", "kind": "function", "doc": "

\n", "signature": "(\tx_ptr,\tscale_ptr,\toss_ptr,\tss_ptr,\tindex_ptr,\tcount_ptr,\taccum_ptr,\tq_ptr,\tqs_ptr,\tN: int,\tE: int,\tH: int,\tW: int,\tSMOOTHED: int,\tROUND: int):", "funcdef": "def"}, {"fullname": "linghe.utils.gather.triton_batch_transpose_smooth_permute_with_indices", "modulename": "linghe.utils.gather", "qualname": "triton_batch_transpose_smooth_permute_with_indices", "kind": "function", "doc": "

\n", "signature": "(\tx,\tscale,\torg_smooth_scale,\tsmooth_scales,\tindices,\ttoken_count_per_expert,\tsplits,\tx_q=None,\tx_scale=None,\tround_scale=False):", "funcdef": "def"}, {"fullname": "linghe.utils.gather.smooth_weighted_permute_with_indices_kernel", "modulename": "linghe.utils.gather", "qualname": "smooth_weighted_permute_with_indices_kernel", "kind": "function", "doc": "

\n", "signature": "(\tgrads_ptr,\ttokens_ptr,\tq_ptr,\tss_ptr,\tqs_ptr,\tcount_ptr,\taccum_ptr,\tindex_ptr,\tsum_ptr,\tM,\tN: int,\tREVERSE: int,\tROUND: int):", "funcdef": "def"}, {"fullname": "linghe.utils.gather.triton_smooth_weighted_permute_with_indices", "modulename": "linghe.utils.gather", "qualname": "triton_smooth_weighted_permute_with_indices", "kind": "function", "doc": "

\n", "signature": "(\tgrads,\ttokens,\tsmooth_scales,\ttoken_count_per_expert,\tindices,\tx_q=None,\tx_scale=None,\tx_sum=None,\treverse=False,\tround_scale=False):", "funcdef": "def"}, {"fullname": "linghe.utils.gather.smooth_permute_with_indices_kernel", "modulename": "linghe.utils.gather", "qualname": "smooth_permute_with_indices_kernel", "kind": "function", "doc": "

\n", "signature": "(\tgrads_data_ptr,\tgrads_scale_ptr,\tq_ptr,\tss_ptr,\tqs_ptr,\tcount_ptr,\taccum_ptr,\tindex_ptr,\tN: int,\ths: int,\tREVERSE: int,\tROUND: int,\tGROUP: int):", "funcdef": "def"}, {"fullname": "linghe.utils.gather.triton_smooth_permute_with_indices", "modulename": "linghe.utils.gather", "qualname": "triton_smooth_permute_with_indices", "kind": "function", "doc": "

\n", "signature": "(\tgrad_data,\tgrad_scale,\tsmooth_scales,\ttoken_count_per_expert,\tindices,\tx_q=None,\tx_scale=None,\treverse=False,\tround_scale=False):", "funcdef": "def"}, {"fullname": "linghe.utils.gather.smooth_permute_with_mask_map_kernel", "modulename": "linghe.utils.gather", "qualname": "smooth_permute_with_mask_map_kernel", "kind": "function", "doc": "

\n", "signature": "(\tgrads_data_ptr,\tquant_data_ptr,\tmask_map_ptr,\tgrads_scale_ptr,\tsmooth_scale_ptr,\tquant_scale_ptr,\tM,\tT,\tN: int,\ths: int,\tREVERSE: int,\tROUND: int):", "funcdef": "def"}, {"fullname": "linghe.utils.gather.triton_smooth_permute_with_mask_map", "modulename": "linghe.utils.gather", "qualname": "triton_smooth_permute_with_mask_map", "kind": "function", "doc": "

\n", "signature": "(\tinp: torch.Tensor,\trow_id_map: torch.Tensor,\tscale: torch.Tensor,\tnum_tokens: int,\tnum_experts: int,\tnum_out_tokens: int,\thidden_size: int,\tsmooth_scales: torch.Tensor,\treverse=True,\tround_scale=False):", "funcdef": "def"}, {"fullname": "linghe.utils.gather.deprecated_smooth_permute_with_mask_map_kernel", "modulename": "linghe.utils.gather", "qualname": "deprecated_smooth_permute_with_mask_map_kernel", "kind": "function", "doc": "

\n", "signature": "(\tgrads_data_ptr,\tquant_data_ptr,\tmask_map_ptr,\tsmooth_scale_ptr,\tquant_scale_ptr,\tM,\tT,\tN: int,\tREVERSE: int,\tROUND: int):", "funcdef": "def"}, {"fullname": "linghe.utils.gather.triton_deprecated_smooth_permute_with_mask_map", "modulename": "linghe.utils.gather", "qualname": "triton_deprecated_smooth_permute_with_mask_map", "kind": "function", "doc": "

\n", "signature": "(\tinp: torch.Tensor,\trow_id_map: torch.Tensor,\tnum_tokens: int,\tnum_experts: int,\tnum_out_tokens: int,\thidden_size: int,\tsmooth_scales: torch.Tensor,\treverse=True,\tround_scale=False):", "funcdef": "def"}, {"fullname": "linghe.utils.loss", "modulename": "linghe.utils.loss", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.utils.loss.softmax_cross_entropy_forward_kernel", "modulename": "linghe.utils.loss", "qualname": "softmax_cross_entropy_forward_kernel", "kind": "function", "doc": "

\n", "signature": "(\tlogit_ptr,\tlabel_ptr,\tloss_ptr,\tsum_exp_ptr,\tmax_logit_ptr,\tN,\tB: int):", "funcdef": "def"}, {"fullname": "linghe.utils.loss.triton_softmax_cross_entropy_forward", "modulename": "linghe.utils.loss", "qualname": "triton_softmax_cross_entropy_forward", "kind": "function", "doc": "

\n", "signature": "(logits, labels):", "funcdef": "def"}, {"fullname": "linghe.utils.loss.softmax_cross_entropy_backward_kernel", "modulename": "linghe.utils.loss", "qualname": "softmax_cross_entropy_backward_kernel", "kind": "function", "doc": "

\n", "signature": "(\tlogit_ptr,\tlabel_ptr,\tsum_exp_ptr,\tmax_logit_ptr,\tinput_grad_ptr,\toutput_grad_ptr,\tN,\tB: int):", "funcdef": "def"}, {"fullname": "linghe.utils.loss.triton_softmax_cross_entropy_backward", "modulename": "linghe.utils.loss", "qualname": "triton_softmax_cross_entropy_backward", "kind": "function", "doc": "

\n", "signature": "(logits, labels, sum_exp, max_logit, input_grad, output_grad=None):", "funcdef": "def"}, {"fullname": "linghe.utils.norm", "modulename": "linghe.utils.norm", "kind": "module", "doc": "

\n"}, {"fullname": "linghe.utils.norm.rms_norm_forward_kernel", "modulename": "linghe.utils.norm", "qualname": "rms_norm_forward_kernel", "kind": "function", "doc": "

\n", "signature": "(x_ptr, weight_ptr, out_ptr, eps, M, T, N: int, W: int):", "funcdef": "def"}, {"fullname": "linghe.utils.norm.triton_rms_norm_forward", "modulename": "linghe.utils.norm", "qualname": "triton_rms_norm_forward", "kind": "function", "doc": "

\n", "signature": "(x, weight, eps=1e-06, out=None):", "funcdef": "def"}, {"fullname": "linghe.utils.norm.rms_norm_backward_kernel", "modulename": "linghe.utils.norm", "qualname": "rms_norm_backward_kernel", "kind": "function", "doc": "

\n", "signature": "(\tgrad_output_ptr,\tx_ptr,\tw_ptr,\tdx_ptr,\tdw_ptr,\teps,\tM,\tT,\tN: int,\tW: int):", "funcdef": "def"}, {"fullname": "linghe.utils.norm.triton_rms_norm_backward", "modulename": "linghe.utils.norm", "qualname": "triton_rms_norm_backward", "kind": "function", "doc": "

\n", "signature": "(grad_output, x, w, eps=1e-06):", "funcdef": "def"}, {"fullname": "linghe.utils.norm.rms_norm_and_block_quant_forward_kernel", "modulename": "linghe.utils.norm", "qualname": "rms_norm_and_block_quant_forward_kernel", "kind": "function", "doc": "

\n", "signature": "(\tx_ptr,\tweight_ptr,\tout_ptr,\tscale_ptr,\ttranspose_output_ptr,\ttranspose_scale_ptr,\trms_ptr,\teps,\tM,\tT: int,\tN: int,\tnb: int,\tW: int,\tH: int,\tROUND: int):", "funcdef": "def"}, {"fullname": "linghe.utils.norm.rms_norm_and_block_quant_forward_n_kernel", "modulename": "linghe.utils.norm", "qualname": "rms_norm_and_block_quant_forward_n_kernel", "kind": "function", "doc": "

\n", "signature": "(\tx_ptr,\tweight_ptr,\tout_ptr,\tscale_ptr,\trms_ptr,\teps,\tM: int,\tT: int,\tN: int,\tnb: int,\tW: int,\tROUND: int):", "funcdef": "def"}, {"fullname": "linghe.utils.norm.rms_norm_and_block_quant_forward_t_kernel", "modulename": "linghe.utils.norm", "qualname": "rms_norm_and_block_quant_forward_t_kernel", "kind": "function", "doc": "

\n", "signature": "(\tx_ptr,\tweight_ptr,\ttranspose_output_ptr,\ttranspose_scale_ptr,\trms_ptr,\tM,\tN,\tW: int,\tROUND: int):", "funcdef": "def"}, {"fullname": "linghe.utils.norm.triton_rms_norm_and_block_quant_forward", "modulename": "linghe.utils.norm", "qualname": "triton_rms_norm_and_block_quant_forward", "kind": "function", "doc": "

Fused RMSNorm forward and block quantization.\nArgs:\n x: Input tensor, shape [M, N]\n weight: RMSNorm weight, shape [N]\n eps: epsilon value for L2 normalization.\n out: output of quantization data\n scale: output of quantization scale.\n rms: output of rms\n round_scale: Set whether to force power of 2 scales.\n output_mode: one of {0, 1, 2}.\n 0: only output non-transpose tensor\n 1: only output transposed tensor\n 2: return both\nReturns:\n out: quantization data\n scale: quantization scale\n rms: Reciprocal of the root mean square of the input calculated over the last dimension.\n transpose_output: quantization data of transposed gradient\n transpose_scale: quantization scale of transposed gradient

\n", "signature": "(\tx: torch.Tensor,\tweight: torch.Tensor,\teps: float = 1e-06,\tout: Optional[torch.Tensor] = None,\tscale: Optional[torch.Tensor] = None,\trms: Optional[torch.Tensor] = None,\tround_scale: bool = False,\toutput_mode: int = 2):", "funcdef": "def"}, {"fullname": "linghe.utils.norm.group_norm_gate_forward_kernel", "modulename": "linghe.utils.norm", "qualname": "group_norm_gate_forward_kernel", "kind": "function", "doc": "

\n", "signature": "(\tx_ptr,\tgate_ptr,\tweight_ptr,\tout_ptr,\teps,\tbs,\tlength,\tDIM: int,\tD: int,\tGROUP_SIZE: int):", "funcdef": "def"}, {"fullname": "linghe.utils.norm.triton_group_norm_gate_forward", "modulename": "linghe.utils.norm", "qualname": "triton_group_norm_gate_forward", "kind": "function", "doc": "

norm and gate in linear attention\nArgs:\n x:\n gate:\n weight:\n eps:\n group_size:

\n\n

Returns:

\n", "signature": "(x: torch.Tensor, gate, weight, eps=1e-06, group_size=4):", "funcdef": "def"}, {"fullname": "linghe.utils.norm.group_rms_gate_backward_kernel", "modulename": "linghe.utils.norm", "qualname": "group_rms_gate_backward_kernel", "kind": "function", "doc": "

\n", "signature": "(\tgrad_output_ptr,\tx_ptr,\tgate_ptr,\tw_ptr,\tdx_ptr,\tdg_ptr,\tdw_ptr,\teps,\tbs,\tlength,\tDIM: int,\tD: int,\tGROUP_SIZE: int,\tT: int):", "funcdef": "def"}, {"fullname": "linghe.utils.norm.triton_group_norm_gate_backward", "modulename": "linghe.utils.norm", "qualname": "triton_group_norm_gate_backward", "kind": "function", "doc": "

\n", "signature": "(grad_output, x, gate, weight, eps=1e-06, group_size=4):", "funcdef": "def"}, {"fullname": "linghe.utils.rearange", "modulename": "linghe.utils.rearange", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.utils.rearange.split_and_cat_kernel", "modulename": "linghe.utils.rearange", "qualname": "split_and_cat_kernel", "kind": "function", "doc": "

\n", "signature": "(\tx_ptr,\ty_ptr,\tscale_ptr,\tscale_output_ptr,\tcount_ptr,\taccum_ptr,\trev_accum_ptr,\tindex_ptr,\tM,\tN: int,\tSCALE: int,\tK: int):", "funcdef": "def"}, {"fullname": "linghe.utils.rearange.triton_split_and_cat", "modulename": "linghe.utils.rearange", "qualname": "triton_split_and_cat", "kind": "function", "doc": "

\n", "signature": "(x, counts, indices, scales=None):", "funcdef": "def"}, {"fullname": "linghe.utils.reduce", "modulename": "linghe.utils.reduce", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.utils.reduce.abs_max_kernel", "modulename": "linghe.utils.reduce", "qualname": "abs_max_kernel", "kind": "function", "doc": "

\n", "signature": "(\tx_ptr,\tscale_ptr,\tsmooth_scale_ptr,\toutput_ptr,\tmin_value,\tM,\tN,\tH: int,\tW: int,\tEVEN: int,\tQUANTIZED: int):", "funcdef": "def"}, {"fullname": "linghe.utils.reduce.triton_abs_max", "modulename": "linghe.utils.reduce", "qualname": "triton_abs_max", "kind": "function", "doc": "

\n", "signature": "(x, scale=None, smooth_scale=None, min_value=1e-30, axis=0):", "funcdef": "def"}, {"fullname": "linghe.utils.reduce.batch_count_zero_kernel", "modulename": "linghe.utils.reduce", "qualname": "batch_count_zero_kernel", "kind": "function", "doc": "

\n", "signature": "(input_ptrs, size_ptr, count_ptr, B: int):", "funcdef": "def"}, {"fullname": "linghe.utils.reduce.triton_batch_count_zero", "modulename": "linghe.utils.reduce", "qualname": "triton_batch_count_zero", "kind": "function", "doc": "

\n", "signature": "(xs):", "funcdef": "def"}, {"fullname": "linghe.utils.reduce.batch_sum_with_ord_kernel", "modulename": "linghe.utils.reduce", "qualname": "batch_sum_with_ord_kernel", "kind": "function", "doc": "

\n", "signature": "(input_ptrs, size_ptr, count_ptr, B: int, ORD: int):", "funcdef": "def"}, {"fullname": "linghe.utils.reduce.triton_batch_sum_with_ord", "modulename": "linghe.utils.reduce", "qualname": "triton_batch_sum_with_ord", "kind": "function", "doc": "

\n", "signature": "(xs, ord=2):", "funcdef": "def"}, {"fullname": "linghe.utils.rope", "modulename": "linghe.utils.rope", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.utils.rope.half_rope_forward_kernel", "modulename": "linghe.utils.rope", "qualname": "half_rope_forward_kernel", "kind": "function", "doc": "

\n", "signature": "(\tq_ptr,\tk_ptr,\tfreqs_ptr,\tqo_ptr,\tko_ptr,\tB,\tq_stride,\tk_stride,\tH: int,\th: int,\tD: int,\td: int):", "funcdef": "def"}, {"fullname": "linghe.utils.rope.triton_half_rope_forward", "modulename": "linghe.utils.rope", "qualname": "triton_half_rope_forward", "kind": "function", "doc": "

\n", "signature": "(q, k, freqs):", "funcdef": "def"}, {"fullname": "linghe.utils.rope.half_rope_backward_kernel", "modulename": "linghe.utils.rope", "qualname": "half_rope_backward_kernel", "kind": "function", "doc": "

\n", "signature": "(q_ptr, k_ptr, freqs_ptr, B, H: int, h: int, D: int, d: int):", "funcdef": "def"}, {"fullname": "linghe.utils.rope.triton_half_rope_backward", "modulename": "linghe.utils.rope", "qualname": "triton_half_rope_backward", "kind": "function", "doc": "

\n", "signature": "(q_grad, k_grad, freqs, inplace=False):", "funcdef": "def"}, {"fullname": "linghe.utils.rope.qk_norm_and_half_rope_forward_kernel", "modulename": "linghe.utils.rope", "qualname": "qk_norm_and_half_rope_forward_kernel", "kind": "function", "doc": "

\n", "signature": "(\tqkv_ptr,\tq_norm_weight_ptr,\tk_norm_weight_ptr,\tfreqs_ptr,\tqo_ptr,\tko_ptr,\tvo_ptr,\tB,\tstride,\teps,\tH: int,\th: int,\tD: int,\td: int,\tinterleave: int):", "funcdef": "def"}, {"fullname": "linghe.utils.rope.triton_qk_norm_and_half_rope_forward", "modulename": "linghe.utils.rope", "qualname": "triton_qk_norm_and_half_rope_forward", "kind": "function", "doc": "

\n", "signature": "(\tqkv,\tq_norm_weight,\tk_norm_weight,\tfreqs,\tH=32,\th=4,\teps=1e-06,\tinterleave=True,\ttranspose=False):", "funcdef": "def"}, {"fullname": "linghe.utils.rope.qk_norm_and_half_rope_backward_kernel", "modulename": "linghe.utils.rope", "qualname": "qk_norm_and_half_rope_backward_kernel", "kind": "function", "doc": "

\n", "signature": "(\tgq_ptr,\tgk_ptr,\tgv_ptr,\tqkv_ptr,\tq_norm_weight_ptr,\tk_norm_weight_ptr,\tfreqs_ptr,\tdqkv_ptr,\tdqw_ptr,\tdkw_ptr,\tB,\tstride,\teps,\tH: int,\th: int,\tD: int,\td: int,\tinterleave: int):", "funcdef": "def"}, {"fullname": "linghe.utils.rope.triton_qk_norm_and_half_rope_backward", "modulename": "linghe.utils.rope", "qualname": "triton_qk_norm_and_half_rope_backward", "kind": "function", "doc": "

\n", "signature": "(\tgq,\tgk,\tgv,\tqkv,\tq_norm_weight,\tk_norm_weight,\tfreqs,\teps=1e-06,\ttranspose=False,\tinterleave=True):", "funcdef": "def"}, {"fullname": "linghe.utils.scatter", "modulename": "linghe.utils.scatter", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.utils.scatter.aligned_scatter_add_kernel", "modulename": "linghe.utils.scatter", "qualname": "aligned_scatter_add_kernel", "kind": "function", "doc": "

\n", "signature": "(\tx_ptr,\to_ptr,\tindices_ptr,\tweights_ptr,\tM,\tN: int,\tK: int,\tSCALE: int):", "funcdef": "def"}, {"fullname": "linghe.utils.scatter.triton_aligned_scatter_add", "modulename": "linghe.utils.scatter", "qualname": "triton_aligned_scatter_add", "kind": "function", "doc": "

\n", "signature": "(x, outputs, indices, weights=None):", "funcdef": "def"}, {"fullname": "linghe.utils.scatter.scatter_add_kernel", "modulename": "linghe.utils.scatter", "qualname": "scatter_add_kernel", "kind": "function", "doc": "

\n", "signature": "(x_ptr, o_ptr, indices_ptr, M, T, N: int):", "funcdef": "def"}, {"fullname": "linghe.utils.scatter.fp32_to_bf16_kernel", "modulename": "linghe.utils.scatter", "qualname": "fp32_to_bf16_kernel", "kind": "function", "doc": "

\n", "signature": "(x_ptr, o_ptr, M, T, N: int):", "funcdef": "def"}, {"fullname": "linghe.utils.scatter.triton_scatter_add", "modulename": "linghe.utils.scatter", "qualname": "triton_scatter_add", "kind": "function", "doc": "

\n", "signature": "(x, outputs, indices):", "funcdef": "def"}, {"fullname": "linghe.utils.scatter.unpermute_with_mask_map_kernel", "modulename": "linghe.utils.scatter", "qualname": "unpermute_with_mask_map_kernel", "kind": "function", "doc": "

\n", "signature": "(\tgrads_ptr,\tprobs_ptr,\tmask_map_ptr,\toutput_ptr,\toutput_probs_ptr,\tnum_experts: int,\tN: int,\tPROB: int):", "funcdef": "def"}, {"fullname": "linghe.utils.scatter.triton_unpermute_with_mask_map", "modulename": "linghe.utils.scatter", "qualname": "triton_unpermute_with_mask_map", "kind": "function", "doc": "

\n", "signature": "(grad: torch.Tensor, row_id_map: torch.Tensor, probs: torch.Tensor):", "funcdef": "def"}, {"fullname": "linghe.utils.silu", "modulename": "linghe.utils.silu", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.utils.silu.silu_and_block_quant_forward_kernel", "modulename": "linghe.utils.silu", "qualname": "silu_and_block_quant_forward_kernel", "kind": "function", "doc": "

\n", "signature": "(\tx_ptr,\tout_ptr,\tscale_ptr,\ttranspose_output_ptr,\ttranspose_scale_ptr,\tM,\tn: int,\tROUND: int,\tOUTPUT_MODE: int):", "funcdef": "def"}, {"fullname": "linghe.utils.silu.triton_silu_and_block_quant_forward", "modulename": "linghe.utils.silu", "qualname": "triton_silu_and_block_quant_forward", "kind": "function", "doc": "

\n", "signature": "(x, out=None, scale=None, round_scale=False, output_mode=2):", "funcdef": "def"}, {"fullname": "linghe.utils.silu.silu_and_block_quant_backward_kernel", "modulename": "linghe.utils.silu", "qualname": "silu_and_block_quant_backward_kernel", "kind": "function", "doc": "

\n", "signature": "(\tg_ptr,\tx_ptr,\tdx_ptr,\tdx_scale_ptr,\ttranspose_dx_ptr,\ttranspose_dx_scale_ptr,\tM,\tn: int,\tROUND: int):", "funcdef": "def"}, {"fullname": "linghe.utils.silu.triton_silu_and_block_quant_backward", "modulename": "linghe.utils.silu", "qualname": "triton_silu_and_block_quant_backward", "kind": "function", "doc": "

\n", "signature": "(g, x, round_scale=False):", "funcdef": "def"}, {"fullname": "linghe.utils.silu.batch_weighted_silu_and_block_quant_forward_kernel", "modulename": "linghe.utils.silu", "qualname": "batch_weighted_silu_and_block_quant_forward_kernel", "kind": "function", "doc": "

\n", "signature": "(\tx_ptr,\tweight_ptr,\tout_ptr,\tscale_ptr,\ttranspose_output_ptr,\ttranspose_scale_ptr,\tcount_ptr,\taccum_ptr,\tn: int,\tE: int,\tROUND: int,\tOUTPUT_MODE: int):", "funcdef": "def"}, {"fullname": "linghe.utils.silu.triton_batch_weighted_silu_and_block_quant_forward", "modulename": "linghe.utils.silu", "qualname": "triton_batch_weighted_silu_and_block_quant_forward", "kind": "function", "doc": "

\n", "signature": "(\tx,\tweight,\tcounts,\tsplits=None,\tout=None,\tscale=None,\tround_scale=False,\toutput_mode=2):", "funcdef": "def"}, {"fullname": "linghe.utils.silu.batch_weighted_silu_and_block_quant_backward_kernel", "modulename": "linghe.utils.silu", "qualname": "batch_weighted_silu_and_block_quant_backward_kernel", "kind": "function", "doc": "

\n", "signature": "(\tg_ptr,\tx_ptr,\tweight_ptr,\tcount_ptr,\taccum_ptr,\tdx_ptr,\tdx_scale_ptr,\ttranspose_dx_ptr,\ttranspose_dx_scale_ptr,\tdw_ptr,\tn: int,\tE: int,\tROUND: int):", "funcdef": "def"}, {"fullname": "linghe.utils.silu.triton_batch_weighted_silu_and_block_quant_backward", "modulename": "linghe.utils.silu", "qualname": "triton_batch_weighted_silu_and_block_quant_backward", "kind": "function", "doc": "

\n", "signature": "(g, x, weight, counts, splits=None, round_scale=False):", "funcdef": "def"}, {"fullname": "linghe.utils.transpose", "modulename": "linghe.utils.transpose", "kind": "module", "doc": "

Copyright (c) Ant Financial Service Group and its affiliates.

\n"}, {"fullname": "linghe.utils.transpose.deprecated_transpose_kernel", "modulename": "linghe.utils.transpose", "qualname": "deprecated_transpose_kernel", "kind": "function", "doc": "

\n", "signature": "(x_ptr, t_ptr, M, N, H: int, W: int, EVEN: int):", "funcdef": "def"}, {"fullname": "linghe.utils.transpose.triton_depracated_transpose", "modulename": "linghe.utils.transpose", "qualname": "triton_depracated_transpose", "kind": "function", "doc": "

\n", "signature": "(x):", "funcdef": "def"}, {"fullname": "linghe.utils.transpose.transpose_kernel", "modulename": "linghe.utils.transpose", "qualname": "transpose_kernel", "kind": "function", "doc": "

\n", "signature": "(x_ptr, t_ptr, M, N, H: int, W: int, EVEN: int):", "funcdef": "def"}, {"fullname": "linghe.utils.transpose.transpose_dim_0_1_kernel", "modulename": "linghe.utils.transpose", "qualname": "transpose_dim_0_1_kernel", "kind": "function", "doc": "

\n", "signature": "(x_ptr, t_ptr, B, M, b_stride, m_stride, N: int):", "funcdef": "def"}, {"fullname": "linghe.utils.transpose.triton_transpose", "modulename": "linghe.utils.transpose", "qualname": "triton_transpose", "kind": "function", "doc": "

\n", "signature": "(x, dim0=None, dim1=None):", "funcdef": "def"}, {"fullname": "linghe.utils.transpose.transpose_and_pad_kernel", "modulename": "linghe.utils.transpose", "qualname": "transpose_and_pad_kernel", "kind": "function", "doc": "

\n", "signature": "(x_ptr, t_ptr, M, N, P, H: int, W: int, EVEN: int):", "funcdef": "def"}, {"fullname": "linghe.utils.transpose.triton_transpose_and_pad", "modulename": "linghe.utils.transpose", "qualname": "triton_transpose_and_pad", "kind": "function", "doc": "

\n", "signature": "(x, out=None, pad=True):", "funcdef": "def"}, {"fullname": "linghe.utils.transpose.batch_transpose_kernel", "modulename": "linghe.utils.transpose", "qualname": "batch_transpose_kernel", "kind": "function", "doc": "

\n", "signature": "(xs_ptr, xts_ptr, M, N, H: int, W: int):", "funcdef": "def"}, {"fullname": "linghe.utils.transpose.triton_batch_transpose", "modulename": "linghe.utils.transpose", "qualname": "triton_batch_transpose", "kind": "function", "doc": "

\n", "signature": "(xs, xts=None):", "funcdef": "def"}, {"fullname": "linghe.utils.transpose.batch_transpose_and_pad_kernel", "modulename": "linghe.utils.transpose", "qualname": "batch_transpose_and_pad_kernel", "kind": "function", "doc": "

\n", "signature": "(x_ptr, t_ptr, count_ptr, accum_ptr, pad_accum_ptr, N, H: int, W: int):", "funcdef": "def"}, {"fullname": "linghe.utils.transpose.triton_batch_transpose_and_pad", "modulename": "linghe.utils.transpose", "qualname": "triton_batch_transpose_and_pad", "kind": "function", "doc": "

\n", "signature": "(x, count_list, x_t=None, pad=True):", "funcdef": "def"}, {"fullname": "linghe.utils.transpose.configs", "modulename": "linghe.utils.transpose", "qualname": "configs", "kind": "variable", "doc": "

\n", "default_value": "[<triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>, <triton.Config object>]"}, {"fullname": "linghe.utils.transpose.opt_transpose_kernel", "modulename": "linghe.utils.transpose", "qualname": "opt_transpose_kernel", "kind": "function", "doc": "

\n", "signature": "(x_ptr, t_ptr, M, N, D, H: int, W: int):", "funcdef": "def"}, {"fullname": "linghe.utils.transpose.triton_opt_transpose", "modulename": "linghe.utils.transpose", "qualname": "triton_opt_transpose", "kind": "function", "doc": "

\n", "signature": "(x):", "funcdef": "def"}]; + + // mirrored in build-search-index.js (part 1) + // Also split on html tags. this is a cheap heuristic, but good enough. + elasticlunr.tokenizer.setSeperator(/[\s\-.;&_'"=,()]+|<[^>]*>/); + + let searchIndex; + if (docs._isPrebuiltIndex) { + console.info("using precompiled search index"); + searchIndex = elasticlunr.Index.load(docs); + } else { + console.time("building search index"); + // mirrored in build-search-index.js (part 2) + searchIndex = elasticlunr(function () { + this.pipeline.remove(elasticlunr.stemmer); + this.pipeline.remove(elasticlunr.stopWordFilter); + this.addField("qualname"); + this.addField("fullname"); + this.addField("annotation"); + this.addField("default_value"); + this.addField("signature"); + this.addField("bases"); + this.addField("doc"); + this.setRef("fullname"); + }); + for (let doc of docs) { + searchIndex.addDoc(doc); + } + console.timeEnd("building search index"); + } + + return (term) => searchIndex.search(term, { + fields: { + qualname: {boost: 4}, + fullname: {boost: 2}, + annotation: {boost: 2}, + default_value: {boost: 2}, + signature: {boost: 2}, + bases: {boost: 2}, + doc: {boost: 1}, + }, + expand: true + }); +})(); \ No newline at end of file diff --git a/linghe/__init__.py b/linghe/__init__.py index e69de29..8b13789 100644 --- a/linghe/__init__.py +++ b/linghe/__init__.py @@ -0,0 +1 @@ + diff --git a/linghe/utils/add.py b/linghe/utils/add.py index 907c852..18201a6 100644 --- a/linghe/utils/add.py +++ b/linghe/utils/add.py @@ -3,6 +3,8 @@ Copyright (c) Ant Financial Service Group and its affiliates. """ +import torch +from typing import Iterable, Optional, Tuple import triton import triton.language as tl @@ -43,7 +45,16 @@ def inplace_add_kernel(x_ptr, y_ptr, M, N, H: tl.constexpr, W: tl.constexpr, rid * H + tl.arange(0, H)[None, :] < M)) -def triton_inplace_add(x, y, accum=True): +def triton_inplace_add(x: torch.Tensor, y: torch.Tensor, accum : bool = True): + """ + inplace add y to x + Args: + x: Tensor + y: Tensor + accum: whether accum y to x + + Returns: x += y if accum=True else x.copy_(y) + """ N = x.shape[-1] M = x.numel() // N # M, N = x.shape @@ -64,63 +75,3 @@ def triton_inplace_add(x, y, accum=True): num_warps=num_warps ) return x - - -@triton.jit -def block_add_kernel(x_ptr, y_ptr, M, N, H: tl.constexpr, W: tl.constexpr, - EVEN: tl.constexpr, ACCUM: tl.constexpr): - rid = tl.program_id(axis=0) - cid = tl.program_id(axis=1) - offs = rid * H * N + cid * W + tl.arange(0, H)[:, None] * N + tl.arange(0, - W)[ - None, :] - if ACCUM: - if EVEN: - x = tl.load(x_ptr + offs) - y = tl.load(y_ptr + offs).to(tl.float32) - tl.store(x_ptr + offs, x + y) - else: - x = tl.load(x_ptr + offs, - mask=(cid * W + tl.arange(0, W)[None, :] < N) & ( - rid * H + tl.arange(0, H)[:, None] < M)) - y = tl.load(y_ptr + offs, - mask=(cid * W + tl.arange(0, W)[None, :] < N) & ( - rid * H + tl.arange(0, H)[:, None] < M)) - tl.store(x_ptr + offs, x + y, - mask=(cid * W + tl.arange(0, W)[:, None] < N) & ( - rid * H + tl.arange(0, H)[None, :] < M)) - else: - if EVEN: - y = tl.load(y_ptr + offs).to(tl.float32) - tl.store(x_ptr + offs, y) - else: - y = tl.load(y_ptr + offs, - mask=(cid * W + tl.arange(0, W)[None, :] < N) & ( - rid * H + tl.arange(0, H)[:, None] < M)) - tl.store(x_ptr + offs, y, - mask=(cid * W + tl.arange(0, W)[:, None] < N) & ( - rid * H + tl.arange(0, H)[None, :] < M)) - - -def triton_block_add(x, y, accum=True): - shape = x.shape[-1] - N = shape - M = x.numel() // N - # M, N = x.shape - H = 128 - W = 128 - EVEN = M % H == 0 and N % W == 0 - num_stages = 2 - num_warps = 8 - - grid = (triton.cdiv(M, H), triton.cdiv(N, W)) - block_add_kernel[grid]( - x, y, - M, N, - H, W, - EVEN, - accum, - num_stages=num_stages, - num_warps=num_warps - ) - return x diff --git a/linghe/utils/norm.py b/linghe/utils/norm.py index 085af0a..a627c78 100644 --- a/linghe/utils/norm.py +++ b/linghe/utils/norm.py @@ -2,7 +2,7 @@ import torch import triton import triton.language as tl - +from typing import Optional @triton.jit @@ -256,10 +256,35 @@ def rms_norm_and_block_quant_forward_t_kernel(x_ptr, -def triton_rms_norm_and_block_quant_forward(x, weight, eps=1e-6, - out=None, scale=None, rms=None, - round_scale=False, - output_mode=2): +def triton_rms_norm_and_block_quant_forward(x: torch.Tensor, + weight: torch.Tensor, + eps: float = 1e-6, + out: Optional[torch.Tensor] = None, + scale: Optional[torch.Tensor] = None, + rms: Optional[torch.Tensor] = None, + round_scale: bool = False, + output_mode: int = 2): + """ + Fused RMSNorm forward and block quantization. + Args: + x: Input tensor, shape [M, N] + weight: RMSNorm weight, shape [N] + eps: epsilon value for L2 normalization. + out: output of quantization data + scale: output of quantization scale. + rms: output of rms + round_scale: Set whether to force power of 2 scales. + output_mode: one of {0, 1, 2}. + 0: only output non-transpose tensor + 1: only output transposed tensor + 2: return both + Returns: + out: quantization data + scale: quantization scale + rms: Reciprocal of the root mean square of the input calculated over the last dimension. + transpose_output: quantization data of transposed gradient + transpose_scale: quantization scale of transposed gradient + """ # row-wise read, row-wise write M, N = x.shape assert N <= 8192 and 8192 % N == 0 @@ -375,7 +400,19 @@ def group_norm_gate_forward_kernel(x_ptr, gate_ptr, weight_ptr, out_ptr, eps, bs weight: [dim] output: [length, bs, dim] """ -def triton_group_norm_gate_forward(x, gate, weight, eps=1e-6, group_size=4): +def triton_group_norm_gate_forward(x: torch.Tensor, gate, weight, eps=1e-6, group_size=4): + """ + norm and gate in linear attention + Args: + x: + gate: + weight: + eps: + group_size: + + Returns: + + """ # row-wise read, row-wise write length, bs, dim = gate.shape assert dim <= 8192 and triton.next_power_of_2(dim) == dim and triton.next_power_of_2(group_size) == group_size diff --git a/setup.py b/setup.py index cd094eb..bc21ae2 100644 --- a/setup.py +++ b/setup.py @@ -18,7 +18,7 @@ license="MIT", license_files=("LICENSE",), description="LLM traning kernels", - URL="https://code.alipay.com/pia/linghe", + URL="https://github.com/inclusionAI/linghe", packages=find_packages(), install_requires=[], python_requires=">=3.8", From 9aeb62b296215f18005f0ab84f3596557bc64aeb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E5=8D=97=E9=9C=84?= Date: Wed, 15 Oct 2025 16:51:38 +0800 Subject: [PATCH 2/7] add pdoc --- docs/{api => }/index.html | 0 docs/{api => }/linghe.html | 0 docs/{api => }/linghe/facade.html | 0 docs/{api => }/linghe/facade/add.html | 0 docs/{api => }/linghe/facade/fp32_linear.html | 0 docs/{api => }/linghe/facade/loss.html | 0 docs/{api => }/linghe/facade/norm.html | 0 docs/{api => }/linghe/facade/rope.html | 0 docs/{api => }/linghe/facade/transpose.html | 0 docs/{api => }/linghe/gemm.html | 0 docs/{api => }/linghe/gemm/fp32_gemm.html | 0 docs/{api => }/linghe/quant.html | 0 docs/{api => }/linghe/quant/block.html | 0 docs/{api => }/linghe/quant/block/block.html | 0 docs/{api => }/linghe/quant/block/group.html | 0 docs/{api => }/linghe/quant/channel.html | 0 docs/{api => }/linghe/quant/channel/channel.html | 0 docs/{api => }/linghe/utils.html | 0 docs/{api => }/linghe/utils/add.html | 0 docs/{api => }/linghe/utils/dot.html | 0 docs/{api => }/linghe/utils/gather.html | 0 docs/{api => }/linghe/utils/loss.html | 0 docs/{api => }/linghe/utils/norm.html | 0 docs/{api => }/linghe/utils/rearange.html | 0 docs/{api => }/linghe/utils/reduce.html | 0 docs/{api => }/linghe/utils/rope.html | 0 docs/{api => }/linghe/utils/scatter.html | 0 docs/{api => }/linghe/utils/silu.html | 0 docs/{api => }/linghe/utils/transpose.html | 0 docs/{api => }/search.js | 0 30 files changed, 0 insertions(+), 0 deletions(-) rename docs/{api => }/index.html (100%) rename docs/{api => }/linghe.html (100%) rename docs/{api => }/linghe/facade.html (100%) rename docs/{api => }/linghe/facade/add.html (100%) rename docs/{api => }/linghe/facade/fp32_linear.html (100%) rename docs/{api => }/linghe/facade/loss.html (100%) rename docs/{api => }/linghe/facade/norm.html (100%) rename docs/{api => }/linghe/facade/rope.html (100%) rename docs/{api => }/linghe/facade/transpose.html (100%) rename docs/{api => }/linghe/gemm.html (100%) rename docs/{api => }/linghe/gemm/fp32_gemm.html (100%) rename docs/{api => }/linghe/quant.html (100%) rename docs/{api => }/linghe/quant/block.html (100%) rename docs/{api => }/linghe/quant/block/block.html (100%) rename docs/{api => }/linghe/quant/block/group.html (100%) rename docs/{api => }/linghe/quant/channel.html (100%) rename docs/{api => }/linghe/quant/channel/channel.html (100%) rename docs/{api => }/linghe/utils.html (100%) rename docs/{api => }/linghe/utils/add.html (100%) rename docs/{api => }/linghe/utils/dot.html (100%) rename docs/{api => }/linghe/utils/gather.html (100%) rename docs/{api => }/linghe/utils/loss.html (100%) rename docs/{api => }/linghe/utils/norm.html (100%) rename docs/{api => }/linghe/utils/rearange.html (100%) rename docs/{api => }/linghe/utils/reduce.html (100%) rename docs/{api => }/linghe/utils/rope.html (100%) rename docs/{api => }/linghe/utils/scatter.html (100%) rename docs/{api => }/linghe/utils/silu.html (100%) rename docs/{api => }/linghe/utils/transpose.html (100%) rename docs/{api => }/search.js (100%) diff --git a/docs/api/index.html b/docs/index.html similarity index 100% rename from docs/api/index.html rename to docs/index.html diff --git a/docs/api/linghe.html b/docs/linghe.html similarity index 100% rename from docs/api/linghe.html rename to docs/linghe.html diff --git a/docs/api/linghe/facade.html b/docs/linghe/facade.html similarity index 100% rename from docs/api/linghe/facade.html rename to docs/linghe/facade.html diff --git a/docs/api/linghe/facade/add.html b/docs/linghe/facade/add.html similarity index 100% rename from docs/api/linghe/facade/add.html rename to docs/linghe/facade/add.html diff --git a/docs/api/linghe/facade/fp32_linear.html b/docs/linghe/facade/fp32_linear.html similarity index 100% rename from docs/api/linghe/facade/fp32_linear.html rename to docs/linghe/facade/fp32_linear.html diff --git a/docs/api/linghe/facade/loss.html b/docs/linghe/facade/loss.html similarity index 100% rename from docs/api/linghe/facade/loss.html rename to docs/linghe/facade/loss.html diff --git a/docs/api/linghe/facade/norm.html b/docs/linghe/facade/norm.html similarity index 100% rename from docs/api/linghe/facade/norm.html rename to docs/linghe/facade/norm.html diff --git a/docs/api/linghe/facade/rope.html b/docs/linghe/facade/rope.html similarity index 100% rename from docs/api/linghe/facade/rope.html rename to docs/linghe/facade/rope.html diff --git a/docs/api/linghe/facade/transpose.html b/docs/linghe/facade/transpose.html similarity index 100% rename from docs/api/linghe/facade/transpose.html rename to docs/linghe/facade/transpose.html diff --git a/docs/api/linghe/gemm.html b/docs/linghe/gemm.html similarity index 100% rename from docs/api/linghe/gemm.html rename to docs/linghe/gemm.html diff --git a/docs/api/linghe/gemm/fp32_gemm.html b/docs/linghe/gemm/fp32_gemm.html similarity index 100% rename from docs/api/linghe/gemm/fp32_gemm.html rename to docs/linghe/gemm/fp32_gemm.html diff --git a/docs/api/linghe/quant.html b/docs/linghe/quant.html similarity index 100% rename from docs/api/linghe/quant.html rename to docs/linghe/quant.html diff --git a/docs/api/linghe/quant/block.html b/docs/linghe/quant/block.html similarity index 100% rename from docs/api/linghe/quant/block.html rename to docs/linghe/quant/block.html diff --git a/docs/api/linghe/quant/block/block.html b/docs/linghe/quant/block/block.html similarity index 100% rename from docs/api/linghe/quant/block/block.html rename to docs/linghe/quant/block/block.html diff --git a/docs/api/linghe/quant/block/group.html b/docs/linghe/quant/block/group.html similarity index 100% rename from docs/api/linghe/quant/block/group.html rename to docs/linghe/quant/block/group.html diff --git a/docs/api/linghe/quant/channel.html b/docs/linghe/quant/channel.html similarity index 100% rename from docs/api/linghe/quant/channel.html rename to docs/linghe/quant/channel.html diff --git a/docs/api/linghe/quant/channel/channel.html b/docs/linghe/quant/channel/channel.html similarity index 100% rename from docs/api/linghe/quant/channel/channel.html rename to docs/linghe/quant/channel/channel.html diff --git a/docs/api/linghe/utils.html b/docs/linghe/utils.html similarity index 100% rename from docs/api/linghe/utils.html rename to docs/linghe/utils.html diff --git a/docs/api/linghe/utils/add.html b/docs/linghe/utils/add.html similarity index 100% rename from docs/api/linghe/utils/add.html rename to docs/linghe/utils/add.html diff --git a/docs/api/linghe/utils/dot.html b/docs/linghe/utils/dot.html similarity index 100% rename from docs/api/linghe/utils/dot.html rename to docs/linghe/utils/dot.html diff --git a/docs/api/linghe/utils/gather.html b/docs/linghe/utils/gather.html similarity index 100% rename from docs/api/linghe/utils/gather.html rename to docs/linghe/utils/gather.html diff --git a/docs/api/linghe/utils/loss.html b/docs/linghe/utils/loss.html similarity index 100% rename from docs/api/linghe/utils/loss.html rename to docs/linghe/utils/loss.html diff --git a/docs/api/linghe/utils/norm.html b/docs/linghe/utils/norm.html similarity index 100% rename from docs/api/linghe/utils/norm.html rename to docs/linghe/utils/norm.html diff --git a/docs/api/linghe/utils/rearange.html b/docs/linghe/utils/rearange.html similarity index 100% rename from docs/api/linghe/utils/rearange.html rename to docs/linghe/utils/rearange.html diff --git a/docs/api/linghe/utils/reduce.html b/docs/linghe/utils/reduce.html similarity index 100% rename from docs/api/linghe/utils/reduce.html rename to docs/linghe/utils/reduce.html diff --git a/docs/api/linghe/utils/rope.html b/docs/linghe/utils/rope.html similarity index 100% rename from docs/api/linghe/utils/rope.html rename to docs/linghe/utils/rope.html diff --git a/docs/api/linghe/utils/scatter.html b/docs/linghe/utils/scatter.html similarity index 100% rename from docs/api/linghe/utils/scatter.html rename to docs/linghe/utils/scatter.html diff --git a/docs/api/linghe/utils/silu.html b/docs/linghe/utils/silu.html similarity index 100% rename from docs/api/linghe/utils/silu.html rename to docs/linghe/utils/silu.html diff --git a/docs/api/linghe/utils/transpose.html b/docs/linghe/utils/transpose.html similarity index 100% rename from docs/api/linghe/utils/transpose.html rename to docs/linghe/utils/transpose.html diff --git a/docs/api/search.js b/docs/search.js similarity index 100% rename from docs/api/search.js rename to docs/search.js From dd1ca44760bba40cff005a8498267f4d446a8372 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E5=8D=97=E9=9C=84?= Date: Thu, 16 Oct 2025 17:09:41 +0800 Subject: [PATCH 3/7] add doc --- README.md | 4 +- asserts/linghe.png | Bin 63376 -> 61328 bytes build.sh | 2 + docs/linghe/facade/add.html | 117 +-- docs/linghe/facade/fp32_linear.html | 105 +- linghe/facade/add.py | 15 +- .../facade/{fp32_linear.py => fp32_gemm.py} | 19 +- linghe/facade/loss.py | 16 + linghe/facade/norm.py | 45 +- linghe/facade/rope.py | 32 + linghe/facade/transpose.py | 12 + linghe/gemm/blockwise_fp8_gemm.py | 242 +++++ linghe/gemm/channelwise_fp8_gemm.py | 130 +++ linghe/gemm/fp32_gemm.py | 211 ++-- linghe/quant/{block => }/block.py | 13 +- linghe/quant/block/group.py | 107 --- linghe/quant/{channel => }/channel.py | 42 +- linghe/quant/group.py | 66 ++ linghe/quant/{block => hadamard}/__init__.py | 0 linghe/quant/hadamard/seperate_hadamard.py | 316 ++++++ linghe/quant/{channel => smooth}/__init__.py | 0 linghe/quant/smooth/reused_smooth.py | 899 ++++++++++++++++++ linghe/quant/smooth/seperate_smooth.py | 133 +++ linghe/utils/add.py | 5 +- linghe/utils/dot.py | 63 +- linghe/utils/gather.py | 300 +++--- linghe/utils/loss.py | 21 + linghe/utils/norm.py | 27 +- linghe/utils/rearange.py | 21 +- linghe/utils/reduce.py | 30 +- linghe/utils/rope.py | 81 +- linghe/utils/scatter.py | 46 +- linghe/utils/silu.py | 77 +- linghe/utils/transpose.py | 114 +-- tests/test_group_quant.py | 4 +- 35 files changed, 2562 insertions(+), 753 deletions(-) rename linghe/facade/{fp32_linear.py => fp32_gemm.py} (67%) create mode 100644 linghe/gemm/blockwise_fp8_gemm.py create mode 100644 linghe/gemm/channelwise_fp8_gemm.py rename linghe/quant/{block => }/block.py (85%) delete mode 100644 linghe/quant/block/group.py rename linghe/quant/{channel => }/channel.py (88%) create mode 100644 linghe/quant/group.py rename linghe/quant/{block => hadamard}/__init__.py (100%) create mode 100644 linghe/quant/hadamard/seperate_hadamard.py rename linghe/quant/{channel => smooth}/__init__.py (100%) create mode 100644 linghe/quant/smooth/reused_smooth.py create mode 100644 linghe/quant/smooth/seperate_smooth.py diff --git a/README.md b/README.md index 3ad17b6..277206e 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ -

Linghe

+

linghe

Logo @@ -66,4 +66,4 @@ Examples can be found in tests. ## Api Reference --- -Please refer to [API doc](asserts/api.md) \ No newline at end of file +Please refer to [API](https://inclusionai.github.io/linghe/) \ No newline at end of file diff --git a/asserts/linghe.png b/asserts/linghe.png index fc1f51c173a79586100e2e0480a64f7d212a2ba4..c1778e43919880f41f19e1cb3a1886548b21063c 100644 GIT binary patch literal 61328 zcmeFZgQVb=(mwXYt_+X+bZ7MI1!T^58M8QDCN4bW03-}KOl@tZz;%5{T z8C0_W{H%=n82KJF6cj%Q3Oe#V+Td@*k1~iq{`2>>6x9EDVhY;TyKms6T)X=D+Qr;X zWJ_YeC$U~jY1yNoa5N(RM}7QCdkh6d6h->^Q`NVqD-$;=$<;8rQj5Mr`S*u^H{oAP_}2^m^@4xB;9oEJ z*9-pjf`7f>UoZIA3;y+jf4$&eFZkCB{(t2K-o@CZ0i>qApQ}_S;Bh7e@)-ZXl9aJ* zs3-7P-!5?ckXzPpdXBcFwSUr+i}8r^|IMv>{=my>FPbtSRfB*0w8JK=FROx1- zUn5cfz|n}}q0{07^#_gq(=nxBphfsHHC69z?L`lyriN&K&>y(|Popt|(X3FD^sxS6D*;Ic zE5xkJ#QCSuw83aU(FAMo|A7>H{MWlQLp;|PWd9TxOGUwG7;l~E{^3ZJ(819@*!jxy z53|NK_XeY3Z>-4t14);H3656$kGSCONd)K%eXiOZ#irSNQ`pWo;3`nHwulasgNhzv5ATPv*MW~=N$j*Zw{S6l{!Yc zS&Crt_}h*{YeI5*!c;=E@PV9fVah^=DIT?G$~?Myo-E~B`76JjiFARqrxQ!XfqzfF zGEnzFT*{Pf%2VIpvBFrgd_fq_RC|<|wW!gjD4i-C_v5SaNZf^UzS5#@BU=+p@!)rw z>xJ|rMfhp0F&2T*{>xCfv|c)Z0HtobrSJ4ON7N?5H+30GHE|x-vG^qn8bjFf?ZRtB zqtPGez5Ue~x|pUWkYVGf636g2j&c+DN9;F+?$#!0FZa`u(zDaMw6{3lqG1If=G4iE zThV#47j^D=J2bn-TjKtI&jqo=qI~EaiCn>pQeONPZ;rmfBjK+~!lP5qDIyV%c=KRm z9g7ntaYsSQIV3JR(%(n=zt(f{j3mS}1Ygf)EC$DjbVX)pUanyTT{x?Y*ygw|U~!LK zrAyuK+TH(aKX9p)642u0NzaxJ72z%g(j-IkkgP$(S_`SY`%IywUB;2HNMm`mYHFH- zzx#NZbUh#mSJ?^SMY=|595fDKAu9RX|LSoA&jN zee#f!CHLRu7w+5pz@Fko27k{@a)cv~N-LyFC`Ld))lNE(yv<-TE5EgcFg*%tt`0vfQn9A9s zk8x?!&m1i|=r8SDbQ_)W?E8ZG_UwxI-<+H%CYam7Iu?r)Ki#FV*<&Ofw=9_3udYb` z-TNgXe5pKmXhL`evbMt!+lI-d^Q%mnGrq2X;&1cR$AEBOd|!p<7u(s-ftsAoyj^m! z1ONL-y|szIKZm`BmBy7#-WgTA7+af`4N~0)BSIa$3=aS66QUv5uO2`&XFO64FP56p zil*$#{LlgAZw`)|8KhhIM8W3YIRAe(Upa!yf#`n?ak0Mt`Aw!B$ZVt{s9eea^$-7F zZ+w^b#r|(L?mwRZ)0QF~I>18C`v37(TAajd|1?(%U66>|6j6E}{ar9Z4%SbJiZqgx zXbjWt*zif$<-DCCvui7juO!DVMpDw&&rTN_!}UWz|?|B64bMvcD#rr{rzx59bA^=GsCj z66W{yY&t*D+q=gk9J}eB2sf<1d12_@5VUS*@0GmV|K-`D!1&74g6r;*+7S!!)WYwg z0cF{!`mWafsw5$|-oU9l$o=xp1^CM}{ZY%)%e_^VdpVH#VsYHQPyUX`$!pimR$>+1 zC!Hz**lIZi^{%6@BIRLVPd@wJx)+hL(RzhQ%)>>;|2WRanL&@&YFLGiS4tqiTLPH* z^|_t&P2)^p5@gn&oF*KHODCMh(iIZ<)o5-xUg4{GT=TmPgA;UR+7s{+`&yN61`=V( zN$L&GVi=aHgcoP>IJ(h7XcTI8WB;*wYG;2uiTK;ua7id*e&tt({kQkBGdu5B{4@%3 zI6GX3S<@|$L1DPUT&kikmj<#H2UHZ9g^bV*U&yRa*1?O~DHCoBAN+!0wtX#?6}Yfz z(Q9a_%cjQKb>7ysdWNW$r0OKPj+dJx>WFT{kR`M~&Z*}#V&Wr1;ya({JQ+8nYv1EA zq-CadLnH5MH*aEocHin)j5Q9Wm*>;=USd!emkaUwY3P^U!#gGD%=uFFmb5tXsp0@a zGdAVoUwUB5C|bk$={j7`c|vWKtyBGcwj&a*=(!cS)6LoUDev};he+F1m3=w;J$x}O zQXyM?moN($=eqn2^4Fr`OUhVC1AgfVu;8}JCh6sCr6=BZ6`p)~Z5?0pls%pUTCRNM z#o~9Ltd_!+*pEKyHJ=X&w5kJr3^q)IMO zcSie>Oflxy?TtoLeaE|+M-Q&~RIta&Rkr*d)o9-*6qEB-hjbyA{N+rB~&Q?b+{ z-^no8QIwcxDleC*z?d!dE@s`>LZ2u5{&&<-1vy6W5%= zn=VGT*zxwK$kc((qmZ7}P))zr>vj7I*KXXFdz9#)taaDdzMaI8%46q8dUc)#S4HU5 zMG9*?!T1eLAgMB7q#IIHzdF%DP>tpKR`XwGIq$%mOl1Sggn0#{QQ-5EgSYRO>`u9F z_dr6phopebxWofv8T=jGVk4(<6q$TNE1$rVXJMBadl3O=#Z#hf(2h5n?`*@Ef74z% z!p(;uOBLQ8B)?qHOMP$DYuCGvjl2!wp4b_A>kQXI8*pvysoNCJ@jpHtirZ+ zIZxfE$Tw-LN|g3R1qf#ji|H|*XGXrBft3nEy92kxX$dYZUdUUU6hd-T-)FK;PJnme zikTQ7Cu;vOu|}>z#`62%&1T@~mg!kI<(w>))(xk7i{3XV&0mjM*t{NiA(=ki?5Dwr z?sx%0L0xi-l-r9row3zb#wu}l?u0!mJ(WQ;W-OWs2f8%ZI3hPe%3+=(hE_2@p zn|ya7Hyk+-ZJ0&6T(WSTdVF=iQ<8lz-(sl1n&uxKE!S-j;N%vPy*#7E@Ubrz`#rev zXt^NS0;2b9YJtOk3G2hfVGeu(8!PWv(ROK2k{jJZm8QP`XEhq`20>pq1F zAN?6;|NN$t9@6vyt13vR@zN{C(LkJ$7FaD+smkf--lW6fgcG7&mKfZ0V5%w;c}17+ zEXS}gDg6>8M~_3n&oyVqoWgG0mX=^somGDR#X{&}f|~C{^W0;C=8uId-bFTcA(yCy zL5tJXx%> zN8-fb(wCtu-t}d)c?d<@&?39!<>Er_norexxB`DP_3+C%S0L`QZ8&vWgnsd(bbah` zw3rbhkdk-hd+5lFT37mC8FhG+nUGfFBBc_@y)2`&hL^l|Bgj=7g!KL_GKPZ;7r9ou z-OFY5vk>N!we%czq8$TL-p;6H0rxoy-z4XeZR0!C^x4;;3E@d@YgMJ43^}t^SVoty zN)6Bd!}R+*!kR}cyzU1ZJ5AWEs&t{z>Pt1EvKXru7?lr~M6behbBWKt1S-ZQ1o@8{ zprYYmF%^G*09i1faKJv{$i6k{&PWEYujZTS8}|4UG* zDxPA$&^7?wSm|kF}(m%HA?T9pl|FUDKuOIwc}Eq&Uh{KVDmdM6<4!tz!1UxFm72r{;< zw)1T26{98Za;qQho>XT{I>8_h|48MMXuSUDfwvbB17x z52)`L23{5vvl1_VHF_^PHyrEP{3d*SnejzaNJ07-RML2K`TC=*cch{)-|p%dZ%om! zSU)ghn|LmW^8I*z0;KH@>6p>e7F}U^4&Yg_S!=krkPkEQ7WOzw zdw-k%ao7DJ2P-PC6FbAvsVe)0rc&`|R}TVlp_0jmhUm%_pddtX4pRJuPT~p@iE=$I{C1rzZBWCY?%S%>6>!fKM?}ySM{6V;if1nD z=IBcM_WlO1o0mYeVv2El=$yb?pRihBo2r!(gBQN%J&N}eL_>>=BJC@iFh?+WcMGkq zUSoQ%R!mVozm5@DIyu|PSn&C&;r+|c8ev+M^zgKr`>S7Vh_)7}C!(HGgkIwF$MuaG z^;=QDc{%);SB47=Zl@;k&H|VXrHj@yn&hoTrj*h^(cnJ@4nM?o<}`0#rJ2VOU>f1B zDmq9D3dC&Tf0D@Wm@TEp0I3xPDIEI;Yl2m=^bDD`EYdhSE`lI%1iQ%nk}GmG24`C_3DrTXS9a(_ z3@BT7VoU1FqgJ}@t)`(@M(*d>O?2}XrC_CxWair1M75(VI9+4M_eug6@5@9*otm>q ziR8XCNnG<{u&Bt6m8(OOg7^@sH-3ET&0I0g@%{z3a2mkReB!^AkjBu-SO}A0jH*DB zyeiOUl|*4|eu)oXA!mAo_Wn9fwkPg`Sjm^)Ki<#Mt8v<{P4Xk=cV7D(Dspxp_<=#< z;;kRuyboB0%k#R7s@A1&kl;yMt+66iea?Xm`80L(Xf4w?ZHPtw=tz!6LFwt<*aE}+ z(Mzwh?c@yaYd40@myjj&<5W>li@}S#SajM*$>T52oBE>&Fv64FIDaI%xrK>9H~0L- zFt_hM0m%Ff?@5d-NJPr+743ma13W zIU4(7=;}JwWu?RVZp^QOtm+0+CBqzc2Q@NclT?xW#qg-`;CGNlJ_q+zW_g3))4p%GeUF27!mgbH0-#v0 zoC7F9ZCybv3VWDf6%iMUe(7r#FR1UR#0?uYm~$eHTL?UUX(R`E)ZeQ_lZ8FrO#U^B zdCq8XC_~Ek=GD<&MT6ep^@6e?ZOFmK^kSITi7_-oK93=y^Wx}5F)-w~4W3%5A*WOY zP;a+5FIL2*p?v=;3d+qAlIYlZn!@i7g3?6J5BnAuGm;fR@-(xUsy}rvw>XWdc3h6G z2en@Y!v4~K_z0{NK3mrK?T^e>CFrhuD7&c?O*ADCjonqgBSSRT?(Hw{8_|>iI!YrllIye&F>Jd} z#K5%{>^lOjAu9j$1rqirAg&L9GHd|27yVPbP3y8jnSdrK87IJlBtQOo8N|5} zNZyztzY4kF_aKtt;Ey0zOU>qmN*q7G9#Nqo;JEaOXZ&?DjK;Po=z@j>g-gAtoW^9C z<*<5!UCLqj&kF__?8`?SP{W(1QawkelUbz{$uLU3VD0ahqkt3IMYBQ6iP?}t4kRTF z>({ypS>{j>!Z@c_E;&m~-0%M2wFJKy3384P5$Ws)&R@s~v-KGGU%UQYq-lx%vPYd- z5U$v`dueK*5!BE3KaWO)^@OU7iOb^yZz-nkPnV5h`f&fM!zKEEJ-;6;QP7v!^fFw$ zL-9Vxa#$OW!$1!3uJIhqDDft8ao(mkrAT@1_S#&%4o1*CX*3^@yYWg4obKHTYRt>W zQHP;{px$Nt)98w2z&%I?+?>i)4{)REUE1^lk>_Pq0BB6tkj~^D!}mKmTrGT#z8gUR=IC(WF{d8s6ga`?Fp10 zR~-Qh%@o-f7VdG^t1&RU>8T-4GW@2lPF5ZtEKypQ=W1JD$uvqe%Q^bJM0zd*pf}a4 zuBl!!xR|p#DbM_FAKJY8bCr;}nSgIJCl)MTPPb#E>gAsvD8{%^fP%@ zSM8VY@$a-~iic=^BZGN1SgoeK$7OMfidF@{L$?R9#pQNUD>b97j+W(24_WdfPxPe% zcs`%d$(2)?M~EKpttOP#gMZ%pKb$-#cjS8ji}Xk4IU z4j?AjlZ=Lf{O1v1L1xVxG*=dsg$8|OogMohcr$`$J~otIw6fzXh5UpC_A4XdJSXey zo)piEP55Yw%2=dYa30K+n^YH%w9I)dcfkFE?4mCN_>UJB@c9N}liyWmc&Y2AUM>5H zp%enpN-c(@8iz%Ws5~`YYqvTX z?a7VPqts>Y2yTedhd*g2yLTiGV|$i(|C09*O<6*W=9EVUseHA-v$(3Lv%pf3opTdu zk(w`CZHA1p*bZ}-bj{<%r*9zVcnu9vl~k`3cgchS9-%%PuWq9W8v^P&nJ~dEDYY3* z&AIw6X1R8^OIHxh!vY-*l0TL^<{B|vpVckTMlYe2w)o4QBz0~p!tbW!EmF;D>dugkZ91qwSV1XXwR5loS6p}-Iy_VM_W2^nl8Zm`bSenK4m>>j zWSNSdL$Z6;CTc2_Tju!W0TEv|YW z$tW(X`zaGx1OAPbb~DPsP)|_Q!jLTkSWCcNGa=J+Atsi1W#{!!A~2iBlnp%!N1iF&}o zD-so5$K!Fdqs$@F+X0Ym-lzTo#r%Os>ZV?y$*EKiifU8Z9|SQeyRx1!-zH&8F+m+o zXmm&aJs*Uiq(6}auS$g#%|ZpWDS6rVf%$|+J7AY z>-gSH*p_!(p(sL?cUh03ctdaIs`7F#ih^hvl{Z-K{3UDfF{evPW&gc&d9r0+)8h9G zxT>G8xvZa)*b!jG;L%8=#WEBMOOp^c8SSX?$@C=&sRZ|Hd!6pWp$RteJY!ZR7wpGm zF}Zr^VU=hwf%{Tc#{AosL+$g5gKcKUbNKOkL-KDtOY~m7-w)QabMoVPM$J=q00C6B zLvDE(wQbuz6DE?>3s{$(iiXoXqX2l-*u5kJT)#j!>t<2Hhs!qJk z}`U%i&8Hr;~NW8HR|RtnnraMy>Ev9aAk?O%Zs zFrEHf`_WP-Io%n>$#(Mj=Usn1;UDud(IgZ^VVa7oV80pED9#@~@YXy(Xd( zK9Ss`Iw$&#M<>xJ`MdCJN{1GIrj5|2aRtrpwxrPpVC@*_ux1{9lnGH!77Cp|BcIZm zu{~e$I-e1oG74gb-4F*TAF1dC$-+P8uxDtRO+lyfSq}sm0;p+S&>qv#H73@eeX&FH@(K38NH}!D<0i znus-^$s?9r{*K<^WBN&dHU8GF7h`kYW3*e1n z@cqM|bA{M4Cm&W>9RI0q7|B^+_)Ut*qaxi*4>yB)J9|V-9_}>2MF-zCJiH{WiLh20 ztlAyE@ehSq;2B{%3!iK>!?ny32b)q$*Ve@4NG1OIgJ9$?vROdW5Mj>Aj|WX($T@w;?c7>-xMlF}(F6LJbbMm#nY9C{+7=r8 zKlWk4J<7Eq0dzx40SRn-fIxKa;JwadHuk*ueuO^gZ|Olsn?Cy(M86)^G!9YT+DMo> zF@+>%AFAQ2lpw5n5o81A2`2Ui>#YrKP|eF)j+$@}d7UmB(A(g4XlF^l*r2~c!W)|X z?y!0|X}ErJHrvb~1sRsFr(xsR9qX+otM4ts2~?==ZJ5;H?cq%%OzTWCAAZhp~Uq869t&iUDXLkyI~ zx@Ep*NEutl!L9RUgY)V_`RpV8yhk`Rd=zl^wnHMCn?bL8d)7ZJP1HCaeY@lN@ixFi z$*kMg7#c0@&dxv~E;FoaFMssw{hgcrUxy$~Z@QRcjMV)=+0EN3Pkz%o7z;r4+-DCy zchlj}WZu7{D7Ej-CgiD#M5Y*+pDWZC92&P0i8k_M80lUkI^J*vfjs3w?(C!obU4Nz zH016Psb~_FwWy|R=)0JFi6>5-G!puvsZoGQC!ycs>K?_y?xQ|VPqiSa+a2=UD({mf zuLFW%Z>My#3dGZr@kG~mnTvvDHUaL5wzYvG9fHOl<(LcCPQwqFseG9@s2n8S99zte z&$zk11A4#R7Q>(D@(w%oeXsM4X65%s8;QwxA$R ze(pq~>~KCI19U}aD^M}~1<#z4JdYOXq2EHZP*~~@j5_|R_6!+eDj&7T*?}jYafx$? zcle3A+tvdd2j^eTjjZ1cu}woDIVx|n$DNgYb06+1NP%`R(Sp;kQaaui zRIWJeUSwG;2Va?(Geh9{Oh@lQ*HZZko1e_PD=(jWs`p~h;B*b&Wk)Me?=0yP)`^8m zH2nfy9ws@rulMKVND&|v5+<}KRWvCl=PU?H9sPNOmZ7I-){Uu5OC=YCv&XyL#(PqEVwkI^=G>UF)W^n zfcT##bE~jIvpZ`eb!5kZywH{@+4M~`@G5tkM8$}41lQmG0Q4B;a`MWxNrm0t^8CeMPOzD{)Gl&n zHwuEStja!7XOun1mp#>sQdF6; zL`hW=F#})n=%q`3rbqbBEk|i<_c2SzhfoNVyT=j=Zv75o)&DliprQualG>_2+Nr!g zD*(%d5u{Cl>;(VwBUQ&iP#|Lm#QbOpauPeusuBRSZKMz|sd^?m?gPD(aMNw0;f*boYS*nf+jjwDp;UFr=!F5>e`&nKzTND zr;-}4&2@{+rt-^rHpZf24?_5;ON*0gw+i~=ez%rB>5qLiwFZ3T#S@o%}$cE^~FNa|IwYGvZ&$ zOufk&SbF%^+!*K3d%?Y*#$6guL#X&9m|Lq8ybhYFEOf=T6Bu5%;7M{h5|8mJ{{odF zLtaA#;Kef#j7GTINNyIIABB_VP0RE*XrS{-AHs5S*ym-#+}l~yR{@PhrGqlm{SIWD zKdvXn!hKvWYU9UJT9EM7I-!;S6Xg%_NNA57tHS5-$z%>r`B-62dB+k+QJ-!eIAc*% z^y&fT;jvVYgi?o?a!6IaH7%z7Xa>TJD?!JGMF=G6xg2U^Nfu?;YY^!a0J;`y`;b7T zpc3H=w%Fqc6g%OFw*I_^dEY~<7V};bjD17_Bz-*cb{ef{-n7yaAXx0F{+c*~pP;z_ z6FJ)qQw@2sg6P`NXGKZJlHb?`TuR2ZJA!ON021@rh(ihQ15CM`5(H0D3nAcE?#*YtOvd z?QgFkAfm7AsWY%ZC^l;?o0>&@gJ##pClv*H<7JEW{+aeL z){snLP+zE=)Tw~noqT@sh~mxn`w}`)aqd404Hs*sPHp0W&M40E&A~k83P6#N0J>-? z9osrbd#4J(Q1cOdpV&#`<09ffvehrWd7vMeb=DERE)xMKqC4cy+eGPzrs`vP`m$8F z{WWYW#o)|wLq7@_!f*^LMboz$*g}&7Th^Ek&uJt`z(ZnLp%ko|x>eIu2=Na6Pc_H>6Gb4(V~L)bg(Qz8fzWEK$l z?Xvdl^Wpn!j*u=F)PwqC5tOtt;Z{~5Z-BEw{hvW7yg`yKbG$wKn3?D;_#JHOY_Ecd zsvY%=kqt-~ccKd1)43v|r9IKxg72vuc^afS{@hMwRYr>bOLM4R50 znB1ZNDg6*%mJbI)#ldY4B!`McEZUP@fJ)=Y!J_g!Fy#O3PXjZF(c@X>7I`I04IMgZ zI6rQDE^OHJVdw04y>A;-fp=g) zKM}o232}YP?x?FW92tT>1!xRy*Aq}1$!O>~svUJ1oWXMOYV1D8-WR>Er0T7i!ztm} zYfEeUVoW#G$peHA2uRg3G!ns&+ovf?y3zVX&fNnOf_sIJS2%`eH_$OvRZ`?t^?7f( zmsMw&<=peHzG@4&5cp%~-JpE?5-Jvk=L-6q+E614X}0CO3Vm1bT%9-mx+`R4JSAO88;dzoRwX33!85C5CuavlHUFfZ~)^)Qv#5x2_2jK=_`T8 zCVUJSf7nmG`5L_gK@jj3L2_f6IKWHABV(948mf@-q7eW`+YgO%7r}?J50;9h-1n{w zXFk(s=(+2^?u!T)uArS7?~LMRB1VnB0JODU5TSKBbXxyf?BINX4D~)Ynn0q4G{YaR zs&2^2F!1JQAX;6N6ghe!zHF6Q{FM$Twa$)RU`>uv#PcL~K{JLvBEW8wmlC1OdBzpz z*WSs5Q;$R6AzuE9}&$A%4k z;_W;byQk99sx7X?7)~7=)U`(}^h=syP9leoNI#-APwBq#t*N4f94B~j z+}L9Fz%6Ii;AMZxQwY#E*IUjtFqZVANjEd?H0iQSk@jNzzPfBEs)Ei^`Rgx-M${Wz zCNagfIxT62#{Gy&b_dWXd_Y^`l8&xsyw?Ha0aOf2_kNsN61}_+vBywsI1{!&^gd=c zGp?oLDww)`!^-{$x;~(F5sYXwer`#f?k*MJoVRy!3{DxWfdq*#j1J#OL)T869~clE zFo!CeM>^Wm%CgK+2%cRvAJ5BfTVXtR8_`pUt>%;P;wkktk{weGIHkj3{GuDo|9jqt z3T^=fpbDO46oEEq&%U2v zMBKB&!@o=*AE~?%scIFZnz);l>XE*?iMY-j?V~c`{XHsKWdIRPwTK@@D)7@&0A|EK z_+^&Y6sPw*CJ=s3J#LAQ?}4}(?*+^%(%u4n@QK(S#^N5J%2z$o*%&PFEPJraoEt&n z;sP1~F_bPfB;-@FzPBl=3+G*ISpZC|5@hkjEheS}-$vjW&;dYkzUod9n8 znGy)dDGH2vrOJzVC5k&MBb<)wlR}Pv`o9dMX+GjL9rCANG4pD%w4-SUoKB*V@29U( zQeS~=X9j{OXtm*`8O4Anh}yMLwT|!&F)F@P%TkNJJVPVab9gZd#E$dlrcwHNyJ?0( zZbXvaL+v~b3)=FeLpngwBy>?Y0-=NcscaG!mku>lJ{a4K2vJw*g^66<0iiu+DG1W$$^qIAL3Cojn=acj~%! ztgdKb6(^gT6auG6+wQt=W#b{Nmj9us*3& zoGuyNo%gb&riFe$vip->45=g7qInIvN1x8FamwP!l6p>(ZBor4bZsmVN)f*2s}-7z zmOb3VazN8}T1#{K3CQUJN;@7Vq{;J6lsYK1H*8m?jP+A&u%BAe-d<+iY8y?qlYA=| zMQC}8uQod$H0zBOT{kAdCH4I(<7hP4lx#k(H==SA;A}972G#Q@8F>V3Arf^ROTmDq>m? z08!Z1EC(p6DNiG=-SfL$H@IlAA}HN_%PH_9GRmJ|>bfqbrwgc*5EP|Nf3DjENf_D# z8c)5%oCgU&jgj~U81&%D<2aH*Jn4cX>?$CUJ0i~{^bEv%VArb$RMqL|M!F}=+6dV? zBv3gZ@_f%K7;_XvW)(RM-Bkc*{nCK(GCE7=#pXMJp336kLUI9*;rm_5EmI);g}fFZ zOH;L!ZYEUMyx%H*CiSN!)$9B$&z1DKkx*~Lf%`omiO!)EV7vi{|5X@(cd=2gIfy04 zWASPFewzj{MNwl2uIFKw0hENEl2>Rx?|H;qP(XI+{QTYH z3p)8cM~0z~watv?`ED-~U$%wciLPP-hd8YdP3r>jI^k4Q{E6(#KB0X?qhL9A%5u?< z)qNWKLYWnaXUKkPCs!~siq*UkoF!+>kPi}U8J9Ax%{za*hvE!W?i){!_8ZRi(jJy& zGJ|^C^dY`4$Wa#jpfzeA!*=V0B3#kKxVfs3m)kld?axLuw9!9KYwh%9xO+6-wfOZ^V7sf}QUyDtO5wW78VFRS9iy2yR zh4pX%P0U04DP*69@9V5AHOpJ)r2%?_r@Wk_ngSM}9*|{c*z0(d%tW_sw3;as6+4fR zd@0!BaYgU}m1Lgkph>|axu{S2KykbiWN`9zi2!uLUmpRIOnvu>Hhv@MH$|@j{dsr6 z6ajnNK#Pnuc8lv5_F`U%c8Ml7e7~rZHPRA;li5ci^da+tKnBgA?ab#UECFkE$nC5} z%lQ08rW){!3m&PR1uZUA$s(I|)c6is%sIKmGK;?Xng;BObBKCoVk0^u7F-x>{0r0+ z#8xvDR|otaEvj|{_L7Z9GeuOFc6qYy7;S1_Vh`QOqQkm;SSM5SOb5LL1kO`nhY5N1x&|b47?lSIhD-)=mm0gE`k;6lA-+E$ zm8$HoOW{W*Lg5?rD?!s*S1@ST?UOr}`?Ahdr$(c^KM$0=4unv+53#;x{8k}WO|%ER zvXWedt!>9qE516156A1z7r*!toiBSX042ge`!RbIO2T~tx>{vQseZ@1%P|Tfr#soz zJgP3bv>FzQzW{2nfE=|}YEfJA%He?Gl4}XDF(-y=k`ywey@$4pP;|r=CRXhp7|r{F z%OwkMMvsj{`5q#!wpB?vw2i!OCDiD4Q9cMk@g4__ocZ!)bfntm75Mf9Htf3oN)kl- zyUp4>ryH{c^9RYPdK&&JAArBI#10L6Wd$bX5WEFY#;TJ#K6(ACvAbncJmtU{Q0pM@F?lK`)4#-(G){(qf_Fy&8uZs(zz5hjKJuAbG6UG6jrCZ!+o01 zn;=ruEFTHze-yTnTBb=-y3n2i3ag{1+IyO6d0+>qZ@#Q`t{fLdYve^EZYiiWydw?> z2J~5Vp&qVud~SY&MIfvx$!VL*3P>;+^NK3Y&yE*N(WemINC$8SVBz6p)ei#@-U(DV z80cFw&%m7*E_(uY8P+adCoUo@#DxY#N5`1AV97~>Og+kA4qaOwl&ph~Dg2RUC5}pi zvkhMcx4ukmp!*$R7Kb!?wspIm2GST7KV23Q;Rtm>{Z2ZoT1xp)o)O3u%y}3DmMgPG z0~WB?AUiJ-Hynr$)~tZe+xMMOFJ=sFfS}^?XxsIz9$Qb&L{JZfObDus$;WY|mp-(E{*^7p28YVavd0Tbl$&L>3?$ik5~iR^v@!o3W@Q%Y^{>-z-3Dy@kDV!O82M>*#O4$S$wrzM}mbP@Qls?ONjDNx;Rvu?sIrmqMs(RcP;& zXv)2aZ`Fq+#sV#mhBh(pI*kj>*}D6JK;C=h>C}Oxg|3xukhq;7iJc<`IeUHKFOPvL zLu-{yyY5WW(lts=9kM%9K96cFvl6TJ?7k@V6qxPdn#g~2f0$fPsU_s-5Aik45*@b) zH*|kc0#WSZ9j}9D_lHb_!_h&jv0sdes?F(LLw*VlLawhrgrprP4R0~TTM{SX( zM@0Op9LUh7{EJbha@`dIILd8sRf)KWhhP>oMn3aQzMb*q=^8&SJ2>QUdsGWzY@BQ_ zGlC2O*IPPQODjr_dr$3if4l`pn~sq5oT@`T8y1q>IDk|e2xF^3+=2I8U8lf}rF)q+ z%@nA6QV1kg^A8lBda|$q@9gvH7*TwGG2oH3*vxr!iCJ`4r!3B&w zSCP|SH$*aBZ$;gF3HZzJE2GH>#xR{^K;^zS$wW2iC#q(aZ)w~O*f(tUO|SJ(%xUe5 z8R_!{>>L-kg=e@IlYM6IuDA#^YlAb7HBdko!X}VDphR~Kl$%U`k9g3~Nr4FcCnjwI zQhbR{L`w~+q>i(yuLqUWD^9i-%u3Wl=AARzu86DB5Q@fF;e08&(YtVc(^HoHk)A1_ z#~Sg<3rH?ui&_DsZBjWVWakwRLms$xyrY{}kAg^qH2YOk%=< zpq8VRKuesdi?zW$e<+Jgz{x5eAWFO%qk3zE>Kt*)Z2nlE&L9{M6c$56bbh(|xo1Ff zqQd-$m%(K+rY3)~&MhN0x-|2={Nln6ko3nS6b*)ofP1Lw0_sV(5qd!fKvS78%pTOo z`ar4}z46aa?*%mHMj!O<_CLFWNP~FP{g~dPN7sQ`^*_5F{{Wl}hscGf+7hM%tb z)&ib`23I0aL&eQqgvJ35uEE>Gq1si2WT^}V;G#nIzHIP8W_p(N^0yg*r|Obqh^FJN zp=)neBWMdIdIAc6p6jU1uMW7Hnd?&Ph>kL2GIL1mm zOk1R?O)^!WjDZC&G7F^C`2qyla~ zWRr_KW{0IsV?Fw;l-NnEoRc;!_WI41dT?XVI$N0%pzxD-`n=9Kx8)PT?Xf^fZ~=sq z%F+^pq$4GFw+Bc^U#O(sT0Y*oA#(7PAC&j93olPCwOm5OGeK(Tnya4GVkfu(zT#s5 z@x>seeXwoY5V=%+55d#rGQ7no=OkJ9622e!Qk+4&Ka*8v+KwZvY248|B6(W-n$_ni zOX$iUwapZgqoIN>pwUmIk$%sS_t|3%g3KEGy5^f?>AeXaB959)iD*1*4e}f?HUkgh z(=x=p3(YC%jxfZE*~;s1MV+L8JzqkhJy$~S|cl7&M5>DkL^eKvi|viN5TftT_ape+gx9C0eTvyRQRN&7#8h`eg1WPsVb9BPBYJ$h4=(FvtF0peYrK=`8azC ztyd5D`1rPfT2t246k7cg6B8A^lc@*vk61uY5l z)#}LoDdx}kF0nUz{xri!lM~MW(3j;5Zqr`ZVd$8sR4tyYQGCP8RFp1Vn)<~1#CWv6 zRfQ(3bfzyRux>3TZD8m3&@6Wth7_smtu;}{d8(hZI27-83jI@Paae&2Axnc*cgnci z{3A6jPBR;h-a;zR~(&&i3F*y%i?%dNp;#Z09~!Lyxjv$OyBN;TS)IdEDD$n zW~&D76X!Yy>|p^UdbX^b{JdP!52;$grER=lX{7bYyeX^Hj?{CTm?7W*$67$9=5R zn4M;5bLLUg=5=m+Z)3}6YpGqavznQ)i|8c|I5g|eZy;9>sQ?E-;sa+Az~6%!7({Z1kI_h8+Ob9-(j;>S*>M( zD?!R*^;4v4JqTNRwj@*Bx_0c{y3x$a3qkZ2#5*>e7ceabE`t zgdP;Ry6vsd&nI;o>VQhcL0WNGq@)vMuV+8Ffr^31eo9wf_Hb*ib-a&lC1_|p8Q4I* z8SIQ2TQic2YoD=pJ$blOM8rdzXZHufF<7BSf&Mo! zN48PGesq0O7$v6T3g^-KW%jM4kTo(*1tjcgi$F+o!JcA;Y(>@(g63Tr&6S&p32VP z_Ia`O7!UE@%FiGx@MRn;WLo7u(oUsTBg%R?r7JW`F9~jZ`uv#*0L1*?m=ie6nC--4 ztTOCER-;ee%t|>62!z5w6Oy7`ea{-9RLrp&_$nUMqC3#?7Yf?9yO<8J*Mn^)TkZEx zXM^JSp@vxc`#{d~4e$G8iHbfcr_}X@Fo0gMKe%-CG@KjvhS4fD*WMwY+?z0{cRG+g zS^I)QgCo@9l$jkIe9KSfD@g&mvWkkDVvi}boqxNVv+lQ&Mi`$~y3jq_W9bQm*3V)u z=%IXkW^S#ByGB3)rK$bB3s zo%w&*`pT%R)-7CNgA$@NN=b^Oq=1NmNOy`f2uP=tk}6#i(#=PybT=X(2#7S&jfAvx z+_~J_aqhU|ouF;uyWJGgRH zSD7l|hBoiuXhW9a7Vet%uG0xcDI+>95tgF^Q@gHimJCVxOkz}mq5fUxwcLazc)Ns1 zzYyM&--+#P8bWvg~&@XG5AD$2GIZLH%KM%)5!)RtLVn6rv^dfgIWz*zhRl~Zpw-Ggt zC;M>Fl$;{D>5MOgH$)+*93Nj=QPH2)8=V2q$t3~a<<6Aq7oFBdqgEU~At_Ko8f=+; zMiU<&@2R10H30;p81di1hKl3~-+FBfcD7m%!mn{XT<}=-+^?OA-_-OZ`hMh6;(7Yw z2tpb9E&N>6jK8XX$0dC{Y#JU~dT(zE^x8esPJO@gLblG-1THpNy`i>~Jn5b0KAq9) zuSF)W6if&EboQ#prwk}7DB{qY|HLkxd-s<2dD1{HdU?u6hq4G6+7vf)} z`TO@xpoQl-%a8Gj4j)(xv88Qah4jyaZxyrzUO=bDUi;P&!!&jJ`r}m1XePNw!Odsk zJ!nexT4;NC<+@@9M;KD%E2~jvbbaoMpYK?YKI+H#Fl;_qS8cXVpQyNEAxo}Ns5i0V zAw1ep5KxQX;Fxv8b8!`?J^4MG3J$S!{iKW1&1e^>QR@F`c)!Kl)f{WjoQvmFS5>`m0ay zu|8brclIw|GIzNL-*HZP$BQDL9Wj%`7NJ7{02Hxzs-kJgiR3Y{`(*m+4sF^~$ zMLnx`Eb2k3z4hG{HWMp)Gtq0gBA@;6sMv4k`j*>FB;+l4tkgl#KbDgqTUY^baaJLI zvx_vjUs1m~H#!GSF5H6Sn0n3?!;YtS?~}Rw6C^=%fWBJsG5T?J70m&aG%h|?+J*k8 zi-rAteSNRUcnN}sP=c}IvA6frhh{(UN^?tiQI_-yjJ`c4r(U`7 zt4E;WWK?Y7>$|538mZ1hKm_eVULyuI6LFsbFPwLV1Oq=nU&N{JI|GjX$);in`%A|Y zc4Ju#e5}x>vO}Wzo$EK=WY_Z&qha8S(A-B;F#EE!xUWGjxps|6VTF2>CxOq>`6pS5 z+s#d=`&>01KOr#}Hh-cqZc|^lW#kVe;@Ux-uTGlYbDgwGVYmI+>RwcsbdtskI%Yo} zY*p)l1oc|~k)h^~ljCfo)}WPy8uR{}PFw`|;2+r2?wTJQhA3~(m27;q!L+3DzJT6{ zefjvr?MIb0J2e)KxA!yj5Z>L8(c&N}^1Bkt=Av3T7Zy-W}NQe`m2rO8Qu(RVivete?p=$g`& zeR$w7-GrvdUiaP##*FpUNmPZ|wJSo#B?YrXAC_=Se}4O%xBF4kKWOFBpDSZ7qA^Hp+7CXX=OOhl(JnQ+ijEwJ7+hScSG!el+Y6m9b)G&SG?zD(AgeF6G^fTk zT9ww7SB*-j59}Gv@KPn0B)2$aV`uqdE8*R6x9o=5{RYaQzJh03QGVd3oJEC3$V18A zuT9hreOM#pkxOyf4Ma`;Rb&;2av@x_7H7tl#@>UI{UT8E;#A6`JixJNRL zs>m4+Q_ucApEp!R`8qtE#?kj>xy7K-$SZXs5(AWB#;bLkLN4?6pJpz{;~jm~w-(-J zFGl$JXOu0)SJNy2rA*@oO{5>Ai>!@}O^5B#u7ZNXm8C`Zk{m4zLIS0DUP?YEGdgP{ zZ8&og4I3R6NaJ5&z^|XKxvZ{IZf=I%C+RNAJ7iFgLTQucdali6heF6OHalN_s^`&M>k_m%@py-%umgflDF+<4QntuY3f1;p-?B-;7aVw?NFBNkCUs$w*BDpd z_Xo$1EKCzB9_`kRX@D7SVASPHYxi-ovUX};SSUA_CFjmh-KO%}lp*>4!oAoUI-`P) zGQTc8n40JLXnYFuJvHjB4!xZ%&&D7EPmZK*ri}{s{h4gEiJ}_8DDXH;5J4tZaD^I-Ci>HvwXO2_a#Ms zW3}8)b7FXT!roV%wTLo=;gaVW%mFex9=BN4-=jnV@WK0&qVY+8F4xxhx=2M~6qxKp z04^^VWASX;Lv6K~@XuofFe0M{-7&9RsuIg*__2zQ>TER=&a)hodi%;|AcmLMo=ee|o4IBb-ClM31@9jpfjVTUec^mra? z7v4+ID4_8@lUrn8LU|^6w41`2`!#Jl1v6Oghj9F= z@qW&@Hyi!_`T8p_!`UA5bw60+NNpo>$!`^*As5O5r&!d zWOk77@G^nM3b)k==aXs&lW5`9DT|Ydqi2fUSPXP&H_goyewHG!*#>*_`dkd#0^)u? z5rL=QJnDMX1x{!;aR`MWEw-_j3SY$Sar|roSaJ@_-Sua=?}e~#UIEu z4WqePcA##Fap1|xPEUk?Jog@So!`duJbPbrjWJ;&7-h&S^YOL!eKZnjv7M_`-DAV6 z{j_x8dL<0zoORF5Dlf+PA+Li*46krOi2Z@F1o_3xDY*w(IXRtBqpM@e^?k2h=VVHT zbaz$PSH_Z?$(@g#Db>9s+b#gpCl0gh<_)X64R3iv+$uw?4EE3*!)KbWMjedRzjTT4 zuU;y=9;(0DKOSzkI+PFpd}$qWiB=suqm;1{eLjnUTldu}AQb9jsxjd0uZ^)^+CG0O z3ioB4aHO7pEI)&p3@S5@BJk>OaWM)H8Sq{YKf7rJ@0c^y6{s8OH@O?z_7bx@S!g)1 zjkQo*yKggu=bI9Tgez7w2YlK-k0e)@)*uoE;yJDBK#fd2`CaJcgmRCHW|iuqIvXFq z&3XIwMu303Lk-HGUi68k6gYl1D-!vKO`d2}(x2;-tV+EeXZm6tu!4MtzL$k$lPTIx zxD0?hU^yCIB#%0!KBHrwKIjhAb$nNsoi*K(mK~QAAU@Yl@rzzP$u2zmvOM^bqe@6+ zd*5{5#mXi5e&F`32#)OIlTr!m^TejE-C;SO{F>k}tO5YXCF|O7v1rotX`YA$)H+^6 zL#5vcKtoH!>ZoVm$16_}jE3}jt8N^Bvyj|UMIH$Lc zg9pKFDfoJbGX|dGC^peRe><;)99!SGO?HZea3KYhU(f|3daF^RmbI@7%iz zK4oRR=08eI0Nwx_Jz!3wGylnS!HFI*N8~1Ay^-~f@cXo88(IHpnLkvuW7I)qf3sd> z)oi~=Q5ng2pHx`=n3-4@+ZgZp&i{hNSxIEBxRtjjW)4?c?2?N{P!N`}scFzqqjn9{ zgEC~GPf>Gq~Mp<&y;UnIR>tQ1|?NGt+$9dwAVc?s8vNaHxEyJl30s(6YK!6EE@KX?bg-S0vrC>y&(2O|DpE`MmEM zk+fK7y@Rn_(tvPZOa5c<+BEVBy!IassyZhcy&CJF$!%|YV;--xq@;yGo{tUSmvGZX?d{1sR-4b)g#CXLHoPv>4$=$yPcmbUY=dEiHfn4tj zgFA_n1arEbN+mHR*<{~c@nxW?tUk2-O(y7h;`*c3r7%cD(ELLN1y-{t8i$NM{!USb z6gl<9=0a?2yA$bv@KAzHUAXNVx*`|;_t!%~5_$o9$}IXsUvf@bS6kv0@o+6f*w*^B zV@n{az9V7PIH2|)Mgtv=wpA9{kI3vk%a-t@2u%RGPV}r^q4i|lkNRe2K{a$xF0JN; zi?3##U5qty6Wl}vb(Zjd&zjDITJ*)&Us5IvDU^o0S@=ia$oAq`7f?1`#@N)1BrB`5DKjmrOgqa0pf<1(*ny*!^LVznuGaY^53t0|BI2%%4Nj9{zVzt~-%2A+l z0}8>@@t{Eh5s}0T(qh)a5AXf&_Z7hLIlLPlYz9O_UGEFd@QC7_aNdCgUS1)!k?dQ1 z(tS`g3G7uyONk;<0e`|W3#LFr)i65)-b5*$YDZzi(ffgqPk7|yv-7bunc-J`waHTc z-xCn}Rq_zaBJ?|Y^In}gV5A1CpI96jVp$klFQN6Th%k{zmd zkiJ*_Y}L7AH+igiFUA(^X!b=egn0+(^^xqy1Pwz6*@ncvit0_Z|6Er*&qTq^_uJP^ zu8|P2(oP3>gZGK_3tPLA&OQ$!Zc(&ies3yHc`4Dx08tqo$^>Efe~sxL361$rX8w1+ zI$#CuF5Vak$fF{$UhJ>y_GvdX0h3vc!6UmQ+uV0epYV*L3*mpiFf zt%LwvxUG0^>&@jG_jRB=NMcXwgI~gr7U|c}tIGACiyNVVU+P*fHOuVkA<+UBX@ES4 zvWzJZ2yGlcAl;Onxpd~_c>6Q6YId%ff6ROq&gEAkP}Wj}-G=Usk(kiEpEmCg-*irj-hXKAn$=iAe_{11d!jX8jU~Rh=LTJgezO3gzQLp`^Q7o5q*zqDVmx0brF`5 z2fMrI!|BeEz$`o3UgsVp@?4ozJUl$=Ef4Xr-bgq3L7e%_`K#D})FS&@zD9lmFh%LNb5(P~ zc&$g(;Jru>;R_zO_lRk4UIt8*4!$*;u)NUUTTOx%Dt(zkL9eRs2{F87ItkY2B_rsf zW;y>FEwP`IC_laDvaYi{SziluJu(%jiOcOJh~lmf@`KoknTC@Gm1E@a(6YYnmrQgF zLgJ#C1Hx$3OqKl{3KB-_Jxu|Aelq#&?ChZH@ANi#=;>d-B=B=|NgKq91~V>_qto~s zI51>Lzq@9LJf$l&7M&EOd4aa~>u0`ZX|4!B3B6=&C+v)j;uo(_-U1Q$qLm@&hbThK zyGe9Y?k$4Jhm}`N16!CCz`EMY*SS$ul^$-t7Gm{da7b7VRQTV~;X5O@71BoBNBkQ3 zT729%9QdCI8NdYa4SeP=d{OxM*PVE8pp!>kBPWUnpcD<_XD+6dKputsCFov@yjRO3 z?0(-+P-@%}DK%?1)9l;OMqs%B9k8I5d}hOuMlUZxpw>LK##sfVR}B3QrG8g?DGl3L zg*EQOxE8oJSQvU>&(pPUFa9-&RCt^p`P1#v@GFIhFUhGn2$QCMRKD(iJW5wSgR#~nx=2!l6K(4H5T%un( zZrkpoS?nl?%uOY_6dZ+FWv8c#nlyL1uAJy2nD}FNogrnp(BLwZQxRMz;r7kG7Sm? zbVA$xb=gkUAgQ$gO{8IRJ({_`Bf$c`0ukquV9_xtAxQyT^D#A5@J6PQ8H?26+O1128}&$DmQP_$4LCl8Uy894&kKC;XAO_+1UP>+2B znZRgw0STBxt9y1d>tlsM-mX@sYto-*)m-u5r`Um6rtf4&BPLpP^Q~H5jFiRM0PXQ! zmBuB^OR6;05}5`UZA1si6^YG+53{mr%TwNDPRUG(mF*=CRmluro~lh)mt1<+AYFp6 zbsYMnnWPmI!X6LSXoQ=8j%HS+3!i0yM(sm}KuxH`xiU$*I~JfPqQw(b3*#@`4)g8p zfv2;hJ(gRzW~Y?HLhY&L| zG7>f3OC{r;%HPb|$&hf`u-!i=2F$QSMGA%^gws#z!J7ee_= zisfw!WnSMqaUJ_O85tS5<;{W3@p>{eJ-89pqC_lfeo+5FT2t}zhe7@q{d{}3AN-=Z zolxe{sEH+R`{Z2M^1O(4gKuN#`J z54O)1cn@vJQ|gcZa>X3ovADll{4Wr7Tu%A(y*f``d3bGBM!=? zLGEknb8O1{xMsr2X?&k%9bLrV55FK(u+$+x>S{^*Z5!&@U*aE4<j>i2bcjdWl-~G9L3V>S-12!_gz0XXA9&m(KL1v-Ng*+fqHx}w&Ryeefy(un*3lJ9aGx8q9`GvX|7?jhwDr2LXK&kgSo z_?vgKDIo8H`*rOu3G0Z16=)a~;PjNd0uUeue**y-0&(B+%8kDzL?ZDNMkLRB>4nvc z=P$xPpS&C^9=ZKiCqG54=14rxuB|(1O|@@AOOY2WN37EPk4MGxFg5y~z>AtHTcb7k zV%QltQKGH2M~>vMmnrVZRq5;f(6A}&g^chzP&$nvz&378ddj|<=)yLd3e5#ihE(l9@G64yI-C(n8KOT@e(_kN}0#P|+Q zWT&`9v_{=eg1oz%V$zmw@0Vzm3_H-Zh^{A5-mdN`OL@~z{OX z71nMVDY=dTMO-3VF_i;~TwndgYDx4(PhTW+>w>HflI;cMd^Odi|6cLgwZ&np!*5j4 zC|Y{F&nLqYV+0~itUeBU>W}`dYu9lY9v4taQzu{;O@5P)tX(D(EP5t+_C4fuT|;hp z9UGM`>yVWYsZtxykfm-{NG+oP&Ev4E3u%5ujEl^0pv^4-8J{++3BIR+OF z0wlB{qhV-I2B7$p6^9jQX<&Q=8!+=A>lXnV$P{r{&@!TCgM)@H7m<90&hIses}}Ox zW}%&-_XBd|^1Un};mF}X{m~XfRI9M6{$5Rb(rt+f$-+-&u6n%{ez_;0i@9)m6WZlW z?T~8Rf}uo${w~&0)-um$voC!zl2Or0R)Q56CaCmf9G=m)C_`o;Mq4EL!p7kgHh<-9 z^*802A^o1v{+&G5uU_0T(m(2T)&`ZMR z`fI2iWWFcWeLD~dy#P>&HGh}Zx%3slvkOK6!9@!rgk>nh02jdoS z6EVwD%=+Mo9-@|!5S9MeIG=eg9|!wsHuIJHWS#4wkkiBXml$2gXE>OL7^A~VMw=pi?f9lBqG~R?F&M_LBxFMGq0#$de0!h^Kc0i)JS;l?f`Y!@JN~Ql+gN` z)`KtQZ6x`YnnR__BYoNO6o&v347yAJZHjJxie0~~$*E*K_@0%bg!yi|7CwauF(+RSJZ4lGwg!xt`0Hz&2kh6Zx2jZGBu_4Up z)2oJGG|_s%x=Cg8W=Of6QlE0&EH((<%n?p@-!?k&eD9fBH^M zqL4LQVs`!Gc;I?swQ!{I34RI8jyzWR5DJ-9%!;SpYt(pcZ*LoKPJfE% z^@RnJ4mGZ5hL`)DfC620l$|`zyrL)b@^%~ zOy*pc7Q0%XhXekf5z=j)Q+dn3_rVY?Kae^fo7DesjwS4ot&UZLE$X+eyroWqOI-5> z7^0eDHf5;+0ZhFHAh^#y#+vz%mzUQEEg>$Z@7>QcZ%U6q4xLR^{#^e%k~VY5y1agq zU=nuwj@&zQXcL&Y!6e)~*v@5{DuiZ(g&_m=M_Ns=!GCWck_cDfE?kB0URu^=ZCU9( za+kG;W_A-}gi&h(@B)0?^iN@MP8^xOG6vCZ@u|xlw!jeyX1LTH=nzH2=1S?<#bre3 z&o+bq=M@U<)<}xW`(jY(*F}?~qZuw;p$x!l@-L5NdqM|8N^fCI<48$J9f{M<5R1xc zz09fpE6b|b(wqhohA@IVCkWKH{^V?|ENWLZ`c^XeIIh+(*r`KDZonj%>~;*u|9tw| z>wA;HOyXqVy}B&)dnw#NqIYT9aYyUZ2FQ}(uZz2yHDayujEJB` zAX!j+y@);)a)-hET0jAv6kN|ZxSrll*Jl1AzZ`$AXUc{C?=Sy%J)Kgs{{}v3B=8-P zgOWo4!{t+IU9-p+00c*5hTUD6&V-1flBbgO`fAh(k91RXbacDo&OKq_)9KH3ww|{S z%u^Dbp5OF~ zyogJ@iF^6a9otDAfE8vLwT}$!-k{U2sKx*LhUi(jOFlzfJvrL#RT1MG?EHG6w1j@NF zs2LR7kPjc3woCza)x-gyvW4_f zch(h`bj?JCmA2+maAQaqW5O2Wk^d=nRhR*^_+xB!04bLOD zWvpv%EtK^us21q*-s2I^yDr`HZ;Fxx zf96+)miuyR&!QVJ}@iZ~z}G z3kByrL7SG_o@Sl!?{|Pc!aR6c!CktZKZpq-p#jS9p}SQGk}QGY7?ME}HjrS@Th#b( zrTgdXYZUu!wf5eY%pqt0BcJ34s05z?i*$d!#TLPAAh6LHUX%Y)6m}Q7s4HMkXg{3Y z5qUWD!Aqt=DoBIv0@@P5!grrZOaHxY58=CiT78}`O^uWeh6*bd%YoBvhGWtbeJ>E) zuJ6O++t6@TF#jO9og#frA%R!gVz}U!lpFCY?_s!I6bE|Kzfc_*XnAKnM^78LMIn80};c!-7lRR4fe&*c4tZAW;c7^qqSfnHO zgeF#Y42t&?kj`eUgN#u5b^JG7uEx6T%lT0yd#N6 zkwIcDl&7sS4DvhH+aQJ$vs;l_IOTPI+6O*ZvJ=ILQAhgyu1v^W@GyMa-|@W@ci6!Z zy#?m&#rM)Z_jW=)>ek*-;v!;$ub8F|<^NC$Eyv&c^aebT$aB>Q8ssj&=qtfW32%JM zyGt+?`5t+!P9=geZ^P(GX;meqS3dZd?WTXO3g~fQ(oL5`l?L&AyTSW{b2}vO79o&8CPlpF8uZTrQlV%>s8&Y{`}AnuHqZ|jOIVL|ahX$TsMvI7yTda;a;BCGf18^wS4;AoTz zo4dk5RyzQ)_oD9X%NvU}T|~=Iozt2tN$*Lvc@0MZymnb=Ks;LAFPuH%6!=_gN5>zCbl_yWPGYkr~RO(=j!@pW1fd3&tE)^7$plH zsMbKK_$%=V=Y|fxc*rDm-n3t{ab0=3yo%?y`Kfk4c`VP0^E;z#j#8(1Uh_zRgIh`$ z-vQ*)g>6ELVTC{a(g&~pr!9rmy#@fw#bB2SGvFOtWc+QTV3(}0MA5!^M#$fI%XwMmeEn41c(9a?CyXdQ*^Arecvv*> zUYWgx{59xGxy25QKGT|GdD1uLB9I4qJ)z|$Ch^xv*R;EqPR~zz&Rs`~AWX4vqqp|U zM>R4H^_!CD9 zoFy*%T2ACFS6;i@yQFR{oF@cT^%H2K;e+I(gE(y>YWYg=8C>sCx=P+yGt}>m8 z8BzO_;IGNhGE3tb1Uge;OlfW)mD7_@`|)C3kFhG z9b_2o2e4pKgx|)*YXnVF+?0zeOTgA#fPQRbU0t0>t9;S_^J#n@VO*K_!cOt>Z9n2C zV)nIuO=CyH7j8nnLa4FdZt!m zDlo+t$T7BVKmX;HSn2*py<{&gB3$$stUn-OEaSww8IKipFR*g>VH0q1rk)-NBEHa` zCi=SWeb@`+c346P(dvK4iva*L>*mglM|Vzz%DS)Wf8L2|F;!b03Y@X5X(qM2AD=0B zr?K`=)_t;G`0A+qCZ40CW9vQ5UEDZaVNi`qfAIYly-u-_FM@M_El<|9?oQwfnY4?E z?sF#wHWav_a3S#!b=$ED_BQrcr7u)eRE%5n?C3bimm%abBl2IpWQXjAXpRI!2#5q* z4Prp>nn{Nm6K*sI+~}I4*Jkj}J|h@+LhzsLI9QDpId5WKSvz)WnVBsAWGmaGopxv( zHbi7VUZZ2#h$K9aaPe;R?4pu0fYvZ>y70(!l&OW%@P<#dlB zR&~J4R%^%#(k_09bZPrz%it`VZ@w*6Fz#hHwB-B6jz3aK+jwrvaS)n%AKOg9$SkZo zXZ@~bz6OQD73#GNqtrD>dPau%>ZGR3fBOD3Dl+E{s7_5!?Tq7kdh#))hQLnFdjm5} z*{^~oF^s*{$q0hQd>e%=+j_o8ooP1HgVm3v^B_};`>0OqwGP-P()Q?zE(cg*;uB|g z2nB*X-QL%{e+@$sI=sqolwl4k*T5F1)0-k+iaS)k)TC9X8ur@+0KsPx=o)8uf5AhJVua^YBp~B^nb(*FIyu zs@#t5w?OK>?8`e;CXjMWsx7Lrd%^rEBRhL{&L~mg&~6XZoww;F=6i-(2&0 zQS{f4M@b+8-bR((U%y6zFp*+c4WJi|!G<=G1?XqLO-@cWBPg&g6udJ`PBs0C->N+oyKOcLC^ZhCuW+fx7aU3Ocq%kF z^)Ejcs>AVw=_;h_B)zlP-5M8imJ9MgWNB+b}vg$QIH9n%DLzu9zG(>oUeblBQ4Nf{Els zdv5_lvf1PwNWEakilQOxqr;uV4F}Kl>&3&?o#M<{9zyK_rAzG^MK!jewc&4k7GMFTzO*=@_Sm;IE9 z&!~vs26d@#-eJ?qK8I<$xH=dkx%0Nq?=5_i^=(}aO`$CIFR^_C0C^#;b6xWf7>o4? ze2ui55CJR$BWE5Z9vnxAv<{#qEd|4_bp2IrLQXpnKiCr-4xmR3d6;>j`j;xFcx!b5gCB0C+Fk=GbqDorWQe5krS`|A|nv5|js z(fFuoDE^r*)NFaG@j-vIR`FxsQi6o=TGL0w{V|uGHA1ba5_13|p!f&9ewfzdB__c- zpp=H_Abth^jjMNMu!m7LCg8fLMRsaEXlj|99Bg4%PU(#79?XR^>0CkFlNaR0r=>Zj z@XcxN`{WbNjfLf}k+hGomv|)9|56?2?hZun#+GlaRolUS*j;@1!*;O+N}ZRdm(vGo z_`FD?$z8Fg$`(***}$`z++FUM?D8Om$(01}y}5$DgI7aSHf@$bi(|_KHzW}$XOd5b zY_>l45L}FViueNJs7Iw9(9Ev|tskL2P!f@F?>7omrSw4P=-bdaYC0G@F`v0V3V$Tt2ifCG-KK+Tu@At-)SlPOWlb%OxI%R-5?KM z05wymYEs+aDTH+ev!Rjk@yyVqjS>$gb))d^ERRP`gEZ^a%FL1=a~3iy6=e zyDZb4iBniF^gPQ72cXh@2;?=ka(_7Jj_8ZH6YH(t-bqUDVXYl?q*P$!GA1bU#viN1 zd7YFHlYGBF7z;0+i>hoiFu~>I9 zQFvrCQByu`;z^gLC30*ZI~Tk%kt1aXKnVVhFwbSA$$`j6MgpDZ$*|7O29smsBy=7- z5V#Z6XAa*9I}YT0P@`8HxOZ^!lSehjnM`5j zsQg*W;|)MSXUCL!H5Qxk^Mm8JKZ?G-7%@Ks5)bN%>>mf;R@D(rAIv6~|Kf`$m<9nv zufoH(Y@p)t>IAl}l+~0Y^xp;OK1ajirmxqZCDDTBO#+*4jhacKJe?g(WEy3YBP%Pi zBtpsiil0h?g5Jw{K)3c7(@((_22{ios5w)M(xxdjHOM|BrXxZSyXm;5GpaBWanTs; zd56Xe4_>Gpdyvbilv=<-kMFtpsxKWRk2_==?`ph!rFZPP@VL=De^|%K3-D`cz^lsr zG);A&ug2;~_24mQe`Nf8x?3Cp^WuET^T>8_nzH)rm&lvUJBYklx}g4VT~Hm@-s&ro^XI! zq$MQ$o?Qf(f%K_Ob_@H#A;{1KRV<%$=gQ9wLBZN%l?%+wB=A5#L+G=RdVUgX{?}GE zwFP#H$mIHTtCeUAOtfk>t(g&(rMmZ1r)DM-h#8JX`R!_7PR=t>cict@c|mexChMz2 zEm=?s`e2WRcyea)b!Bd<)H+)u28;rRC9r4e+ohcy>4uW5_^$%W^J0@lZ1Hv zD|HH}d}c1SdU5f0k09lb0Ig7IV6d(JKGgN8zGVz*QxOsJLpRipj$u8*Js zkd~#$`E4lOc+LCfryU*Hc(BEHa-q#Uzch2~C#bOgqm*fzGqCa3Sfyn4BV7PCKBX>J zyTvgNRa%cNx(rL|o>fCL3>+;@1&k_$95!V6G@x$3JNhO6pG%H`Yho|&%LfY zi^4GmxWG4XJ~t2XgyE_P<-x@iFWASD)P_$L+9Bt22+FCTyD~m&RxQfXHU7X3hPERl zdF>OQD#4%VRs9l-)4TVZb*dxBavzl{7F|Ifh1|_zm0%WObR#)t6BcJ5C#S9xr_wK` zOXXi&@cZVNMu~PqFt%L%?Q-w(sH8!2?@x^IM*9ch+WK4{Bt=E|1vU-MU`x;(S)N`K(-O zd_yph4YYV~o2ELv@9BzGUVK#E*2*w`l~MrT$!zqvTaIW!t?@c>w+ES14yIXa1ifUC z+ka*8_3GjDcOQ#StZmU6Mb#~%j`me5lG>~A@02U$>=wS!P9X}O_M@ioj$f9sr z9g>H|ALj?u$%hosbXua@H{j}#pNc;&FN$WfgRG_T>b{@IoVE+rIH`yYmpBLi>cgu$|R==9I$P*f_TKgSjl&GXb5lJMEdUNrfPGLwcKzc=J+J%VG-%S==l_uNCbtXc&r$;R*Y2ZC15ZJJO$MJ=*X(7ut5?aa@$E|M_S!UL^63X%ZnWv`OS? z1kKQhZbSI_*E5AT0jIG=XIZM5le%8Nk}EI-D8w!LB9z(*#xqd*N?mI+d=`lfo7WX6 zweOZ91spy7W$9~e>*n_httj%yYF`zw-;|4^$<{ApyP59Z#r4Gu+h2efCk~fCTP~jP zdX^V>HyJ3x=rZvUDj8Q`W?ILMT3yj%dN;Yd%8I=}jBKrC zH&-2GWh*54Nb=hl;{5R_G?0-w>86l1lU{lQ2}gO@3$G}07y$y$;_CM>GW89Bf4B`~ z{I9SjgvB)B3$ioEL5Dx{=yl1a2ggs->6K^+ z0#Q=4v6v@JugASwsP&)%oB8!p>))oZNG?VSC`Txj59hZfF#FeOfs!#v$G2~L@9(m% za%Ezl#8C$|R<^h>MY&E?SZmK>9x6SQkqNq5UZbB?aV!Ts%&}7m)z{_dptKPvM{l++ zs#R$-(Y{-|$k4dQY%AEn&>;KbP?uxNEI7s>yRpogOZuf_SEV%<{gnqtKj|TfZ?u-r z6jsK8MZJnJ%@EjXfpv?B#yt0<7}j5UbP}0nrA@LGc$mXiVwxH8Ltj6b^k#_bshce}_$Q)|-*g@sOYkQiM=>!u z;@lPe)?|GKVp38sUc8u|y=j1fk3U0{SpRR3K|24OOp=;O1e{E3b$5Hqj5`RQy}IJY z%=)-sSIq#b=Mwir2gTL0@)BTBUMexae*L;nNbSe8?xoahMm|1u7*NV#W-uxiY0@!s z0J$ScYEPZ#D&(|NOEUd|XTm%aCy?A_Oo{X+z_1+rbswaD94di&=6wXif<^Wtht15& zSt3w&9fW0+Jj25A`3YfAVyQ*zuPqt)fFHsXc%hsA(k&y<(=mIaUV;1!qU%adrK7~q zsmjpDCv%>R90N+lM&e{LzZ<1OnVf8EmTyYS$ZX&6$NEUbnTe+eSHF61)Uq|?(Lm8= zeN(j_*2-3^oSXSv#;*)w{*YSW>H-TV=0icv*U(vJ)%Yd=Gdh2&n0P!Cz-mEA$)xN< z2vGtF%py3btSn4!n|#qrzaK?uC|1tFK<~RTbz16 zpDEBPe~+x@+J+O8fk|iL-xtyam<_Tsg(0_3W{WHVEafTK2K8&zzgq2-7c!XXS=y_hmqiMCV+4PI*AJ@Rpbe%%&%WJ# zYTy0548cmQ5UgbVK?c$;LC)$HhCl}~rpo0H#wI4MMb0N5B6&lm3N~X*pGU_cIyPAl zz4bKAt#$QI^9AunpjDj{tfo2%VQ#QR1wY{MBeF*p*K>ghy#Naqxk?tma>qkhzBDg$ z*h~C%zuGK1L;D#(VZHgeR_xm0x5>lVZCvUP3j|y@4Y>9dPA4`zmw-{UQ6loLZ4n;*+R_L z<%+1Xv{pE~9zJky6uN}f5MRR8^W+ti9jNekAmt`#bYbglK7zhoDGPmlK-BTUI#=|!*Dm^l>1bU~p$WKY2Y8|w6s9cP6f%1Z7(IQqsc<(Nu0M*54%sk>vB>uFm+ zJY5_nS|uhmEDJ{J`H6zXfbIIDD-MrZW9sqe=a5X==!q>B{hH@$FR)gSN?i_mhaiS1 zR1KS3tC!N7>+6x6-n_TlgBhSOrj8C28L=}FE$!R47rX|sw~GJHlVFIa0V_llTiq#X z(KuO}@`CVBGe^+18p|%a+PuYBkj(BA z3d}X%j;DASe6qW%M zg{TVH^H*Q`>dz5uI_xUJ{CXBN-nQO!+-by)dpUnlypam*dlg_^9?_!tK$A%J zZIoa3GUQZ|;Le25FohEVic&@ffP~TB5q5z1`Av{w!pY>3;hJTCh5>m#BM_9|0q^&& zssVAhnDwTAW`@UrKC%)in)jj!Iw^HRD&vFjhz>!c~b^z~?{>=T0OHw>Kff4{fO1iQ5pvCbgW2&Mx zI1~LA9T?b=0+ca_G3!c34XF_AR7eH5Y(A~{V!}$sCtUP6b$Z~Zm_jZF>Q96$Qr&a4 zJHfp5gPWdStSyX+B3@5UTFTznaFh#1ZMev|F-(w1?GE{_QDib3RuyxR-ONLTm@@t` zL+XcnDr+;r&h16VwHf~#VQxa!6miGo2`z>YiOPIQRqx%Ty#uYL9@DJ=W&vN;R8Vh? zh_C?-uZXR=FP$KI+=2DT|G0x;q1)vExg8}CIJr`{plvHReLh5Knj!FibL#$-L$~m! zS9Z|@N4yf4H$c^T`LLZNyE>%4u#0pt#S~fHw*Xt(0tzo&ssPE-V9HyJ?&5JDmLx;2 z5gaJ{VG75_#E{18btOIL-WS2`^C;^o(?+(4%s0ru)v*_iLf`36+D(M03cMC3PA(!d zPE0+I-|7ej?AfOs503j5j`|voTt3zK&3;)hj)laXcKCPcGgG_Pa73SU6;Fmta?@9q zP2GN+nO7~)-DSK9y8^@|XgIVh*tna|Ug2`Qh2yup{KQ0VaS9iiYQ9Vebi9xwvA(rM8nu zGtAH49?#zpRE(+zbO0jCKn9u(lFDBZyFf?k)kV+{k zhKI=I!5|R`2+4#&ozK;g`Yl-~wk{%g118V)0q0=^`F5A$m}%1F$VfuL!^e3ljEF$} zyoDtXU#?)f7Ig3QDNM8Td6-(e6l3Ty<)-454YD`70ndUF(5o6xfc%=q6|>qh);M^F@0}TS63TRyScet@=91bJg*47 zhm%6#wxcxzScc$?8DYXAPfrpr&+&V~87qZ{i4k&CLe9iI8K3;3!h!A)nP}Va-Q_oqp0X$&mCKHQZoe8WVDEq9l?dvAUNXy6R`@Y9w7nLkthikGjS!2jxt(HaVvRyH*YLy7p7R zIfyrm#cl8}8hY{?l6a#6Six)W>RaZp$R`3-Qf=445Yv4bYjKt_ zku)mzPUb(_Wo87`OF4&6HtV40!+>SIVpTi(uusX*%R$hmS%LO1&*op;0dkKd;wnh@ z=&_T*g)nhr+Kfo*3Yg12X~e06Xu*K@<2pPs**pHnr;swf@aACYn+X^hZip;E32<=$ zvi6uS2Kj<^MIPN}GZ84?%KiV--gSpl{f2#=A2KQ<6cNeJD&rTz$*7#NvL!3KqC`f> zDP)tZtn6%(O(-KHBSIyzTFMrY@!ns~pT|Gj^_*QGzY&UN{o?|Gj4dG6=A$LIds zqowp=R%VokALmBixzCg^P+BTr1}6N$#;0K5VphEQa-Svc2dV&V0BY35p65f!A3bD zjV!0og2)Rd$5AhrAaK)_eP2p4>JE3Yp}5`576MV@+hpKXuIziP`EZ9SJFDUNJjg1y5o(}6$@vT^4QH2nJG?nlgPA~k!ms56m&$#X&id$lbb={1^=iq~|#3iJU~W6_>O29a~+Rfjd0x_PhsA$-kismix`@ zbS;Ab;OtkrIy5NV5_vwrZx3dhYq^1#n5-5g72QB~~fbo8g^dJ8+xpeuPQ~+->Bd@BBDuNP47CR}Kt3@~sLWyB@etViL zoMuc`mVirp?h>$SnVDp~3MnEV{R$=DJFPy~2H;oH1F0DN?68gCw;#N;ClcT-oIVx& z=gS%(Y;P@LerD`61``54R+Bm^l50gFGnOIz(a z`2KdE+mF8peS|>5BhwXOnN8DS8Uwv0P$OAgjKvo_A_-ntu6q54Ex&D%1j>fS#=GFZ zojBoE`epzq8KHrr|EuhCJuOyQ{N>;P+CtO5lc|SAI?IX07BC&{qvylwal6314svsR z<9bnRNFw~s0B^|}3#-=zr3GKbpIr!&GK_jNf?!aO zSl`-|1c25Bzuh1h!#P?TtAExIOw(yl=hFTnz;x;YDr>-5(kL??Y_MtQNZjB-x;q&t zX|Pr+fUQN*3OwDwNRlk^pgV#kfr041E7?p4m^2P(+k1yI+Zo)keqPG9#;RZ>K)jV*!W4~_vD0x$2W$!4Z&kY1p{ zcw-{>=5i(*=QXRRhuhDRje@g)+jq(;z^kUr#~zUKA`}DCDeOk6G;szX|XHCYRRT6GTW+v!$ga4A3+H`lfmqi{Ph* zz7OF;QznRz{Q4qzZ!OOxlw~!-#c15eoAVvK2Y}` zCy=MnAV~AIH#kD2fBQcs85@@eZXZ_jVL>!a0PO#yIdwAh1`_+kV4A3MXKodvKCXo6 z0%d{j4X+aXw#9aU?GZj+Q~vwF(&!u^9c&7BMqHgmh!^h)Ik3HUtg>Ld<6#ZhD{7iI zzYzuRuo`DF-rUXW!-}WDg6cOwl#}_tV$q#kScH6>OabEp~noP|i8&mhg3ydY$8&HXIv#;s!n;9><2ZU}b3+7N(mG!W2{`1S4&c(DW zJ1k`G!z*u5Odz0c9*v9<*%tK;;R7lTc^As&8itaNggtH7TVljg&-0V>vMy0kNFgmx3 zE9hRA*=P1>gzLNq>PPL5;#r#-6R!p@YjryS1*&)ft$*^R6;unU1e36?^3z8@sLimZ zQ+{(5p@Cx$%zoFKjPPdE)0>xQZDZfGUrOMZY}e2y`|apOSLxv|^7@yG_#0Fxtr;nqIEDdVQ7X!X_}yu#w>lMAY3k(G`R z?@7|@jqX~UpO{dTf3QSHPiDs5@3j)QB zPMKYLv_`G^b!gE%^6hb3pyfWzjVlxE3b6kS9#B&KcT5DRlO*4Ii z=!p(NV%|Mk%Mh(Lg|^WA-IV8ELX7Dy(wDF(KpP)9178Ht#j;_2GAjR(7Kn_77R*p) z-^zh08@8R2!~&^hF)@s#(4E8A>3LPec-5*K66O{lu-cp^K=qOnE`%Z!B>#CEAUvP~ z-7a^M3!@J%FwMnBzR@UUsfA=-y^{789%G)LNE`}|X4cfp+VmS!Md!NvCL8IN+C7nx z1HINHNluv-yC8qEg9I-OmRb~&(kT3nM9={8@?{LyvOVSY|w5x?wC5-&*RZ8rhLUgh?m(Eh>4$Fpu9)@L^-269X$eP#Cj z>U-t+z{)q*?cj>hM&R#jTZ$wRc|SJuT#^yxvNwny#j28fBFSA26Zy+3GIArn*5S_E zi%-BQ(1$X_5r@4AY46_&{#2cQVV0^^fZnEeX3`9yfTMYasmwm^?GgVxg$n z7Gk4F24i!=F<_B>j$vm%;O=e=I^ipEr+xE}O*9QnO+43_ay27GE`~2bl6p3yOr=Z6 z`rT2>o}5_3C>wp-S9;ByUc@StjWbg$EVC=k zW$p6&9iK_aH%5*SRAzwdJBQhyGX=Hi>CvK{H6JDPyKWJYCnE?6fc7*3(t%OnsxuQK z0RcZ&LGHM^^h@MMI>eEn^T@bP&d>M;UJwM^>JT=@%M1@#6c$z=A&aj>ZnEDsCiFD- zC%Z?X0I$Ta2XF+RWU>!6x0hFYSs$?P1N1XBjq0$-!|0d;iyi{qJPq5x6l#}Jvc9}O zhm?{b``GGL+MogDD|~ehK$d*mqmqqM$TX5IpJ>JjM_~ZLr-vGgwjbJA#mJiQa=GML?2lm7$k+|?gqSOi0<{6HT=f#3i*#O3WlWt4r zyyDOm0CyfSt|d`WRQz5r9tUEJd7YJ()ApJ5ToKm)k=~^ zZ)p+LnY}F5THp(@9s=xp!WU0=M0cCg2uUFPy6aD(Cy8F92?R95-IomAa(E`{J{9*s zY_E<@Q=oK&`movy=q;CeW-!*~j%Q`O(msmSP#@DNS+IeW`N5Tu? zb-y!XmM4sfEx)`i%tQ)m5=9i4*iIX780fb+;(07BCY2`IfjhE&j&(>LosPCW>uY!M zIf+?HpR6{1Y6wXiWQzsr`6|x$`Dc_PNH(N62kGTw3-_PjnA*9hpJJ_WcT_Hp9=dtP&p3lkqFkWjLUcj;qou`4Tuh(sSHDlEyT`{E zo*f_tq-tKZ<=BlTrae*&19UXIo#$NcPV~~v5DR$fvm9JiaZB`-b^2V6S|gZ7I3)Li zLyLrCtZ64Or2xo^Am=YZBh@Dmtqh%=wbL4%N^FPg7NILMo%{CY2t>mh+JIhEe*DX4 z`DN>F)f_!BC&&SRfbIiV9y2s;TuEf)FO)HuAvtp+Dx$V7_MHKZRaFeE6N+w%+b9kI z=Ya}Q2%AFO6H^c2tUlp1MJ}oFZ$t|Uve#m_lcVmCniMoEGpsni_FKZ!REd%zDODuW z3nw<2$>SHF3x@+Gl*Apzc_AMX28Y2KP7fM%R~BAXRHWsVcPT1c`8&rsru#2{y!@u{-*X4(4?yON zlCZ2msmM$D#r5dt9(Dcn(n&=zyi9i=~qsnO*`_8 z7n*F}d?{|Is+S&C5#lC>(g3z#P5$#aqlc?oIU+Kx%*!n_hEN|e*!79oajFOfV9>Bf z<(?P&pqaUOGg6=(6w*CtZ%%aPNcX!N)6gC4-pGiFIr&3j{HX$fHL|p2vmZZ|TvHTc z!?B+1R|qL5l?M{lek#O{?pz|TZi#Bvpi;;lGus);q>d$Pxy>MU6=zkWcc8#RrWSe@ zNQ8AHIy2-1_54Q+s|l#RrdZeO{*bf8Fx~xavJuACcd4y(UcolxEhS$cy9FKUDP5Yu zj790d$i-WI-ua1b194uaRG!nzJI41hHMu=}xu%5)y9MVTK%1;inI3C~7)VW_u0v7) zIs_I7Patt3{+~tqp~jUhDIVjdAq8uml415;hfnc8&3GGs(%}r>~_Gv)5^U z&0(W_royfgaPLmW;%(fOctv#pf-H8A?0kz9tItR~Lr|!BX<-KGx|5&nw85SSdHdtr zx)fwPTc!G}gRd(p+2u9TcoLwB$Yv#e-b}IV(<~C5P$|z?T><{01BQeh(+`KUU+*Hv z37f|NUUI~2A8mpt0Fqy3EX;vK*J_jEPB zhIj1fmm$FBk;imbK8(DP;ANa{n}2E{HNk=>!2fJ~tA=V$gB4B~hJMvTZO0Bm8-n~4 zv4R>SEVK$aiEtSq#Aa|bOX!K$ZF(=SdE%ukEe0|*GPo(uFNCP)sC0F^#h-_QD|hrA zv-2a}{S=%kGS#Ar80vI~Yt^?n^yUetDL&ZWZ|apaz^oq^NQhdx4^Ct|;*;u?q_g9F z=MXB62X2e$z**c?d^By-8(@h!;Q-7PHb!*?)mzkxww%l|1Zf)1B`DJQQV=348gC_c? zX3}ymHaslIAf=3?#@3BB5BC;JG8T#ysvU7N+kUx9Akj?ItXWOl7nNNsFKnNBG$9?= zvN($b8i#O)3mC+O*}l4ZmoOv5kBt0*MhRfnK97$FYGk_9Hp^RiT9V>={FUx$P_)nv z9IgW^u3tLSak?p@-bm<5~sz|tDLCj9o=p~7g>aY)DUvKLuF6x-1=jj%N89jGoEGix=NPh3dQc(DExTnd{TtoE&qVs(J4!M66U177@~=;Ewho(iB*B zxwP0MdpNtOvi9I7i;_o=sU+@YdMiJPOiB#XJ&Prxkk?y5E8G#@i)+*_jfE?w7Mh6d zfda>7gDD;<0qoniZ%e+UMjOK|TzBEX(K%cvJCYIgfe#~rq*@;f=JNR@A`*-e_IXxM znpdsrSJ+i_CUCDAFV@Hxqwy1MGmQikbYMOWokx!rmJ%%?AdR1$x%W* zCtQZ^d-L3+SVH53OLePedW9RJ5%NfjbR_*_aq;4zi8@`T^7SW|(JBvjY%#PEwB^?cIaIjwD>&YxX^dREo*wU(M28A4+3><>H8 z-cI#sdV;F+qh`HxcF4L+BSWhJci|_oD~MTIDt#t4^fERuFt7v|O@;yT6PS~eb7@be zIIi(;^#VD`&bIfLnyL==Bn+0c<~(`wsPq%()W?^%hi$~DSyaqWnc+DBx$h@qOOop} z3U?RHwTuk;XjO*0C|o(3pe*)8M_T%g(GNYj3)GCx__mQ!EbYC$=PE9n-n_ZGy>ci> zXOD%2g*RYS>LE@)40Rp;t*xzD5m{3BQFxM&>=cl*@vSW8m8rSaGTESV>GXBRl|h@T zz`@6|YPF`?D8ddpi|8)>9pq#DF&-|fMcy}EMT}=>*_5th_=%!IdMk2&sL9FxItaOe z$-$bd0FS6wQ)YKao@^+nhD*oyfYAZ23P?^RrKfXue&@mOL=~BB2}Y!cPm8F(@$edE zS$t8P1Lc5_L1RVgV20?wGxST$xr=(Q?w1Oo-3TzAp z!&R{CpNFuYwk@MJ-HqFq{Mu*>$EO}C%<@MqStNSm=Et*iiuk1*x)v>} zMy9hu!&b$K-+LdO?DYP(n2p)qk#+TJdI7C4hgR&(Z|A6(`}V3{F20;n_}+Q#SVE^0 z)z{Y#fvX#U#A1NR#P~{kooR}2oP>ilFF*&%%F28tNXd-`pLvp>7V~_B@6-K(0~z1R zW+`EpDyKndg`e-8`4(569MqS5y~k$966da+-yHUJ&L4DAH1s9XjWVOX{d6JQZ^*Y< zv;O%WeI{nhIH9`7nm;BzWA7|0pJ4prb9C;L_tRzCT_If`I3+Gd=lVt#@+W@p_6`@( z;Zr|Pxc0l~RwAXiWAw_0cL}i5RFn*6c@ea<3`!#gB-GpC)cb}$MrZETAJBPwqJH4YH2K`#=+AZ& zR4iBaqq;OiUJ7O(W*Ci5y!v)FQZh*-YTvom3xp4qe5z074=>JUQq`OcVe)<$!riLW zC`=sdf!1)HruS^2`aVXrlc_QAS($>%JvF?%Qg-iUVhTXu2R1e~2#;pn1p@3CC1~3E^)pob=_%X;5_#=dP`;Ml*`pj@*m` za=`%LedP4=^71XGC6}O+{^1P!JR>tRsi;VFbad4BImHdhmp$YXsOPt%iU7zHt4}JW`a$OZpG9WATG$c(2PV{yG5s&53{8SsF@)RPW0?IKlmQ2jCvkfoIGf z9kOEm!;j#9KR6LqTKb@!j=z4mKZ8dwr;<7nSMjs&-H=-?ThZ@y7~ zk}!UP$y_YLXQ+DB|2Fx2c;(=Aju zL2EB4)*n}g>iUbAR5VeXO;#dwvVW1})}Qc7_qTn!^~~Rt=x^G*CDp&2t6P8i!`Z%d zB5j>ETc^#I7Hnz3mKJPj!Il|{3~&7pIo@0G literal 63376 zcmeFZby$>Z`!0+KN{CW|(%qfXh|=8+N;lFCjtSCT5(*3r(j6*|bPlN?AT8an?}71s zzwfu*>;Juvb$Be9am_r>bKmzB=XqY&MUaY;G!_OC1`-kymaNQUH6)~)W=KfaZ=l@( z|H6vJLkxagb5WC)Kq~4bSq5)jnd`_}C@LZ`f!AnAsMm0juERe9{vcf=LPEWKjf5n3 zjrc#W)vhrhK7)*e^wt^)1@Rd@@Du)04*u7FeqK+%_TMX}BmepIO|$gte_mgQKO54< zy(f>s5A>E(i;c)!y_cha+*x+!N!v!V~W_R zsXFssn5BP_&*p}J<&4+j&kg4pZf-_K6f|rK@#jd_kpF%0?jnRM!M|Scf43K?I74m^H0G*jvkh#!I(JUydHsV; zhZ1^LJbHWkEsy9e9PFHM@udr6#o4%}lvVXCy?+pll60O};Z+$9C99;H;#KNKqyM9} zq(XQN9ahVc|LUJi3u1vE5n&c*7)Baf!d0c`=~T#$t7bIiT`-kwP-ROqpx*o|M7-A8DLMj_iOp^{>dT#HLj~o6z9fpW%JI+ zR!T`a=k}KKw%h&5WA`m)cl1^6@|mo=x4cMDGMM<%5@jzd{uVpUdNoQaZYG1@+E+0B27H;A{X+$mzP zg#$lu2p?w40~jg6>iaN*5+A+7;ugDP%5|Cl?fD`yn)`2)3i>j&H@7@aCqRBJ+i}}I zG0HQoy)C;a=|RGOFnNq`|O* zaQQpoJbqidZ2o_ZA>u{Ix^!la4)>3CcCzVz+R(9G-lAUxQo>p@o7N~{ zlbaDnKmW^_gA@F23Ms%Xr5EF;^s{=l*4{8rf4Nu=LWt08;`H5R}%|R}WpO+9k zq>o28ktgD^{d^FgnLU9CJ^`Ogcl)0=ORq$R(U_%?%}Y4j*G&w-vtRl*WaWlUm-26C z`SyQ1H+*f1;8$yPeH%LI7qW$9HKI!uY{pExpxM<)TtdI=4 zOIL-w!H{X^i!omK+Y+!7K+;(wv9A9=Y$M`3kbQKSj>^8inZI7+KkNK|;9{;0vm5(= zix24v$d2_}QbeM@p58k>{6m2_)Pp{o4e?T;M6ps~B#CVy_oLfFi4O+mU)~Xw!J|`< zA@@K0AssFoO_$XfO&_ouIGe4I7zI^I5&TO1ZLR|R%nBnF(n_6Q@^oDuq}^MCaqs;a z$k-+kJ6IiKXRZGJf?Om{$0Ufysz1Fe!*RCJdEL04glKJwo!AD7Zs~9W)lL zEGjVVf3m*ZNuQ=*+8&k?KV3HNo|_dXmv(WwG^8Z79#c8vy*gZ|U&S(qi-K?l3IGsO zZvN$rlT@C}wFE66&W9HZoBXtm>}D>mkD7DZY9-7*KRd~$4j1Mk_52QrjjWP%h#$W~ zju2-wokx`6Ys^&US?hwPpXt^~%XXux#yGcWdmckE;j+;5fIJ z82A@FTQ2|o6`qm|r@0fX`)DaG?adu>qkRh-@~dw~)BrrTNJ=C8-%U0rNZ_$n^FCRx z8rta-OEdD_YQmf17%=;#tU9Y3-QXfg%w&0CC4lI*&EY!eWhuHSK%Hqims3%$cshHSKvk*|0F*< z)jJQRupi}WCFvaiB};CIl>9eNcc@T6#p+k`@&{1~4I(xg4rX{vS^~Y^ak1fslZn)B z2bwv=Ogc6-Y?EA!m09+ck7(&}nH9+S5 z>%(&M^*ag;J_kjbv0+!?QI59Xj$5(*XfZi!MJyEI5AW#vDJ2|guKUHtWz!SiJ?|fg z9?Etg9?Dr6x%w6c`zN`yu;tFrei*Z121LI%>NcU;#(unhhdH%mQxUeYzy` zn6)pM>;^;0tzB*wR#;#wR_Hj&&1;gBM+*7w;#hrpj#G-Yfi!~H?oh%&l(iuG)P$n? z?JgeQ!{4wXYuF(}FwVn*v-=q|@dnPFCMYaj?wAHX>lN0U4F^iaP3PWq;UlTn>-R?W zV;Mv42>gt2J>6>0?oBz%R`L;Ezg*-3p$@Rep|Uv2`4{N(gLAip@2778~xph63|&rP>pvy^JR zT3F@a$>r`{48lrH6h-PeuNY!2Wf&|aEx2WkI)mDDib*jhp+2t>8L6KE_?=W|e*D#Y ztLkQTWs%tZ*?!YvLVk-UF0pPyABbTnN0G21>WUuSlxg;e0hWS5x1C4#;Y}H ze1AkjBccF25y@5a5=LG_4kbMN{_@V?w8!l9p}5?=hYI?>F!T7>Pv%KH=3OznvBkc$ z)9#b@Fw+(<&*p14Gtu$r%*gYRF5`|kHHt}Pumuy8#Q{1qs&y49(m;Ii_oJdhc;#*6 zKCuhmcR!=lh#^zb%A?Sc1+f+D?36f%+GXe$9@oR)KbF67_pvC{BVX;p9=?ucikljD zA+?a#_D}EaykNxMbG!se{T7Ko!aI^Azr!rk?Bk$(nJJNI>Rq(Ir(F87zTOe5e-SoQqsNefX0U4KPX&ibF|X-D^L(VCSlV?eHu1Qae=bUBP!T|M!W-CqOP zHh@g)KbpEh>RsHhFDGf>W8V^pvz%mGWF3)Yb9KcTDrHYOGfX{Ne-hukESshaioc}na$AIeX_=MK5FI*1;I zUHO2N_4(0KvH4vXiO^`g_8*iRiu-FS=N-c7fohQ1JQvJ9)3G9N>`~BnGGSYsWge#s z-L>|??mVO=bc+V`WlL4S0_+2fBA^#aa6J6u7QB%~S@ zKs`EMErvUqU}5>7oXbSZ>Sg>-w??I66nfJ9yvx+$m5!KbREfv5n|3XTZ$lB3j|RE> zgt>Pt{OPau2@N|T_&Hp*bHc$qzmTvn81v3uPUG%+$L1 zi1q3tBLq6`F!Le%s_8faU9(6V@`k=4A)2LCFa^%Fq~t7d9>lVMOSWagr3 z6XN|+c?aZZdS#_>;cXd*rxlE;z?A9~PhbehoY2tR47onE>3>iMYg*R4+D?Ed?gI>D z`TaL0;<^gRz#eN<-}A2g{(4)eKuX$u+#92QtBu$iWVyY9QVTalko_&%-Je`up}zoG zXuVv0!X%hAsp4ot5rOEfDuAo!=h`9Ie2n6g#-l~~)#4_9SklPveY;ASg*7m72Tu}Y ztEOfEE62`XnmNA$Io%RBh~w(qYEb;ochchXAL>0qTr~#*u8O5X8g=FN3}MZDlXmLw zpKynl>sL7}zZE-uTKnr`Fju&SXW`v@yfzxP{5}44*RJ0%BhNsDB^=aILyzg*TxGw! zzz@NHL@X74NUvMItRw=sL2jW^QTmMPT58gk0-MVI@%63tf=-2~%vh1(5LY+LP?62? z-fDatNL*c}aEvLmS?{v2(*+qJPTzr%QeRW`CiRu?bkRNOnk(JQ*+(Q6EWBSs*~(v6 zQpeu;?5*$?*Nr!#qY~&@eNZc}1!&uPK$4JEQlIC=<>vlmz~)wE0WKo&hVF`lfq+2( zj8)h5<&|F{LAcO%!GUq~0K-(!=H=;skZo%J{bM{9-+1~jMQpK4gX+*r2lBJ^+1^;6 zpwy*40KYsDBXA5EKQ}~D&BEo7sMMIw<-9Pa7)59EhWQ;#2hX5vU3u6Vs^~#YpZT97 z1m6z@+0Z_w-W%bngXk?*U+$5ryfS%ZhvUr{e(&ZTviH&y_90ZgdjKC7h5Af}URjJ) zVKWE~Qxl5ze`GK#L2wGtRc9;&1+t;V?i~L}BN)un;9Pjz?W9lt31*R8;qX&HSSNW1R1o6o6&mCpg=gR&_%1@*q;PHHR ze{q_A_owE#3GqKW&awyobbE90Y+KoXJDTU&_vZSS_@tNn&k!UIH`+mYcEhMqs~cBN zE}mWsPI6?=l?M@2J5k?7{R)YEi2)hKAaVJ|`I?xPIK$Qdbe_DbGN=+^Pk9vBX-)sp zESd|zCCZXK{{w$4AgLlgm0q*n8w`~C%IkAu68+M#)F*)N>fp zsle~b5Y$W+%%T<*_m2-1!P+G=LnoF}YI8myO`bwv;u9uBrs(<2x?DIl&|8)Pdb&DR|SzTYf=P`+RMgGg0ohO>eyO; z4H?J-Rdw(+p;5vOU^%vdb5~h9ivU=*+kp+q)vop52BzZv9frVQ6dYyZs#2*0w7rRb zXEQpGdv4yXceOK%4D0y#B4ERfYo*;#z#v4{tLMYYyvF1aZY~0^Nqi`{joA4>u#1vw zyqpM6fMMLDOtv3GE09AbcC2>-I4F5AN}f$Oi@NfH!)K$$(tE9>eeedcUH|z@0c6B; zSwN^*9QzmBV3AiQL-F?~O)B+VOx-!DNAvxGX)F^icJ6sRG1kb(UZ?3dIihVWf_0yZ z=4@Fxpv#+SA&-CR+T|vSLy@oFxYfbKb!GG5jzrABcXa|P5Kl~t#~}By41{0W=&|?S znBVwkeV0LUlM5F;wweHbtiIHnvWhEFY}7F5(YTYcyBTm%$HfF-V^yJi!=GBiB3plz z<7|zi8RpftJQv5q7+%Vg#rtBzg;?Z!xGODI->`B?;h9 zy&-mxshW1zC*pT7&?FiIURJ){#hhI+W>m*TyH!-TVZ&pmX-0a5W;Bd}27C++`;(m> zfw%wV?c#LcOcGsRajdXm+jeak9-_NI@|uw`Z3~eFWVKMZ zU1m%*Ln>>g!6zH;j@8eF)&-Lk%3GHDjrM`Wc%%A`4PDf;43bh;My#Pv5y$ZhR%SV3(Us%w z8}^?-7W16fjF`6A2w_+;$ddOd^msOlC#3f%7fN0-8P49^~Yb1y}1_0I>Dbk;Wt`1u~-4`6e!m2te z0xlCVsKn$x55~2;UZ*yKxa>7#du)cT5KeomB*Q--$sTyDJ$DM}mH%HuF2lg59e@Fr ztNZfcKamT5KOpYm^=}|tk`a`&Nm1r1B5A^?f0Z z17+C&bnxBV|H|Fz6Ub5>TKdBX=m*klHLms>qv0D@*QFqaP=!SB3i2fnNmBgT;z1x+ z#udr)23|#goEP9=?_!f)<#KEa4#jjdbQo33gPx&3dE_}LV9=Naj~IWv;#Mdtz!3Gx zEh`YGq)4UmiZPhhEcvI|@Soa)_;v6V;^#$jh$ry@0dcKP<~d>?BT?Q(tnrvTMfNh3 z{DHvlX!|py!%a|G^boruP71EE`n&++>V3Kw_)-L`L`aWIn-$}q<2L|15SK4j@dmM+ zFt8jq64F1pMg@oA%da7-mS3(2`vB<#3O>SlqBx9r7JG1}wyUeZ5O1T=@hM?25EbZF z@Ae=TjE0>P55ClpIf8ISd|6=HG7Yp>Y*n1%|Gzx&D7*lJ7g82ILfHCixV~dIhW`HQ zDLX(RnXlal=Za&)304D2t^2xOao+GB0Lu^pZuqoTE*$P{?eH0mnZbX0XZ$O#FhTd4MgQ3i@*~|6oW2( zBJndICk%i1dco$}x>GJ4LX>ijGusZfDXG?VSAVWH7|JP`t{)fVl2v*kv^i+fi zM2Zuj!)8*<&1{F8ZePkjilBf1{giP)r8`~)s@x!R>vpTWD@29XXm z3vs(Gc*>dJ^tG}$BQw`0s~lIxij4vCeLV*khW znN}%6gG0mD1(grG3pTJY(V!M3c%TR+OlZpQ8*DrHazT~Ahw)L5@TbiH8)2u-hm{1x z@p7Y01zemhzrbXZ0V0;xt85tp-l}k_WIvnZn?oRAWmx_P^yU`T_0MN1QJ)j}~< zxJGl5&xD$=ljuW{65?QUE>(QN&H5dt@TXtj@Rr~26W&pn0Gja-5IraRABg5U&+>}N zw)-F+x(ovpQ?O1oPh4$eJCtlj{rep*FV)Wm89COItTcSCIU~R(vG6`3^i|0p8%;=Q z23RM>2ncw)EBVDxfC9Kf$;BGC;R=$4XV}<$h3DgCkhK;DfPRV}e(2}k7tB^`Wqv#j zl4f7gv7J#cV-Tn}w7KrXEVJ{A8wG*7*2kNFvQfw0np6Qq;J!B7V{sa@?6KbD@bBB%^?{Jsyj12|Nx{)jaNNS%)8r zKz-SdYo}bccLc8?x5DdsCt0rrvx#TBmb&e?YSmico6!Fa+1P+1Iegb*<4yQPs?Qhh zgG135yKu=DgARLWq+POgDJ?)0h<7fGQX{RXZ3nXfP1_0QOLxRhIV1jYZw7ej04vZ=j0>5h$ z$<1~sb8T^Ob{kc+HO&xJJR~-^XfI6nlSqqH%N>EKy^Vq z%< z#a&kCQhd}=U$!=x$lyn1iAL+-t`CUGAb7TTxZKc1?SQ5R3`>O>(xn*@$3B-|uJPNf zkjZ-8N&8~yWLa!e*?N8P56{v=2{X#oKZd=V)-|KSChRfg(I_$pqL=cRae!aZk09pj zkm_eR85|~=8p2W`_f!+f)__D?XrPhkA2YkFuG^NA7lcQzMdmcX;U9)W>QrBSDCUAp z!8YC|dEgl)9fJR!Z^7JvA}DQsb_j$fox~R`sf5P9TU{yZPVqid>l-&-uB31bSaMF# z9eYer`-iLYafjC?p6@-lId^8g^o3reU7!>QLfX)4g+@Os#MDlO4iB0x&IYKZppwbN zhI6xzX)LovL;;u3K4;|>WZUCcdh)X)a(OY?Q380hjKs>oy3&x7_FaX|bEZ}*p7 zCykw@2YviN3@iuI0JMj1vVe=!nKyh>rwr|ueGP0yy{vu(6`A&K2NR{wVldv|0oldu z1U2p6D9dFORMCVuu*yRkKgp(*>)f$@p>?-1e;~Ik9ho|!icb6%h+#k7ZeT6{k&T>zVc5!d$fQdVc=t=;1mzrED({#9H^^=7KwEcD%6_bWYAC!~ zP&Npt*vXgElhpfcqlcl%G0rTUFPq`K%Frqlbha$HPBGxv;xl}gw~8+AvNj7%(lPx*~{D-@1QIB-i-EB-=N_T`Di)RKr3YsS0UtkpqWe& zE!$xGYl^HIRH~!>y4f=}P~Gv0rLh~}rNVIoBpABsx9O9mECEV>EZ{U1V;NmciM}Sh zSO=EqXs%(4bIds^9s0^riIy#}Jts`Yrj^=3DZ%TphtgIhU7QZ_diu-Ta3UI9XEa(8y#`1Z^_t?PBEI^w7CSCJjLQoak1aiW~!C-GV2CPYEn?0B3alPf9lO` z%pBga&hkQ~`cMT#4hu245>^p>YKx1l@Bj@cw?p^2z$PBGEnI(c-R`6(^hPl-Xnu$3 z6WSPYH{$e|J=n@tdzVlim*zO)V{7Oz znbUTS@hv|gsV{Lq-X~9^-z`J-JE)sgJlSk&>Whb6i~U~lAOQGC6I7nPfv4eYf0EWp z)h+4MH?Ho?giy0SkpqRT*>kx%fKw-1Etfu`o5e?9HHU8Ak3K)z|M^k(-4${j*8Cna zQkQ)ex2`oNs5vSKktEQiSyja}KI+nDB6QUsBhrsmq>!R(<9WeZXs;`;VC?%Dl`tmH zi&Kw#)G^pip<87R8mn_rK0^x~3C1CF@^5*)E#u5DALN$gG5++e>(5AO=snpSeCh8% zKFSGsbGDOq@qQ1>xaM8k?tTfId!ii80`>g}TkE!0b_(-EviXM|-eZOy3lqG&LCm7Y zb2kEYf_0tq=vv&wtfmUwcL${77kw-XkxUC006Y^Sb!x%sJ*%Xa@2H&aO-04;aF*#Y z9@|K~;ad3f9i|ME*;P>Tfshicv)tGBS>zkTCZu&{3nhzK#z!Ou6z-@`jE;8ku-sW@JBEg@NgXfW@50p$-O8bS;*kwhZp}aZ!#7$wzzt-}@ zq{{wx7iG^yGGka@xKt&3lp4UEF4T{5T=D3cNCaUcYtq|E$%HEJsfYJ2#DJrdGBV-~GA2nrsfA&Py_o~^1Kz%CRxdELJmIrR!G$+VR3Dt-G>3JO};6;uns5fgd|R6F`F6bZ_hv}!#f_k3 ze0V?SHzkV6n5AP(SEYD=sAVb3%vx=a?IlPpK+fM?PAD=i)?S%xRlpTUyG|8I=&-ls{trOvb6^}jI0n(+eX8eNkQz)g zpL5K>MbVsRh_z};t?-eck`StinW6$?!|8UnHk85rVO2^x?X-wRtqKwK?_qUiO(=cn zUiHQ1#W{N`gS)vD-cYONh<-H7AC&I92XbyyQjNky zex<)01ZrJEDv&d8NPPqZ)cG#xVkCZ|&pJpueOP0Z+Z5belPY#8!a@Wwd>aF@UB_-T z41U?r%FFPc%8-!AGe8R4 z>xC=2qkxy}RyCcmwtAOz(POsAd}9wd>W*#s(h`Ix=4HA$szA{@tAob_r3>RdF=3T^ z-Mof!yuYr0|1)3RLA{m`1ZZ#-T(4KNMfY+q`H8o3opD{<$VLZe7neo$13<29w1cRt zbSWD39Y6?&=UJv{E^>RR+_^B$STBVier@+H6Ba(cE4nS2ehXQb-t4+(2C7iDbmO>C z;iNN;-c{kchXLD^^$yuE@AOPmNJ^9KKKXRdxMW$mz`-lajjbN!mDhRbkGHdQvEl;wTD{ZP;i7Jzpl=!bL|UT zPQ~)cQrBKjJ$6rRCz!a>N$_=^RKzj9+q9EyNv*dDAI>tJ8Tw9;P_y!Zm{Q+O?Pfug zICo*%{fslvWTs17(?(_yDM?^$hs2xHARo`RF1P}3LJI?Rg|70J*`r>ML4dC&-GqgJ zkr;YtN_E|1)Nd;Lk(Xa22Uw&xTtQ#a&P~E#?yqsN3@Q}dHBz`smLUirX6tUDqaQQ4 zkf~cv#gyjhhDTGrLc7t1*5Yh!g2b`B@v)kD=r+)G7s#K(ic-X*t8r8nI{~kYwt3Dcb5W8ce8Q_&1-F@Dd1uluw>yExoY@^b?){< zk!|7RX_sv)kn1<0pcXh4_>zyY*DD(oR7K&a!n!Ml(PgGXXq_NXl^a~haRL~ES^_GPBUpuFO;`3Iz8? z=j;8X9#aG3Pm;>_04MK(CYJ2o}=E=Fzaz|WpSgMum7t!CZXRJXJIW&xwpN0`I za1qL6AK%v1t+~hbyMywC^y5sA^`Ohza6-Zq_|qlOOR=LZFw3P^fSwhAv*k};Z!`D& zTJrV&IWJ>!HVriS?Rmme4Nb^Bd%{%`IJ_xXaUv}^MF#xX>WCPYPGIEH!yCJ&1ey%S z;L80CN!4?kJEDi*@kiy(P~iOO{@s)4w9}S}_C3=ydqqwcWd-BY_@-0y%1xh0B*-E* z!&*dgFY6GWL(Q}JF6vsFsm$DESc&?eJ{}ax8J!Ica`rqg5Fq#JNsY0-(Ha>!P<8k3N z1MHDHx0#|Q2DtzMXBmKZF1b%_KQEbL8-HVDTNtVSd=g0`D>9rF>A4O_4iV_v4a7~a zKYA_>LB-%VcY+kYz7+02i6TgR1HEr*-Zp9|ld{JEv|JFYIc6}0*7bL*fqY>Q2i2Gu z0QFgcN_r_|M-Wo2zhd#ngrsDst0;Y3Y85S<~7(6(23Xv9(~PS*c^vHFwWz zoFz*E9R`SQCIBR*_=fy-Zu2JPFu+e$@4vQ?8}qFdkV;Gg*yRXCHNz$9b6{arCl7A_7K21mOGDW~y)+i*yTUta4$vanb#US|-*%VYDiW6h zNTMqVH!)M1MF~vSm^lG5 z(EIUST1@&2r-k-CZVLsFjjAlX>H#-e207tdz4W3;eS(s+(b#6#&*eM*zh8FEi(MQk z*Md&7q1p5OD!x#j@cEW+1Q?J~jFyQGP7_$q$Mw}8K)n>R2kkk_K##MUg{)0uLx516 zRSj7MFwb7*j0u=sBQoj7+BoZ5TGn3PrHz{Thyo&oiV`Hjft&5if}Y7r3*ZXOW4~K8 zZ$d!XRY>IY_gA-^+!DbK9kkjYCU$zWd8RdbX z>Q8&0xRZp|5dufe%b-uWL#QLXGoSfM{x0BMW?a_hi^p+%kHA?W@SSq7OecS-{s-oxeZtKl+2Y9G_fFX6R}4rEdW+(VtphddWN}~qDFE;X!QsFvua7fhLaG^mgb8iP zz-38-tzm({53K7K8Y!z9q>wkD^u`xrbTt(6TwE8`WTFSI@J=qSgMhmn9&l<}xrJUJ ziX6!hD6}FSPIflnQt*;6XrlTPS@^+>PqFCd;heZqTCHdARH5Bfu_5RsMGw1cX>t3 z?&FoFp#bC+t_f`q5w-0>D^M1UJ3#x!f^B=Y4w`hio7N7NQQ&RjOD2%{M+{Dg@r-7ET)#3ShkP;E%WDsc_g|J2pFm5f6CmE}*#CdZ8Mj#2Ir^IqwFlqThOOXt$%xm{0NRsfX7X%l*pk^_N5tq6q&Tk(^fA-@} zKnJg|lQ{#n$M<$UcS#|~BvwkH_2D3t4$gLxXt?5XU&=_)XO43RbG&M6Q}7%lc0^@E7<;Z5M>qk6Wnh1O}qWau#E!7Za@ zptH!2Wf46_l*skKtp`NX&-KtBOSr>DtC1i1{!%Bgm-0(R+ zjXcghI|Cj~i@V^=MVP1`TKZ+57Z-qjaFB?Ui12==LbKc&%KlqDdMYrjdok+#V$TLvM ze&Yc_LL6!-yqsPrU18}Bx%U~H@Hki{NraNv4}aAkPVZuQF@yUpL&G@PNZQIjaK4}> z1LQM+_S$`0*7l@^A4i~NBkNwuV8{Y!sg;lP)Ud#>2E9*%z?TH)gRzcT*?ik)8>jG= zB7K8Y$0YCna0QZ6W^usTDj2Gq(`2R;vQ2B{DF7v}i(fmmqNeaXejoK-8tYczUVN}~ z#+L$CyIJN)1AX8flO7Rt<8=)D(rlIECf0CO?2;03u~I*=HJtO&4PgoGn}?T<;KjI6 z&C%V%t^2UY8D_L*w%&XKo({m4YxV*?5zGF9=?kEz61;V%a(x^K=gb!$lI))&ck5BG zc@_W;@LJ}#Y{&Vg#=8o1ZGHO+9?fLjz9u%u0P zgeKq$MwJe+rMK&Kb4F1@`+|9=y9}sm0zj)&Oij?vROR781$>$V1VYEugTnpWk@#Ve z*%f-5NN3!5o2%)6g3(4~T_wN1H~$77RCy6L4(b zncdYf$D(I=^hyQaY&h3<7T}XW7XXej1wm^z41Qz5^&Yc{IpvKOnFnnGcjbeA>Jb2z z;R-ajY`9d3m8?nOkRF^R)GKn|xbTUOE0cluLaJei9q0`dZw^Ji@W* z%-#EVIa|NR?TB9p(3CQnHu$X5GC(1$23&l-5%(0`0L94!(2n;6j}H1-15E)hkXRj9 zey}TGcbcR3!kTUht$lhl5?!J?3$lp*eG#~xsEzAf;`&$w3P~9q^%kB5U_epGaA(SQ zelg9_cb*tVX9|>q>0L1Fr9FDH51I)ryY%fF1S|o@ihx*tkRr79}YTA|KVX^Rtgu+P@d40SYJjW3CUN8LkzhHgup(&#!K+RQ}z-6r2OmsuJ1m zjpVloLf};gHG?N%k9GS}MOHQ4dREcZX|pWT{P@mxv&HNmv;rWOa1NBU;vY`s^v{8? z`~W`M1}lyyj|(h{O!o#uYseRv%HR z=WX~sT?NQs%M5DwCu}YKGCmi#c#)@xZ8cCtrPaj6YUzP0u|a{Qex`E|@0f-I9gEsG z6I$^n8z3sU@Ry?ge@+>wN2nd}O8su0F&I!;wx|O5>-{4;4?8EPXJ8H@FNEsoWcI=+ zQLuFSrhWT06OoDPXXLBz+Iiy48E{K4R$CL2N(O9)T7PZoc$D0th2l5;ctflvyrgH+ z)!ohBBn-!mrjUC_2Y}29w(?9mW%O)!vm4J7i5>mkB3%pN?v(+zlppQ4m4WtsTa!^;4EtFZ`t1l*hrIIr_w z_9)`??xPLwupQk>Vtz$;HxM(r#3%^}GsXUU`K=h9WFJGV108|exWb#jp$}8u8rn8m z8&ro|>zc}SzlZMF%75Jiu(fhcE6;BOgn3sG&v?oeIVaA*_yyZY&cSN}J%$v(PO#ik z^cNzZt0XCZ4>mqEF_#;y#-XlC9)Q!#`>eoQP6+(V3tp{-TvVDQJ}(Ab3RW0E#2t3) zK38Uh{8JL3|1X2= zPe?t4SAfi8=A64((CJ@(d~v;THhFx zgwhNEoM<$`7Kox#v}%}KSzVqz&4M)vfDf~6(gqkJdwt*Qc!51YASK}Z=yi~X#V$}o z*jjlC<>WYtTR|Z@nN%`_t>*}uIQqgK$cpf`9;@|5tm$3`*qiPA_u?a<X@+=R|n{R!FPWv48S{h zFKHO#=$B9&j+-RUniW{TU5r`OKOq~kN(I#?tj+h;45FZz`=(uW6>09xWgVuX2&&l5 zZ|{^mj!~knfK?0BgS%9tHK<;o#*+yMXMjVp`GG$-c;;F3uW#@4v2E2fBV%R)L#q3v z{#1EWXs54)ZUf#_4!YX}cl347&iew+x$`|yfFd;sY-Q5HQu)7mRFVA?00sOF4+oHL z+$PZ}j}Uw3@%#H{Myarv-1k7Vum)I!StB}W>%|dV+kj8vzBw53XgZmG)2>pCR0$^W z?8fvOfgP>3rl$&`x8$sWe=1a}7->wZ&i(O@xkyIxBy}_-jeRH$OYHFT9rzF&G$8kf zSo=W?h*mj2oQ9d6SB&a0xAH6q#Dj_l@6zuUsprkYU>dpidYs~)&Owm*6I7posh6ZJ z)FyLwY}>iJPV0m4DYMcI8wEyPDDHYt?lTwi6wZbtpfGWr6)SGJ+#Zr=FYTbR29(qu zW7Ep)U}JiSt!^#K+=m>X6qb2hjY*WtJPrn%%k7s!4&KpB$kM08IK3)+bNS5p06o{b z!Y6=Fc!QU8jD8T$9!e3E408qQB>RCtoo{}3U2q>%gW`b_I9X&mo`?hy2Bc-@*G<(w z=^B_zrucxA5k78B*F$OKXPvb11>35&0o6cPp=d4lGG_s5p52F0R8j$Fv!n-3Cim-1 zD($^N@oESNdZLpDD@P_FnV?;_hHd_55}M~yiia9Eeb|cE%vfy^d@MX6ue#u((jDNR zav)_6CzaT=lY^eWkuc57>xy7Jnbdv!sqSl%SwfeDye2XR_=Iz0Oix9NM-f~)1u0Mx zzTG@y{6}8TOa3&&!WowfVMIhqoobkVi`5WSEa>(TfwOZ-46jtvGtf{H=Y6mt(s#mZ z^R{ZnJ6~0EtYHR_*1`?9-l8w}TF2~Y&!^k5Fa_eI$-HRVKeh0LBTJL}x#~UpjJo7h zY+~;}<J74tIA>DYliy$OC}R$}JVYJUF%2z;Cq|y*(YFnqbjfBSq^rBE z)^UAt%abHx%N?JdD6Z@rj&wz*Nre=eP5NhjHz%g2*$6ximqarT7VoNmts2p^p}sh+gW_rVs^d zW%D3wc$V75%tGV!3P0!dNuNRDi|~e%=Vc!S$wkjuKE0mFl8+6`NvBdRc}$JbhP+>d z!~5o5_w2Zd@^F1z&O>#MC+s){41Ax`sb0LmS6yzX%G&%=*T;)l_i16KPCb=-hgfW^ z!F=)kl5gKLOOI3mOaF-a8aGz}i?S4v0N>ASlfqd(9!I^*PciDed~PGl{TAK~_W z#e9>bj?>SWt%DsHEZ&pEE2;eC=aNgkQ*K`}pZ$tLb->|Q@|G-5W@H~=+#T?#u}CW9 zNj^6E9=GAaI9AjpGzp{Ig1;zowTq29R9~q#(3x%4t`2Uq z%wldRZbn9_8o9WWu3@o76t_{BXt0w5K2Ppcc7A)UZnpBic%iNa&uoA0ew21ZjQK+( z$TJk1pTki~taoCH3ccm3UPuS~O}xRAd2BPCm`1Np-$17?8ldEkZ(#KW_aWrPz_rA6 zZ~5NHcsYyi>r~_7Nw1KQi!gru$`W*237k<5Zv}cB7O^0XorML|z-u8^sbxhn0@ToA z$wa(6K()7lp?8dH&N;0KfRe+N1iK$>)7QCRe6)2&b?)mwL8o9G7-FWWlciV%zaxR- zTchS+qo0q!k5_oU%(Hr*!8|+^6zzFSy>={YKdj#kCo|R)aJPNkKot3f7|viZY;TsW z09}n(L9)$ez(>v)xY+j3Xl0^c1+&RY`)$yMC<~;Co>nPwG>V9N@l3o@LBWL+)aZqI z#*ClA%bVTMLgMK7r-~@pzcnd5y1f%QcII0iF%PPiNxKV`d>_yz8nvFYomOQQ4rsOO z^jntdG!|vHQOppLM8ycRo(v#~_cSB*EY;Zbh&D5`AT)frghugP+zQ!y`^76et5a*u zH93j>-67?8I>ouYx({I{&XknIZ*fWmN!!2n(s;yIV-+2>OE>2_bjLgvZS! zv@U*tU-}XIs^G|x1Nm{f9Ykj45nu6>%<;Fg^=vQS-ii>zYO4Q!WJ_b(J#*&}AJUIX zP@Kfh9_FW3D&5l{7zwl(~HI1p^FP1x|Q~N!{rRTZw=Iw9p zlEulX!WVgroD^a1B;Op1@&(Me3ROK33ET(mE6yeuJwOKx_*_P3%FAf`bRcrb$!OOo zjero!S~S1MY4(8Zc}*Nnn69FkEbGY><$umE9YB%;o%322b%@S6399 zVQoSnZ2?6Aompb#g94F{CVTtqliD_;1-c~?BvZ#Z8DUXtb|+;l_S3Z^52uG8r3!nM z{_^W$Wfc$*_;?T=dSbG7pp0AUG|O!rfR92I9Wiq9b>?sr^XKfQFh%|-d$`1G{|yuq zb`JZQ$2Iojjz2dYr)XFCih>%XRiBd*YjwRLEW+!YHQKdLZC9WOfQ6Sj{aoj?tsw59 z{jn@1hLh!i?0rX*oA8uj3kc;nrMojD zaFqCEbI)Y|g*a}h&t_Tu%jdSpz4T^P4zpAjz9;O?s{w4$Ua0g?U2uZl$fop7v`y>@ zFHWNwHy;-}Syx&O{gO7b+{ojpvM6Ifw@coK*q!*>#Uwq}#};pX2mxc~LigP2A35}= zL%g0>r*e#iIROXS!&?N1H@3jAZnUK-8VoNa2U35d;E5b=^gH2r#8BeEd?FL{dCMXQ zIhIK?3Os8`OW)jYC}w+nj`8n|lf=p`5NP?Ty7x9`XAKmdqQFu9YgPLiw{wsbKN5zN zP&|KLhpa?zMp%?U9YEq8SufVv#v`$@apQ5=b3aI^K3|7v@hsaT45{vw*Y})^wlO!V z>ehv#hMZ<@?%SO#hM$)#>DNo8D_oQM9oPBwtu>lPz7}g};W&)~y8b-e!G?;Yd&$1%L)4$pb^*?aA^=9+UZeB?Lz9$Mxdix2i0RNH9n zSlXpiJGEi{oK%q*Gfa%5`26lA)lE*rI86u1A;b62gX~7Q4hBnx^FzKAmGcp$7#BGq zYWsR?Y1_(=sdwJoU`LZgyYtq@hdpvs*dUfVc1{%VC(&0M> zaARj$?n;C$*phXVny+Skk$SQoW`F$XU>bF(Xss|Ev$22X?=C?^^Mm}l2+2zZklZjF z;A-0*o_Tmc+oTk z&+-LLCuxxu{qz|A6|$3uWTK}#!z63+8v+~PEl|Xz&2lq2I+}LH%vrC(BDk~c6%Hm= z46zdr?eWK;#DyO^j=`ra^$f>Jqxi2<{Ec5Z_xo{%|KeTv>UWK-UNrk23EBABzt$vN zL4^3?7y-Q4vnhg#u>&1e)i$F%1hVq?mT>0kxIi0lOBq?|1j352sMEuA6upJAAEO|V z;kQUGVcenF*6fTs_hDW2>Rqy%wj-MLKB)*VXJt&`A1`WiP3tzeK3Y4zfPrO5>cmNV zJUi~!qp@)&Hlg>M8V`l;&@r`tf2wiLd*bq%`~0ge?b^Us&clA|&VmoT{6YauoncYR zDk0UdudfEIt=Wl%WGkg(%k>;xGJHQZ#aT#bS(2pnMbwvo0o;E_wSnf7Zodg->Z(CD~oIo-2SIQXd&(Z`nXO)5JryJ|Fc-1;9lx^di=KPw)mTMu+shYV zeyw4szRuQp-`|J+MXE|HK^Z-~nuku3Uqco|L$+`p+k!o+ z+p3HUwQ98*>AA?-c)29M-FTFZhOQU`rgE2diJ?%| zMikg`QX3|pDh%3I%7GrTE;>t&;=~n|7O*l$=;96;0H`?ocO7cQhNUhMxF zNKna+?$GkjKbf1{uA;!CKrvX@??@J?=~B`bW@6TJmCMPmUvUNinbPw< z^hA22gw|{<6JfpGNc(O7ge=x<7qT&YvoqzM|QuM(@ zK2- zXUt(7J>_=MqserOUtL-xj;xH3CCp=srk@2g32ct`)W%+*^T$A;X-t*XS?F&8vahg5UzcXG^+C@f$UKcUoVlu>BrN3X?P3Ewn<|PbYZ&(e_kD_oG z^|;-8zhw}1(iGQI(pUazOW*1SKNjCl(yn@1YoOHQqB=C|V{xc8oTJR~lI)?~4HMLY ziYC9+z_r0+6$68v{r3qR2KspYQg=zYLMAILLvbhqAvo<|4KP{krbTgi^D;DRmy*b_A)p*{xax75yryO^UMAZ2jev^U&4s zccsCe&jO6ct)|C~Z}x*eH$` zgDRWRf&+@CEiKa@?ax~h(rtbi{#G{}nB9NY$;uI0xLI(&ZT3v#n!#r=p&BeU2RGNE zn*g@b>Rx(jQ|Z~NEz>JpMMQ8D<%LdC?BUdF78gz5JEJn&B81;8XY_sa(^>41j9mb7 zQsB6-;yIl+#q}!LneKK}jn=iVRDlmxkd_QtH8v-~F;Hc?!BX9L1xqe^CY{$g|MR|MB zJ7QC}Aokwy*+$8KUtA7tHJjJcjAR^BX#ooCpj_lv^aDsS`B8@^#9v{knZt4m3yZi`^{}yVX#6PE@Y|4`EH(G)6zQ4`>7OeDYj@nj&g^2f z=dMlp-kp2zaOf@Y7!QK|y-+!X6&hBNiPaQ3^|^8#5WTd;bH!)U99M7JRO({Az(A$p z=!ylKiimy^4h%#QnFiUJj-+W@xCIsEhKGrprQEt&LsXpV%}8s|bY9G80VV-Weoq$? zFvw1@hD~OCtlfX|PhG6j@#sB9l5X9tix;LpeEKASk4+1!E{r@rT@~DF`O7ZDhPYoK zzMinVHMLf%aGG)zNp%EqqoTqelpcSzlIb~GUDa7unp#RMs$I|*Fp^XNRPfF10jk8M%Dp>__BglNBJ z$rHVKcp;<0duHZ3pUpLVIxIK|vTaHdPkL9dD{M};TqoQ3kxlp-rLpK^^mEbs6`YQz z9Ht_)wj!dX$IrG-$nrFM2hvH-1o$QRqU78&Zx^TCid-r0(3jBOSI^;?K6P{(y=hQe zzcOAvxBmU<3;zcw!{0ds$$5IuMWB>xL1S75f^`#FnQzRbr}y+en(>GOG80ELzdmN# z)yTpx*<<+{-K%sWtXu`%gti*9HrCwso9jjiL2C|pX|(s%;6fuIVfQ=cFsOO8;1xXv zmDY(&hg~=v**?%L(A{~mc57k_t~y0XA$rHrxR_XdWVd3YQd_;X4+kqQz*mUMX$vH& z9R3#b0oe{C*hT4CHB=@mvvveE82eF|8!rwOrn80)KgRQ9%gM&mZi3+TUOP$39ui=3 zc|SkBW=}Ul^Y=v=%pIW9rdF~X3G}~;(uRBF$iQ;4>QPL3`iA6&hXw=A#8kwjDtJc5 zL0>X-;kX+r-N){q?;;r+R)KEt)4>7{Z3)^dSop8rw#^1^B&i%z;lEE9WWd5NW5Lgk zMU(mj_oiXcTm0IU=ON&h;3v=i{bev3pENO#YL%UdMXu-Ugv%5zvY5m~>!0b8kO+0I zw5r~g!!p-zo83=+RK^=v>3yziukwwkfX06w!pGawNgk2Gtr|Gh(d>sQm} z{L7f7*aHj}e0uGHuul#c+uXxUD&j&hAd{2YWoQWXA&5^Rr(LyD<&hW*_pMC+9Sl2V zHu&RnJhF!J%A*?#kP`Wn&{ogZ``OtT7ul=&&xM}*m>-^%Paf4VSh%e$=~Y_ws*0V) zJze4*n(^KaS4!sX9eFO=yB1CArz+3nFa}6lI<^vn#qQO5t<{7Vh4C;(aL6){QCxHQ^QjFg$9N?A{*xPY*XQNyZku5pd+( z`|MS;%1-!PZ1=3xcC1ML?oQfVe*yud8eOpK(qnWyULlnUGl9R1nPuebDg={i=Xy0uFdq-fu42Y?0928%lnj~ z^7qDoNc!rjo;xJ!UnFziasB!sf1iLo%l1nNfgOpm1dS+)pRmF}pi?#_fzDr?1u_yz zvs`s7bGG{jFY~l8RyXzB=u3v6no4kZa7ZN6qraXyTfQv+m~zVO(VuVoihGs5C~%UC zgVDlm#1qVM0&RqMX}*X0cW&IVe;60R6-vgnm%Y(+Q!}^2THm{r_Ze$0q$}?=r8pqY zg^DwpC@7bI{_yo$eF7bMA-%tN2uheR9m!JA3N#AL1c5x6RJHK@h6%s0D7V}h_B7rv z(VS|A?E1EJ>bZ}?j?GzV3x`P-=jS_`6(6?Fp|oXtWEKw81>;NR8gYJ55W#vDNunSz=gE z_ofnvL!?AvNM@$gW8-3ew6l!4?#y49%x|Ckx(#@6EVs+?(Wqry-Mcjp4KiH3%SDCx zr%qq|YA5OeJPW|9mt+`x@Y9|nufX;V7gzjGA_{+8`g_Ak8p*TB-cOmVK`Oq8}9_36xL;+8^Lx)j&Rj;`Z0-5k%zQ9CrS z?2Tb8*`VuiRQeA5=v$-&Kj0!MfGwm!^1uhOh(AB_?G4OT=KSJrohw|_6vQ;;)@2%K zR&^AqtfbUJ&|%sg1!$n|dG~j$0^MUw>V_FMH*}243Z@IEfn_OKcX8jCz3E4ue`l7R z=9#uxP*LVN%cYk-(;kcD5fPsu#UzJhzL%x^3Y?=qB~VnL|86XLY)&yCSt}fUOJYn; zHjChbPvQlA^c$!4yMnIG0&hA)l|&Eci3T5e%{4%CfXph{7aS?y1V0zY&x1$d`DSd@ z;^It%tf$(UXubu|eF%#349CqI000ov(5{;167c+e-Z*KQFX4Mn?;5YfB?ZE=q$U+2l(oi$`4^Kb9sMB29ohToLGu z96=J7&`;@1Hv`l22HN1Vi#3XqIAd;G(tU!Bl|c*m(DhR}w!h2T7UjPil)HOy$&K+H z1FN_?mpsd}!e}LdaaP8l4xDYm``+IDD^;CfAJG(9Tw?%`c%~%~W0_F%L3P0nH0Z0? zb>F)s2Sh^XXd5r0x!4%mHghGs?ZW(#^|WdT6#x4inwzLGqO`HO&^}-BJNQob{4E;r zniUlJEHM++g4c`}gX>T9Q=Z$SCT5hN!t=xzKAwneE*|!B&p3EQewE4t>@N&Tg7;GF zSONYzkf%4rd}~aBBw+27X@e#zLmUX(^PAn&s?KcMP!1v`Sn=K^XWBn|$KSv5zhC>f zX70faL3t7PJ8;2IE!_z{C0-z?fYQF9w#$CfTv;6A++>H?2(UdnrEb8J5|Jm3up5K3 z%Pdusg6emm?8SkzVs0(B0j?oupT(^H1P-5_UtA1TUd+^CW5vD-aE zON}(unOcdv4~(R%;QC^PrmBT>$Zxl=ptJt?WviIDWPc!)_fbr+(=|%l3<@+I6G}}8 zZo=^1E2|nn#oSgR7k154&DMa&1Tn%VWcOH+{x7lx^x*2MK82#D;?9N3l5KA>aP*#~ z&dSND37HvRcRxQnRXk?d*`EOcobETpF$KaZul@BQkY;~APQ^ePf|?Or3>I@0-JV}R z0=NZ~6|Z^VA4C-uu{+}B^1|I<1Z;^VBXNwsryidVq6|S%H(B*5?Ui4&`D8DRbHdTY zWUr3OJ3LYAKv#)Ju(XXhbn}o`H=^QX(zjsbwnIIm!AFbJ;C3ehPA^Y9Ng(d7(Rq%w z9k?!r$LlzK*LmO#pKU@DkV-RJ{mLM=x?&ed)jq1agp_E=I-_O(0~4MP z+m3`FRc%y=_>BlgVf#(+e^xicah;--e9;37$0Fjjcd5!)jBPkninD&6P!GCc>7xu< zH7X>BAo*pe=469i>W)$qgwMO{;P*;jes}dtBGPI^DCtR8&cVz(OrPk{RTMf7B2mqy z8=?~beQ8XrA%Zf|*#S@Q<)ljGI#gKHS3MeL+C-fCg3}vozmo&ZqYT^On`g@2e=wN8{|0JIl}{yw0CSqSe8MkIc~IWaq4-`2HLYrcDT>Q3bhKmnoUyI zlmlgG1mU>9*V7>Sd-^)0+Kyh`Uft}p5P0_*veWO!ZBhb)c4 zaRH`zc7@*eRh2jV$`OLI=tJn@W@!~_2YfLZ#I1Fhy4iaA%h7V_^E(obzRF+JekacP z`qc&23u=GBIQ-{x{g7(JIy$|w9V;=q!|i#y13&LNY?D{<;@P^@30;2|2LpwP)>U7Ae_^$b26iI zQ0yJQ_v}J1u$Ohso0@_#P-l2jKMpeJnT&Mdv8xhE9Y_f&B?Di?q9NwRf48!~KgoX~ zU2j=-R?gVptB1+a(q4I$EJ}BqE^P`no>@yHU$)q=*Lx$kQLXG!(`YAPjz-lSV} zl=7R1%aHd3R6#7v>FkpE8!7z7mysBs!ob$Nae#%KBpRUrmXuqnCE0Wc=OE>E%ag;c zg)QJXN&tO{MGPKWp?t|zgz`h72J!h;Vk>d@mp=7c2ctu99bqeBP%o;BC%)Kfn~>G+ z)r$*DG26G_lgwuu^NVy0DmZ8mi2DZ>U3^9SKimmTO!8l{siv4l@9*D=@}@pmrC1o6 ztJ+ut&Xo%|lM9WeChM+!tA9+lSr45DqVh;;F=OCF(}3ewBAjVCNF##lc+siLKR{OC zyhYVyg&&GVn7i?k?!(V~S-8|h0+lkZ$@DMd0dGUsfiLT@;D?oq_zA z72qgb-Sd{4VX^6ZSQ(HZk&J>g=$>Icv#D7l?~nZ8ParPp1r052C@IIy6PB{{Bg$=P zZZhSlua6cgl>bWjpBF$P5~<8L_oLo8G{_^}oi@V^<85_;)Db}Va}5Vg2TX=o|tKISe%Iq+7Tv89G`ty)~-V^7`LxVm4WVF_9f-c?nDE}MC0$dPZ0f5+u#CAmr`-K)$=;1%sj{Z&w z>tETg8?8%Ov!H$!(Z1Ok&l)P{@KLXeFT{u}V^|jR&x`dx7Oh zuaTo9@1}8k73?OBb~ZuZuUt`4F%JsiSWu6u09HVM`8x0TZk|bN3-mkfOY5p`Phi`(f;jXp=ZA|H}qLz2^bKr)h$jIrvTNPU58@cTjkUO z$z1`Ow@S(%jMZPluo?O1`$z~lhhuVNt=Nm~mPHgf-}P4(q)I-OKB0@28gIX%;#aDj z8Zik%rdM$CGCDXrba{=_zLJ%B2FCg{ulaP_wCaTy+I=|iu~|M#zqf3Q+6I|xg5(t; zDYyqoHLF`jGwlz6mQEBIa}$D=4c$G1N-O0-wHnV^Us(R1jvwx;8z^{gfTkuMk!eTs zvMxWKRy1k*^A&~?BYq}8>aoP4w`vbv>JNUBRe;y5;ojQMo@$}xf$w0-1y$Rn{0hhY z=IO=DSmRu4YyWUE_#3_&REZ>%=Rs-jv#K;DOnGVh6m@CUSe=N5x1)1hJ&Pgl2wwIZ zo*1jPhh7Lf9x1zASYE1_UrUimt0+&eE+zR1c)y3T4Tm(;Z0Md+)m~4uJzaHg8jDTp zt(kegK?c>w9@8q4KTbojwc8b~j$AiwSQWyQM7l*9iIC2$+AZNBMyL4YGhL%~Z?&zXG>T4`4vk3EG#*Y zEC$)Z)kLI02ENO88YpiJCS1ODi-ueJG02$cCM767Kn7lYJu%$4l1Lg7wy>z`N>u^` zzBHAFh!>v*q$!B5@&RFMD>1{16%1djs(bJ>KgJbw8t#Wm}Lu$wRk!h$HY9inIxuTmM!n`L4%ke0-IXJ_&h{F0 zvSk4VE;SY@mGfq^h6Y@89dm=DIB^()X0k&q_27O85{v`!dxX|*NsWaRKrX2qBq^e%EddUYF-Z5Q7B(|;O3{GD%4=Gd`7h2 zA`DG=OYi}Ufu`S7M1*+1YuqGgF_@9q0bsb)Tj{+wrn`;rZ4CFkc*oJ1TeS*Mb_70G z%;c|5<9;A>-Jk+lCIo~=T4E0gz~e)OWVi7=lnfl4kvD37`>o`5 zxrAl}M>vH5{|`N5d>Ao8H2WAi7JZdpH{OZv9t+P!a<(#znw zpRYoc?>wEUzPr>fv}|3~Se)j6v(J(bM}t=;t&tJ=j-z!&J0+tVsQ)dL&?94C#eNW( z;&H`6n8e#>=9uZ*;M1>qNDIz_R_60tAe@?WcZ)L#^dH@DE_99;zqVRi<2XYVyw}Jh zq0MDzhM;)(Xe7(I^>O467|Il-7WJz5$(ok>;$$sh@PxuQ)e(W!&$uO;jNF6`NBQn@ zzuuS1YajnBiT6CDpUepm8(@bhR=!}wEMyie?y}6@kn7Q3#&Mr})hGYB)bl%NKNA6d zQ4oHzwxksu7aQn5O%3ve(_$q^5E0w{_JpF9YX8x>($Tu6J@@9OJj1m)1&IWLijwnC zT3={3%D)X!d{4;sXRj@z;%)eq7_9h&8Xx?t&}$;3`ouwxD<)m%NXFaTfabQr&Cxur zc#4|nu%d#R2aF9yC{Bhu#Y|>TXtz3qkT*0s*uQ3FZE(t2)*$wLLPvX*iMG-Glk2ZZ zEheQTnr9a-#~{+@4U(tbWC36W(hC!3SYiIpUbZ)`vU1zX=JOdm_0{gOv53BMn>lET>&V@(VhD@f56^qjoe$rU3VT$1Z{|dh>m6UdHvi`x0$7 zs1G)V#z|gkFGe@mS~8oKrcC?(ELk zK|V{i(?dsuj^ zTA*BHP+MYQX_?)i$FlmEdgA_gJ~1YiIZK-fihIp98407v5}e`GIc0P|zjnGxtFR~8 zW~1I^W9YkcPuLP(;T7@i zzbIy#0cIgiL`Jde|NRa`*4lFwLz&3kVVOO{-dS${m0$5kQUn4}x&XJGNv1#f%7~nk za0Md2!l%?af$TnzH8^}nvBK=nrLI#^;;hG|(47DTuYPDhShTFd2)LIWryn1?xD0eF z_#UK`pNaxJ-;(M93&ZVT7hFuri@>F1Yz{ zByX@}@yPhcs?~>NoPFP^W|NLxacwVTcSyPbS4L=>{c97I&HrweGf*-rN{q>x zhR9$9jN`#we#(jQzwq^x^}7Pyy}9|I{6rDzjfy{|R7FLtQ~Ngs)jA4UO_d~_s*8vk zM~a7kO|3lnmi=f!JW`Q1QUP`r5B=jc&xht(L`CszIw~h4ixU$QjZ{bVEu1I$pWKW{ zvD695kbF?*W{tC+J7CSxj|x6s`wlE^Z~}@(#^dqq@gfCx8587V`jPm6CH>jEf*jxb z(0V$vuAbJf)PDb0EA^F_jdnwFeDp#6hCz*8IkC$3@+IOpAT?Xh5?Cwz!ZGgg+PKCU`kM{TXz|pl+Hsr*LR&rblGRGpE~z!T*8Xs%!jJ!L4E`?DZ`A4jeuWO*ws-=Ml&FwR zl@-b1;&oN+xvh72LpNiB?DXyIG^J=FaFFo>eSim2^_*hdab2Q8bk63&vvg~BrQ4eR zOG5_SAEGg3mJg69t^h&~l#s*IK`lHo&N%2q7y{ARNJ|&p(q`j#qZM^DHEQs+{bCh1IkJ+So=k$(qKj@N! zidBR)QUV{X-?rlZPb(Q#GjoSgd`B!Wwp(k-r5Tr{!9yc72jl*cQN zIfx_Smv*Q^O7(}-h-5HuiAADHpezm^?k>rAcoh<>?^V|mU+$3H?1Ml}bV$-3vDBq@ z$IMt?0Kwt55FBN%Ja=OTd})WOZFN-pvu(+9YV37{$W z0(sDazGc^|=*UmPAw|wQ!gTe#NHyGI57rkVT!|^JArhupqZmk>3YhA2BN zpbI?hdNcdaUX~-d^yk}})0N#;wfL7II(#PB?(Ck$u}$>Jx4^E&Zmc%b+qWyEZhj%| zb-1Zs{Y>sd8n)QiYnS4p)Cpi}N(kwyjC9)yj!L1k-m*Jk#eZ8cEeg4CEtiL~1Kgsf za}%%wSkSs6XZqo6xIXJ{6d6ywg+O({z{aLnbatktPW^Lnu^$-$=>b`T!co#6SO-fT z8kTS9uKj6wtKd8&d@}l#u0M@HZ`=6Zy4y32AKg6TYYB9hRmmQ?2FQi4moqpu50)rl zuMGubeAy2c?D!;iO~7knCHPUHbNxjBnN0C8xr`h6ZL6?Ll)}7aKjG?>-mwtDL#g3^ z-w$NjF|V_KHM^N@-0V42>uBNu<_Q0GDwpqpC#l9C02iDZLJTT|k?qiyhEmdN%0-tV zg(nO8A>EVBSh~4=qSsGCdir$ayBdZ{4Hkphz2z)sG3jFQRQigK2jY8zxBXxTTNnk$ zQubelM35I*GA)%P1Eg^3`7k=6ol(0at=M|_!4YLY4rN^IH6fCR*_to>RsXxtxAG}c zIG_Cc;-Rq;()GPF!3BoY&j4&4?j6TL;PNt{;*Y62;zNgx=KJ%u1}@wfkBh-4PqTZ* z>W$;%$jxwyPQ~ZH(o0zppA6RcSn8_XcML(5DCrHLCy4JPjP?@)EC$GqrTK*eTAT@t za(aC&P1U4U&jwZik1oq9Y&bV`p#!2zAE3Bea!VT=*v#@qD%2)26*(xf<@gd=kwW| zg$I|hShb1LAt$cr|G&38Q3Wzh?|Z5`Xv*7Gvm}6m^5?ovG&%k5Qm-b`VqtCF4`cOG z*OC$Z7IKP3i#4op2pYPuLv6E}U;P7XllRJBMeXi3U9^w@SzL@sbjD2M*p zD9B@?M!n3EAOh0ZgA9t4D|8_sS-K(wMx?mu!n=>xzty{dcX}UWrzs(C;Nrx*Vqp%pB_!Xo>sB8t#Ot&U91UQ4*kTsw7O^%BK)L0CMw)0(QC}KnQIu@ z(!b{87qy8VYOA8XYKAq-XbMHO_j7SwWL+LB)5-iwb1@5x7)^aJm=PGd0m8&r{}unz zSzkr$W#VMbsldZvsKQcFwK?$S8VG3K?=duiwS6C`uereGvJ`>%Twt(>r@BMb z^FMK_gyuM>r~x!k+uPgH)a-lMu5kmfMSfjOwg>{QPa>SC+r8DzM)sStongXT32{Rb zu<8kzhY%Nn3eY~ppsvzWMn=XwEB_7wS)}HO?UMwf@XfTuXwHOYQx0}@O#r}q|75V;hMM+dL1LT z_@z?|BST*2r{FTw16ael`2o{|GssZqLE}e~`tb<>hceHbBQ#5`MleI8T_ z79cLp7}ixyx@D%5B&;z2{tzU6BEQ_R-xEQ$h9Ih3MZ`##7mMcB>V@bNH%v__5-V)& zR;^n`yA)RchT57!%Y1U%tJa`&?HuV%fudEFbytgD0I63Nm zcSU#FY$=_K$ox7>FkUB2Aga(-{pqz#9bW+|4)XY%9+1^b*!RQiUlN;B%?N)o)X}NE z+oVSXhw+ZnEz6}17SCw&xJ0qD%}4-37wm|t{ZvE`Ej%_m17CmDRVV!XIvT-X6e06o z0AB|V=da(s847_*){_qznzIT>>+mCdBDa!$^ffp@EcoMsk`{;C_i1Q&_ElM~{YTtF zU6NH51$(GOq>fhrNIV;IARedh;t2JCUAK%TTW*J81aZcQtHF6}N$6#^l50}CK$M#B zK+)uGvO$S%IUV8UW9UX$02}XJjr0a<+JN+(cNgRr&~fx-V1Vc{hTeG4U;aGTyw>WA z!;k0!3rMKvp-NS8yRESb<0M6##l8XGZbZ{i``O?`^p5!-IZSfA`?~lBmmm1S>DgZa z0*Ux_1k4qG0AEX-#x>yic_~d0<-3>nFfL&6tB~q_;JtUj(_9C=aH>{q+ZmO9OXBS; z#W1ic$`}I3d ztbrLO7q+)H)7q+W`veQ8kjx+zeBOp&m{3P&Lw?Hb1CZz~fPC)?kKqb5_o)#H)hoq8 zBx&SA9bymfu`ecao7fp~xGvw)p^voq_9qO)Pp;<X&77OT6XbeBr{l_i7XaUd@S!sZb4r$Aik`i0N1D z8N(O#b$85DprGlJVCnD|*GDBpHYRXMDd63irH}? zXsKb)kcD=4qgS?n8ZtfyA+Pw`D7IejKPF=4QrOVxKeZC78IqnmM1*JbgHDR{<;WBo zL61ft8_2dN$aE;SZ8_`6ZY_7l3JIM|&9@tq$ImHxj6qWqjsrC|Kdr}nReV0M?OhhN1FgZ#Ic-fxfV7B<)jD=^ovsI<7<*9{{TaBHtA z;^;NsS^``%7UHK4e0ws^OV#vI8UDL-C0()-{GBS7AMNJdFzK7K1#**lS1MRw6gFw4 z9-biB09cdu!R&gPWT!E&92pP2T8Gpn^bY`P6}rO6>4h6(zJyLc-h=r)(NJf28u2c` zVe^N>_D@I7O*m|AL;<@dPt`9ymr^o4O=PeLcg<(hAT@RPstZ$Y0jaVol_vPk{-$8yy|% z5s3JvQV{2(BWNxmL4*~vxzAxcQoguPftzT2$7kTWU2Q$&z5;gRrex(QJ-mYj%pcu& zH3na*;ArtKH#{7q8zjOk;qr{HUI?=LvE!$mTAY6EO24wKv4OE%mH|TDEjDR2w)&XD z3PM7bvw#Ps?BIKxHSCw`>G&RG|Eg48Yux!jtob2(BGpy|pO3Sftg`W=ae#7)AUC(I za<)SJ-VwWvBS6KBIdzEPKmF#u;^5g}YqzU{mCw$~C>wY7Cb@MAt#UFiYeVwk;*r48 zd~Ylfy(u5}5l}fUiWVa7xsYL>q)GUr$pd3?lYoiMz)NzQ${L6ki-1C-|Jx*Bct(`B0Zu5>Q*zoN6kb`O6bZXm(?2I*-=mL z1=Mj>pc|P=((%ZqC<&kx_XoW?>zkl0K4o{{DYgl22@E;hEik}i0m5A(aKS4#33Z#W z1zS2A&ztR<&nVy4ZU_RAu8DojP7kc>qE5hxI}Y64y6wV9F0TQ$PwZm|ZgMVQH_8-K z37%vrKu5~SkoIb6j(BTIZ8!A-( zYt}XhD$zvHgU1yxtNag3r$B(WyX>7KD?;`6@Qrt;^N|@+it-!5nL6mz(T_kf_wXp? zd!NC)T=XrSuej@FS+vTRoF@PP;=SA3Dy&Kqtn@Xn^EUGvGYu*dD{Nlh(;nG9G(`(417LvB*Z?dv+Vg zc(Wl>WIFQtkl)f8{jcpbm|+G#dIYq#6(_d7t48NkH|{*Kn|)_10HmnFPo9Ruvfu;v z&*@iLYi;$%r;UZb7a<;4ubbPRdKpY1o_?Y{ws4)u15nD;_#C_k5Y5LEJJyANrX!l$ zB0iKem!SYL)Q;r_{HTP4f0jj3Ja!tOa)q#0Hy>1g7On6#ANKS%b4%F`Vm~DvON)j= z(_AZ8`^!cF;#t}(Au-*Tz3lHcqVj_!Gs6x>>coMmFcx{2ozb1I7Hlh3(&jCz06692Dwm3LL zr#)phx(luv?Ff!$NaH=4sIZ(JgqUpfX}2zitY9S*5Xj(kuXPH~lo11*Wyy0<6NqF0 zhDece88^Ymg)c9spO}H=#K-%e(#2+8>*7F*We?adjkP!%e_#aUk4JdAVmR?n{vE%i zi3xCBZeCTL?#!alrv8?(WNZ9dd*d?R(o)Rn{)`EXyTB3J9nl!(V#(qtj*L=%k^Q@6 z#nYs*A_1^kIe3utSHm5o86KMYeT-=8IddbP-->3h=d9N!Szv&KH0Lr8@3uq&s_k|9 zZCk`hmAWC^w`;ZrqnoVu_>=h#wfFCP=KTK1BrumY%!|6028WZtikKF}RCVCsMUs}V zs&3|aZv}WtM3EM=&|_CoQ88y?Qf`;=kb%)dbw*k|$5^X?%R^8bO{3+6pay#$aC@LF zvM{UR=>Hse{XqNuk*K3cOPTQ45n7LM+3XiNNuiExT?&{FxxlQ69wc>yHsTX2-W)}G z3$`)EidCxhWU1OZVkT=mqUz_&le-p;OhLQWKO{|T>GY_DEg%2L{OtyjSz7iAnP`Dd zsY7{hbV%=!bK+d+o5rue6J|qHt+Uga!sbZF!GWKlH}B%*>kLr2gzcPK#AeCa@KGw; zow|R`x*pS^z+e$8Zcc)A3E*_TZ$d$!OM{dM1Lc5YH`T!BOs!R$!K6=yrCn-FDVO9) zs8uSbYB@-tmyifZaL>{1^tWe5xzWQ6{+*Tp8)odgt2`;6b`*9?(K*dwN;u}65Pkfp z87MD$J8D2G?l|q+ovskJbbxj$VKBFkiL8}=van9BxJ5bbTlzP9H>R>?oQ>IqhjjKO zr>Nb82SliaIdW9?2evBx&s3sp0<|DnN(9cJho|@8BaPh*Oww~~-sL@DUoub~Iad3f zp9M46|0tLCTj@Dx8F>0O84jpB9c!V+wMv@%EM>0CL+!aI;Q4V8oAB{_;*%fK^)!2r z%sN|5JMHTCdMn!t(mX-tdja)?7EMkIHlJ)z) zaNc#4ww?()f4bg0t4ZxCUBRTQp|>+nt5cLi*VWLAn*x@CNJo^p(NleOS*`}A9LPbL z#RTFcqGsyoM4Ym$KqtFw`UpV@;OfVI&KzU?Mg={&4w2HG`n^=P{n9IV8(_>71w8@g zq7^p7QWeY-;Q4CR-B84aao^`r5URf^$$6UWf_Fu7@JsP_qR)fXh*Q|4lb`0;3C8Qk&nmw@`-dB!`?i z0e|)l(8m(mfS)^`^Tr`$lA6zw@Aan}#J^pF)}V_q5C3v3GK#PK^<;k`=Q$L&gv@zv znljfynFcoM=B|+QSjhE}g#{6mW($Kigyf)_fpc=gSd5>1`9su*y2Kuo!paO^)yW?j zt2aLV6mpT@&>FQ~Qu6>J;Qu0H3uRi}|LQ2l_5=UrJILO9Tlp zVWu_kh7|_0QVJ`TmRcn#ouzsBn^RR}A0Va3vi*Q|3wjA#Oidr51OWLa=Ghft7KINP z9FG9wP$r7_fPae)k`Osrd_?JoVB=U^PJo;8bUOWB8%cU0Ua;@TR*akK`5R1A6(3P= z`x2s{YiUPl>(Lq#@%x6TcildAcb-&pE^8yL5F z>E7xe1?Mge=9?fEm30S_)|pP0^-uv2)p9v88izXDNThW}_e6nyYk!8)VHi zQjQPW2^|ACjgqDWroAwV-GS|1v??NZ@^=G8C5Que7y)35?I+LyZ?Rzr?1p?iOw9{= z8vkvVf#xFKpxP}nCG!0q`bfb%OEwE-n^LfJ(Y`M)qfzg&!0^S@md|#7_ZQTH2L4jE z3Z_7m2+o(@P5T7w8x^JAVjSYtVbjOn`3DYJ8hs57OL28-20MXhk)N{mgV7b>+Wz9w zbzT<4u1Ox3_tB7)Tn|%lrw#fV2|$Ha572lJ<&jW-LXvdHy`+xU^S#Msg<8(%od9u-dOwgDSmSD;d}_Vpy-$Hs+-^ev!DAtO0oKB`w2J+$0Fi3%REt# z*(vP7@R1m>3>0_8Dt-?1oTgQ59k6S;>_ALrOLlT#=G{Rc+XLDA0vy`i%%^?4Our$# zaNnHNO~=0i*ejeQ*>L;+E_LxXj{qoe1!pc?$EU_j-HPgrXls9t_s?Y+*A$bw zY$Fq{C@>ym1GUl^$d-0D3lM5C%(zhMl)5!Y@cFn`6#Ptvkd7V@j3pd(wY9aExTP2| zbgWGu0(zs_@K*QEY*f_f1!w>or1|XWc>sW)4bEJhm!1+(Vy0#sB|V4;R(bXi=l^9D zBZs>~udR^ar89rlulHzIa5ZthD$HfGy<%Kt+%fAyelPNmEmgl^j=B48kO%axm4!UX z!#ZPs0%rbdq)2bL-%6bn)rsY5VFM79<#?G{=k(Ph>~6A+4x{Gp%mv+Kq1~h#do|69 z!Z5UD0fr{TK^#@z5v^n&#Q)qfavZ%%JjxCffy+|BnmKibNxVuu(~(&qQp- z91IrC(IaX;_c+1Hsy)xwvE%XM$IYxP-__y9qIL>@uT=}1?fXr|3Z~PNCb)xQ4d0C7 z%c&yA*9E$c{pn{rxjVSOO`$-iqB$#l1^Mq&9bJD$>D>eK&GA|H+LetE$rqOhnss)O zu-00b>79-ZrUnS?L)sjVf|iGrEHn&Vk75%31V!jx;-cVRAf2MOd`+N5%r6k7gJb?|T*39t&XyWc(hRcj@KquQt8$}|_(qqzntSf@wF%E&R zF%RIvW3H!kz-!<^)TrS7^!GU6Yaz^Evno?fF9G3SL&w-q5%pJIDF5UoIq^>Nq9d-t zJ7SgOTz{>u0?tS*VHnlW|M2x2>gCh*qV=!B%5Ni-%v>AZ94FYS%^?~dx3XKAn2(|}ZVbNd(*aJoRdl9S4>y3OiBXZwv z-rM-c8?a0HY=tblOHYVsX}ko$ z8DEQd)j;qOY47YM+1@c{yK&Tye7P)$%nr0wOWl41*o231I1`64?tuY+3--i>YQbQS-u+C}20lfNH!43PhBx+_mni zw(Df~rAt1UzOC&B%VSk(YixisTUS?F=@VkG_om!Uz~0`#0wAp|C~0~TXFt#)nf0=5 zmn@<@kHcVYk#@`Jo&jW8>h0+>nkg#4ZT~q@M#Bi136YhkWagAj5{ImekRn_5JoPA>jK~OOXU}9-wgxJ*kdVEylJUE~ zI6Uv$`~UAB{qa1W9_M@R?|t9bbzk>&eXh?3@HwIo8$(r*8_J#Tq3_sCvftKE7mSe8 zU4YcgGnO0obb(ul+2253PwdPY3{-kT-0`G~GOlGtAu4b_9Qy$3JY z^51S|LW)F^6b;-^o;+uVy^c=!_~e&{{EI@ zIBts>T{rwo?<0cSedfQt!n4eVeKzBO6^6ffE0x7Lz zru5Drh=E`0Z7P@aZ{m}7KpUYwHG$IBnZgu^@(%$Hj)%QN(r_&&=Zs=Pny!IYN@mW; zBwB%Z5_;AlN4(0r)4N#;TFYX`0 zQ~PL+r2KG^ZfXMc6UiGUt%*Lapkob97cZ+*#bPO|!=-}eg-5T-&}&tSmq=mCrhJ(IYpGMR%M-(>IEQB#la1EiL{)AvIT9Xy& zLaix|Kt#$O{F0~<5Q+0_vY+m~C~{Op*mUm1iN08hE<@VyM33C=i7ym&9aLa^d{jl5 zj%4lRdQ=*lL~F5*dEq^Y->(xG8qK@IC}N*#hyjmwlcrw~*-0HH5fSaVps5f>II4n% z!NY%f1(&`kSZrUKDs#2O5yHoUH44NdQBQH%Mg(Q%T}`VHT-zLVcoR9ur3$!z(LF9QUHELVXT$CZQ!! z+^1Hqe0Eb)8X6;VXTY{Ye9-bpP+2gy)av`C5`bY1fEdHfuZcy%fueX-w2)hyv@v3TLfxt=!Ra_jFB9zUBGf0PYRi9r*&;)=!d%HClKy|F*s z97LBt*4Ot*-jX-U7|(<5*a(-P4!BsGKztyF$$f_OC4ZU}T9wEfCj94~_&(oowK=== zkA?B7#T>6LbeH~AH~aa019^!~BtVG2tLFj%Ti|n7#N@go(1!*zx;=l3Ni>Gap%$8V z388>yzL8DsWzwp9YqFW-?X!YmsXfB!g)l517w)l-Z+<#A(S&3_of9C^R65F0O^J7^ z(<&*&8-ZCOfxIhOGnv}uJA))VUYmF=QWGq~HT{UY6b>eePX={v1y>VAL=tNe6-pqH z$V<_JlAaFw7HbQQ(VX(#7RgMuTk|u(9@PL8AU7d*=2JI2*hC?2N8J02CEm~C*k@Y? z#mqCk!xEFVS?GW+lX56?zx-9|bI_m5)) zceRGV1zZDV8~YMSAd)ZocUQQT2Axy68-tJd56CTqLWEtL)>%Lat^AbDl9^YIWd^lU ziAhtz=u$4+UA|MEbcz2kPw_Mr(Q^Kuxb<3w(+%A zdI05{Ts+Q$n~Cj@V@U}hslb*JW8;hXs|uMJxDd)_Y8BdlyPrL+>$;3NCY+v^Ap@6^ z;Z=*pm9$T?qY~XJOJ&oA6~#B39+xMFel2s#AC^do{v79(yA(+JoS2GRgQ?eJy&O9K z`J^+G)3%};PxKrD2+meAA=|o8Zv~LZ)tWWF;UgkzdjaYmdUrBZ(?ZHX$D_5?kX(BmqszM~Q*st7 zh+d_8$8xV;1b8#Tr7?4ohcTAc36Przo}dxx1qgbK-)YteIqVIV2U?44=U1T?T|z5t zjcl;ZIb8wl9(O6~ujvFDgDR-avwE|dK+13cir?zKEmAJ?5C)q-uV8-*z(aKcphF5G zakT==yQ(9glpOhTY9fo0-Yw7gJ*{T0ffp!c868EBstT}+#r#Y>UHmEj`r%Qfm18Y72*q8{|6e%xHyr3rjoQ9Ec z|I^*emMqZL^6nISkz7610;f1`aq!N=!(Vk;N^Ym3ws+rgLP#}ffbQ&+%X`o^il7IY z{)%>rhr~cA%mLbRdi}cj(Ci--st5>{A~<=G4$sXssQL(QxRc`VX#pM;2>aL&O{Nfy ze1jL`P<*w4JC1q=X;F_tJS5jyjcH5fSkv!S4qZG<#>~Z|Gpos|^Ue7^GYttAYWelJHAMU<~&%gS&P_Vlvch@vgrP|gIu zN=pK_?4;Z7pzM)&^`~(X%Nb#!mY`+|EiYkLyGqAqtEYe(^9O@`-is~ zQt{LGPoB63Y?EQLM3y_g?j;jsW4&~lrT-cYp`? zSDqT$BOVP541`r&@{R*28%K3l8iL^5+*$F@zlgl>Fd8fvtTtE#jfM*X6K|-m zC(O9O$*F_>lK{tFQZ{KTJ8e5`>SHPssz_(r=05220%PDk!hs+m|8w`oP%lBLe*l_F z4iqrJIbaq)udLaL%G54C<>Ezp0Rov!z*4aM6P1QE9&CQLiVG%AS59S{x@ddi1bL>B z9&m_<$!^oZNkMVjP4!xlXAN22w^9dQk=q;P^3A85=7o84r6yy-jdA-SOFTLD6IwI^ zW|6&Lj&O;?^$#K#^>H!Xf4HueJI`C~J(*xno_8XCdVAseBH{Jwg#E2{3<5V)7c8V< z^&m+-=Jxt#SZYHDr&@Yg?!IFLJUP&q690m`xZ=Kln97Ykdjzm@k{4AEUwHY|P6Kb} zU5nt+-K!FJ|5}T_e9tSG73ll^eZ8rt26)P>r_MXEt70K-`A}>9X*g8+F|b6#V!cD> zw%-vr23d_5#!kQ4qMO_7f#L_Q*8njq&|A&*kb>y2Ht( z)5ht#lx~t)T8mbD|wUQvOQ6Pu|8NHyEIHyq8tKwoL zPYK+D?!h|`*8ZwRNfOrqwDYzXi2Uh=T2n%rPF+iRelwS48gc+~5OzKBOuv@N38nsX z6o;|f-i9y{mQ}z_qzEc;J5S&Esg3}YAnBfnI9Qy_gf>B9m$YzqxYi`OQ`PQ|^3Q}G zz@usy_MMrkgP9WBmn8UKkAW(z3{%F4ui9;Z=S`xeL{%pI;3C|m$_15BBTmM98$A;n z8&y^hGJ?zYW`Dm_APJI$OeoZcbXmd|#hMxd0C@4_-}-oJ&Dc8|M2k5PPvlJoT#vY! z0H!2Ja(+(a{(C&q`Ur16NWyqgO#!RHr5Dk@ek6!`*b?3G%2$wukO=e~-C3nFgo!JO z?dD%+o#z=8QU(CqdqBlY)Y&}jr?vu6v<6Ts)*LTCtq1vEP3V;gJ1r4!u``Mw@UG4W z&7hKM1w<+(qvyeib)ZTi*$kT%gTzY6X28^141dwsFx753uwB9LdnHZ&oSym52OcMVd6n_;LOg{Hlf%+ zBs*WOUaLpW`yfF5RR`Y$oxICp{x2U|g~20vj_Gy!$)MCNd1H{o6(Td_e$WYQ?Z3J9 z7t|vCUh9h<>(p1xRdYNwsXe^Og5U3m+$MHx|K5pq)@6 z4wGR>A6s$MR9q!&+h@5s_$sh;kb}u@SWO!m8g8Wbv&Lv341^owv(l`R zymSqZuKmtr9!o9|wpe4?=CzIn1y64|#|;x@l0}Yl?<+L)3kX*{G)P?TxKaJgyBry(UlI)HTaqxwrarpYM8kbF1TsN44^KFC?7`p`snYBstmX!$I3ZQFnfAl zgrjDkh*$8=|0E?;W`ZWA>MlR^Re-WLRzgOu-*UE*0t;Lza!_DFg1BFaqP!$RkV?D> z2~m}_kSCvPbp?X*cAnx(82LLOTj$dUsW<&VCBxJgi?%loJZsreTZ-ZUoq?U-jG<1| zDJ@D9K(e+&$*t!I!VFHIF@6^e@=l6C7Q?Dp=4kD!{io6yC|e{$Ymll{CBNy;4k6S5 z*4iT5ru(eV`^2f>!Buy#fjozTIm2H#>Tm-jaNA%#J>Q~`)K=Os(qgWz>vRG`t)8Pk z^f8VXBq=)CY>3$&mnk_o5LePXRd&3~JEBOr`nFUp1+DWV`iH!M zxvk0#3%WipdkMAv^;N)8Obl4KCko!>(B9$W5<4_F=+xCFctpT`#jynbDhXUW0*q|{ z;^9u@j@twvbzTU|&e0sZxHIS_7JbutjYWZSqlJ*ERTGXR=zupPcvU<>*>R;0c zIhA-=Ah;NhwyG0;r%-H!s)QPJZAvw~dw1YC$|3zLSj-zl=N@TK{e^AM)IQJ04_Sbs zA;S~sohMlmyz>wKV$9Gofp4iTV*<4)7rb%@v$0zLilEiNJv+0x2X%sQ__mZ^lQ$Vl zBH%~Zrrc{j+v%AL-PzmSw)WFcnD(aJFeq3NhIy@?q^Ka zt;upS(hd76k6mR!J)#`WU-x-rPIvI%nH;Py(J-Orizc8eOoDo-4Y)M+NgnM?t#dCD zmKoatHem#WgZxm`2jM^AdmTwh4pTkIUCZ|i$)4+Ztk-ZaSfz%CA<~o*O?MFodc8@- z>HtqZtUOKlpm&yk+4(k}e|CI`rzZ^-NC+Z*k`mE(&kiFk86IoD)-9)W^)u2x?e+9H zWBbsFpXqe`{j9RwzHv_9_=z{ZWHeq<1*V)YlwTOP%bayfY)CJhul$E!P;v@&DQPQEI*^uhw)D0tD(@$yv3$kBqg!d2yOOMb3hv9Xzqarv;m z?pA4eWKbkX+!xL%AqxesmYCsiHTumql-UP3?RJWlrY~vu<=(9ksn$8QO3yXm%uR>3Lbd0^ZgGPaeYp;-wCXgSr>+ezKTbS_T1; zue6Me48kU&tgfzpZRQxG8T`8pSXrPzWLXV`cwz)%RXTm>+uIh1q_pXVveJ z!h}PEPQ&rVQAx*v+9zotWF0|=M?L#LNGXK$8etJ=4Mi;pasBXBm9J< zgM-6edwZ(3w-mQcO?{zULfxYoCHh!M^Wi}{DjpslKF8@krS^fzu7N8by1KgF6eTCK zIr(a_1|6Kdqd3%EVtb1wWHVIj3OQRB!L3PdSZ6%m z-zU$N=!#Q6l%kToDN&T>Zgu@vp=E9F%GiCWEA3YJgsS(d;;-v)R}2T5B}dVPye~7d zTS1j1)QLYJ!;PQvWUy%7%jjbhQ7F z#aWy#BsPy)cEaPhuFYh2ZDc5WxL6MC*rSeEsY)9Oe_@lkDkV|UGecU;rrP@r*lfe1 z(QlV*{4~mtZH@MXdod_?Dfa`%V(5e4B{5YwB|S+}P0b z%MIUQ-qO)o$pVuV{iVv$bYbTA67ut732m0l&9Nl-ib;O`qe*7EJHvCqEq+*{Vb^alkv6lEj_+Nq@appGYaC94c%IcEU% zETf>X7S2zFTNymmDygiSgn*OLjE>(Llk@?hl4B!+|LnoaQ$~L2A+rT(75-LBA@OJK zW;X{oo7RZ@wZ2yVpqIjV9V-Us7%|QQ5EhHZ3;D}E(JJ7jd z_Q=ra3(=!=d2-Sc(GTm&+t;KHSRA$aOeBomgB>Svz4`UsQro>hd2zQbAE>F3rF`xB z6KY;VGO&J9yz|E@*IN6Ytp%|9ifZ_ShQ_hAAB=T-I>|M1^e=)1)k1lJf%>}D#iEeV zZDsE&Pogj>2@u~9>z`X%DD+^oE1ci^zz`%KTX|0>n9HxUsrC9@S3I@)n(EA(qHbo> z2UNC_hT$*JV`^jp6iw%xt#N}_Ibj8*;2OC8sZfUM@6(Ov^dK{vwdg?TZm(r%kbO#V z32&=Xf9;Rkv}QzMiV^}hlxTS?{{&R}4>?Z_*SBRSm=V%PSB3G$GAy2Q7aZMxj=p=( zpzF0$>r=1!vU+%bl2e^nNMDhXH>}hSOE|FY+)0(SwdnwQ>v(`+w7*V%#qGAac>vP% zfgn#Id<{VIJmfS3mvQTI6@^`~+T;=>*N}+o1NkV{30?TMFGUa<%7A z;-N;U1ubXW`?I(zDDK2um?8{_5;Ak59k9EaulOxw>FqQ8ozBIPM*&WZOl)EJwh0E zQ$oO6Au88n;bU)LoZFHy@lh|dXBC>5TXg-^QlU_>l%m~x^p)rjCHC3x&)@tQAo!9c zcE%_DE=s~!#&uAJOl?i*Btzb#OBL&R0_#O&SZ5pK29Evti#&XD?pX%h;1GDTu(PP3tTEab`QvW;0A%X`F zMMDvJ!O4DQxvV$vJ5Sd2$@~JkS?c7K@#Gcts64l2Ba>U-On!?PdeFB7TQ+LpJRZLM zspig3O~}{uR)_WnYv(!nmTBKc?zU$YM zZ|?QM5GxeYPAJKI{u!OO?MAS86dfjAiuO%z~l`Qz_fI9 zM23d@aA2juXy1vHq`>o+WD7Ce#j(WkeQ09WVf7(tCU)r6;p!G1bX_P#F*~M)aTA`s2 z1sCgHx5XN`crDiK7coHN15*_#xFm1jQLhEqx*@Q>?&Z&x+d5=^P&EgOSzOv17K$FC z%`0EhcCfE${BW%Hm*tv*S?7%P#-LPAf!8p-ZZjD^^JT$F8567ERKRg?s=3E|ab2*| zofr6#I5;v@*59VnQSg2LL&iciw@xpEJ7WmLE(2T0-Wf=E)c}|!o==;9uwuUjm@*t_ z**R>A9Y9E12i3mi3mv+f9-3(awvI`5@RBcQa3^e?)`X`Lxz4LEW-KkU{MDw)>AM{) zCsrF_Cx;RGGxD(+KjmXLb}M1&dJy+CB$c1Ewjxf(bvlKNjtJKQhL%{UUXFJ2B;fB27W;bUcU;oq`BexFd)LYCh`XL~@0k~$ zQEvakXznT)@)2L2ySY`cp!u#?rrq_+LT*}@8Emh_d=`CHoqug2PXEL$Ae%lAKEehZR-q1D3hInEq=let2*<+GW zd;t#iUOTV>uet+DPep}7&N&TOn~>KW#O+EQ22qD;$Nk+f#a5R(=j}Iz|{A~eeLHZ34HSvv$C=}0IRyLEerF#%R~AMu-(;iNxif40m&M&*b|-3gqITWIt`+QB!LAkTTEVUr>{`LD73^BUt`+QB!LAkTTEVUr>{`LD y73^BUt`+QB!LAkTTEVUr>{`LD6=3~IR(LhCO@DX$J($}Af8?aDNM>EqzxRJ0d=qN` diff --git a/build.sh b/build.sh index 904a121..e2c9f7d 100644 --- a/build.sh +++ b/build.sh @@ -3,3 +3,5 @@ rm -rf dist && rm -rf linghe.egg-info && python setup.py develop && python setup.py bdist_wheel && + +#pdoc --output-dir docs -d google --no-include-undocumented --no-search --no-show-source linghe \ No newline at end of file diff --git a/docs/linghe/facade/add.html b/docs/linghe/facade/add.html index 203f198..10e2ec9 100644 --- a/docs/linghe/facade/add.html +++ b/docs/linghe/facade/add.html @@ -73,38 +73,7 @@

-

Base class to create custom autograd.Function.

- -

To create a custom autograd.Function, subclass this class and implement -the forward() and backward() static methods. Then, to use your custom -op in the forward pass, call the class method apply. Do not call -forward() directly.

- -

To ensure correctness and best performance, make sure you are calling the -correct methods on ctx and validating your backward function using -torch.autograd.gradcheck().

- -

See :ref:extending-autograd for more details on how to use this class.

- -

Examples::

- -
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)
->>> class Exp(Function):
->>>     @staticmethod
->>>     def forward(ctx, i):
->>>         result = i.exp()
->>>         ctx.save_for_backward(result)
->>>         return result
->>>
->>>     @staticmethod
->>>     def backward(ctx, grad_output):
->>>         result, = ctx.saved_tensors
->>>         return grad_output * result
->>>
->>> # Use it by calling the apply method:
->>> # xdoctest: +SKIP
->>> output = Exp.apply(input)
-
+

inplace add with mix precision

@@ -113,57 +82,27 @@

@staticmethod
def - forward(ctx, x, y): + forward(ctx, x: torch.Tensor, y: torch.Tensor):

-

Define the forward of the custom autograd Function.

+

inplace add y to x with mix precise

-

This function is to be overridden by all subclasses. -There are two ways to define forward:

- -

Usage 1 (Combined forward and ctx)::

- -
@staticmethod
-def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
-    pass
-
+
Arguments:
    -
  • It must accept a context ctx as the first argument, followed by any -number of arguments (tensors or other types).
  • -
  • See :ref:combining-forward-context for more details
  • +
  • ctx: autograd context
  • +
  • x: to be updated
  • +
  • y: add to x
-

Usage 2 (Separate forward and ctx)::

- -
@staticmethod
-def forward(*args: Any, **kwargs: Any) -> Any:
-    pass
-
-@staticmethod
-def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
-    pass
-
- -
    -
  • The forward no longer accepts a ctx argument.
  • -
  • Instead, you must also override the torch.autograd.Function.setup_context() -staticmethod to handle setting up the ctx object. -output is the output of the forward, inputs are a Tuple of inputs -to the forward.
  • -
  • See :ref:extending-autograd for more details
  • -
+
Returns:
-

The context can be used to store arbitrary data that can be then -retrieved during the backward pass. Tensors should not be stored -directly on ctx (though this is not currently enforced for -backward compatibility). Instead, tensors should be saved either with -ctx.save_for_backward() if they are intended to be used in -backward (equivalently, vjp) or ctx.save_for_forward() -if they are intended to be used for in jvp.

+
+

output: x

+
@@ -179,26 +118,20 @@

-

Define a formula for differentiating the operation with backward mode automatic differentiation.

- -

This function is to be overridden by all subclasses. -(Defining this function is equivalent to defining the vjp function.)

- -

It must accept a context ctx as the first argument, followed by -as many outputs as the forward() returned (None will be passed in -for non tensor outputs of the forward function), -and it should return as many tensors, as there were inputs to -forward(). Each argument is the gradient w.r.t the given output, -and each returned value should be the gradient w.r.t. the -corresponding input. If an input is not a Tensor or is a Tensor not -requiring grads, you can just pass None as a gradient for that input.

- -

The context can be used to retrieve tensors saved during the forward -pass. It also has an attribute ctx.needs_input_grad as a tuple -of booleans representing whether each input needs gradient. E.g., -backward() will have ctx.needs_input_grad[0] = True if the -first input to forward() needs gradient computed w.r.t. the -output.

+

backward of inplace add

+ +
Arguments:
+ +
    +
  • ctx: autograd context
  • +
  • grad_output: input gradient
  • +
+ +
Returns:
+ +
+

tuple of gradients

+
diff --git a/docs/linghe/facade/fp32_linear.html b/docs/linghe/facade/fp32_linear.html index 43c7dfc..09c0fd7 100644 --- a/docs/linghe/facade/fp32_linear.html +++ b/docs/linghe/facade/fp32_linear.html @@ -73,38 +73,8 @@

-

Base class to create custom autograd.Function.

- -

To create a custom autograd.Function, subclass this class and implement -the forward() and backward() static methods. Then, to use your custom -op in the forward pass, call the class method apply. Do not call -forward() directly.

- -

To ensure correctness and best performance, make sure you are calling the -correct methods on ctx and validating your backward function using -torch.autograd.gradcheck().

- -

See :ref:extending-autograd for more details on how to use this class.

- -

Examples::

- -
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)
->>> class Exp(Function):
->>>     @staticmethod
->>>     def forward(ctx, i):
->>>         result = i.exp()
->>>         ctx.save_for_backward(result)
->>>         return result
->>>
->>>     @staticmethod
->>>     def backward(ctx, grad_output):
->>>         result, = ctx.saved_tensors
->>>         return grad_output * result
->>>
->>> # Use it by calling the apply method:
->>> # xdoctest: +SKIP
->>> output = Exp.apply(input)
-
+

gemm with bf16/fp16 inputs and float32 output, +currently used in MoE router gemm.

@@ -113,57 +83,27 @@

@staticmethod
def - forward(ctx, input, weight): + forward(ctx, input: torch.Tensor, weight: torch.Tensor):

-

Define the forward of the custom autograd Function.

+

gemm forward with bf16/fp16 inputs and float32 output.

-

This function is to be overridden by all subclasses. -There are two ways to define forward:

- -

Usage 1 (Combined forward and ctx)::

- -
@staticmethod
-def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
-    pass
-
+
Arguments:
    -
  • It must accept a context ctx as the first argument, followed by any -number of arguments (tensors or other types).
  • -
  • See :ref:combining-forward-context for more details
  • +
  • ctx:
  • +
  • input: bf16/fp16 act tensor
  • +
  • weight: bf16/fp16 weight tensor
-

Usage 2 (Separate forward and ctx)::

- -
@staticmethod
-def forward(*args: Any, **kwargs: Any) -> Any:
-    pass
-
-@staticmethod
-def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
-    pass
-
- -
    -
  • The forward no longer accepts a ctx argument.
  • -
  • Instead, you must also override the torch.autograd.Function.setup_context() -staticmethod to handle setting up the ctx object. -output is the output of the forward, inputs are a Tuple of inputs -to the forward.
  • -
  • See :ref:extending-autograd for more details
  • -
+
Returns:
-

The context can be used to store arbitrary data that can be then -retrieved during the backward pass. Tensors should not be stored -directly on ctx (though this is not currently enforced for -backward compatibility). Instead, tensors should be saved either with -ctx.save_for_backward() if they are intended to be used in -backward (equivalently, vjp) or ctx.save_for_forward() -if they are intended to be used for in jvp.

+
+

output of gemm

+
@@ -179,26 +119,7 @@

-

Define a formula for differentiating the operation with backward mode automatic differentiation.

- -

This function is to be overridden by all subclasses. -(Defining this function is equivalent to defining the vjp function.)

- -

It must accept a context ctx as the first argument, followed by -as many outputs as the forward() returned (None will be passed in -for non tensor outputs of the forward function), -and it should return as many tensors, as there were inputs to -forward(). Each argument is the gradient w.r.t the given output, -and each returned value should be the gradient w.r.t. the -corresponding input. If an input is not a Tensor or is a Tensor not -requiring grads, you can just pass None as a gradient for that input.

- -

The context can be used to retrieve tensors saved during the forward -pass. It also has an attribute ctx.needs_input_grad as a tuple -of booleans representing whether each input needs gradient. E.g., -backward() will have ctx.needs_input_grad[0] = True if the -first input to forward() needs gradient computed w.r.t. the -output.

+

backward

diff --git a/linghe/facade/add.py b/linghe/facade/add.py index 945ad0e..d477b14 100644 --- a/linghe/facade/add.py +++ b/linghe/facade/add.py @@ -10,9 +10,22 @@ class InplaceAddFunction(torch.autograd.Function): @staticmethod - def forward(ctx, x, y): + def forward(ctx, x: torch.Tensor, y: torch.Tensor): return triton_inplace_add(x, y) @staticmethod def backward(ctx, grad_output): return grad_output, grad_output + + +def inplace_add(x: torch.Tensor, y: torch.Tensor): + """ + inplace add y to x with mix precise + Args: + ctx: autograd context + x: to be updated + y: add to x + Returns: + return updated x tensor + """ + return InplaceAddFunction.apply(x, y) \ No newline at end of file diff --git a/linghe/facade/fp32_linear.py b/linghe/facade/fp32_gemm.py similarity index 67% rename from linghe/facade/fp32_linear.py rename to linghe/facade/fp32_gemm.py index d356856..5bc1e0f 100644 --- a/linghe/facade/fp32_linear.py +++ b/linghe/facade/fp32_gemm.py @@ -10,9 +10,9 @@ triton_fp32_gemm_for_update) -class FusedFp32GEMM(torch.autograd.Function): +class Fp32GEMM(torch.autograd.Function): @staticmethod - def forward(ctx, input, weight): + def forward(ctx, input: torch.Tensor, weight: torch.Tensor): shape = input.shape assert len(shape) == 3 input = input.view(shape[0] * shape[1], shape[2]) @@ -32,9 +32,22 @@ def backward(ctx, grad_output): grad_output = grad_output.view(shape[0] * shape[1], shape[2]) input, weight = ctx.saved_tensors - dx = triton_fp32_gemm_for_backward(grad_output, weight, accum=False) + dx = triton_fp32_gemm_for_backward(grad_output, weight) dx = dx.view(*ctx.shape) dw = triton_fp32_gemm_for_update(grad_output, input) return dx, dw + + +def fp32_gemm(input: torch.Tensor, weight: torch.Tensor): + """ + gemm with bf16/fp16 inputs and float32 output, + currently used in MoE router gemm. + Args: + input: bf16/fp16 activation tensor + weight: bf16/fp16 weight tensor + Returns: + output of gemm + """ + return Fp32GEMM.apply(input, weight) \ No newline at end of file diff --git a/linghe/facade/loss.py b/linghe/facade/loss.py index fff59dd..a1fe7b9 100644 --- a/linghe/facade/loss.py +++ b/linghe/facade/loss.py @@ -38,6 +38,22 @@ def backward(ctx, grad_output): return grad, None, None, None +def softmax_cross_entropy(logits: torch.Tensor, labels: torch.Tensor, inplace: bool = False): + """ + softmax cross entropy + Args: + logits: logits tensor, shape [...,dim] + labels: labels tensor, shape [...] + inplace: update gradient in the `logits` tensor if True + + Returns: + per token loss + """ + assert logits.is_contiguous() + assert labels.is_contiguous() + return SoftmaxCrossEntropyFunction.apply(logits, labels, inplace) + + class GradScalingFunction(torch.autograd.Function): @staticmethod def forward(ctx, x, coef=0.2): diff --git a/linghe/facade/norm.py b/linghe/facade/norm.py index c5d31ad..7f09318 100644 --- a/linghe/facade/norm.py +++ b/linghe/facade/norm.py @@ -37,17 +37,32 @@ def backward(ctx, dy): return dx, dw, None +def rms_norm(x: torch.Tensor, weight: torch.Tensor, eps: float = 1e-6): + """ + rms norm of x with weight + Args: + x: activation tensor + weight: weight tensor + eps: epsilon for RMS + + Returns: + rms output + """ + assert x.contiguous() + assert weight.contiguous() + return RMSNormFunction.apply(x, weight, eps) + class GroupNormGateFunction(torch.autograd.Function): @staticmethod - def forward(ctx, x, gate, weight, eps=1e-6, group_size=4): + def forward(ctx, attn_output, gate, weight, eps=1e-6, group_size=4): output = triton_group_norm_gate_forward( - x, + attn_output, gate, weight.data, eps=eps, group_size=group_size ) - ctx.save_for_backward(x, gate, weight.data) + ctx.save_for_backward(attn_output, gate, weight.data) ctx.eps = eps ctx.group_size = group_size @@ -55,11 +70,11 @@ def forward(ctx, x, gate, weight, eps=1e-6, group_size=4): @staticmethod def backward(ctx, dy): - x, gate, weight = ctx.saved_tensors + attn_output, gate, weight = ctx.saved_tensors dx, dg, dw = triton_group_norm_gate_backward( dy, - x, + attn_output, gate, weight, ctx.eps, @@ -67,3 +82,23 @@ def backward(ctx, dy): ) return dx, dg, dw, None, None + + + +def group_norm_gate(attn_output: torch.Tensor, + gate: torch.Tensor, + weight: torch.Tensor, + eps: float = 1e-6, + group_size: int = 4): + """ + return group_rms_norm(transpose(attn_output, [0,1]), weight) * sigmoid(gate) + Args: + attn_output: output of core attn, shape [bs, length, n_heads, head_dim] + gate: gate tensor for attention output, shape [length, bs, dim] + weight: weight of RMS norm, shape [dim] + eps: epsilon for RMS + group_size: group size of group RMS norm + Returns: + output with shape [length, bs, dim] + """ + return GroupNormGateFunction.apply(attn_output, gate, weight, eps, group_size) \ No newline at end of file diff --git a/linghe/facade/rope.py b/linghe/facade/rope.py index f12d518..bdf52fa 100644 --- a/linghe/facade/rope.py +++ b/linghe/facade/rope.py @@ -47,3 +47,35 @@ def backward(ctx, grad_q, grad_k, grad_v): transpose=True, interleave=True) return dqkv, dqw, dkw, None, None, None, None + + +def qk_norm_half_rope(qkv: torch.Tensor, + q_norm_weight: torch.Tensor, + k_norm_weight: torch.Tensor, + freqs: torch.Tensor, + H: int = 32, + h: int = 4, + eps: float = 1e-6): + """ + split qkv to q/k/v, apply qk norm and half rope to q/k, transpose q/k/v to flash-attention layout + Args: + qkv: QKV tensor with size of [S, B, dim], heads are interleaved + q_norm_weight: rms norm weight for query + k_norm_weight: rms norm weight for key + freqs: Freqs tensor based on half dim. + H: Number of attention heads. + h: Number of key/value heads. + eps: epsilon value for L2 normalization. + + Returns: + qo: shape [B, S, H, head_dim] + ko: shape [B, S, h, head_dim] + vo: shape [B, S, h, head_dim] + """ + return QkNormHalfRopeFunction.apply(qkv, + q_norm_weight, + k_norm_weight, + freqs, + H, + h, + eps) \ No newline at end of file diff --git a/linghe/facade/transpose.py b/linghe/facade/transpose.py index d3ecfaa..6ee19b3 100644 --- a/linghe/facade/transpose.py +++ b/linghe/facade/transpose.py @@ -16,3 +16,15 @@ def forward(ctx, x): @staticmethod def backward(ctx, grad_output): return triton_transpose(grad_output, dim0=0, dim1=1) + + +def transpose_dim01(x): + """ + transpose a tensor with the first two dims, x.ndims should not greater than 4 + Args: + x: input tensor + + Returns: + a transposed tensor + """ + return TransposeDim01Function.apply(x) \ No newline at end of file diff --git a/linghe/gemm/blockwise_fp8_gemm.py b/linghe/gemm/blockwise_fp8_gemm.py new file mode 100644 index 0000000..da9416a --- /dev/null +++ b/linghe/gemm/blockwise_fp8_gemm.py @@ -0,0 +1,242 @@ +# -*- coding: utf-8 -*- +""" +Copyright (c) Ant Financial Service Group and its affiliates. +""" + +import torch +import triton +import triton.language as tl +from triton import Config + +fp8_gemm_configs = [ + Config({"BLOCK_SIZE_M": block_m, "BLOCK_SIZE_N": block_n}, + num_stages=num_stages, num_warps=8) + for block_m in [32, 64, 128] + for block_n in [32, 64, 128] + for num_stages in [3, 4, 5, 6] +] + + +# @triton.autotune(configs=fp8_gemm_configs, key=["N", "K"]) +@triton.jit +def fp8_gemm_bb_kernel( + a_ptr, + b_ptr, + c_ptr, + a_s_ptr, + b_s_ptr, + M, + N: tl.constexpr, + K: tl.constexpr, + BLOCK_SIZE_K: tl.constexpr, + BLOCK_SIZE_M: tl.constexpr, + BLOCK_SIZE_N: tl.constexpr, +): + # a blockwise quantization, b blockwise quantization. + pid_m = tl.program_id(axis=0) + pid_n = tl.program_id(axis=1) + k = tl.cdiv(K, BLOCK_SIZE_K) + offs_m = (pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) % M + offs_n = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)) % N + offs_k = tl.arange(0, BLOCK_SIZE_K) + a_ptrs = a_ptr + offs_m[:, None] * K + offs_k[None, :] + # b_ptrs = b_ptr + offs_n[None, :] * K + offs_k[:, None] + b_ptrs = b_ptr + offs_n[:, None] * K + offs_k[None, :] + nb = K // BLOCK_SIZE_K + + accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32) + for i in range(0, k): + a_s = tl.load(a_s_ptr + pid_m * nb + i) + b_s = tl.load(b_s_ptr + pid_n * nb + i) + a = tl.load(a_ptrs, mask=offs_k[None, :] < K - i * BLOCK_SIZE_K, + other=0.0) + b = tl.load(b_ptrs, mask=offs_k[None, :] < K - i * BLOCK_SIZE_K, + other=0.0) + accumulator += tl.dot(a, tl.trans(b)) * (a_s * b_s) + # accumulator = tl.dot(a, tl.trans(b), accumulator) + # accumulator += (accumulators-accumulator) * scale + a_ptrs += BLOCK_SIZE_K + b_ptrs += BLOCK_SIZE_K + c = accumulator.to(c_ptr.dtype.element_ty) + offs_m = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M) + offs_n = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N) + c_ptrs = c_ptr + offs_m[:, None] * N + offs_n[None, :] + mask = (offs_m[:, None] < M) & (offs_n[None, :] < N) + tl.store(c_ptrs, c, mask=mask) + + + +def triton_bb_fp8_gemm(a: torch.Tensor, + b: torch.Tensor, + a_s: torch.Tensor, + b_s: torch.Tensor, + out_dtype=torch.bfloat16, + block_size=128): + assert a.is_contiguous() and b.is_contiguous() + assert a_s.is_contiguous() and b_s.is_contiguous() + K = a.size(-1) + M = a.numel() // K + N = b.size(0) + c = torch.empty(M, N, dtype=out_dtype, device=a.device) + grid = lambda META: (triton.cdiv(M, META["BLOCK_SIZE_M"]), + triton.cdiv(N, META["BLOCK_SIZE_N"])) # noqa + + fp8_gemm_bb_kernel[grid](a, b, c, a_s, b_s, + M, N, K, + BLOCK_SIZE_K=block_size, + BLOCK_SIZE_M=block_size, + BLOCK_SIZE_N=block_size, + num_warps=8, + num_stages=4 + ) + return c + + + +@triton.autotune(configs=fp8_gemm_configs, key=["N", "K"]) +@triton.jit +def fp8_gemm_tb_kernel( + a_ptr, + b_ptr, + c_ptr, + a_s_ptr, + b_s_ptr, + M, + N: tl.constexpr, + K: tl.constexpr, + BLOCK_SIZE_K: tl.constexpr, + BLOCK_SIZE_M: tl.constexpr, + BLOCK_SIZE_N: tl.constexpr, +): + # a tilewise quantization, b blockwise quantization. + pid_m = tl.program_id(axis=0) + pid_n = tl.program_id(axis=1) + k = tl.cdiv(K, BLOCK_SIZE_K) + offs_m = (pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) % M + offs_n = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)) % N + offs_k = tl.arange(0, BLOCK_SIZE_K) + a_ptrs = a_ptr + offs_m[:, None] * K + offs_k[None, :] + b_ptrs = b_ptr + offs_n[None, :] * K + offs_k[:, None] + a_s_ptrs = a_s_ptr + offs_m * k + b_s_ptrs = b_s_ptr + (offs_n // BLOCK_SIZE_K) * k + + accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32) + for i in range(k): + # a = tl.load(a_ptrs) + # b = tl.load(b_ptrs) + a = tl.load(a_ptrs, mask=offs_k[None, :] < K - i * BLOCK_SIZE_K, + other=0.0) + b = tl.load(b_ptrs, mask=offs_k[:, None] < K - i * BLOCK_SIZE_K, + other=0.0) + a_s = tl.load(a_s_ptrs) + b_s = tl.load(b_s_ptrs) + # accumulator += tl.dot(a, b) * a_s[:, None] * b_s[None, :] + accumulators = tl.dot(a, b, accumulator) + accumulator += (accumulators - accumulator) * a_s[:, None] * b_s[None, + :] + a_ptrs += BLOCK_SIZE_K + b_ptrs += BLOCK_SIZE_K + a_s_ptrs += 1 + b_s_ptrs += 1 + c = accumulator.to(c_ptr.dtype.element_ty) + offs_m = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M) + offs_n = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N) + c_ptrs = c_ptr + offs_m[:, None] * N + offs_n[None, :] + # tl.store(c_ptrs, c) + mask = (offs_m[:, None] < M) & (offs_n[None, :] < N) + tl.store(c_ptrs, c, mask=mask) + + + + +def triton_tb_fp8_gemm(a: torch.Tensor, + b: torch.Tensor, + a_s: torch.Tensor, + b_s: torch.Tensor, + out_dtype=torch.bfloat16, + block_size=128): + assert a.is_contiguous() and b.is_contiguous() + assert a_s.is_contiguous() and b_s.is_contiguous() + K = a.size(-1) + M = a.numel() // K + N = b.size(0) + c = torch.empty(M, N, dtype=out_dtype, device=a.device) + grid = lambda META: (triton.cdiv(M, META["BLOCK_SIZE_M"]), + triton.cdiv(N, META["BLOCK_SIZE_N"])) # noqa + + fp8_gemm_tb_kernel[grid](a, b, c, + a_s, b_s, + M, N, K, + block_size + ) + return c + + +@triton.autotune(configs=fp8_gemm_configs, key=["N", "K"]) +@triton.jit +def fp8_gemm_tt_kernel( + a_ptr, + b_ptr, + c_ptr, + a_s_ptr, + b_s_ptr, + M, + N: tl.constexpr, + K: tl.constexpr, + BLOCK_SIZE_K: tl.constexpr, + BLOCK_SIZE_M: tl.constexpr, + BLOCK_SIZE_N: tl.constexpr, +): + # a and b all tilewise quantization. + pid_m = tl.program_id(axis=0) + pid_n = tl.program_id(axis=1) + k = tl.cdiv(K, BLOCK_SIZE_K) + offs_m = (pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) % M + offs_n = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)) % N + offs_k = tl.arange(0, BLOCK_SIZE_K) + a_ptrs = a_ptr + offs_m[:, None] * K + offs_k[None, :] + b_ptrs = b_ptr + offs_n[None, :] * K + offs_k[:, None] + a_s_ptrs = a_s_ptr + offs_m * k + b_s_ptrs = b_s_ptr + offs_n * k + + accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32) + for i in range(k): + a = tl.load(a_ptrs, mask=offs_k[None, :] < K - i * BLOCK_SIZE_K, + other=0.0) + b = tl.load(b_ptrs, mask=offs_k[:, None] < K - i * BLOCK_SIZE_K, + other=0.0) + a_s = tl.load(a_s_ptrs) + b_s = tl.load(b_s_ptrs) + accumulator += tl.dot(a, b) * a_s[:, None] * b_s[None, :] + a_ptrs += BLOCK_SIZE_K + b_ptrs += BLOCK_SIZE_K + a_s_ptrs += 1 + b_s_ptrs += 1 + + c = accumulator.to(c_ptr.dtype.element_ty) + offs_m = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M) + offs_n = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N) + c_ptrs = c_ptr + offs_m[:, None] * N + offs_n[None, :] + mask = (offs_m[:, None] < M) & (offs_n[None, :] < N) + tl.store(c_ptrs, c, mask=mask) + + +def triton_tt_fp8_gemm(a: torch.Tensor, + b: torch.Tensor, + a_s: torch.Tensor, + b_s: torch.Tensor, + out_dtype=torch.bfloat16, + block_size=128): + assert a.is_contiguous() and b.is_contiguous() + assert a_s.is_contiguous() and b_s.is_contiguous() + K = a.size(-1) + M = a.numel() // K + N = b.size(0) + c = torch.empty(*a.size()[:-1], N, dtype=out_dtype, device=a.device) + grid = lambda META: (triton.cdiv(M, META["BLOCK_SIZE_M"]), + triton.cdiv(N, META["BLOCK_SIZE_N"])) # noqa + fp8_gemm_tt_kernel[grid](a, b, c, + a_s, b_s, + M, N, K, + block_size) + return c diff --git a/linghe/gemm/channelwise_fp8_gemm.py b/linghe/gemm/channelwise_fp8_gemm.py new file mode 100644 index 0000000..a4d978f --- /dev/null +++ b/linghe/gemm/channelwise_fp8_gemm.py @@ -0,0 +1,130 @@ +# -*- coding: utf-8 -*- +""" +Copyright (c) Ant Financial Service Group and its affiliates. +""" + +import torch +import triton +import triton.language as tl + + +# os.environ["TRITON_PRINT_AUTOTUNING"] = "1" + + +# fp8_gemm_configs = [ +# Config({"BLOCK_SIZE_K": block_k, "BLOCK_SIZE_M": block_m, "BLOCK_SIZE_N": block_n}, num_stages=num_stages, num_warps=num_warps) +# for block_k in [64, 128, 256] +# for block_m in [64, 128, 256] +# for block_n in [64, 128, 256] +# for num_stages in [2, 3, 4, 5] +# for num_warps in [4, 8, 16] +# # for num_stages in [3] +# # for num_warps in [8] +# ] + +# @triton.autotune(configs=fp8_gemm_configs, key=["M", "N", "K"]) +@triton.jit +def scaled_mm_kernel( + a_ptr, + b_ptr, + c_ptr, + a_scale_ptr, + b_scale_ptr, + N, + K, + ACCUM: tl.constexpr, + EVEN: tl.constexpr, + BLOCK_SIZE_K: tl.constexpr, + BLOCK_SIZE_M: tl.constexpr, + BLOCK_SIZE_N: tl.constexpr, +): + pid_m = tl.program_id(axis=0) + pid_n = tl.program_id(axis=1) + k = tl.cdiv(K, BLOCK_SIZE_K) + offs_m = (pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) + offs_n = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)) + offs_k = tl.arange(0, BLOCK_SIZE_K) + a_ptrs = a_ptr + offs_m[:, None] * K + offs_k[None, :] + b_ptrs = b_ptr + offs_n[None, :] * K + offs_k[:, None] + a_scale = tl.load(a_scale_ptr + offs_m) + b_scale = tl.load(b_scale_ptr + offs_n) + + if ACCUM: + c_ptrs = c_ptr + offs_m[:, None] * N + offs_n[None, :] + accumulator = tl.load(c_ptrs).to(tl.float32) + a_s = 1 / tl.maximum(a_scale, 1e-30) + b_s = 1 / tl.maximum(b_scale, 1e-30) + accumulator = accumulator * a_s[:, None] * b_s[None, :] + else: + accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32) + + if EVEN: + for i in range(k): + a = tl.load(a_ptrs) + b = tl.load(b_ptrs) + accumulator = tl.dot(a, b, accumulator) + a_ptrs += BLOCK_SIZE_K + b_ptrs += BLOCK_SIZE_K + else: + for i in range(k): + indices = i * BLOCK_SIZE_K + offs_k + a = tl.load(a_ptrs, mask=indices[None, :] < K) + b = tl.load(b_ptrs, mask=indices[:, None] < K) + accumulator = tl.dot(a, b, accumulator) + a_ptrs += BLOCK_SIZE_K + b_ptrs += BLOCK_SIZE_K + + accumulator = accumulator * a_scale[:, None] * b_scale[None, :] + accumulator = accumulator.to(c_ptr.dtype.element_ty) + c_ptrs = c_ptr + offs_m[:, None] * N + offs_n[None, :] + tl.store(c_ptrs, accumulator) + + +def triton_scaled_mm(a: torch.Tensor, + b: torch.Tensor, + a_scale: torch.Tensor, + b_scale: torch.Tensor, + out_dtype=torch.float32, + c=None, + accum=True): + """ + similar to torch._scaled_mm, support accumulating gemm output to c + and low precision output tensor + Args: + a: left fp8 tensor + b: right fp8 tensor, column-major + a_scale: fp32 scale of a + b_scale: fp32 scale of b + out_dtype: output tensor dtype + c: output tensor + accum: accumulate output on c if True + + Returns: + c: output tensor + """ + assert a.is_contiguous() and b.is_contiguous() + M, K = a.size() + N, K = b.size() + ACCUM = accum and c is not None + if c is None: + c = torch.empty(M, N, dtype=out_dtype, device=a.device) + BLOCK_SIZE_K = 128 + BLOCK_SIZE_M = 128 + BLOCK_SIZE_N = 256 + EVEN = K % BLOCK_SIZE_K == 0 + grid = lambda META: ( + M // META["BLOCK_SIZE_M"], N // META["BLOCK_SIZE_N"]) # noqa + scaled_mm_kernel[grid](a, b, c, + a_scale, + b_scale, + N, K, + ACCUM, + EVEN, + BLOCK_SIZE_K, + BLOCK_SIZE_M, + BLOCK_SIZE_N, + num_stages=3, + num_warps=8 + ) + + return c diff --git a/linghe/gemm/fp32_gemm.py b/linghe/gemm/fp32_gemm.py index e4bbeb9..8f44067 100644 --- a/linghe/gemm/fp32_gemm.py +++ b/linghe/gemm/fp32_gemm.py @@ -59,10 +59,19 @@ def fp32_gemm_kernel( tl.store(c_ptrs, c) -# a, bf16 -# b, bf16 -# c, fp32 + def triton_fp32_gemm(a: torch.Tensor, b: torch.Tensor): + """ + return fp32 gemm result with fp16/bf16 inputs, + it's mainly used for MoE router GEMM + and DO NOT suitable for large size GEMM + Args: + a: left matrix with fp16/bf16 precision + b: right matrix with fp16/bf16 precision + + Returns: + c: output with fp32 precision + """ assert a.is_contiguous() and b.is_contiguous() M, K = a.size() N, K = b.size() @@ -86,11 +95,11 @@ def triton_fp32_gemm(a: torch.Tensor, b: torch.Tensor): return c +# @triton.autotune(configs=fp32_gemm_configs, key=["M", "N", "K"]) @triton.jit -def scaled_fp32_gemm_kernel( +def fp32_gemm_for_backward_kernel( a_ptr, b_ptr, - scale_ptr, c_ptr, M, N: tl.constexpr, @@ -106,61 +115,64 @@ def scaled_fp32_gemm_kernel( offs_n = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)) offs_k = tl.arange(0, BLOCK_SIZE_K) a_ptrs = a_ptr + offs_m[:, None] * K + offs_k[None, :] - b_ptrs = b_ptr + offs_n[None, :] * K + offs_k[:, None] + b_ptrs = b_ptr + offs_n[None, :] + offs_k[:, None] * N c = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32) + for i in range(k): - a = tl.load(a_ptrs).to(tl.float32) + a = tl.load(a_ptrs) b = tl.load(b_ptrs).to(tl.float32) # c += tl.dot(a, b) c = tl.dot(a, b, c) a_ptrs += BLOCK_SIZE_K - b_ptrs += BLOCK_SIZE_K - - scale = tl.load( - scale_ptr + pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) - c *= scale[:, None] - + b_ptrs += BLOCK_SIZE_K * N offs_m = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M) offs_n = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N) c_ptrs = c_ptr + offs_m[:, None] * N + offs_n[None, :] tl.store(c_ptrs, c) -def triton_scaled_fp32_gemm(a: torch.Tensor, b: torch.Tensor, - scale: torch.Tensor): +def triton_fp32_gemm_for_backward(a: torch.Tensor, + b: torch.Tensor): + """ + mix precision gemm for backward, a@b.float() + Args: + a: input gradient, fp32 + b: gemm weight, bf16/fp16 + Returns: + c: gradient of activation + """ assert a.is_contiguous() and b.is_contiguous() M, K = a.size() - N, K = b.size() - c = torch.empty(M, N, dtype=torch.float32, device=a.device) + K, N = b.size() + c = torch.empty((M, N), dtype=b.dtype, device=b.device) grid = lambda META: (triton.cdiv(M, META["BLOCK_SIZE_M"]), triton.cdiv(N, META["BLOCK_SIZE_N"])) # noqa BLOCK_SIZE_K = 128 BLOCK_SIZE_M = 32 BLOCK_SIZE_N = 128 num_warps = 4 - num_stages = 3 - scaled_fp32_gemm_kernel[grid](a, b, scale, c, - M, N, K, - BLOCK_SIZE_K, - BLOCK_SIZE_M, - BLOCK_SIZE_N, - num_warps=num_warps, - num_stages=num_stages - ) + num_stages = 2 + fp32_gemm_for_backward_kernel[grid](a, b, c, + M, N, K, + BLOCK_SIZE_K, + BLOCK_SIZE_M, + BLOCK_SIZE_N, + num_warps=num_warps, + num_stages=num_stages + ) return c # @triton.autotune(configs=fp32_gemm_configs, key=["M", "N", "K"]) @triton.jit -def fp32_gemm_for_backward_kernel( +def fp32_gemm_for_update_kernel( a_ptr, b_ptr, c_ptr, M, N: tl.constexpr, K: tl.constexpr, - ACCUM: tl.constexpr, BLOCK_SIZE_K: tl.constexpr, BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, @@ -171,63 +183,62 @@ def fp32_gemm_for_backward_kernel( offs_m = (pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) offs_n = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)) offs_k = tl.arange(0, BLOCK_SIZE_K) - a_ptrs = a_ptr + offs_m[:, None] * K + offs_k[None, :] + a_ptrs = a_ptr + offs_m[None, :] + offs_k[:, None] * M b_ptrs = b_ptr + offs_n[None, :] + offs_k[:, None] * N - if ACCUM: - c = tl.load(c_ptr + offs_m[:, None] * N + offs_n[None, :]).to( - tl.float32) - else: - c = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32) - + c = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32) + # c = tl.load(c_ptr + offs_m[:, None] * N + offs_n[None, :]).to(tl.float32) for i in range(k): - a = tl.load(a_ptrs) + a = tl.trans(tl.load(a_ptrs)).to(tl.float32) b = tl.load(b_ptrs).to(tl.float32) # c += tl.dot(a, b) c = tl.dot(a, b, c) - a_ptrs += BLOCK_SIZE_K + a_ptrs += BLOCK_SIZE_K * M b_ptrs += BLOCK_SIZE_K * N + offs_m = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M) offs_n = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N) c_ptrs = c_ptr + offs_m[:, None] * N + offs_n[None, :] tl.store(c_ptrs, c) -# a: router output, fp32 -# b: router weight, bf16, should be transposed before calculation -# c: dy of rms, bf16, shoule be accumlated -def triton_fp32_gemm_for_backward(a: torch.Tensor, b: torch.Tensor, - c: Optional[torch.Tensor] = None, - accum=False): +def triton_fp32_gemm_for_update(a: torch.Tensor, b: torch.Tensor): + """ + mix precision gemm for updaing weight + Args: + a: gradient of output, fp32 + b: input activation, bf16/fp16 + Returns: + c: gradient of weight + """ assert a.is_contiguous() and b.is_contiguous() - M, K = a.size() + K, M = a.size() K, N = b.size() - if c is None: - c = torch.empty((M, N), dtype=b.dtype, device=b.device) - accum = False + c = torch.empty((M, N), dtype=b.dtype, device=b.device) grid = lambda META: (triton.cdiv(M, META["BLOCK_SIZE_M"]), triton.cdiv(N, META["BLOCK_SIZE_N"])) # noqa BLOCK_SIZE_K = 128 BLOCK_SIZE_M = 32 BLOCK_SIZE_N = 128 num_warps = 4 - num_stages = 2 - fp32_gemm_for_backward_kernel[grid](a, b, c, - M, N, K, accum, - BLOCK_SIZE_K, - BLOCK_SIZE_M, - BLOCK_SIZE_N, - num_warps=num_warps, - num_stages=num_stages - ) + num_stages = 3 + fp32_gemm_for_update_kernel[grid](a, b, c, + M, N, K, + BLOCK_SIZE_K, + BLOCK_SIZE_M, + BLOCK_SIZE_N, + num_warps=num_warps, + num_stages=num_stages + ) return c -# @triton.autotune(configs=fp32_gemm_configs, key=["M", "N", "K"]) + @triton.jit -def fp32_gemm_for_update_kernel( +def scaled_fp32_gemm_kernel( a_ptr, b_ptr, + scale_ptr, c_ptr, M, N: tl.constexpr, @@ -242,18 +253,21 @@ def fp32_gemm_for_update_kernel( offs_m = (pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) offs_n = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)) offs_k = tl.arange(0, BLOCK_SIZE_K) - a_ptrs = a_ptr + offs_m[None, :] + offs_k[:, None] * M - b_ptrs = b_ptr + offs_n[None, :] + offs_k[:, None] * N + a_ptrs = a_ptr + offs_m[:, None] * K + offs_k[None, :] + b_ptrs = b_ptr + offs_n[None, :] * K + offs_k[:, None] c = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32) - # c = tl.load(c_ptr + offs_m[:, None] * N + offs_n[None, :]).to(tl.float32) for i in range(k): - a = tl.trans(tl.load(a_ptrs)).to(tl.float32) + a = tl.load(a_ptrs).to(tl.float32) b = tl.load(b_ptrs).to(tl.float32) # c += tl.dot(a, b) c = tl.dot(a, b, c) - a_ptrs += BLOCK_SIZE_K * M - b_ptrs += BLOCK_SIZE_K * N + a_ptrs += BLOCK_SIZE_K + b_ptrs += BLOCK_SIZE_K + + scale = tl.load( + scale_ptr + pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) + c *= scale[:, None] offs_m = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M) offs_n = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N) @@ -261,13 +275,34 @@ def fp32_gemm_for_update_kernel( tl.store(c_ptrs, c) -# a: router output, fp32, should be transposed before calculation -# b: input of rms, bf16, should be transposed before calculation -def triton_fp32_gemm_for_update(a: torch.Tensor, b: torch.Tensor): +def triton_scaled_fp32_gemm(a: torch.Tensor, + b: torch.Tensor, + scale: torch.Tensor): + """ + c = (a*scale[:,None])*b + this kernel is used to fuse RMSNorm and quantization in MoE layer + native implementation: + y = rms_norm(x), + y_q = quantization(y), + router_logits = y@w + we can not fuse rms_norm and quantization + as we still need bf16 y for moe router gemm + fused implementation: + y_q, rms = quantization(rms_norm(x)) + router_logits = (x/rms)@y + so we need a scaled fp32 gemm kernel + Args: + a: activation tensor + b: weight tensor + scale: scale for activation tensor, 1/rms + + Returns: + + """ assert a.is_contiguous() and b.is_contiguous() - K, M = a.size() - K, N = b.size() - c = torch.empty((M, N), dtype=b.dtype, device=b.device) + M, K = a.size() + N, K = b.size() + c = torch.empty(M, N, dtype=torch.float32, device=a.device) grid = lambda META: (triton.cdiv(M, META["BLOCK_SIZE_M"]), triton.cdiv(N, META["BLOCK_SIZE_N"])) # noqa BLOCK_SIZE_K = 128 @@ -275,17 +310,20 @@ def triton_fp32_gemm_for_update(a: torch.Tensor, b: torch.Tensor): BLOCK_SIZE_N = 128 num_warps = 4 num_stages = 3 - fp32_gemm_for_update_kernel[grid](a, b, c, - M, N, K, - BLOCK_SIZE_K, - BLOCK_SIZE_M, - BLOCK_SIZE_N, - num_warps=num_warps, - num_stages=num_stages - ) + scaled_fp32_gemm_kernel[grid](a, b, + scale, + c, + M, N, K, + BLOCK_SIZE_K, + BLOCK_SIZE_M, + BLOCK_SIZE_N, + num_warps=num_warps, + num_stages=num_stages + ) return c + @triton.jit def scaled_fp32_gemm_for_update_kernel( a_ptr, @@ -309,7 +347,6 @@ def scaled_fp32_gemm_for_update_kernel( b_ptrs = b_ptr + offs_n[None, :] + offs_k[:, None] * N c = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32) - # c = tl.load(c_ptr + offs_m[:, None] * N + offs_n[None, :]).to(tl.float32) for i in range(k): scale = tl.load( scale_ptr + i * BLOCK_SIZE_K + tl.arange(0, BLOCK_SIZE_K)) @@ -326,11 +363,19 @@ def scaled_fp32_gemm_for_update_kernel( tl.store(c_ptrs, c) -# a: router output, fp32, should be transposed before calculation -# b: input of rms, bf16, should be transposed before calculation -# scale: 1/rms -def triton_scaled_fp32_gemm_for_update(a: torch.Tensor, b: torch.Tensor, +def triton_scaled_fp32_gemm_for_update(a: torch.Tensor, + b: torch.Tensor, scale: torch.Tensor): + """ + see triton_scaled_fp32_gemm + Args: + a: y + b: activation before RMS norm + scale: 1/rms + + Returns: + dw + """ assert a.is_contiguous() and b.is_contiguous() K, M = a.size() K, N = b.size() diff --git a/linghe/quant/block/block.py b/linghe/quant/block.py similarity index 85% rename from linghe/quant/block/block.py rename to linghe/quant/block.py index cdb9d16..cfb37fa 100644 --- a/linghe/quant/block/block.py +++ b/linghe/quant/block.py @@ -28,9 +28,20 @@ def block_quant_kernel(x_ptr, y_ptr, s_ptr, M, N, BLOCK_SIZE: tl.constexpr, tl.store(s_ptr + pid_m * n + pid_n, s) -def block_quant(x, +def triton_block_quant(x, block_size=128, round_scale=False): + """ + blockwise quantize x + Args: + x: input tensor + block_size: block wise + round_scale: whether round scale to power of 2 + + Returns: + y: quantized tensor, float8_e4m3fn + s: quantization scale, float32 + """ M, N = x.size() y = torch.empty((M, N), dtype=torch.float8_e4m3fn, device=x.device) s = x.new_empty(x.size(-2) // block_size, x.size(-1) // block_size, diff --git a/linghe/quant/block/group.py b/linghe/quant/block/group.py deleted file mode 100644 index de5908c..0000000 --- a/linghe/quant/block/group.py +++ /dev/null @@ -1,107 +0,0 @@ -# -*- coding: utf-8 -*- -""" -Copyright (c) Ant Financial Service Group and its affiliates. -""" - -import torch -import triton -import triton.language as tl - - -@triton.jit -def group_quant_kernel(x_ptr, y_ptr, s_ptr, N, BLOCK_SIZE: tl.constexpr, - K: tl.constexpr, ROUND: tl.constexpr): - pid = tl.program_id(axis=0) - offs = pid * N + tl.arange(0, K * BLOCK_SIZE) - n = tl.cdiv(N, K * BLOCK_SIZE) - soffs = pid * n * K + tl.arange(0, K) - for i in range(n): - x = tl.load(x_ptr + offs).to(tl.float32) - x = tl.reshape(x, (K, BLOCK_SIZE), can_reorder=False) - s = tl.maximum(tl.max(tl.abs(x), 1) / 448.0, 1e-30) - if ROUND: - s = tl.exp2(tl.ceil(tl.log2(s))) - y = x / s[:, None] - y = y.to(y_ptr.dtype.element_ty) - y = tl.reshape(y, (K * BLOCK_SIZE,), can_reorder=False) - tl.store(y_ptr + offs, y) - tl.store(s_ptr + soffs, s) - offs += K * BLOCK_SIZE - soffs += K - - -def triton_group_quant(x, dtype=torch.float8_e4m3fn, group_size=128, - round_scale=False): - M, N = x.shape - K = 16 - assert N % group_size == 0 and N % (group_size * K) == 0 - assert x.is_contiguous() - - y = torch.empty((M, N), device=x.device, dtype=dtype) - s = torch.empty(M, N // group_size, device=x.device, dtype=torch.float32) - grid = (M,) # noqa - group_quant_kernel[grid](x, - y, - s, - N, - group_size, - K, - round_scale, - num_stages=5, - num_warps=4) - return y, s - - -@triton.jit -def persist_group_quant_kernel(x_ptr, y_ptr, s_ptr, N, BLOCK_SIZE: tl.constexpr, - B: tl.constexpr, K: tl.constexpr, - ROUND: tl.constexpr): - pid = tl.program_id(axis=0) - offs = pid * B * N + tl.arange(0, B)[:, None] * N + tl.arange(0, - K * BLOCK_SIZE)[ - None, :] - n = tl.cdiv(N, K * BLOCK_SIZE) - soffs = pid * B * n * K + tl.arange(0, B)[:, None] * n * K + tl.arange(0, - K)[ - None, :] - - for j in range(n): - x = tl.load(x_ptr + offs).to(tl.float32) - x = tl.reshape(x, (B, K, BLOCK_SIZE)) - - s = tl.maximum(tl.max(tl.abs(x), 2) / 448.0, 1e-30) - if ROUND: - s = tl.exp2(tl.ceil(tl.log2(s))) - y = x / s[:, :, None] - y = y.to(y_ptr.dtype.element_ty) - y = tl.reshape(y, (B, K * BLOCK_SIZE)) - tl.store(y_ptr + offs, y) - tl.store(s_ptr + soffs, s) - offs += K * BLOCK_SIZE - soffs += K - - -def triton_persist_group_quant(x, dtype=torch.float8_e4m3fn, group_size=128, - round_scale=False): - M, N = x.shape - device = x.device - K = 8 - B = 8 - assert N % group_size == 0 and N % (group_size * K) == 0 - assert x.is_contiguous() - - y = torch.empty((M, N), dtype=dtype, device=device) - s = torch.empty(M, N // group_size, device=x.device, dtype=torch.float32) - - grid = (M // B,) # noqa - persist_group_quant_kernel[grid](x, - y, - s, - N, - group_size, - B, - K, - round_scale, - num_stages=3, - num_warps=8) - return y, s diff --git a/linghe/quant/channel/channel.py b/linghe/quant/channel.py similarity index 88% rename from linghe/quant/channel/channel.py rename to linghe/quant/channel.py index 4968747..01a9571 100644 --- a/linghe/quant/channel/channel.py +++ b/linghe/quant/channel.py @@ -3,6 +3,7 @@ Copyright (c) Ant Financial Service Group and its affiliates. """ +from typing import Optional import torch import triton import triton.language as tl @@ -36,6 +37,16 @@ def row_quant_kernel(x_ptr, q_ptr, s_ptr, M, N, BLOCK_SIZE: tl.constexpr, def triton_row_quant(x, round_scale=False): + """ + rowwise quantize x + Args: + x: input x + round_scale: whether round scale to power of 2 + + Returns: + x_q: quantized tensor + x_scale: quantization scale + """ M, N = x.shape BLOCK_SIZE = max([N % x == 0 for x in [512, 1024, 2048, 4096, 8192]]) x_q = torch.empty((M, N), dtype=torch.float8_e4m3fn, device=x.device) @@ -73,9 +84,10 @@ def deprecated_tokenwise_row_quant_kernel(x_ptr, out_ptr, scale_ptr, M, offs += N -def triton_deprecated_tokenwise_row_quant(x, out=None, scale=None, - round_scale=False): - # row-wise read, row-wise write +def triton_deprecated_tokenwise_row_quant(x: torch.Tensor, + out: Optional[torch.Tensor] = None, + scale: Optional[torch.Tensor] = None, + round_scale: bool = False): M, N = x.shape device = x.device if out is None: @@ -113,6 +125,16 @@ def tokenwise_row_quant_kernel(x_ptr, out_ptr, scale_ptr, N: tl.constexpr, def triton_tokenwise_row_quant(x, out=None, scale=None, round_scale=False): + """ + rowwise quantize x with power of 2 dim size + Args: + x: input x + round_scale: whether round scale to power of 2 + + Returns: + out: quantized tensor + scale: quantization scale + """ # row-wise read, row-wise write M, N = x.shape device = x.device @@ -169,7 +191,18 @@ def transpose_row_quant_kernel(x_ptr, q_ptr, s_ptr, M, N, H: tl.constexpr, toffs += H -def triton_transpose_row_quant(x, side=0, round_scale=False): +def triton_transpose_row_quant(x, round_scale=False): + """ + transpose x and row quantize x + Args: + x: input x + round_scale: whether round scale to power of 2 + + Returns: + x_q: quantized tensor + x_scale: quantization scale + + """ M, N = x.shape H = 1024 W = 16 @@ -218,7 +251,6 @@ def channel_quant_forward(x, w): def channel_quant_backward(y, w): y_q, y_scale, w_q, w_scale = triton_channel_quant_nn(y, w) - # print(f'{y.shape=} {w.shape=} {y_q.shape=} {y_scale.shape=} {w_q.shape=} {w_scale.shape=}') output = torch._scaled_mm(y_q, w_q.t(), scale_a=y_scale, diff --git a/linghe/quant/group.py b/linghe/quant/group.py new file mode 100644 index 0000000..9dec9b8 --- /dev/null +++ b/linghe/quant/group.py @@ -0,0 +1,66 @@ +# -*- coding: utf-8 -*- +""" +Copyright (c) Ant Financial Service Group and its affiliates. +""" + +import torch +import triton +import triton.language as tl + + +@triton.jit +def group_quant_kernel(x_ptr, y_ptr, s_ptr, N, BLOCK_SIZE: tl.constexpr, + K: tl.constexpr, ROUND: tl.constexpr): + pid = tl.program_id(axis=0) + offs = pid * N + tl.arange(0, K * BLOCK_SIZE) + n = tl.cdiv(N, K * BLOCK_SIZE) + soffs = pid * n * K + tl.arange(0, K) + for i in range(n): + x = tl.load(x_ptr + offs).to(tl.float32) + x = tl.reshape(x, (K, BLOCK_SIZE), can_reorder=False) + s = tl.maximum(tl.max(tl.abs(x), 1) / 448.0, 1e-30) + if ROUND: + s = tl.exp2(tl.ceil(tl.log2(s))) + y = x / s[:, None] + y = y.to(y_ptr.dtype.element_ty) + y = tl.reshape(y, (K * BLOCK_SIZE,), can_reorder=False) + tl.store(y_ptr + offs, y) + tl.store(s_ptr + soffs, s) + offs += K * BLOCK_SIZE + soffs += K + + +def triton_group_quant(x, + dtype=torch.float8_e4m3fn, + group_size=128, + round_scale=False): + """ + groupwise quantize x, group is in under rowwise format + Args: + x: input tensor + group_size: group wise + round_scale: whether round scale to power of 2 + + Returns: + y: quantized tensor, float8_e4m3fn + s: quantization scale, float32 + """ + M, N = x.shape + K = 16 + assert N % group_size == 0 and N % (group_size * K) == 0 + assert x.is_contiguous() + + y = torch.empty((M, N), device=x.device, dtype=dtype) + s = torch.empty(M, N // group_size, device=x.device, dtype=torch.float32) + grid = (M,) # noqa + group_quant_kernel[grid](x, + y, + s, + N, + group_size, + K, + round_scale, + num_stages=5, + num_warps=4) + return y, s + diff --git a/linghe/quant/block/__init__.py b/linghe/quant/hadamard/__init__.py similarity index 100% rename from linghe/quant/block/__init__.py rename to linghe/quant/hadamard/__init__.py diff --git a/linghe/quant/hadamard/seperate_hadamard.py b/linghe/quant/hadamard/seperate_hadamard.py new file mode 100644 index 0000000..b52c4ef --- /dev/null +++ b/linghe/quant/hadamard/seperate_hadamard.py @@ -0,0 +1,316 @@ +# -*- coding: utf-8 -*- +""" +Copyright (c) Ant Financial Service Group and its affiliates. +""" + +import torch +import triton +import triton.language as tl + + +@triton.jit +def hadamard_quant_row_kernel( + x_ptr, + hm_ptr, + x_q_ptr, + x_scale_ptr, + M, + N, + BLOCK_SIZE: tl.constexpr, + R: tl.constexpr, +): + pid = tl.program_id(0) + row_start = pid * R * BLOCK_SIZE + rows = row_start + tl.arange(0, R * BLOCK_SIZE) + mask_rows = rows < M + + hm = tl.load( + hm_ptr + tl.arange(0, BLOCK_SIZE)[:, None] * BLOCK_SIZE + tl.arange(0, + BLOCK_SIZE)[ + None, :]) + + max_val = tl.zeros((R * BLOCK_SIZE,), dtype=tl.float32) + 1.17e-38 + + num_col_blocks = tl.cdiv(N, BLOCK_SIZE) + for col_block in range(num_col_blocks): + col_start = col_block * BLOCK_SIZE + cols = col_start + tl.arange(0, BLOCK_SIZE) + mask_cols = cols < N + + offs = rows[:, None] * N + cols[None, :] + x = tl.load(x_ptr + offs, mask=mask_rows[:, None] & mask_cols[None, :], + other=0.0) + x_transformed = tl.dot(x, hm) + current_max = tl.max(tl.abs(x_transformed), axis=1) + max_val = tl.maximum(max_val, current_max) + + scale = max_val / 448.0 + tl.store(x_scale_ptr + rows, scale, mask=mask_rows) + s = 448.0 / tl.where(max_val > 0, max_val, 1.0) + + for col_block in range(num_col_blocks): + col_start = col_block * BLOCK_SIZE + cols = col_start + tl.arange(0, BLOCK_SIZE) + mask_cols = cols < N + + offs = rows[:, None] * N + cols[None, :] + x = tl.load(x_ptr + offs, mask=mask_rows[:, None] & mask_cols[None, :], + other=0.0) + x_transformed = tl.dot(x, hm) + quantized = (x_transformed * s[:, None]).to(x_q_ptr.dtype.element_ty) + tl.store(x_q_ptr + offs, quantized, + mask=mask_rows[:, None] & mask_cols[None, :]) + + +@triton.jit +def hadamard_quant_col_kernel( + x_ptr, + hm_ptr, + xt_q_ptr, + xt_scale_ptr, + M, + N, + BLOCK_SIZE: tl.constexpr, + R: tl.constexpr, +): + pid = tl.program_id(0) + col_start = pid * R * BLOCK_SIZE + cols = col_start + tl.arange(0, R * BLOCK_SIZE) + mask_cols = cols < N + + hm = tl.load( + hm_ptr + tl.arange(0, BLOCK_SIZE)[:, None] * BLOCK_SIZE + tl.arange(0, + BLOCK_SIZE)[ + None, :]) + + max_val = tl.zeros((R * BLOCK_SIZE,), dtype=tl.float32) + 1.17e-38 + + num_row_blocks = tl.cdiv(M, BLOCK_SIZE) + for row_block in range(num_row_blocks): + row_start = row_block * BLOCK_SIZE + rows = row_start + tl.arange(0, BLOCK_SIZE) + mask_rows = rows < M + + offs = rows[:, None] * N + cols[None, :] + x = tl.load(x_ptr + offs, mask=mask_rows[:, None] & mask_cols[None, :], + other=0.0) + x_transformed = tl.dot(hm, x) + current_max = tl.max(tl.abs(x_transformed), axis=0) + max_val = tl.maximum(max_val, current_max) + + scale = max_val / 448.0 + tl.store(xt_scale_ptr + cols, scale, mask=mask_cols) + s = 448.0 / tl.where(max_val > 0, max_val, 1.0) + + for row_block in range(num_row_blocks): + row_start = row_block * BLOCK_SIZE + rows = row_start + tl.arange(0, BLOCK_SIZE) + mask_rows = rows < M + + offs = rows[:, None] * N + cols[None, :] + x = tl.load(x_ptr + offs, mask=mask_rows[:, None] & mask_cols[None, :], + other=0.0) + x_transformed = tl.dot(hm, x) + quantized = (x_transformed * s[None, :]).to(xt_q_ptr.dtype.element_ty) + quantized_t = tl.trans(quantized) + store_offs = cols[:, None] * M + rows[None, :] + tl.store(xt_q_ptr + store_offs, quantized_t, + mask=mask_cols[:, None] & mask_rows[None, :]) + + +# y = x @ w +# dx = y @ wT +# dwT = yT @ x +def triton_hadamard_quant_x(x, hm): + # apply hadamard transformation and quantization for x + # y = x @ w: x->x@h and rowwise quant + # dwT = yT @ x: x->xT@h and rowwise quant + M, N = x.shape + device = x.device + BLOCK_SIZE = hm.size(0) + R = 1 + x_q = torch.empty((M, N), dtype=torch.float8_e4m3fn, device=device) + xt_q = torch.empty((N, M), dtype=torch.float8_e4m3fn, device=device) + x_scale = torch.empty((M, ), dtype=torch.float32, device=device) + xt_scale = torch.empty((N, ), dtype=torch.float32, device=device) + + grid_row = (triton.cdiv(M, R * BLOCK_SIZE),) + hadamard_quant_row_kernel[grid_row]( + x, + hm, + x_q, + x_scale, + M, + N, + BLOCK_SIZE, + R, + num_stages=6, + num_warps=4 + ) + + grid_col = (triton.cdiv(N, R * BLOCK_SIZE),) + hadamard_quant_col_kernel[grid_col]( + x, + hm, + xt_q, + xt_scale, + M, + N, + BLOCK_SIZE, + R, + num_stages=6, + num_warps=4 + ) + + return x_q, x_scale,xt_q, xt_scale + + +# y = x @ w +# dx = y @ wT +# dwT = yT @ x +def triton_hadamard_quant_w(w, hm): + # apply hadamard transformation and quantization for w + # y = x @ w: w->w@h and rowwise quant + # dx = y @ wT: w->h@wT and rowwise quant + M, N = w.shape + device = w.device + w_q = torch.empty((M, N), dtype=torch.float8_e4m3fn, device=device) + wt_q = torch.empty((N, M), dtype=torch.float8_e4m3fn, device=device) + w_scale = torch.empty((M, ), dtype=torch.float32, device=device) + wt_scale = torch.empty((N, ), dtype=torch.float32, device=device) + + BLOCK_SIZE = hm.size(0) + R = 1 + + grid_row = (triton.cdiv(M, R * BLOCK_SIZE),) + hadamard_quant_row_kernel[grid_row]( + w, + hm, + w_q, + w_scale, + M, + N, + BLOCK_SIZE, + R, + num_stages=6, + num_warps=4 + ) + + grid_col = (triton.cdiv(N, R * BLOCK_SIZE),) + hadamard_quant_col_kernel[grid_col]( + w, + hm, + wt_q, + wt_scale, + M, + N, + BLOCK_SIZE, + R, + num_stages=6, + num_warps=4 + ) + + return w_q, w_scale, wt_q, wt_scale + + +# y = x @ w +# dx = y @ wT +# dwT = yT @ x +def triton_hadamard_quant_y(y, hm): + # apply hadamard transformation and quantization for dy + # dx = y @ wT: y->y@h and rowwise quant + # dwT = yT @ x: y->h@yT and rowwise quant + M, N = y.shape + device = y.device + BLOCK_SIZE = hm.size(0) + R = 1 + y_q = torch.empty((M, N), dtype=torch.float8_e4m3fn, device=device) + yt_q = torch.empty((N, M), dtype=torch.float8_e4m3fn, device=device) + y_scale = torch.empty((M, ), dtype=torch.float32, device=device) + yt_scale = torch.empty((N, ), dtype=torch.float32, device=device) + + grid_row = (triton.cdiv(M, R * BLOCK_SIZE),) + hadamard_quant_row_kernel[grid_row]( + y, + hm, + y_q, + y_scale, + M, + N, + BLOCK_SIZE, + R, + num_stages=6, + num_warps=4 + ) + + grid_col = (triton.cdiv(N, R * BLOCK_SIZE),) + hadamard_quant_col_kernel[grid_col]( + y, + hm, + yt_q, + yt_scale, + M, + N, + BLOCK_SIZE, + R, + num_stages=6, + num_warps=4 + ) + + return y_q, y_scale, yt_q, yt_scale + + +def triton_hadamard_quant_nt_megatron(x, w, hm): + x_q, _, x_scale, _ = triton_hadamard_quant_x(x, hm) + w_q, _, w_scale, _ = triton_hadamard_quant_w(w, hm) + return x_q, x_scale, w_q, w_scale + + +def triton_hadamard_quant_nn_megatron(y, w, hm): + y_q, _, y_scale, _ = triton_hadamard_quant_y(y, hm) + _, wt_q, _, wt_scale = triton_hadamard_quant_w(w, hm) + return y_q, y_scale, wt_q, wt_scale + + +def triton_hadamard_quant_tn_megatron(y, x, hm): + _, yt_q, _, yt_scale = triton_hadamard_quant_y(y, hm) + _, xt_q, _, xt_scale = triton_hadamard_quant_x(x, hm) + return yt_q, yt_scale, xt_q, xt_scale + + +def hadamard_quant_forward_megatron(x, w, hm): + x_q, x_scale, w_q, w_scale = triton_hadamard_quant_nt_megatron(x, w, hm) + output = torch._scaled_mm(x_q, + w_q.t(), + scale_a=x_scale, + scale_b=w_scale, + out_dtype=x.dtype, + use_fast_accum=True + ) + return output, x_q, w_q, x_scale, w_scale + + +def hadamard_quant_backward_megatron(y, w, hm): + y_q, y_scale, wt_q, wt_scale = triton_hadamard_quant_nn_megatron(y, w, hm) + output = torch._scaled_mm( + y_q, + wt_q.t(), + scale_a=y_scale, + scale_b=wt_scale, + out_dtype=y.dtype, + use_fast_accum=True + ) + return output, y_q, wt_q.t(), y_scale, wt_scale + + +def hadamard_quant_update_megatron(y, x, hm): + yt_q, yt_scale, xt_q, xt_scale = triton_hadamard_quant_tn_megatron(y, x, hm) + output = torch._scaled_mm(yt_q, + xt_q.t(), + scale_a=yt_scale.t(), + scale_b=xt_scale, + out_dtype=x.dtype, + use_fast_accum=True + ) + return output, yt_q, xt_q, yt_scale, xt_scale + diff --git a/linghe/quant/channel/__init__.py b/linghe/quant/smooth/__init__.py similarity index 100% rename from linghe/quant/channel/__init__.py rename to linghe/quant/smooth/__init__.py diff --git a/linghe/quant/smooth/reused_smooth.py b/linghe/quant/smooth/reused_smooth.py new file mode 100644 index 0000000..14aab4e --- /dev/null +++ b/linghe/quant/smooth/reused_smooth.py @@ -0,0 +1,899 @@ +# -*- coding: utf-8 -*- +""" +Copyright (c) Ant Financial Service Group and its affiliates. +""" + +import torch +import triton +import triton.language as tl + +from linghe.tools.util import round_up + + +# TODO(nanxiao): use max instead of sum +@triton.jit +def tokenwise_reused_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, qs_ptr, max_ptr, + M, T, + N: tl.constexpr, + W: tl.constexpr, + EVEN: tl.constexpr, + REVERSE: tl.constexpr, + ROUND: tl.constexpr, + CALIBRATE: tl.constexpr): + pid = tl.program_id(axis=0) + # row-wise read, row-wise write + smooth_scale = tl.load(ss_ptr + tl.arange(0, N))[None, :] + if not REVERSE: + smooth_scale = 1.0 / smooth_scale + + if CALIBRATE: + output_maxs = tl.zeros((W, N), dtype=tl.float32) + for i in range(T): + indices = pid * W * T + i * W + tl.arange(0, W) + if EVEN: + x = tl.load(x_ptr + pid * W * T * N + i * N * W + tl.arange(0, W)[:, + None] * N + tl.arange( + 0, N)[None, :]).to( + tl.float32) + else: + x = tl.load(x_ptr + pid * W * T * N + i * N * W + tl.arange(0, W)[:, + None] * N + tl.arange( + 0, N)[None, :], + mask=indices[:, None] < M).to( + tl.float32) + if CALIBRATE: + output_maxs = tl.maximum(tl.abs(x), output_maxs) + x *= smooth_scale + x_max = tl.max(tl.abs(x), axis=1) + scale = tl.maximum(x_max / 448.0, 1e-30) + if ROUND: + scale = tl.exp2(tl.ceil(tl.log2(scale))) + if EVEN: + tl.store(qs_ptr + pid * W * T + i * W + tl.arange(0, W), scale, ) + else: + tl.store(qs_ptr + pid * W * T + i * W + tl.arange(0, W), scale, + mask=indices < M) + + x /= scale[:, None] + xq = x.to(q_ptr.dtype.element_ty) + if EVEN: + tl.store(q_ptr + pid * W * T * N + i * N * W + tl.arange(0, W)[:, + None] * N + tl.arange( + 0, + N)[ + None, :], + xq) + else: + tl.store(q_ptr + pid * W * T * N + i * N * W + tl.arange(0, W)[:, + None] * N + tl.arange( + 0, + N)[ + None, :], + xq, + mask=indices[:, None] < M) + if CALIBRATE: + output_maxs = tl.max(output_maxs, 0) + tl.store(max_ptr + pid * N + tl.arange(0, N), output_maxs) + + +@triton.jit +def blockwise_reused_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, qs_ptr, max_ptr, + M, + N, + H: tl.constexpr, + W: tl.constexpr, + EVEN: tl.constexpr, + REVERSE: tl.constexpr, + ROUND: tl.constexpr, + CALIBRATE: tl.constexpr): + pid = tl.program_id(axis=0) + # row-wise read, row-wise write + offs = pid * W * N + tl.arange(0, W)[:, None] * N + tl.arange(0, H)[None, :] + soffs = tl.arange(0, H) + x_max = tl.zeros((W,), dtype=tl.float32) + n = tl.cdiv(N, H) + for i in range(n): + smooth_scale = tl.load(ss_ptr + soffs) + if EVEN: + x = tl.load(x_ptr + offs).to(tl.float32) + else: + x = tl.load(x_ptr + offs, + mask=pid * W + tl.arange(0, W)[:, None] < M).to( + tl.float32) + if CALIBRATE: + output_maxs = tl.max(x.abs(), 0) + tl.store(max_ptr + pid * N + i * H + tl.arange(0, H), output_maxs) + if REVERSE: + x = x * smooth_scale + else: + x = x / smooth_scale + x_max = tl.maximum(tl.max(tl.abs(x), axis=1), x_max) + offs += H + soffs += H + + scale = tl.maximum(x_max / 448, 1e-30) + if ROUND: + scale = tl.exp2(tl.ceil(tl.log2(scale))) + + tl.store(qs_ptr + pid * W + tl.arange(0, W), scale, + mask=pid * W + tl.arange(0, W) < M) + + s = (1.0 / scale)[:, None] + + offs = pid * W * N + tl.arange(0, W)[:, None] * N + tl.arange(0, H)[None, :] + soffs = tl.arange(0, H) + for i in range(n): + smooth_scale = tl.load(ss_ptr + soffs) + if EVEN: + x = tl.load(x_ptr + offs) + else: + x = tl.load(x_ptr + offs, + mask=pid * W + tl.arange(0, W)[:, None] < M) + + if REVERSE: + xq = (x.to(tl.float32) * smooth_scale * s).to( + q_ptr.dtype.element_ty) + else: + xq = (x.to(tl.float32) / smooth_scale * s).to( + q_ptr.dtype.element_ty) + + if EVEN: + tl.store(q_ptr + offs, xq) + else: + # tl.store(q_ptr+offs, xq, mask=(i*H+tl.arange(0, H)[None,:] 8192 else 4 + EVEN = M % W == 0 + T = triton.cdiv(M, W) + if calibrate: + x_maxs = torch.empty((T, N), device=device, dtype=torch.bfloat16) + else: + x_maxs = None + grid = (T,) + blockwise_reused_smooth_quant_kernel[grid]( + x, + x_q, + smooth_scale, + x_scale, + x_maxs, + M, + N, + H, + W, + EVEN, + reverse, + round_scale, + calibrate, + num_stages=3, + num_warps=4 + ) + if calibrate: + x_maxs = x_maxs.amax(0).float() + + return x_q, x_scale, x_maxs + + +@triton.jit +def subrow_reused_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, qs_ptr, + subrow_scales_ptr, + tail_ri, + tail_si, + head_ri, + head_ei, + size, + N, + W: tl.constexpr, + TAIL: tl.constexpr, + HEAD: tl.constexpr, + REVERSE: tl.constexpr, + ROUND: tl.constexpr): + if TAIL: + # scale is saved as max/448 + scale = tl.maximum(tl.load(subrow_scales_ptr), 1e-30) + if ROUND: + scale = tl.exp2(tl.ceil(tl.log2(scale))) + # scale only stores in subrow with leading values + + T = tl.cdiv(N - tail_si, W) + for i in range(T): + mask = tail_si + i * W + tl.arange(0, W) < N + if REVERSE: + smooth_scale = tl.load( + ss_ptr + tail_si + i * W + tl.arange(0, W), mask=mask) + else: + smooth_scale = tl.load( + ss_ptr + tail_si + i * W + tl.arange(0, W), other=1e30, + mask=mask) + smooth_scale = 1.0 / smooth_scale + x = tl.load(x_ptr + i * W + tl.arange(0, W), mask=mask).to( + tl.float32) + x *= smooth_scale + x /= scale + xq = tl.minimum(tl.maximum(x, -448), 448) + tl.store(q_ptr + tail_ri * N + tail_si + i * W + tl.arange(0, W), + xq.to(q_ptr.dtype.element_ty), mask=mask) + + if HEAD: + # scale is saved as max/448 + scale = tl.maximum(tl.load(subrow_scales_ptr + 1), 1e-30) + if ROUND: + scale = tl.exp2(tl.ceil(tl.log2(scale))) + tl.store(qs_ptr + head_ri, scale) + + T = tl.cdiv(head_ei, W) + for i in range(T): + mask = i * W + tl.arange(0, W) < head_ei + if REVERSE: + smooth_scale = tl.load(ss_ptr + i * W + tl.arange(0, W), + mask=mask) + else: + smooth_scale = tl.load(ss_ptr + i * W + tl.arange(0, W), + other=1e30, mask=mask) + smooth_scale = 1.0 / smooth_scale + x = tl.load(x_ptr + size - head_ei + i * W + tl.arange(0, W), + mask=mask).to(tl.float32) + x *= smooth_scale + x /= scale + xq = tl.minimum(tl.maximum(x, -448), 448) + tl.store(q_ptr + head_ri * N + i * W + tl.arange(0, W), + xq.to(q_ptr.dtype.element_ty), mask=mask) + + +def triton_subrow_reused_smooth_quant(x, smooth_scale, x_q, x_scale, + subrow_scales, offset, size, + reverse=False, round_scale=False): + M, N = x_q.shape + W = 128 + if offset % N == 0: + tail_ri = 0 + tail_si = 0 + TAIL = False + else: + tail_ri = offset // N + tail_si = offset % N + TAIL = True + + if (offset + size) % N == 0: + head_ri = 0 + head_ei = 0 # head_size = head_ei + HEAD = False + else: + head_ri = (offset + size) // N + head_ei = (offset + size) % N + HEAD = True + + grid = (1,) + subrow_reused_smooth_quant_kernel[grid]( + x, + x_q, + smooth_scale, + x_scale, + subrow_scales, + tail_ri, + tail_si, + head_ri, + head_ei, + size, + N, + W, + TAIL, + HEAD, + reverse, + round_scale, + num_stages=3, + num_warps=1 + ) + + +@triton.jit +def depracated_tokenwise_reused_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, + qs_ptr, M, W, + N: tl.constexpr, + REVERSE: tl.constexpr, + ROUND: tl.constexpr): + pid = tl.program_id(axis=0) + # row-wise read, row-wise write + smooth_scale = tl.load(ss_ptr + tl.arange(0, N)) + if not REVERSE: + smooth_scale = 1.0 / smooth_scale + + for i in range(W): + x = tl.load(x_ptr + pid * W * N + i * N + tl.arange(0, N), + mask=pid * W + i < M).to(tl.float32) + x *= smooth_scale + x_max = tl.maximum(tl.max(tl.abs(x)), 1e-30) + + scale = x_max / 448.0 + if ROUND: + scale = tl.exp2(tl.ceil(tl.log2(scale))) + tl.store(qs_ptr + pid * W + i, scale, mask=pid * W + i < M) + + x /= scale + xq = x.to(q_ptr.dtype.element_ty) + tl.store(q_ptr + pid * W * N + i * N + tl.arange(0, N), xq, + mask=pid * W + i < M) + + +def triton_depracated_tokenwise_reused_smooth_quant(x, smooth_scale, x_q=None, + x_scale=None, reverse=False, + round_scale=False): + # row-wise read, row-wise write + M, N = x.shape + device = x.device + if x_q is None: + x_q = torch.empty((M, N), device=device, dtype=torch.float8_e4m3fn) + if x_scale is None: + x_scale = torch.empty((M,), device=device, dtype=torch.float32) + sm = torch.cuda.get_device_properties(device).multi_processor_count + W = triton.cdiv(M, sm) + grid = (sm,) + depracated_tokenwise_reused_smooth_quant_kernel[grid]( + x, + x_q, + smooth_scale, + x_scale, + M, + W, + N, + reverse, + round_scale, + num_stages=3, + num_warps=8 + ) + return x_q, x_scale + + +@triton.jit +def batch_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, qs_ptr, xm_ptr, count_ptr, + accum_ptr, T, N: tl.constexpr, + REVERSE: tl.constexpr, ROUND: tl.constexpr, + CALIBRATE: tl.constexpr): + pid = tl.program_id(axis=0) + + i_expert = pid // T + i_batch = pid % T + + # row-wise read, row-wise write + smooth_scale = tl.load(ss_ptr + i_expert * N + tl.arange(0, N)) + if not REVERSE: + smooth_scale = 1.0 / smooth_scale + + if CALIBRATE: + x_maxs = tl.zeros((N,), dtype=tl.float32) + + count = tl.load(count_ptr + i_expert) + ei = tl.load(accum_ptr + i_expert) + si = ei - count + + n = tl.cdiv(count, T) # samples for each task + for i in range(i_batch * n, min((i_batch + 1) * n, count)): + x = tl.load(x_ptr + si * N + i * N + tl.arange(0, N)).to(tl.float32) + if CALIBRATE: + x_maxs = tl.maximum(x_maxs, x.abs()) + x *= smooth_scale + scale = tl.maximum(tl.max(tl.abs(x)) / 448.0, 1e-30) + if ROUND: + scale = tl.exp2(tl.ceil(tl.log2(scale))) + + tl.store(qs_ptr + si + i, scale) + + s = 1.0 / scale + x *= s + xq = x.to(q_ptr.dtype.element_ty) + tl.store(q_ptr + si * N + i * N + tl.arange(0, N), xq) + + if CALIBRATE: + tl.store(xm_ptr + pid * N + tl.arange(0, N), x_maxs) + + +""" +select and smooth and quant +x: [bs, dim] +smooth_scales: [n_experts, dim] +token_count_per_expert: [n_experts] +x_q: [bs, dim] +x_scale: [bs] +""" + + +def triton_batch_smooth_quant(x, smooth_scales, token_count_per_expert, + x_q=None, x_scale=None, x_maxs=None, + reverse=False, round_scale=False, + calibrate=False): + # row-wise read, row-wise write + + M, N = x.shape + device = x.device + n_expert = token_count_per_expert.shape[0] + assert 128 % n_expert == 0 + if x_q is None: + x_q = torch.empty((M, N), device=device, dtype=torch.float8_e4m3fn) + if x_scale is None: + x_scale = torch.empty((M,), device=device, dtype=torch.float32) + accum_token_count = torch.cumsum(token_count_per_expert, 0) + T = 128 // n_expert + if calibrate and x_maxs is None: + x_maxs = torch.empty((128, N), device=device, dtype=torch.float32) + + grid = (128,) + batch_smooth_quant_kernel[grid]( + x, + x_q, + smooth_scales, + x_scale, + x_maxs, + token_count_per_expert, + accum_token_count, + T, N, + reverse, + round_scale, + calibrate, + num_stages=3, + num_warps=8 + ) + if calibrate: + x_maxs = x_maxs.view(n_expert, T, N).amax(1) + return x_q, x_scale, x_maxs + + +@triton.jit +def batch_pad_transpose_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, qs_ptr, + count_ptr, + accum_ptr, + N, + H: tl.constexpr, + W: tl.constexpr, + E: tl.constexpr, + REVERSE: tl.constexpr, + ROUND: tl.constexpr): + eid = tl.program_id(axis=0) + bid = tl.program_id(axis=1) + + count = tl.load(count_ptr + eid) + ei = tl.load(accum_ptr + eid) + si = ei - count + round_count = tl.cdiv(count, 32) * 32 + + counts = tl.load(count_ptr + tl.arange(0, E)) + n_blocks = tl.cdiv(counts, 128) + bias = tl.sum(tl.where(tl.arange(0, E) < eid, n_blocks, 0)) + + n = tl.cdiv(count, H) + maxs = tl.zeros((H, W), dtype=tl.float32) + for i in range(n): + # col-wise read, row-wise write + indices = i * H + tl.arange(0, H) + smooth_scale = tl.load(ss_ptr + indices, mask=indices < count) + if not REVERSE: + smooth_scale = 1.0 / smooth_scale + + x = tl.load(x_ptr + si * N + i * H * N + bid * W + tl.arange(0, H)[:, + None] + tl.arange(0, + W)[ + None, :], + mask=indices[:, None] < count).to(tl.float32) + x *= smooth_scale[:, None] + maxs = tl.maximum(maxs, tl.abs(x)) + + maxs = tl.max(maxs, 0) + scale = tl.maximum(tl.max(maxs, 0) / 448.0, 1e-30) + if ROUND: + scale = tl.exp2(tl.ceil(tl.log2(scale))) + tl.store(qs_ptr + eid * N + bid * W + tl.arange(0, W), scale) + s = 1.0 / scale + + for i in range(n): + # col-wise read, row-wise write + indices = i * H + tl.arange(0, H) + smooth_scale = tl.load(ss_ptr + indices, mask=indices < count) + if not REVERSE: + smooth_scale = 1.0 / smooth_scale + + x = tl.load(x_ptr + si * N + i * H * N + bid * W + tl.arange(0, H)[:, + None] + tl.arange(0, + W)[ + None, :], + mask=indices[:, None] < count).to(tl.float32) + x *= smooth_scale[:, None] + x *= s + xq = tl.trans(x.to(q_ptr.dtype.element_ty)) + tl.store( + q_ptr + bias * N + bid * W * round_count + i * H + tl.arange(0, W)[ + :, + None] + tl.arange( + 0, H)[None, :], xq, mask=indices[None, :] < round_count) + + +""" +used in silu backward +pad to multiple of 32 and transpose and smooth quant +x: [sum(token_per_expert), dim] +smooth_scales: [sum(token_per_expert)] +token_count_per_expert: [n_experts] +splits: list of token_count_per_expert +x_q: [sum(roundup(token_per_expert)) * dim] +x_scale: [n_experts, dim] +""" + + +def triton_batch_pad_transpose_smooth_quant(x, + smooth_scales, + token_count_per_expert, + splits, + x_q=None, x_scale=None, x_maxs=None, + reverse=False, round_scale=False): + # col-wise read, row-wise write + + M, N = x.shape + device = x.device + n_expert = token_count_per_expert.shape[0] + round_splits = [(x + 31) // 32 * 32 for x in splits] + round_size = sum(round_splits) + if x_q is None: + x_q = torch.empty((round_size, N), device=device, + dtype=torch.float8_e4m3fn) + if x_scale is None: + x_scale = torch.empty((n_expert, N), device=device, dtype=torch.float32) + accum_token_count = torch.cumsum(token_count_per_expert, 0) + H = 128 + W = 32 + grid = (n_expert, N // W) + batch_pad_transpose_smooth_quant_kernel[grid]( + x, + x_q, + smooth_scales, + x_scale, + token_count_per_expert, + accum_token_count, + N, + H, + W, + n_expert, + reverse, + round_scale, + num_stages=3, + num_warps=8 + ) + return x_q, x_scale + + +@triton.jit +def reused_transpose_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, qs_ptr, M, N, P, + H: tl.constexpr, W: tl.constexpr, + EVEN: tl.constexpr, + REVERSE: tl.constexpr, + ROUND: tl.constexpr): + pid = tl.program_id(axis=0) + # col-wise read, row-wise write + offs = pid * W + tl.arange(0, H)[:, None] * N + tl.arange(0, W)[None, :] + soffs = tl.arange(0, H) + x_max = tl.zeros((W,), dtype=tl.float32) + m = tl.cdiv(P, H) + for i in range(m): + if EVEN: + x = tl.load(x_ptr + offs) + smooth_scale = tl.load(ss_ptr + soffs)[:, None] + else: + x = tl.load(x_ptr + offs, + mask=(i * H + tl.arange(0, H)[:, None] < M) & ( + pid * W + tl.arange(0, W)[None, :] < N)) + other = 0.0 if REVERSE else 1e30 + smooth_scale = tl.load(ss_ptr + soffs, mask=soffs < M, other=other)[ + :, None] + if REVERSE: + x = x * smooth_scale + else: + x = x / smooth_scale + x_max = tl.maximum(tl.max(tl.abs(x), axis=0), x_max) + offs += H * N + soffs += H + + scale = tl.maximum(x_max / 448.0, 1e-30) + if ROUND: + scale = tl.exp2(tl.ceil(tl.log2(scale))) + + if EVEN: + tl.store(qs_ptr + pid * W + tl.arange(0, W), scale) + else: + tl.store(qs_ptr + pid * W + tl.arange(0, W), scale, + mask=pid * W + tl.arange(0, W) < N) + + s = (1.0 / scale)[None, :] + offs = pid * W + tl.arange(0, H)[:, None] * N + tl.arange(0, W)[None, :] + soffs = tl.arange(0, H) + toffs = pid * W * P + tl.arange(0, W)[:, None] * P + tl.arange(0, H)[None, + :] + for i in range(m): + if EVEN: + x = tl.load(x_ptr + offs).to(tl.float32) + smooth_scale = tl.load(ss_ptr + soffs)[:, None] + else: + x = tl.load(x_ptr + offs, + mask=(i * H + tl.arange(0, H)[:, None] < M)).to( + tl.float32) + other = 0.0 if REVERSE else 1e30 + smooth_scale = tl.load(ss_ptr + soffs, mask=soffs < M, other=other)[ + :, None] + + if REVERSE: + x = (x * smooth_scale * s).to(q_ptr.dtype.element_ty) + else: + x = (x / smooth_scale * s).to(q_ptr.dtype.element_ty) + if EVEN: + tl.store(q_ptr + toffs, tl.trans(x)) + else: + # mask with P instead of M + tl.store(q_ptr + toffs, tl.trans(x), + mask=(i * H + tl.arange(0, H)[None, :] < P)) + offs += H * N + toffs += H + soffs += H + + +def triton_reused_transpose_smooth_quant(x, + smooth_scale, + reverse=False, + pad=False, + round_scale=False): + # col-wise read, row-wise write + # M should be padded if M % 32 != 0 + M, N = x.shape + device = x.device + P = (M + 31) // 32 * 32 if pad else M + x_q = torch.empty((N, P), device=device, dtype=torch.float8_e4m3fn) + x_scale = torch.empty((N,), device=device, dtype=torch.float32) + H = 1024 + W = 16 # if N >= 4096 else 16 + assert N % W == 0 + EVEN = P % H == 0 and M == P + + grid = (triton.cdiv(N, W),) + reused_transpose_smooth_quant_kernel[grid]( + x, + x_q, + smooth_scale, + x_scale, + M, + N, + P, + H, + W, + EVEN, + reverse, + round_scale, + num_stages=3, + num_warps=4 if N >= 8192 else 4 + ) + return x_q, x_scale + + +@triton.jit +def reused_transpose_rescale_smooth_quant_kernel(x_ptr, q_ptr, + org_smooth_scale_ptr, + org_quant_scale_ptr, + transpose_smooth_scale_ptr, + transpose_quant_scale_ptr, M, + N, P, H: tl.constexpr, + W: tl.constexpr, + EVEN: tl.constexpr, + ROUND: tl.constexpr): + pid = tl.program_id(axis=0) + # col-wise read, row-wise write + offs = pid * W + tl.arange(0, H)[:, None] * N + tl.arange(0, W)[None, :] + soffs = tl.arange(0, H) + x_max = tl.zeros((W,), dtype=tl.float32) + org_smooth_scale = tl.load( + org_smooth_scale_ptr + pid * W + tl.arange(0, W))[None, :] + + m = tl.cdiv(P, H) + for i in range(m): + if EVEN: + x = tl.load(x_ptr + offs).to(tl.float32) + org_quant_scale = tl.load(org_quant_scale_ptr + soffs)[:, None] + transpose_smooth_scale = tl.load( + transpose_smooth_scale_ptr + soffs)[:, None] + else: + x = tl.load(x_ptr + offs, + mask=(i * H + tl.arange(0, H)[:, None] < M)).to( + tl.float32) + org_quant_scale = tl.load(org_quant_scale_ptr + soffs, + mask=soffs < M, other=0.0)[:, None] + transpose_smooth_scale = tl.load(transpose_smooth_scale_ptr + soffs, + mask=soffs < M, other=0.0)[:, None] + + x = x / org_smooth_scale * (org_quant_scale * transpose_smooth_scale) + x_max = tl.maximum(tl.max(tl.abs(x), axis=0), x_max) + offs += H * N + soffs += H + + scale = tl.maximum(x_max / 448.0, 1e-30) + if ROUND: + scale = tl.exp2(tl.ceil(tl.log2(scale))) + + tl.store(transpose_quant_scale_ptr + pid * W + tl.arange(0, W), scale) + + s = (1.0 / scale)[None, :] + + offs = pid * W + tl.arange(0, H)[:, None] * N + tl.arange(0, W)[None, :] + soffs = tl.arange(0, H) + toffs = pid * W * P + tl.arange(0, W)[:, None] * P + tl.arange(0, H)[None, + :] + for i in range(m): + + if EVEN: + x = tl.load(x_ptr + offs).to(tl.float32) + org_quant_scale = tl.load(org_quant_scale_ptr + soffs)[:, None] + transpose_smooth_scale = tl.load( + transpose_smooth_scale_ptr + soffs)[:, None] + else: + x = tl.load(x_ptr + offs, + mask=(i * H + tl.arange(0, H)[:, None] < M) & ( + pid * W + tl.arange(0, W)[None, :] < N)).to( + tl.float32) + org_quant_scale = tl.load(org_quant_scale_ptr + soffs, + mask=soffs < M, other=0.0)[:, None] + transpose_smooth_scale = tl.load(transpose_smooth_scale_ptr + soffs, + mask=soffs < M, other=0.0)[:, None] + + x = x * s / org_smooth_scale * ( + org_quant_scale * transpose_smooth_scale) + x = tl.trans(x.to(q_ptr.dtype.element_ty)) + if EVEN: + tl.store(q_ptr + toffs, x) + else: + tl.store(q_ptr + toffs, x, + mask=(i * H + tl.arange(0, H)[None, :] < P)) + offs += H * N + toffs += H + soffs += H + + +""" +x_q is colwise smooth and rowwise quant +org_smooth_scale and transpose_smooth_scale is reversed +smooth scale and quant scale should be power of 2 +step: dequant x_q -> apply smooth scale -> quant -> transpose -> pad +implement: x_q/org_smooth_scale*(org_quant_scale*smooth_scale) -> colwise quant and transpose +""" + + +def triton_reused_transpose_rescale_smooth_quant(x_q, org_smooth_scale, + org_quant_scale, + transpose_smooth_scale, + reverse=True, + pad=False, + round_scale=False): + # col-wise read, row-wise write + + assert reverse + M, N = x_q.shape + device = x_q.device + P = round_up(M, b=32) if pad else M + xt_q = torch.empty((N, P), device=device, dtype=torch.float8_e4m3fn) + x_scale = torch.empty((N,), device=device, dtype=torch.float32) + H = 256 + W = 16 + assert N % W == 0 + EVEN = P == M and M % H == 0 + + grid = (triton.cdiv(N, W),) + reused_transpose_rescale_smooth_quant_kernel[grid]( + x_q, + xt_q, + org_smooth_scale, + org_quant_scale, + transpose_smooth_scale, + x_scale, + M, N, P, + H, W, + EVEN, + round_scale, + num_stages=4, + num_warps=8 + ) + + return xt_q, x_scale + + +def triton_reused_smooth_quant_nt(x, w, smooth_scale): + x_q, x_scale, x_maxs = triton_reused_smooth_quant(x, 1 / smooth_scale) + w_q, w_scale, x_maxs = triton_reused_smooth_quant(w, smooth_scale) + return x_q, x_scale, w_q, w_scale + + +def triton_reused_smooth_quant_nn(y, w, smooth_scale): + y_q, y_scale, x_maxs = triton_reused_smooth_quant(y, smooth_scale) + w_q, w_scale = triton_reused_transpose_smooth_quant(w, 1 / smooth_scale) + return y_q, y_scale, w_q, w_scale + + +def triton_reused_smooth_quant_tn(y, x, smooth_scale): + y_q, y_scale = triton_reused_transpose_smooth_quant(y, smooth_scale) + x_q, x_scale = triton_reused_transpose_smooth_quant(x, 1 / smooth_scale) + return y_q, y_scale, x_q, x_scale + + +def reused_smooth_quant_forward(x, w, smooth_scale): + x_q, x_s, w_q, w_s = triton_reused_smooth_quant_nt(x, w, smooth_scale) + output = torch._scaled_mm(x_q, + w_q.t(), + scale_a=x_s.view(-1, 1), + scale_b=w_s.view(1, -1), + out_dtype=x.dtype, + use_fast_accum=True) + return output + + +def reused_smooth_quant_backward(y, w, smooth_scale): + y_q, y_s, w_q, w_s = triton_reused_smooth_quant_nn(y, w, smooth_scale) + output = torch._scaled_mm(y_q, + w_q.t(), + scale_a=y_s.view(-1, 1), + scale_b=w_s.view(1, -1), + out_dtype=y.dtype, + use_fast_accum=True) + return output + + +def reused_smooth_quant_update(y, x, smooth_scale): + y_q, y_s, x_q, x_s = triton_reused_smooth_quant_tn(y, x, smooth_scale) + output = torch._scaled_mm(y_q, + x_q.t(), + scale_a=y_s.view(-1, 1), + scale_b=x_s.view(1, -1), + out_dtype=y.dtype, + use_fast_accum=True) + return output + + +def reused_smooth_quant_f_and_b(x, w, y, smooth_scale): + reused_smooth_quant_forward(x, w, smooth_scale) + reused_smooth_quant_backward(y, w, smooth_scale) + reused_smooth_quant_update(y, x, smooth_scale) diff --git a/linghe/quant/smooth/seperate_smooth.py b/linghe/quant/smooth/seperate_smooth.py new file mode 100644 index 0000000..802021e --- /dev/null +++ b/linghe/quant/smooth/seperate_smooth.py @@ -0,0 +1,133 @@ +# -*- coding: utf-8 -*- +""" +Copyright (c) Ant Financial Service Group and its affiliates. +""" + +import torch + +from linghe.quant.smooth.reused_smooth import triton_reused_smooth_quant, \ + triton_reused_transpose_smooth_quant, triton_subrow_reused_smooth_quant +from linghe.utils.transpose import triton_transpose_and_pad + +""" +megatron fp8 training steps: +step 0: init w smooth scale w_smooth +step 1: smooth and quant w after w is updated by optimizer +step 2: in forward step, columnwise smooth x and rowwise quant x, calc y=x@w; + meanwhile, record the columnwise max of x, it is used to update w_smooth +step 3: in dgrad step, columnwise smooth y and rowwise quant y, transpose x, calc dx=y@wT +step 4: in wgrad step, dequant then smooth an then quant y_q to get yt_q, calc dw=yT@x + +alternative (it's not suitable for fp8 combine): +step 4: in wgrad step, rowwise smooth y and columnwise quant y and transpose to get yt_q, calc dw=yT@x + +""" + +""" +divide x by smooth_scale and row-wise quantization +smooth scale is updated by square root of x's column-wise maxs, and set in weight's x_maxs attr + +transpose: transpose quantized x for wgrad +pad: # pad M to be multiplier of 32, including quant scales and transposed x + +""" + + +# y = x @ w +# dx = y @ wT +# dwT = yT @ x +def triton_smooth_quant_input(x, smooth_scale, x_q=None, x_scale=None, xt_q=None, + transpose=True, pad=True, round_scale=False): + x_q, x_scale, x_maxs = triton_reused_smooth_quant(x, smooth_scale, x_q=x_q, + x_scale=x_scale, reverse=False, + round_scale=round_scale) + + if transpose: + xt_q = triton_transpose_and_pad(x_q, out=xt_q, pad=pad) + else: + xt_q = None + xt_scale = smooth_scale + + return x_q, xt_q, x_scale, xt_scale + + +# y = x @ w +# dx = y @ wT +# dwT = yT @ x +def triton_smooth_quant_grad(y, smooth_scale, transpose_smooth_scale, reverse=True, + transpose=True, pad=True, round_scale=False): + assert reverse, "args `smooth_scale` and/or `transpose_smooth_scale` must be in reciprocal format in triton_smooth_quant_grad" + y_q, y_scale, _ = triton_reused_smooth_quant(y, smooth_scale, reverse=True, + round_scale=round_scale) + if transpose: + yt_q, yt_scale = triton_reused_transpose_smooth_quant(y, + transpose_smooth_scale, + reverse=True, + pad=pad, + round_scale=round_scale) + else: + yt_q, yt_scale = None, None + + return y_q, yt_q, y_scale, yt_scale + +""" +we stat the max/mean of rowwise maximums +gate: 1.15/0.14 +up: 0.34/0.14 +down 1.12/0.15 +large value may cause underflow in w, but leading to overflow in dy +however, underflow in w only influences a row of w, but will influences +all the rows in dy, therefore we use a very small value to avoid overflow in dy + +furthermore, we clip the values of the subrow within the master weight, to avoid +inconsistant values between training and evaluation. + +""" +def triton_smooth_quant_w(w, smooth_scale, w_q, quant_scale, subrow_scales, offset=0, + round_scale=False): + assert w.ndim == 1 + assert w_q.size(1) == smooth_scale.size(0) + + size = w.numel() + M, N = w_q.shape + + if size == M * N: + triton_reused_smooth_quant(w.view(M, N), smooth_scale, x_q=w_q, + x_scale=quant_scale, + round_scale=round_scale) + elif offset % N == 0 and size % N == 0: + n_row = size // N + row_id = offset // N + w_q_slice = w_q[row_id:row_id + n_row] + quant_scale_slice = quant_scale[row_id:row_id + n_row] + triton_reused_smooth_quant(w.view(n_row,N), smooth_scale, x_q=w_q_slice, + x_scale=quant_scale_slice, + round_scale=round_scale) + else: + row_si = (offset - 1)//N + 1 + row_ei = (offset + size) // N + col_si = offset % N + col_ei = (offset + size ) % N + n_row = row_ei - row_si + mw_offset = 0 if col_si == 0 else N - col_si + w_q_slice = w_q[row_si:row_ei] + quant_scale_slice = quant_scale[row_si:row_ei] + w_slice = w[mw_offset:mw_offset+n_row*N].view(n_row,N) + triton_reused_smooth_quant(w_slice, + smooth_scale, + x_q=w_q_slice, + x_scale=quant_scale_slice, + round_scale=round_scale) + + # subrow scale is writed by the row with leading master weights + if col_si > 0 or col_ei > 0: + triton_subrow_reused_smooth_quant(w, + smooth_scale, + w_q, + quant_scale, + subrow_scales, + offset, + size, + reverse=False, + round_scale=round_scale) + diff --git a/linghe/utils/add.py b/linghe/utils/add.py index 18201a6..868e1f0 100644 --- a/linghe/utils/add.py +++ b/linghe/utils/add.py @@ -51,9 +51,10 @@ def triton_inplace_add(x: torch.Tensor, y: torch.Tensor, accum : bool = True): Args: x: Tensor y: Tensor - accum: whether accum y to x + accum: x += y if accum=True else x.copy_(y) - Returns: x += y if accum=True else x.copy_(y) + Returns: + updated x """ N = x.shape[-1] M = x.numel() // N diff --git a/linghe/utils/dot.py b/linghe/utils/dot.py index 6ac1ead..f57acab 100644 --- a/linghe/utils/dot.py +++ b/linghe/utils/dot.py @@ -18,14 +18,24 @@ def dot_kernel(x_ptr, y_ptr, sum_ptr, M, N, H: tl.constexpr, W: tl.constexpr): sums = tl.zeros((W,), dtype=tl.float32) for i in range(n): x = tl.load(x_ptr + offs).to(tl.float32) - q = tl.load(y_ptr + offs).to(tl.float32) - sums += tl.sum(x * q, axis=1) + y = tl.load(y_ptr + offs).to(tl.float32) + sums += tl.sum(x * y, axis=1) offs += H tl.store(sum_ptr + pid * W + tl.arange(0, W), sums) def triton_dot(x, y): + """ + vector dot multiply, output = sum(x*y, 1), + it is used to calculate gradient of router weight + Args: + x: + y: + + Returns: + output of sum(x*y, 1) + """ M, N = x.shape H = 128 W = 16 @@ -45,52 +55,3 @@ def triton_dot(x, y): ) return s - -@triton.jit -def mix_precise_dot_kernel(x_ptr, q_ptr, sum_ptr, smooth_scale_ptr, - quant_scale_ptr, M, N, H: tl.constexpr, - W: tl.constexpr): - # rowwise read, rowwise write - pid = tl.program_id(axis=0) - offs = pid * W * N + tl.arange(0, W)[:, None] * N + tl.arange(0, H)[None, :] - soffs = tl.arange(0, H) - quant_scale = tl.load(quant_scale_ptr + pid * W + tl.arange(0, W)) - - n = tl.cdiv(N, H) - sums = tl.zeros((W,), dtype=tl.float32) - for i in range(n): - x = tl.load(x_ptr + offs) - q = tl.load(q_ptr + offs) - smooth_scale = tl.load(smooth_scale_ptr + soffs)[None, :] - q = q.to(tl.float32) * smooth_scale - x = x.to(tl.float32) - sums += tl.sum(x * q, axis=1) * quant_scale - offs += H - soffs += H - - tl.store(sum_ptr + pid * W + tl.arange(0, W), sums) - - -# q should be dequant -def triton_mix_precise_dot(x, q, smooth_scale, quant_scale, reverse=False): - assert reverse - M, N = x.shape - device = x.device - s = torch.empty((M,), device=device, dtype=x.dtype) - - H = 128 - W = 16 - num_stages = 5 - num_warps = 8 - - grid = (triton.cdiv(M, W),) - mix_precise_dot_kernel[grid]( - x, q, s, - smooth_scale, - quant_scale, - M, N, - H, W, - num_stages=num_stages, - num_warps=num_warps - ) - return s diff --git a/linghe/utils/gather.py b/linghe/utils/gather.py index dd29c03..abba7e4 100644 --- a/linghe/utils/gather.py +++ b/linghe/utils/gather.py @@ -55,13 +55,20 @@ def make_row_id_map_kernel(map_ptr, count_ptr, output_ptr, M, B, P, offs += b * E -# """ -# make row id map, shape:[n_tokens, n_experts] -# """ + def triton_make_row_id_map( routing_map: torch.Tensor, multiple_of: int = 1 ): + """ + make row id map, values in the tensor are the row indices + Args: + routing_map: a tensor of 0/1 values, 1 indicates routed + multiple_of: padding the tokens of each expert to multiple of this value + + Returns: + row id map with shape [n_tokens, n_experts] + """ n_tokens, n_experts = routing_map.shape T = 128 block_counts = torch.empty((T, n_experts), dtype=torch.int32, @@ -137,20 +144,21 @@ def make_row_id_map_and_indices_kernel(map_ptr, count_ptr, row_map_ptr, offs += b * E -""" -routing map, shape:[n_tokens, n_experts] -num_out_tokens, shape:[sum(round(bs))] - -row id map, shape:[n_tokens, n_experts] -row id indices, shape: [sum(n_tokens_per_experts)] -""" - - def triton_make_row_id_map_and_indices( routing_map: torch.Tensor, num_out_tokens: int, multiple_of: int = 1, ): + """ + similar with triton_make_row_id_map, but output an indices tensor as well + Args: + routing_map: [n_tokens, n_experts] + num_out_tokens: sum(round_up_to(n_tokens, multiple_of)) + multiple_of: padding the tokens of each expert to this value + Returns: + row_in_map: [n_tokens, n_experts] + row_indices: [num_out_tokens] + """ n_tokens, n_experts = routing_map.shape T = 128 block_counts = torch.empty((T, n_experts), dtype=torch.int32, @@ -208,15 +216,17 @@ def index_select_kernel(x_ptr, out_ptr, scale_ptr, scale_out_ptr, index_ptr, M, tl.store(scale_out_ptr + dst_idx, scale, mask=dst_idx < M) -""" -index select for quantized tensor -x: [bs, dim] -x_scale: [bs] -indices: [K] -""" - - def triton_index_select(x, indices, scale=None, out=None, scale_out=None): + """ + index select for quantized tensor + Args: + x: [bs, dim] + indices: [K] + scale: [bs] + Returns: + out: output of selected x + scale_out: scale of selected scale + """ # row-wise read, row-wise write M, N = x.shape E = indices.shape[0] @@ -311,22 +321,6 @@ def fill_padded_token_with_zero_kernel(data_ptr, scale_ptr, probs_ptr, tl.store(probs_ptr + i, 0.0) -""" -gather with mask map -inp: [num_tokens, hidden_size], rowwise_data -scale: [num_tokens, scale_size], rowwise_scale_inv -prob: [num_tokens], router prob -row_id_map: [n_experts, num_tokens] - index >= 0: row index of output tensor - index == -1: ignore - Note: index may not be contiguous -num_out_tokens: output token count, including padding tokens -contiguous: whether indices in row_id_map is contiguous - False means padded -token_per_expert: [num_experts], token count per expert, non-blocking cuda tensor -""" - - def triton_permute_with_mask_map( inp: torch.Tensor, scale: torch.Tensor, @@ -336,6 +330,28 @@ def triton_permute_with_mask_map( contiguous: bool = True, tokens_per_expert: Optional[torch.Tensor] = None ): + """ + gather quantized tensor with row id map + Args: + inp: [num_tokens, hidden_size], rowwise quantized tensor + scale: [num_tokens], quantization scale + probs: router prob, used as weight + row_id_map: [n_experts, num_tokens] + index >= 0: row index of output tensor + index == -1: ignore + Note: index may not be contiguous + num_out_tokens: output token count, including padding tokens + contiguous: whether indices in row_id_map is contiguous, + False means padded + tokens_per_expert: [num_experts], token count per expert, + non-blocking cuda tensor + + Returns: + output: permuted quantized tensor + permuted_scale: permuted quantization scale + permuted_probs: permuted router prob + + """ num_tokens, hidden_size = inp.shape num_tokens_, num_experts = row_id_map.shape # not transposed assert num_tokens == num_tokens_ @@ -489,21 +505,6 @@ def batch_smooth_transpose_smooth_permute_kernel(x_ptr, scale_ptr, oss_ptr, toffs += H -""" -used for smooth backward in 0.12 -`x`: dy, may be smooth quantized, it should be gather, optional requantized, padded to multiple of 32 and tranposed -x: [bs, dim] -scale: [bs], optional -org_smooth_scale: [dim], optional -smooth_scales: [n_experts, dim], reversed -token_count_per_expert: [n_experts], tensor of token count per expert -splits: [n_experts], list of token_count_per_expert -indices: [sum(tokens_per_experts)] -x_q: [sum(roundup(tokens_per_experts)) * dim] -x_scale: [sum(roundup(tokens_per_experts))] -""" - - def triton_batch_transpose_smooth_permute_with_indices(x, scale, org_smooth_scale, @@ -511,8 +512,26 @@ def triton_batch_transpose_smooth_permute_with_indices(x, indices, token_count_per_expert, splits, - x_q=None, x_scale=None, + x_q=None, + x_scale=None, round_scale=False): + """ + used for smooth quantization backward in megatron 0.12, + x is gathered, requantized, padded to multiple of 32 and tranposed + Args: + x: dy, [bs, dim], it is smooth quantized + scale: [bs], quantized scale + org_smooth_scale: [dim] + smooth_scales: [n_experts, dim] + indices: [sum(tokens_per_experts)] + token_count_per_expert: [n_experts], tensor of token count per expert + splits: [n_experts], list of token_count_per_expert + round_scale: round quantization scale to power of 2 + + Returns: + x_q: [sum(roundup(tokens_per_experts)) * dim] + x_scale: [sum(roundup(tokens_per_experts))] + """ # row-wise read, row-wise write M, N = x.shape n_expert = len(splits) @@ -596,25 +615,32 @@ def smooth_weighted_permute_with_indices_kernel(grads_ptr, tl.store(q_ptr + si * N + i * N + tl.arange(0, N), xq) -""" -select and smooth and quant, used in 0.11 all2all moe -x: [bs, dim] -smooth_scales: [n_experts, dim] -indices: [n_experts*topk] -x_q: [bs*topk, dim] -x_scale: [bs*topk] -""" - - -def triton_smooth_weighted_permute_with_indices(grads, tokens, +def triton_smooth_weighted_permute_with_indices(grads, + tokens, smooth_scales, token_count_per_expert, - indices, x_q=None, + indices, + x_q=None, x_scale=None, x_sum=None, reverse=False, round_scale=False): - # row-wise read, row-wise write + """ + select and smooth and quant, used in megatron 0.11 all2all moe + Args: + grads: [bs, dim] + tokens: [bs, dim] + smooth_scales: [n_experts, dim] + token_count_per_expert: [n_experts] + indices: [n_experts*topk] + reverse: whether scale is 1/scale + round_scale: whether round scale to power of 2 + + Returns: + x_q: [bs*topk, dim] + x_scale: [bs*topk] + x_sum: [bs*topk] + """ M, N = grads.shape n_expert, n = smooth_scales.shape assert N == n, f'{N=} {n=}' @@ -694,23 +720,31 @@ def smooth_permute_with_indices_kernel(grads_data_ptr, tl.store(q_ptr + i * N + tl.arange(0, N), xq) -""" -select and smooth and quant -grad_data: [bs, dim] -grad_scale: [bs, dim/128] -smooth_scales: [n_experts, dim] -indices: [n_experts*topk] -x_q: [bs*topk, dim] -x_scale: [bs*topk] -""" - - -def triton_smooth_permute_with_indices(grad_data, grad_scale, +def triton_smooth_permute_with_indices(grad_data, + grad_scale, smooth_scales, token_count_per_expert, - indices, x_q=None, - x_scale=None, reverse=False, + indices, + x_q=None, + x_scale=None, + reverse=False, round_scale=False): + """ + select and smooth and quant + Args: + grad_data: [bs, dim] + grad_scale: [bs] + smooth_scales: [n_experts, dim] + token_count_per_expert: [n_experts] + indices: [n_experts*topk] + x_q: [bs*topk, dim] + x_scale: [bs*topk] + reverse: + round_scale: + + Returns: + + """ # row-wise read, row-wise write M, N = grad_data.shape n_expert, n = smooth_scales.shape @@ -796,14 +830,6 @@ def smooth_permute_with_mask_map_kernel(grads_data_ptr, quant_data_ptr, mask=mask) -# """ -# gather and optional dequant and smooth quant -# inp: [num_tokens, hidden_size], rowwise_data -# row_id_map: [n_experts, num_tokens], indices -# scale: [num_tokens, hs], rowwise_scale_inv, optional -# num_tokens: [n_experts] -# smooth_scale_ptrs: [n_experts, hidden_size] -# """ def triton_smooth_permute_with_mask_map( inp: torch.Tensor, row_id_map: torch.Tensor, @@ -816,6 +842,24 @@ def triton_smooth_permute_with_mask_map( reverse=True, round_scale=False ): + """ + gather and optional dequant and smooth quant + + Args: + inp: [num_tokens, hidden_size], rowwise quantized tensor + row_id_map: [n_experts, num_tokens], indices + scale: [num_tokens, hs], rowwise_scale_inv, optional + num_tokens: [n_experts] + num_experts: + num_out_tokens: + hidden_size: + smooth_scales: [n_experts, hidden_size] + reverse: + round_scale: + + Returns: + + """ assert row_id_map.shape[1] == num_experts output = torch.empty((num_out_tokens, hidden_size), dtype=torch.float8_e4m3fn, @@ -828,7 +872,6 @@ def triton_smooth_permute_with_mask_map( (num_out_tokens,), dtype=torch.float32, device=inp.device ) - # print(f'{inp.shape=} {row_id_map.shape=} {num_tokens=} {num_out_tokens=}') sm = torch.cuda.get_device_properties(inp.device).multi_processor_count T = triton.cdiv(num_tokens, sm) grid = (num_experts, sm) @@ -847,84 +890,3 @@ def triton_smooth_permute_with_mask_map( round_scale ) return output, permuted_scale - - -@triton.jit -def deprecated_smooth_permute_with_mask_map_kernel(grads_data_ptr, - quant_data_ptr, - mask_map_ptr, - smooth_scale_ptr, - quant_scale_ptr, M, T, - N: tl.constexpr, - REVERSE: tl.constexpr, - ROUND: tl.constexpr): - eid = tl.program_id(axis=0) - bid = tl.program_id(axis=1) - n_experts = tl.num_programs(axis=0) - - # smooth_scale_ptr = tl.load(smooth_scale_ptrs + eid).to(tl.pointer_type(tl.float32)) - smooth_scale = tl.load(smooth_scale_ptr + eid * N + tl.arange(0, N)) - if not REVERSE: - smooth_scale = 1.0 / smooth_scale - for i in range(bid * T, tl.minimum(bid * T + T, M)): - index = tl.load(mask_map_ptr + i * n_experts + eid) - mask = index >= 0 - if index >= 0: - x = tl.load(grads_data_ptr + i * N + tl.arange(0, N), mask=mask).to( - tl.float32) - - x *= smooth_scale - x_max = tl.max(tl.abs(x)) - - scale = tl.maximum(x_max / 448.0, 1e-30) - if ROUND: - scale = tl.exp2(tl.ceil(tl.log2(scale))) - - tl.store(quant_scale_ptr + index, scale, mask=mask) - - x /= scale - xq = x.to(quant_data_ptr.dtype.element_ty) - tl.store(quant_data_ptr + index * N + tl.arange(0, N), xq, - mask=mask) - - -# """ -# gather and smooth quant -# inp: [num_tokens, hidden_size], rowwise_data -# row_id_map: [n_experts, num_tokens], indices -# num_tokens: [n_experts] -# smooth_scale_ptrs: [n_experts, hidden_size] -# """ -def triton_deprecated_smooth_permute_with_mask_map( - inp: torch.Tensor, - row_id_map: torch.Tensor, - num_tokens: int, - num_experts: int, - num_out_tokens: int, - hidden_size: int, - smooth_scales: torch.Tensor, - reverse=True, - round_scale=False -): - assert row_id_map.shape[1] == num_experts - output = torch.empty((num_out_tokens, hidden_size), dtype=inp.dtype, - device=inp.device) - permuted_scale = torch.empty( - (num_out_tokens,), dtype=torch.float32, device=inp.device - ) - sm = torch.cuda.get_device_properties(inp.device).multi_processor_count - T = triton.cdiv(num_tokens, sm) - grid = (num_experts, sm) - deprecated_smooth_permute_with_mask_map_kernel[grid]( - inp, - output, - row_id_map, - smooth_scales, - permuted_scale, - num_tokens, - T, - hidden_size, - reverse, - round_scale, - ) - return output, permuted_scale diff --git a/linghe/utils/loss.py b/linghe/utils/loss.py index b8c735c..1eae2fa 100644 --- a/linghe/utils/loss.py +++ b/linghe/utils/loss.py @@ -44,6 +44,15 @@ def softmax_cross_entropy_forward_kernel(logit_ptr, label_ptr, loss_ptr, TODO: support distributed loss with pytorch ongoing nvshmem feature """ def triton_softmax_cross_entropy_forward(logits, labels): + """ + compute token-wise softmax cross entropy loss + Args: + logits: logits tensor + labels: labels tensor + + Returns: + loss of each token + """ M, N = logits.shape device = logits.device loss = torch.empty((M,), device=device, dtype=torch.float32) @@ -93,6 +102,18 @@ def softmax_cross_entropy_backward_kernel(logit_ptr, label_ptr, sum_exp_ptr, def triton_softmax_cross_entropy_backward(logits, labels, sum_exp, max_logit, input_grad, output_grad=None): + """ + backward of softmax cross entropy loss + Args: + logits: logit tensor, [bs, dim] + labels: label tensor, [bs] + sum_exp: [bs] + max_logit: [bs] + input_grad: gradient, [bs, dim] + + Returns: + output_grad: [bs, dim] + """ M, N = logits.shape device = logits.device if output_grad is None: diff --git a/linghe/utils/norm.py b/linghe/utils/norm.py index a627c78..d030abb 100644 --- a/linghe/utils/norm.py +++ b/linghe/utils/norm.py @@ -27,6 +27,15 @@ def rms_norm_forward_kernel(x_ptr, weight_ptr, out_ptr, eps, M, T, def triton_rms_norm_forward(x, weight, eps=1e-6, out=None): + """ + rms norm + Args: + x: input tensor + weight: weight of rms norm + eps: epsilon of rms norm + Returns: + out: output tensor + """ # row-wise read, row-wise write M, N = x.shape W = 8192 // N @@ -394,24 +403,18 @@ def group_norm_gate_forward_kernel(x_ptr, gate_ptr, weight_ptr, out_ptr, eps, bs tl.store(out_ptr + offs, x) -""" -x: [bs, length, n_heads, head_dim], output of attn -gate: [length, bs, dim] -weight: [dim] -output: [length, bs, dim] -""" def triton_group_norm_gate_forward(x: torch.Tensor, gate, weight, eps=1e-6, group_size=4): """ norm and gate in linear attention Args: - x: - gate: - weight: - eps: - group_size: + x: output of attn, [bs, length, n_heads, head_dim] + gate: gate tensor, [length, bs, dim] + weight: rms norm weight, [dim] + eps: epsilon of rms norm + group_size: group size of group rms norm Returns: - + output tensor """ # row-wise read, row-wise write length, bs, dim = gate.shape diff --git a/linghe/utils/rearange.py b/linghe/utils/rearange.py index c1ed887..58e5d6c 100644 --- a/linghe/utils/rearange.py +++ b/linghe/utils/rearange.py @@ -32,15 +32,20 @@ def split_and_cat_kernel(x_ptr, y_ptr, scale_ptr, scale_output_ptr, count_ptr, mask=i * K + tl.arange(0, K) < count) -""" -select and smooth and quant -x: [bs, dim] -counts: [n_split] -indices: [n_split] -""" - - def triton_split_and_cat(x, counts, indices, scales=None): + """ + split x to multiple tensors and cat with indices, + it is used for permutation in moe + Args: + x: [bs, dim] + counts: [n_split] + indices: [n_split] + scales: [bs] + + Returns: + y: output tensor + output_scales: output scales if scales is not None + """ M, N = x.shape n_split = counts.shape[0] device = x.device diff --git a/linghe/utils/reduce.py b/linghe/utils/reduce.py index 646d8b0..72b3b3d 100644 --- a/linghe/utils/reduce.py +++ b/linghe/utils/reduce.py @@ -46,8 +46,19 @@ def abs_max_kernel(x_ptr, tl.store(output_ptr + pid * W + tl.arange(0, W), scale) -# update weight smooth scale for next step with x input def triton_abs_max(x, scale=None, smooth_scale=None, min_value=1e-30, axis=0): + """ + columnwise abs max of x, it is used in smooth quantization + Args: + x: input tensor, may be quantized tensor + scale: quantization scale if x is quantized + smooth_scale: optional smooth scale + min_value: output = max(max(abs(x,0)), min_value) + axis: reduce axis + + Returns: + max tensor + """ assert axis == 0 N = x.size(-1) M = x.numel() // N @@ -95,6 +106,14 @@ def batch_count_zero_kernel(input_ptrs, size_ptr, count_ptr, B: tl.constexpr): def triton_batch_count_zero(xs): + """ + count zero in tensor list, it is used to monitor zeros in gradient tensor + Args: + xs: input tensors + + Returns: + a single-value int64 tensor + """ device = xs[0].device sizes = torch.tensor([x.numel() for x in xs], dtype=torch.int64, device=device) @@ -142,6 +161,15 @@ def batch_sum_with_ord_kernel(input_ptrs, size_ptr, count_ptr, B: tl.constexpr, def triton_batch_sum_with_ord(xs, ord=2): + """ + return sum(abs(x)**ord). + Args: + xs: Tensor lists. + ord: the order of tensor. + + Returns: + a single-value fp32 tensor + """ assert ord in (1, 2) device = xs[0].device sizes = torch.tensor([x.numel() for x in xs], dtype=torch.int64, diff --git a/linghe/utils/rope.py b/linghe/utils/rope.py index beb881f..85f5696 100644 --- a/linghe/utils/rope.py +++ b/linghe/utils/rope.py @@ -80,15 +80,18 @@ def half_rope_forward_kernel(q_ptr, k_ptr, freqs_ptr, qo_ptr, ko_ptr, B, 0, h)[:, None] + tl.arange(0, D)[None, :], k) -""" -apply norm to qk, then apply rope to qk, then transpose qkv -q: [len, bs, q_head, head_dim] -k: [len, bs, kv_head, head_dim] -v: [len, bs, kv_head, head_dim] -""" - - def triton_half_rope_forward(q, k, freqs): + """ + apply norm to qk, then apply half rope to qk + Args: + q: query tensor, [len, bs, q_head, head_dim] + k: key tensor, [len, bs, kv_head, head_dim] + freqs: rope freqs + + Returns: + qo: + ko: + """ L, B, H, D = q.shape h = k.shape[2] assert freqs.shape[1] == D // 2 @@ -170,13 +173,6 @@ def half_rope_backward_kernel(q_ptr, k_ptr, freqs_ptr, 0, D)[None, :], k) -""" -apply norm to qk, then apply rope to qk, then transpose qkv -q: [len, bs, q_head, head_dim] -k: [len, bs, kv_head, head_dim] -v: [len, bs, kv_head, head_dim] -""" - def triton_half_rope_backward(q_grad, k_grad, freqs, inplace=False): assert inplace @@ -325,16 +321,30 @@ def qk_norm_and_half_rope_forward_kernel(qkv_ptr, 0, D)[None, :], v1) -""" -use qkv as input, to reduce redundant gradient copy in backward -split qkv, apply norm to qk, apply rope to qk -qkv: [len, bs, kv_head*(q_head//kv_head + 2 ) * head_dim)] -""" - - def triton_qk_norm_and_half_rope_forward(qkv, q_norm_weight, k_norm_weight, freqs, H=32, h=4, eps=1e-6, interleave=True, transpose=False): + + """ + split qkv to q/k/v, apply qk norm and half rope to q/k, + transpose q/k/v to flash-attention layout + Args: + qkv: QKV tensor with size of [S, B, dim], heads are interleaved + q_norm_weight: rms norm weight for query + k_norm_weight: rms norm weight for key + freqs: Freqs tensor based on half dim. + H: Number of attention heads. + h: Number of key/value heads. + eps: epsilon value for L2 normalization. + interleave: whether head of qkv is interleaved, i.e., [qqkvqqkv] + transpose: whether qkv is tranposed, i.e., [S, B, dim], + only support transpose format currently + Returns: + qo: shape [B, S, H, head_dim] + ko: shape [B, S, h, head_dim] + vo: shape [B, S, h, head_dim] + """ + assert transpose L, B, Dim = qkv.shape stride = qkv.stride(1) # qkv may be a slice of a tensor @@ -532,17 +542,28 @@ def qk_norm_and_half_rope_backward_kernel(gq_ptr, gk_ptr, gv_ptr, 0, D)[None, :], v1) -""" -apply norm to qk, then apply rope to qk -q: [len, bs, q_head, head_dim] -k: [len, bs, kv_head, head_dim] -v: [len, bs, kv_head, head_dim] -""" - - def triton_qk_norm_and_half_rope_backward(gq, gk, gv, qkv, q_norm_weight, k_norm_weight, freqs, eps=1e-6, transpose=False, interleave=True): + """ + backward kernel of triton_qk_norm_and_half_rope_forward + Args: + gq: gradient of qo, [len, bs, q_head, head_dim] + gk: gradient of ko, [len, bs, q_head, head_dim] + gv: gradient of vo, [len, bs, q_head, head_dim] + qkv: input qkv + q_norm_weight: + k_norm_weight: + freqs: + eps: + transpose: + interleave: + + Returns: + dqkv: gradient of qkv + dqw: gradient of q_norm_weight + dkw: gradient of k_norm_weight + """ assert transpose B, L, H, D = gq.shape stride = qkv.stride(1) diff --git a/linghe/utils/scatter.py b/linghe/utils/scatter.py index 7549015..bb39945 100644 --- a/linghe/utils/scatter.py +++ b/linghe/utils/scatter.py @@ -3,12 +3,12 @@ Copyright (c) Ant Financial Service Group and its affiliates. """ +from typing import Optional import torch import triton import triton.language as tl -# for megatron 0.11 scatter_add @triton.jit def aligned_scatter_add_kernel(x_ptr, o_ptr, indices_ptr, weights_ptr, M, @@ -30,7 +30,21 @@ def aligned_scatter_add_kernel(x_ptr, o_ptr, indices_ptr, weights_ptr, M, tl.store(o_ptr + pid * N + offs, sums) -def triton_aligned_scatter_add(x, outputs, indices, weights=None): +def triton_aligned_scatter_add(x: torch.Tensor, + outputs: torch.Tensor, + indices: torch.Tensor, + weights: Optional[torch.Tensor] = None): + """ + scatter_add for megatron 0.11 + Args: + x: input tensor + outputs: output tensor + indices: gather indices + weights: rowwise weight, it is router prob in MoE router + + Returns: + output tensor + """ M, N = x.shape m = outputs.size(0) @@ -82,6 +96,16 @@ def fp32_to_bf16_kernel(x_ptr, o_ptr, M, T, N: tl.constexpr): def triton_scatter_add(x, outputs, indices): + """ + naive version of scatter add, very slow + Args: + x: input tensor + outputs: output tensor + indices: indices + + Returns: + outputs + """ M, N = x.shape float_outputs = torch.zeros(outputs.shape, dtype=torch.float32, @@ -149,18 +173,22 @@ def unpermute_with_mask_map_kernel(grads_ptr, probs_ptr, mask_map_ptr, tl.store(output_ptr + pid * N + tl.arange(0, N), sums) -# """ -# gather and smooth quant -# inp: [num_tokens, hidden_size], rowwise_data -# row_id_map: [n_experts, num_tokens], indices -# prob: [num_out_tokens], rowwise_scale_inv -# """ - def triton_unpermute_with_mask_map( grad: torch.Tensor, row_id_map: torch.Tensor, probs: torch.Tensor, ): + """ + scatter add with row id map + Args: + grad: gradient tensor, [num_out_tokens, hidden_size] + row_id_map: row id map, [n_experts, num_tokens] + probs: [num_out_tokens] + + Returns: + output: [num_tokens, hidden_size] + restore_probs: [num_tokens, num_experts] + """ hidden_size = grad.shape[1] num_tokens, num_experts = row_id_map.shape # not transposed diff --git a/linghe/utils/silu.py b/linghe/utils/silu.py index c0d8aa1..0d8c9c4 100644 --- a/linghe/utils/silu.py +++ b/linghe/utils/silu.py @@ -64,11 +64,27 @@ def silu_and_block_quant_forward_kernel(x_ptr, tl.trans(xq), mask=indices[None, :] < M) -# used in shared expert -def triton_silu_and_block_quant_forward(x, out=None, scale=None, +def triton_silu_and_block_quant_forward(x, + out=None, + scale=None, round_scale=False, output_mode=2): - # row-wise read, row-wise write + """ + fused silu and blockwise quantization, used in shared expert + Args: + x: input tensor + round_scale: whether round scale to power of 2 + output_mode: one of {0, 1, 2} + 0: only output non-transposed quantized tensor + 1: only output transposed quantized tensor + 2: output both + + Returns: + out: quantized tensor + scale: quantization scale + transpose_output: quantized tensor of transposed output + transpose_scale: quantization scale of transposed output + """ M, N = x.shape n = N // 2 device = x.device @@ -177,7 +193,19 @@ def silu_and_block_quant_backward_kernel(g_ptr, x_ptr, # used in shared expert def triton_silu_and_block_quant_backward(g, x, round_scale=False): - # row-wise read, row-wise write + """ + backward of triton_silu_and_block_quant_forward + Args: + g: gradient + x: input tensor + round_scale: whether round to power of 2 + + Returns: + dx: quantized non-transposed gradient + dx_scale: scales of quantization non-transposed gradient + transpose_dx: quantized transposed gradient + transpose_dx_scale: scales of quantization transposed gradient + """ M, N = x.shape n = N // 2 device = x.device @@ -281,7 +309,7 @@ def batch_weighted_silu_and_block_quant_forward_kernel(x_ptr, weight_ptr, mask=indices[None, :] < count) -# used in routed experts + def triton_batch_weighted_silu_and_block_quant_forward(x, weight, counts, @@ -290,7 +318,25 @@ def triton_batch_weighted_silu_and_block_quant_forward(x, scale=None, round_scale=False, output_mode=2): - # row-wise read, row-wise write + """ + silu and blockwise quantize activation in routed experts + Args: + x: activation tensor in routed experts + weight: router prob tensor + counts: cuda tensor of token count per expert + splits: python int list of token count per expert + round_scale: whether round scale to power of 2 + output_mode: one of {0, 1, 2} + 0: only output non-transposed quantized tensor + 1: only output transposed quantized tensor + 2: output both + + Returns: + out: quantized tensor + scale: quantization scale + transpose_output: quantized tensor of transposed output + transpose_scale: quantization scale of transposed output + """ M, N = x.shape n = N // 2 n_experts = counts.shape[0] @@ -307,7 +353,8 @@ def triton_batch_weighted_silu_and_block_quant_forward(x, dtype=torch.float32) # intra layout and inner layput are not consist, # tensors will be viewed after splitting - scale = torch.empty((M * n // 128,), device=device, dtype=torch.float32) + if scale is None: + scale = torch.empty((M * n // 128,), device=device, dtype=torch.float32) if M == 0: return out, scale, transpose_output, transpose_scale @@ -437,6 +484,22 @@ def triton_batch_weighted_silu_and_block_quant_backward(g, x, weight, counts, splits=None, round_scale=False): + """ + backward of triton_batch_weighted_silu_and_block_quant_forward + Args: + g: gradient + x: input tensor + weight: router prob tensor + counts: cuda tensor of token count per expert + splits: python int list of token count per expert + round_scale: whether round scale to power of 2 + Returns: + dx: quantized non-transposed gradient + dx_scale: scales of quantization non-transposed gradient + dw: gradient of weight + transpose_dx: quantized transposed gradient + transpose_dx_scale: scales of quantization transposed gradient + """ # row-wise read, row-wise write M, N = x.shape n = N // 2 diff --git a/linghe/utils/transpose.py b/linghe/utils/transpose.py index 4392fb3..bfc1fcf 100644 --- a/linghe/utils/transpose.py +++ b/linghe/utils/transpose.py @@ -4,7 +4,7 @@ """ import itertools - +from typing import Optional import torch import triton import triton.language as tl @@ -15,54 +15,6 @@ # os.environ["TRITON_PRINT_AUTOTUNING"] = "1" -@triton.jit -def deprecated_transpose_kernel(x_ptr, t_ptr, M, N, H: tl.constexpr, - W: tl.constexpr, EVEN: tl.constexpr): - pid = tl.program_id(axis=0) - # col-wise read, row-wise write - offs = pid * W + tl.arange(0, H)[:, None] * N + tl.arange(0, W)[None, :] - toffs = pid * W * M + tl.arange(0, W)[:, None] * M + tl.arange(0, H)[None, - :] - m = tl.cdiv(M, H) - for i in range(m): - if EVEN: - y = tl.trans(tl.load(x_ptr + offs)) - tl.store(t_ptr + toffs, y) - else: - y = tl.trans(tl.load(x_ptr + offs, mask=(pid * W + tl.arange(0, W)[ - None, :] < N) & ( - i * H + tl.arange( - 0, H)[:, - None] < M))) - tl.store(t_ptr + toffs, y, - mask=(pid * W + tl.arange(0, W)[:, None] < N) & ( - i * H + tl.arange(0, H)[None, :] < M)) - offs += H * N - toffs += H - - -def triton_depracated_transpose(x): - M, N = x.shape - device = x.device - t = torch.empty((N, M), device=device, dtype=x.dtype) - - H = 512 - W = 32 if x.dtype.itemsize == 1 else 16 - EVEN = M % H == 0 and N % W == 0 - num_stages = 3 - num_warps = 8 - - grid = (triton.cdiv(N, W),) - deprecated_transpose_kernel[grid]( - x, t, - M, N, - H, W, - EVEN, - num_stages=num_stages, - num_warps=num_warps - ) - return t - @triton.jit def transpose_kernel(x_ptr, t_ptr, M, N, H: tl.constexpr, W: tl.constexpr, @@ -99,7 +51,19 @@ def transpose_dim_0_1_kernel(x_ptr, t_ptr, B, M, b_stride, m_stride, tl.store(t_ptr + toffs, y) -def triton_transpose(x, dim0=None, dim1=None): +def triton_transpose(x: torch.Tensor, + dim0: Optional[int] = None, + dim1: Optional[int] = None): + """ + transpose x with dim0 and dim1 + Args: + x: input tensor + dim0: dim 0 + dim1: dim 1 + + Returns: + transposed tensor + """ shape = x.shape rank = len(shape) assert rank <= 4 @@ -180,13 +144,19 @@ def transpose_and_pad_kernel(x_ptr, t_ptr, mask=(rid * H + tl.arange(0, H)[None, :] < P)) -""" -pad: M will be padded to mutiplier of 32 -M is usually less than N without deepep -""" - def triton_transpose_and_pad(x, out=None, pad=True): + """ + transpose x and padding the column size to be mutiplier of 32, + it is used for calculated gradient of weight with torch._scaled__mm + Args: + x: input tensor + out: + pad: whether need padding + + Returns: + out: output tensor + """ # fat block, shape:[H,W] M, N = x.shape P = round_up(M, b=32) if pad else M @@ -228,14 +198,14 @@ def batch_transpose_kernel(xs_ptr, xts_ptr, M, N, H: tl.constexpr, toffs += H -""" -x: [M, N]*expert -x_t: [N,M]*expert -""" - - def triton_batch_transpose(xs, xts=None): - # block shape:[H,W] + """ + batch transpose x + Args: + xs: input tensor list, [M, N]*expert + Returns: + xts: output tensor list, [N,M]*expert + """ M, N = xs[0].shape n_experts = len(xs) if xts is None: @@ -286,16 +256,18 @@ def batch_transpose_and_pad_kernel(x_ptr, t_ptr, count_ptr, accum_ptr, toffs += H -""" -pad: M will be padded to mutiplier of 32 -padding should be filled with 0 -M is usually less than N -x: [sum(bs), N] -x_t: [sum(pad(bs)*N)] -""" - - def triton_batch_transpose_and_pad(x, count_list, x_t=None, pad=True): + """ + transpose and pad each tensor stored in x + Args: + x: [sum(bs), N] + count_list: a python list of token count + pad: whether pad to mutiplier of 32, + padding value should be filled with 0 if padded + + Returns: + x_t: output tensor + """ assert pad # block shape:[H,W] M, N = x.shape diff --git a/tests/test_group_quant.py b/tests/test_group_quant.py index c8c2af8..132de67 100644 --- a/tests/test_group_quant.py +++ b/tests/test_group_quant.py @@ -5,8 +5,8 @@ import torch -from linghe.quant.block.group import (triton_group_quant, - triton_persist_group_quant) +from linghe.quant.group import (triton_group_quant, + triton_persist_group_quant) from linghe.tools.benchmark import benchmark_func from linghe.tools.util import (output_check, torch_group_quant) From be1ec9f2a13365e0975b0973dea51a58b349569b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E5=8D=97=E9=9C=84?= Date: Thu, 16 Oct 2025 20:29:25 +0800 Subject: [PATCH 4/7] recover deleted code --- build.sh | 2 +- docs/linghe/facade.html | 2 +- docs/linghe/facade/add.html | 67 +- docs/linghe/facade/fp32_gemm.html | 88 ++ docs/linghe/facade/loss.html | 302 +----- docs/linghe/facade/norm.html | 300 +----- docs/linghe/facade/rope.html | 155 +-- docs/linghe/facade/transpose.html | 147 +-- docs/linghe/gemm.html | 2 + docs/linghe/gemm/blockwise_fp8_gemm.html | 56 ++ docs/linghe/gemm/channelwise_fp8_gemm.html | 93 ++ docs/linghe/gemm/fp32_gemm.html | 166 ++++ docs/linghe/quant.html | 3 + docs/linghe/quant/block.html | 46 +- docs/linghe/quant/channel.html | 108 ++- docs/linghe/quant/group.html | 89 ++ docs/linghe/quant/hadamard.html | 54 ++ .../quant/hadamard/seperate_hadamard.html | 56 ++ docs/linghe/quant/smooth.html | 55 ++ docs/linghe/quant/smooth/reused_smooth.html | 56 ++ docs/linghe/quant/smooth/seperate_smooth.html | 56 ++ docs/linghe/utils/add.html | 8 +- docs/linghe/utils/dot.html | 32 + docs/linghe/utils/gather.html | 286 ++++++ docs/linghe/utils/loss.html | 65 ++ docs/linghe/utils/norm.html | 48 +- docs/linghe/utils/rearange.html | 35 + docs/linghe/utils/reduce.html | 95 ++ docs/linghe/utils/rope.html | 116 +++ docs/linghe/utils/scatter.html | 98 ++ docs/linghe/utils/silu.html | 153 +++ docs/linghe/utils/transpose.html | 128 +++ linghe/facade/add.py | 4 +- linghe/facade/fp32_gemm.py | 3 + linghe/facade/hadamard_quant_linear.py | 163 ++++ linghe/facade/loss.py | 6 + linghe/facade/norm.py | 6 + linghe/facade/rope.py | 3 + linghe/facade/smooth_quant_linear.py | 154 +++ linghe/facade/transpose.py | 3 + .../seperate_hadamard.py => hadamard.py} | 170 +--- .../{smooth/reused_smooth.py => smooth.py} | 221 +++-- linghe/quant/smooth/__init__.py | 0 linghe/quant/smooth/seperate_smooth.py | 133 --- linghe/tools/util.py | 271 +++++- linghe/utils/norm.py | 103 +- linghe/utils/silu.py | 889 ++++++++++++++++++ tests/test_channel_quant.py | 47 + tests/{test_gemm.py => test_fp32_gemm.py} | 0 tests/test_gather.py | 361 ++++++- .../test_hadamard_quant.py | 0 tests/test_norm.py | 102 +- tests/test_silu.py | 397 +++++++- tests/test_smooth_quant.py | 328 +++++++ 54 files changed, 4952 insertions(+), 1379 deletions(-) create mode 100644 docs/linghe/facade/fp32_gemm.html create mode 100644 docs/linghe/gemm/blockwise_fp8_gemm.html create mode 100644 docs/linghe/gemm/channelwise_fp8_gemm.html create mode 100644 docs/linghe/quant/group.html create mode 100644 docs/linghe/quant/hadamard.html create mode 100644 docs/linghe/quant/hadamard/seperate_hadamard.html create mode 100644 docs/linghe/quant/smooth.html create mode 100644 docs/linghe/quant/smooth/reused_smooth.html create mode 100644 docs/linghe/quant/smooth/seperate_smooth.html create mode 100644 linghe/facade/hadamard_quant_linear.py create mode 100644 linghe/facade/smooth_quant_linear.py rename linghe/quant/{hadamard/seperate_hadamard.py => hadamard.py} (52%) rename linghe/quant/{smooth/reused_smooth.py => smooth.py} (82%) delete mode 100644 linghe/quant/smooth/__init__.py delete mode 100644 linghe/quant/smooth/seperate_smooth.py create mode 100644 tests/test_channel_quant.py rename tests/{test_gemm.py => test_fp32_gemm.py} (100%) rename linghe/quant/hadamard/__init__.py => tests/test_hadamard_quant.py (100%) create mode 100644 tests/test_smooth_quant.py diff --git a/build.sh b/build.sh index e2c9f7d..7a6a6d5 100644 --- a/build.sh +++ b/build.sh @@ -4,4 +4,4 @@ rm -rf linghe.egg-info && python setup.py develop && python setup.py bdist_wheel && -#pdoc --output-dir docs -d google --no-include-undocumented --no-search --no-show-source linghe \ No newline at end of file +# pdoc --output-dir docs -d google --no-include-undocumented --no-search --no-show-source linghe \ No newline at end of file diff --git a/docs/linghe/facade.html b/docs/linghe/facade.html index ffe1d6e..a1dbc13 100644 --- a/docs/linghe/facade.html +++ b/docs/linghe/facade.html @@ -28,7 +28,7 @@

Submodules

  • add
  • -
  • fp32_linear
  • +
  • fp32_gemm
  • loss
  • norm
  • rope
  • diff --git a/docs/linghe/facade/add.html b/docs/linghe/facade/add.html index 10e2ec9..57a6c73 100644 --- a/docs/linghe/facade/add.html +++ b/docs/linghe/facade/add.html @@ -29,16 +29,7 @@

    API Documentation

    @@ -63,37 +54,21 @@

    -
    -
    +
    +
    - class - InplaceAddFunction(torch.autograd.function.Function): - - -
    - - -

    inplace add with mix precision

    -
    - - -
    -
    -
    @staticmethod
    - def - forward(ctx, x: torch.Tensor, y: torch.Tensor): + inplace_add(x: torch.Tensor, y: torch.Tensor):
    - +

    inplace add y to x with mix precise

    Arguments:
      -
    • ctx: autograd context
    • x: to be updated
    • y: add to x
    @@ -101,41 +76,11 @@
    Arguments:
    Returns:
    -

    output: x

    -
    -
    - - -
    -
    -
    -
    @staticmethod
    - - def - backward(ctx, grad_output): - - -
    - - -

    backward of inplace add

    - -
    Arguments:
    - -
      -
    • ctx: autograd context
    • -
    • grad_output: input gradient
    • -
    - -
    Returns:
    - -
    -

    tuple of gradients

    +

    return updated x tensor

    -
    diff --git a/docs/linghe/facade/fp32_gemm.html b/docs/linghe/facade/fp32_gemm.html new file mode 100644 index 0000000..b4858e4 --- /dev/null +++ b/docs/linghe/facade/fp32_gemm.html @@ -0,0 +1,88 @@ + + + + + + + linghe.facade.fp32_gemm API documentation + + + + + + + + + +
    +
    +

    +linghe.facade.fp32_gemm

    + +

    Copyright (c) Ant Financial Service Group and its affiliates.

    +
    + + + + +
    +
    +
    + + def + fp32_gemm(input: torch.Tensor, weight: torch.Tensor): + + +
    + + +

    gemm with bf16/fp16 inputs and float32 output, +currently used in MoE router gemm.

    + +
    Arguments:
    + +
      +
    • input: bf16/fp16 activation tensor
    • +
    • weight: bf16/fp16 weight tensor
    • +
    + +
    Returns:
    + +
    +

    output of gemm

    +
    +
    + + +
    +
    + + \ No newline at end of file diff --git a/docs/linghe/facade/loss.html b/docs/linghe/facade/loss.html index 6229b49..21cda4a 100644 --- a/docs/linghe/facade/loss.html +++ b/docs/linghe/facade/loss.html @@ -29,28 +29,7 @@

    API Documentation

    @@ -75,287 +54,34 @@

    -
    -
    - - class - SoftmaxCrossEntropyFunction(torch.autograd.function.Function): - - -
    - - -

    Base class to create custom autograd.Function.

    - -

    To create a custom autograd.Function, subclass this class and implement -the forward() and backward() static methods. Then, to use your custom -op in the forward pass, call the class method apply. Do not call -forward() directly.

    - -

    To ensure correctness and best performance, make sure you are calling the -correct methods on ctx and validating your backward function using -torch.autograd.gradcheck().

    - -

    See :ref:extending-autograd for more details on how to use this class.

    - -

    Examples::

    - -
    >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)
    ->>> class Exp(Function):
    ->>>     @staticmethod
    ->>>     def forward(ctx, i):
    ->>>         result = i.exp()
    ->>>         ctx.save_for_backward(result)
    ->>>         return result
    ->>>
    ->>>     @staticmethod
    ->>>     def backward(ctx, grad_output):
    ->>>         result, = ctx.saved_tensors
    ->>>         return grad_output * result
    ->>>
    ->>> # Use it by calling the apply method:
    ->>> # xdoctest: +SKIP
    ->>> output = Exp.apply(input)
    -
    -
    - - -
    -
    -
    @staticmethod
    - - def - forward(ctx, logits, labels, inplace=False): - - -
    - - -

    Define the forward of the custom autograd Function.

    - -

    This function is to be overridden by all subclasses. -There are two ways to define forward:

    - -

    Usage 1 (Combined forward and ctx)::

    - -
    @staticmethod
    -def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
    -    pass
    -
    - -
      -
    • It must accept a context ctx as the first argument, followed by any -number of arguments (tensors or other types).
    • -
    • See :ref:combining-forward-context for more details
    • -
    - -

    Usage 2 (Separate forward and ctx)::

    - -
    @staticmethod
    -def forward(*args: Any, **kwargs: Any) -> Any:
    -    pass
    -
    -@staticmethod
    -def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
    -    pass
    -
    - -
      -
    • The forward no longer accepts a ctx argument.
    • -
    • Instead, you must also override the torch.autograd.Function.setup_context() -staticmethod to handle setting up the ctx object. -output is the output of the forward, inputs are a Tuple of inputs -to the forward.
    • -
    • See :ref:extending-autograd for more details
    • -
    - -

    The context can be used to store arbitrary data that can be then -retrieved during the backward pass. Tensors should not be stored -directly on ctx (though this is not currently enforced for -backward compatibility). Instead, tensors should be saved either with -ctx.save_for_backward() if they are intended to be used in -backward (equivalently, vjp) or ctx.save_for_forward() -if they are intended to be used for in jvp.

    -
    - - -
    -
    -
    -
    @staticmethod
    - - def - backward(ctx, grad_output): - - -
    - - -

    Define a formula for differentiating the operation with backward mode automatic differentiation.

    - -

    This function is to be overridden by all subclasses. -(Defining this function is equivalent to defining the vjp function.)

    - -

    It must accept a context ctx as the first argument, followed by -as many outputs as the forward() returned (None will be passed in -for non tensor outputs of the forward function), -and it should return as many tensors, as there were inputs to -forward(). Each argument is the gradient w.r.t the given output, -and each returned value should be the gradient w.r.t. the -corresponding input. If an input is not a Tensor or is a Tensor not -requiring grads, you can just pass None as a gradient for that input.

    - -

    The context can be used to retrieve tensors saved during the forward -pass. It also has an attribute ctx.needs_input_grad as a tuple -of booleans representing whether each input needs gradient. E.g., -backward() will have ctx.needs_input_grad[0] = True if the -first input to forward() needs gradient computed w.r.t. the -output.

    -
    - - -
    -
    -
    -
    +
    +
    - class - GradScalingFunction(torch.autograd.function.Function): - - -
    - - -

    Base class to create custom autograd.Function.

    - -

    To create a custom autograd.Function, subclass this class and implement -the forward() and backward() static methods. Then, to use your custom -op in the forward pass, call the class method apply. Do not call -forward() directly.

    - -

    To ensure correctness and best performance, make sure you are calling the -correct methods on ctx and validating your backward function using -torch.autograd.gradcheck().

    - -

    See :ref:extending-autograd for more details on how to use this class.

    - -

    Examples::

    - -
    >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)
    ->>> class Exp(Function):
    ->>>     @staticmethod
    ->>>     def forward(ctx, i):
    ->>>         result = i.exp()
    ->>>         ctx.save_for_backward(result)
    ->>>         return result
    ->>>
    ->>>     @staticmethod
    ->>>     def backward(ctx, grad_output):
    ->>>         result, = ctx.saved_tensors
    ->>>         return grad_output * result
    ->>>
    ->>> # Use it by calling the apply method:
    ->>> # xdoctest: +SKIP
    ->>> output = Exp.apply(input)
    -
    -
    - - -
    -
    -
    @staticmethod
    - def - forward(ctx, x, coef=0.2): + softmax_cross_entropy(logits: torch.Tensor, labels: torch.Tensor, inplace: bool = False):
    - + -

    Define the forward of the custom autograd Function.

    - -

    This function is to be overridden by all subclasses. -There are two ways to define forward:

    +

    softmax cross entropy

    -

    Usage 1 (Combined forward and ctx)::

    - -
    @staticmethod
    -def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
    -    pass
    -
    +
    Arguments:
      -
    • It must accept a context ctx as the first argument, followed by any -number of arguments (tensors or other types).
    • -
    • See :ref:combining-forward-context for more details
    • +
    • logits: logits tensor, shape [...,dim]
    • +
    • labels: labels tensor, shape [...]
    • +
    • inplace: update gradient in the logits tensor if True
    -

    Usage 2 (Separate forward and ctx)::

    - -
    @staticmethod
    -def forward(*args: Any, **kwargs: Any) -> Any:
    -    pass
    -
    -@staticmethod
    -def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
    -    pass
    -
    - -
      -
    • The forward no longer accepts a ctx argument.
    • -
    • Instead, you must also override the torch.autograd.Function.setup_context() -staticmethod to handle setting up the ctx object. -output is the output of the forward, inputs are a Tuple of inputs -to the forward.
    • -
    • See :ref:extending-autograd for more details
    • -
    - -

    The context can be used to store arbitrary data that can be then -retrieved during the backward pass. Tensors should not be stored -directly on ctx (though this is not currently enforced for -backward compatibility). Instead, tensors should be saved either with -ctx.save_for_backward() if they are intended to be used in -backward (equivalently, vjp) or ctx.save_for_forward() -if they are intended to be used for in jvp.

    -
    - - -
    -
    -
    -
    @staticmethod
    - - def - backward(ctx, grad_output): - - -
    - - -

    Define a formula for differentiating the operation with backward mode automatic differentiation.

    - -

    This function is to be overridden by all subclasses. -(Defining this function is equivalent to defining the vjp function.)

    - -

    It must accept a context ctx as the first argument, followed by -as many outputs as the forward() returned (None will be passed in -for non tensor outputs of the forward function), -and it should return as many tensors, as there were inputs to -forward(). Each argument is the gradient w.r.t the given output, -and each returned value should be the gradient w.r.t. the -corresponding input. If an input is not a Tensor or is a Tensor not -requiring grads, you can just pass None as a gradient for that input.

    +
    Returns:
    -

    The context can be used to retrieve tensors saved during the forward -pass. It also has an attribute ctx.needs_input_grad as a tuple -of booleans representing whether each input needs gradient. E.g., -backward() will have ctx.needs_input_grad[0] = True if the -first input to forward() needs gradient computed w.r.t. the -output.

    +
    +

    per token loss

    +
    -
    diff --git a/docs/linghe/facade/norm.html b/docs/linghe/facade/norm.html index 89b4a1c..e858009 100644 --- a/docs/linghe/facade/norm.html +++ b/docs/linghe/facade/norm.html @@ -29,28 +29,10 @@

    API Documentation

    @@ -75,287 +57,65 @@

    -
    -
    +
    +
    - class - RMSNormFunction(torch.autograd.function.Function): - - -
    - - -

    Base class to create custom autograd.Function.

    - -

    To create a custom autograd.Function, subclass this class and implement -the forward() and backward() static methods. Then, to use your custom -op in the forward pass, call the class method apply. Do not call -forward() directly.

    - -

    To ensure correctness and best performance, make sure you are calling the -correct methods on ctx and validating your backward function using -torch.autograd.gradcheck().

    - -

    See :ref:extending-autograd for more details on how to use this class.

    - -

    Examples::

    - -
    >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)
    ->>> class Exp(Function):
    ->>>     @staticmethod
    ->>>     def forward(ctx, i):
    ->>>         result = i.exp()
    ->>>         ctx.save_for_backward(result)
    ->>>         return result
    ->>>
    ->>>     @staticmethod
    ->>>     def backward(ctx, grad_output):
    ->>>         result, = ctx.saved_tensors
    ->>>         return grad_output * result
    ->>>
    ->>> # Use it by calling the apply method:
    ->>> # xdoctest: +SKIP
    ->>> output = Exp.apply(input)
    -
    -
    - - -
    -
    -
    @staticmethod
    - def - forward(ctx, x, weight, eps=1e-06): + rms_norm(x: torch.Tensor, weight: torch.Tensor, eps: float = 1e-06):
    - + -

    Define the forward of the custom autograd Function.

    - -

    This function is to be overridden by all subclasses. -There are two ways to define forward:

    - -

    Usage 1 (Combined forward and ctx)::

    +

    rms norm of x with weight

    -
    @staticmethod
    -def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
    -    pass
    -
    +
    Arguments:
      -
    • It must accept a context ctx as the first argument, followed by any -number of arguments (tensors or other types).
    • -
    • See :ref:combining-forward-context for more details
    • +
    • x: activation tensor
    • +
    • weight: weight tensor
    • +
    • eps: epsilon for RMS
    -

    Usage 2 (Separate forward and ctx)::

    - -
    @staticmethod
    -def forward(*args: Any, **kwargs: Any) -> Any:
    -    pass
    -
    -@staticmethod
    -def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
    -    pass
    -
    - -
      -
    • The forward no longer accepts a ctx argument.
    • -
    • Instead, you must also override the torch.autograd.Function.setup_context() -staticmethod to handle setting up the ctx object. -output is the output of the forward, inputs are a Tuple of inputs -to the forward.
    • -
    • See :ref:extending-autograd for more details
    • -
    - -

    The context can be used to store arbitrary data that can be then -retrieved during the backward pass. Tensors should not be stored -directly on ctx (though this is not currently enforced for -backward compatibility). Instead, tensors should be saved either with -ctx.save_for_backward() if they are intended to be used in -backward (equivalently, vjp) or ctx.save_for_forward() -if they are intended to be used for in jvp.

    -
    - - -
    -
    -
    -
    @staticmethod
    - - def - backward(ctx, dy): - - -
    - - -

    Define a formula for differentiating the operation with backward mode automatic differentiation.

    - -

    This function is to be overridden by all subclasses. -(Defining this function is equivalent to defining the vjp function.)

    +
    Returns:
    -

    It must accept a context ctx as the first argument, followed by -as many outputs as the forward() returned (None will be passed in -for non tensor outputs of the forward function), -and it should return as many tensors, as there were inputs to -forward(). Each argument is the gradient w.r.t the given output, -and each returned value should be the gradient w.r.t. the -corresponding input. If an input is not a Tensor or is a Tensor not -requiring grads, you can just pass None as a gradient for that input.

    - -

    The context can be used to retrieve tensors saved during the forward -pass. It also has an attribute ctx.needs_input_grad as a tuple -of booleans representing whether each input needs gradient. E.g., -backward() will have ctx.needs_input_grad[0] = True if the -first input to forward() needs gradient computed w.r.t. the -output.

    +
    +

    rms output

    +
    -
    -
    -
    +
    +
    - class - GroupNormGateFunction(torch.autograd.function.Function): - - -
    - - -

    Base class to create custom autograd.Function.

    - -

    To create a custom autograd.Function, subclass this class and implement -the forward() and backward() static methods. Then, to use your custom -op in the forward pass, call the class method apply. Do not call -forward() directly.

    - -

    To ensure correctness and best performance, make sure you are calling the -correct methods on ctx and validating your backward function using -torch.autograd.gradcheck().

    - -

    See :ref:extending-autograd for more details on how to use this class.

    - -

    Examples::

    - -
    >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)
    ->>> class Exp(Function):
    ->>>     @staticmethod
    ->>>     def forward(ctx, i):
    ->>>         result = i.exp()
    ->>>         ctx.save_for_backward(result)
    ->>>         return result
    ->>>
    ->>>     @staticmethod
    ->>>     def backward(ctx, grad_output):
    ->>>         result, = ctx.saved_tensors
    ->>>         return grad_output * result
    ->>>
    ->>> # Use it by calling the apply method:
    ->>> # xdoctest: +SKIP
    ->>> output = Exp.apply(input)
    -
    -
    - - -
    -
    -
    @staticmethod
    - def - forward(ctx, x, gate, weight, eps=1e-06, group_size=4): + group_norm_gate( attn_output: torch.Tensor, gate: torch.Tensor, weight: torch.Tensor, eps: float = 1e-06, group_size: int = 4):
    - + -

    Define the forward of the custom autograd Function.

    - -

    This function is to be overridden by all subclasses. -There are two ways to define forward:

    +

    return group_rms_norm(transpose(attn_output, [0,1]), weight) * sigmoid(gate)

    -

    Usage 1 (Combined forward and ctx)::

    - -
    @staticmethod
    -def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
    -    pass
    -
    +
    Arguments:
      -
    • It must accept a context ctx as the first argument, followed by any -number of arguments (tensors or other types).
    • -
    • See :ref:combining-forward-context for more details
    • +
    • attn_output: output of core attn, shape [bs, length, n_heads, head_dim]
    • +
    • gate: gate tensor for attention output, shape [length, bs, dim]
    • +
    • weight: weight of RMS norm, shape [dim]
    • +
    • eps: epsilon for RMS
    • +
    • group_size: group size of group RMS norm
    -

    Usage 2 (Separate forward and ctx)::

    - -
    @staticmethod
    -def forward(*args: Any, **kwargs: Any) -> Any:
    -    pass
    -
    -@staticmethod
    -def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
    -    pass
    -
    - -
      -
    • The forward no longer accepts a ctx argument.
    • -
    • Instead, you must also override the torch.autograd.Function.setup_context() -staticmethod to handle setting up the ctx object. -output is the output of the forward, inputs are a Tuple of inputs -to the forward.
    • -
    • See :ref:extending-autograd for more details
    • -
    - -

    The context can be used to store arbitrary data that can be then -retrieved during the backward pass. Tensors should not be stored -directly on ctx (though this is not currently enforced for -backward compatibility). Instead, tensors should be saved either with -ctx.save_for_backward() if they are intended to be used in -backward (equivalently, vjp) or ctx.save_for_forward() -if they are intended to be used for in jvp.

    -
    - - -
    -
    -
    -
    @staticmethod
    - - def - backward(ctx, dy): - - -
    - - -

    Define a formula for differentiating the operation with backward mode automatic differentiation.

    - -

    This function is to be overridden by all subclasses. -(Defining this function is equivalent to defining the vjp function.)

    - -

    It must accept a context ctx as the first argument, followed by -as many outputs as the forward() returned (None will be passed in -for non tensor outputs of the forward function), -and it should return as many tensors, as there were inputs to -forward(). Each argument is the gradient w.r.t the given output, -and each returned value should be the gradient w.r.t. the -corresponding input. If an input is not a Tensor or is a Tensor not -requiring grads, you can just pass None as a gradient for that input.

    +
    Returns:
    -

    The context can be used to retrieve tensors saved during the forward -pass. It also has an attribute ctx.needs_input_grad as a tuple -of booleans representing whether each input needs gradient. E.g., -backward() will have ctx.needs_input_grad[0] = True if the -first input to forward() needs gradient computed w.r.t. the -output.

    +
    +

    output with shape [length, bs, dim]

    +
    -
    diff --git a/docs/linghe/facade/rope.html b/docs/linghe/facade/rope.html index 8cf1b21..0d39eca 100644 --- a/docs/linghe/facade/rope.html +++ b/docs/linghe/facade/rope.html @@ -29,16 +29,7 @@

    API Documentation

    @@ -63,146 +54,40 @@

    -
    -
    +
    +
    - class - QkNormHalfRopeFunction(torch.autograd.function.Function): - - -
    - - -

    Base class to create custom autograd.Function.

    - -

    To create a custom autograd.Function, subclass this class and implement -the forward() and backward() static methods. Then, to use your custom -op in the forward pass, call the class method apply. Do not call -forward() directly.

    - -

    To ensure correctness and best performance, make sure you are calling the -correct methods on ctx and validating your backward function using -torch.autograd.gradcheck().

    - -

    See :ref:extending-autograd for more details on how to use this class.

    - -

    Examples::

    - -
    >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)
    ->>> class Exp(Function):
    ->>>     @staticmethod
    ->>>     def forward(ctx, i):
    ->>>         result = i.exp()
    ->>>         ctx.save_for_backward(result)
    ->>>         return result
    ->>>
    ->>>     @staticmethod
    ->>>     def backward(ctx, grad_output):
    ->>>         result, = ctx.saved_tensors
    ->>>         return grad_output * result
    ->>>
    ->>> # Use it by calling the apply method:
    ->>> # xdoctest: +SKIP
    ->>> output = Exp.apply(input)
    -
    -
    - - -
    -
    -
    @staticmethod
    - def - forward(ctx, qkv, q_norm_weight, k_norm_weight, freqs, H=32, h=4, eps=1e-06): + qk_norm_half_rope( qkv: torch.Tensor, q_norm_weight: torch.Tensor, k_norm_weight: torch.Tensor, freqs: torch.Tensor, H: int = 32, h: int = 4, eps: float = 1e-06):
    - + -

    Define the forward of the custom autograd Function.

    - -

    This function is to be overridden by all subclasses. -There are two ways to define forward:

    +

    split qkv to q/k/v, apply qk norm and half rope to q/k, transpose q/k/v to flash-attention layout

    -

    Usage 1 (Combined forward and ctx)::

    - -
    @staticmethod
    -def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
    -    pass
    -
    +
    Arguments:
      -
    • It must accept a context ctx as the first argument, followed by any -number of arguments (tensors or other types).
    • -
    • See :ref:combining-forward-context for more details
    • +
    • qkv: QKV tensor with size of [S, B, dim], heads are interleaved
    • +
    • q_norm_weight: rms norm weight for query
    • +
    • k_norm_weight: rms norm weight for key
    • +
    • freqs: Freqs tensor based on half dim.
    • +
    • H: Number of attention heads.
    • +
    • h: Number of key/value heads.
    • +
    • eps: epsilon value for L2 normalization.
    -

    Usage 2 (Separate forward and ctx)::

    - -
    @staticmethod
    -def forward(*args: Any, **kwargs: Any) -> Any:
    -    pass
    -
    -@staticmethod
    -def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
    -    pass
    -
    +
    Returns:
    -
      -
    • The forward no longer accepts a ctx argument.
    • -
    • Instead, you must also override the torch.autograd.Function.setup_context() -staticmethod to handle setting up the ctx object. -output is the output of the forward, inputs are a Tuple of inputs -to the forward.
    • -
    • See :ref:extending-autograd for more details
    • -
    - -

    The context can be used to store arbitrary data that can be then -retrieved during the backward pass. Tensors should not be stored -directly on ctx (though this is not currently enforced for -backward compatibility). Instead, tensors should be saved either with -ctx.save_for_backward() if they are intended to be used in -backward (equivalently, vjp) or ctx.save_for_forward() -if they are intended to be used for in jvp.

    -
    - - -
    -
    -
    -
    @staticmethod
    - - def - backward(ctx, grad_q, grad_k, grad_v): - - -
    - - -

    Define a formula for differentiating the operation with backward mode automatic differentiation.

    - -

    This function is to be overridden by all subclasses. -(Defining this function is equivalent to defining the vjp function.)

    - -

    It must accept a context ctx as the first argument, followed by -as many outputs as the forward() returned (None will be passed in -for non tensor outputs of the forward function), -and it should return as many tensors, as there were inputs to -forward(). Each argument is the gradient w.r.t the given output, -and each returned value should be the gradient w.r.t. the -corresponding input. If an input is not a Tensor or is a Tensor not -requiring grads, you can just pass None as a gradient for that input.

    - -

    The context can be used to retrieve tensors saved during the forward -pass. It also has an attribute ctx.needs_input_grad as a tuple -of booleans representing whether each input needs gradient. E.g., -backward() will have ctx.needs_input_grad[0] = True if the -first input to forward() needs gradient computed w.r.t. the -output.

    +
    +

    qo: shape [B, S, H, head_dim] + ko: shape [B, S, h, head_dim] + vo: shape [B, S, h, head_dim]

    +
    -
    diff --git a/docs/linghe/facade/transpose.html b/docs/linghe/facade/transpose.html index 9efef88..9d645db 100644 --- a/docs/linghe/facade/transpose.html +++ b/docs/linghe/facade/transpose.html @@ -29,16 +29,7 @@

    API Documentation

    @@ -63,146 +54,32 @@

    -
    -
    +
    +
    - class - TransposeDim01Function(torch.autograd.function.Function): - - -
    - - -

    Base class to create custom autograd.Function.

    - -

    To create a custom autograd.Function, subclass this class and implement -the forward() and backward() static methods. Then, to use your custom -op in the forward pass, call the class method apply. Do not call -forward() directly.

    - -

    To ensure correctness and best performance, make sure you are calling the -correct methods on ctx and validating your backward function using -torch.autograd.gradcheck().

    - -

    See :ref:extending-autograd for more details on how to use this class.

    - -

    Examples::

    - -
    >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_AUTOGRAD)
    ->>> class Exp(Function):
    ->>>     @staticmethod
    ->>>     def forward(ctx, i):
    ->>>         result = i.exp()
    ->>>         ctx.save_for_backward(result)
    ->>>         return result
    ->>>
    ->>>     @staticmethod
    ->>>     def backward(ctx, grad_output):
    ->>>         result, = ctx.saved_tensors
    ->>>         return grad_output * result
    ->>>
    ->>> # Use it by calling the apply method:
    ->>> # xdoctest: +SKIP
    ->>> output = Exp.apply(input)
    -
    -
    - - -
    -
    -
    @staticmethod
    - def - forward(ctx, x): + transpose_dim01(x):
    - + -

    Define the forward of the custom autograd Function.

    - -

    This function is to be overridden by all subclasses. -There are two ways to define forward:

    +

    transpose a tensor with the first two dims, x.ndims should not greater than 4

    -

    Usage 1 (Combined forward and ctx)::

    - -
    @staticmethod
    -def forward(ctx: Any, *args: Any, **kwargs: Any) -> Any:
    -    pass
    -
    +
    Arguments:
      -
    • It must accept a context ctx as the first argument, followed by any -number of arguments (tensors or other types).
    • -
    • See :ref:combining-forward-context for more details
    • +
    • x: input tensor
    -

    Usage 2 (Separate forward and ctx)::

    - -
    @staticmethod
    -def forward(*args: Any, **kwargs: Any) -> Any:
    -    pass
    -
    -@staticmethod
    -def setup_context(ctx: Any, inputs: Tuple[Any, ...], output: Any) -> None:
    -    pass
    -
    +
    Returns:
    -
      -
    • The forward no longer accepts a ctx argument.
    • -
    • Instead, you must also override the torch.autograd.Function.setup_context() -staticmethod to handle setting up the ctx object. -output is the output of the forward, inputs are a Tuple of inputs -to the forward.
    • -
    • See :ref:extending-autograd for more details
    • -
    - -

    The context can be used to store arbitrary data that can be then -retrieved during the backward pass. Tensors should not be stored -directly on ctx (though this is not currently enforced for -backward compatibility). Instead, tensors should be saved either with -ctx.save_for_backward() if they are intended to be used in -backward (equivalently, vjp) or ctx.save_for_forward() -if they are intended to be used for in jvp.

    -
    - - -
    -
    -
    -
    @staticmethod
    - - def - backward(ctx, grad_output): - - -
    - - -

    Define a formula for differentiating the operation with backward mode automatic differentiation.

    - -

    This function is to be overridden by all subclasses. -(Defining this function is equivalent to defining the vjp function.)

    - -

    It must accept a context ctx as the first argument, followed by -as many outputs as the forward() returned (None will be passed in -for non tensor outputs of the forward function), -and it should return as many tensors, as there were inputs to -forward(). Each argument is the gradient w.r.t the given output, -and each returned value should be the gradient w.r.t. the -corresponding input. If an input is not a Tensor or is a Tensor not -requiring grads, you can just pass None as a gradient for that input.

    - -

    The context can be used to retrieve tensors saved during the forward -pass. It also has an attribute ctx.needs_input_grad as a tuple -of booleans representing whether each input needs gradient. E.g., -backward() will have ctx.needs_input_grad[0] = True if the -first input to forward() needs gradient computed w.r.t. the -output.

    +
    +

    a transposed tensor

    +
    -
    diff --git a/docs/linghe/gemm.html b/docs/linghe/gemm.html index de3721c..3175a31 100644 --- a/docs/linghe/gemm.html +++ b/docs/linghe/gemm.html @@ -27,6 +27,8 @@

    Submodules

    diff --git a/docs/linghe/gemm/blockwise_fp8_gemm.html b/docs/linghe/gemm/blockwise_fp8_gemm.html new file mode 100644 index 0000000..69ab790 --- /dev/null +++ b/docs/linghe/gemm/blockwise_fp8_gemm.html @@ -0,0 +1,56 @@ + + + + + + + linghe.gemm.blockwise_fp8_gemm API documentation + + + + + + + + + +
    +
    +

    +linghe.gemm.blockwise_fp8_gemm

    + +

    Copyright (c) Ant Financial Service Group and its affiliates.

    +
    + + + + +
    +
    + + \ No newline at end of file diff --git a/docs/linghe/gemm/channelwise_fp8_gemm.html b/docs/linghe/gemm/channelwise_fp8_gemm.html new file mode 100644 index 0000000..c8c6831 --- /dev/null +++ b/docs/linghe/gemm/channelwise_fp8_gemm.html @@ -0,0 +1,93 @@ + + + + + + + linghe.gemm.channelwise_fp8_gemm API documentation + + + + + + + + + +
    +
    +

    +linghe.gemm.channelwise_fp8_gemm

    + +

    Copyright (c) Ant Financial Service Group and its affiliates.

    +
    + + + + +
    +
    +
    + + def + triton_scaled_mm( a: torch.Tensor, b: torch.Tensor, a_scale: torch.Tensor, b_scale: torch.Tensor, out_dtype=torch.float32, c=None, accum=True): + + +
    + + +

    similar to torch._scaled_mm, support accumulating gemm output to c + and low precision output tensor

    + +
    Arguments:
    + +
      +
    • a: left fp8 tensor
    • +
    • b: right fp8 tensor, column-major
    • +
    • a_scale: fp32 scale of a
    • +
    • b_scale: fp32 scale of b
    • +
    • out_dtype: output tensor dtype
    • +
    • c: output tensor
    • +
    • accum: accumulate output on c if True
    • +
    + +
    Returns:
    + +
    +

    c: output tensor

    +
    +
    + + +
    +
    + + \ No newline at end of file diff --git a/docs/linghe/gemm/fp32_gemm.html b/docs/linghe/gemm/fp32_gemm.html index 50a667e..f0f3faf 100644 --- a/docs/linghe/gemm/fp32_gemm.html +++ b/docs/linghe/gemm/fp32_gemm.html @@ -28,6 +28,21 @@

    API Documentation

    @@ -51,6 +66,157 @@

    +
    +
    + + def + triton_fp32_gemm(a: torch.Tensor, b: torch.Tensor): + + +
    + + +

    return fp32 gemm result with fp16/bf16 inputs, + it's mainly used for MoE router GEMM + and DO NOT suitable for large size GEMM

    + +
    Arguments:
    + +
      +
    • a: left matrix with fp16/bf16 precision
    • +
    • b: right matrix with fp16/bf16 precision
    • +
    + +
    Returns:
    + +
    +

    c: output with fp32 precision

    +
    +
    + + +
    +
    +
    + + def + triton_fp32_gemm_for_backward(a: torch.Tensor, b: torch.Tensor): + + +
    + + +

    mix precision gemm for backward, a@b.float()

    + +
    Arguments:
    + +
      +
    • a: input gradient, fp32
    • +
    • b: gemm weight, bf16/fp16
    • +
    + +
    Returns:
    + +
    +

    c: gradient of activation

    +
    +
    + + +
    +
    +
    + + def + triton_fp32_gemm_for_update(a: torch.Tensor, b: torch.Tensor): + + +
    + + +

    mix precision gemm for updaing weight

    + +
    Arguments:
    + +
      +
    • a: gradient of output, fp32
    • +
    • b: input activation, bf16/fp16
    • +
    + +
    Returns:
    + +
    +

    c: gradient of weight

    +
    +
    + + +
    +
    +
    + + def + triton_scaled_fp32_gemm(a: torch.Tensor, b: torch.Tensor, scale: torch.Tensor): + + +
    + + +

    c = (ascale[:,None])b +this kernel is used to fuse RMSNorm and quantization in MoE layer +native implementation: + y = rms_norm(x), + y_q = quantization(y), + router_logits = y@w +we can not fuse rms_norm and quantization +as we still need bf16 y for moe router gemm +fused implementation: + y_q, rms = quantization(rms_norm(x)) + router_logits = (x/rms)@y +so we need a scaled fp32 gemm kernel

    + +
    Arguments:
    + +
      +
    • a: activation tensor
    • +
    • b: weight tensor
    • +
    • scale: scale for activation tensor, 1/rms
    • +
    + +

    Returns:

    +
    + + +
    +
    +
    + + def + triton_scaled_fp32_gemm_for_update(a: torch.Tensor, b: torch.Tensor, scale: torch.Tensor): + + +
    + + +

    see triton_scaled_fp32_gemm

    + +
    Arguments:
    + +
      +
    • a: y
    • +
    • b: activation before RMS norm
    • +
    • scale: 1/rms
    • +
    + +
    Returns:
    + +
    +

    dw

    +
    +
    + + +
    \ No newline at end of file diff --git a/docs/linghe/quant.html b/docs/linghe/quant.html index 7de4b97..bd6573d 100644 --- a/docs/linghe/quant.html +++ b/docs/linghe/quant.html @@ -29,6 +29,9 @@

    Submodules

    diff --git a/docs/linghe/quant/block.html b/docs/linghe/quant/block.html index 6465535..a562b18 100644 --- a/docs/linghe/quant/block.html +++ b/docs/linghe/quant/block.html @@ -25,11 +25,13 @@ -

    Submodules

    - + +

    API Documentation

    + @@ -45,11 +47,43 @@

    Submodules

    linghe.quant.block

    - +

    Copyright (c) Ant Financial Service Group and its affiliates.

    +
    +
    +
    +
    + + def + triton_block_quant(x, block_size=128, round_scale=False): + + +
    + + +

    blockwise quantize x

    + +
    Arguments:
    + +
      +
    • x: input tensor
    • +
    • block_size: block wise
    • +
    • round_scale: whether round scale to power of 2
    • +
    + +
    Returns:
    + +
    +

    y: quantized tensor, float8_e4m3fn + s: quantization scale, float32

    +
    +
    + + +
    \ No newline at end of file diff --git a/docs/linghe/quant/channel.html b/docs/linghe/quant/channel.html index 54287c2..80243be 100644 --- a/docs/linghe/quant/channel.html +++ b/docs/linghe/quant/channel.html @@ -25,10 +25,19 @@ -

    Submodules

    - + +

    API Documentation

    + @@ -44,11 +53,100 @@

    Submodules

    linghe.quant.channel

    - +

    Copyright (c) Ant Financial Service Group and its affiliates.

    +
    + +
    +
    + + def + triton_row_quant(x, round_scale=False): + + +
    + + +

    rowwise quantize x

    + +
    Arguments:
    + +
      +
    • x: input x
    • +
    • round_scale: whether round scale to power of 2
    • +
    + +
    Returns:
    + +
    +

    x_q: quantized tensor + x_scale: quantization scale

    +
    +
    + + +
    +
    +
    + + def + triton_tokenwise_row_quant(x, out=None, scale=None, round_scale=False): + + +
    + + +

    rowwise quantize x with power of 2 dim size

    + +
    Arguments:
    + +
      +
    • x: input x
    • +
    • round_scale: whether round scale to power of 2
    • +
    + +
    Returns:
    + +
    +

    out: quantized tensor + scale: quantization scale

    +
    +
    + + +
    +
    +
    + + def + triton_transpose_row_quant(x, round_scale=False): + + +
    + + +

    transpose x and row quantize x

    + +
    Arguments:
    + +
      +
    • x: input x
    • +
    • round_scale: whether round scale to power of 2
    • +
    + +
    Returns:
    + +
    +

    x_q: quantized tensor + x_scale: quantization scale

    +
    +
    + + +
    \ No newline at end of file diff --git a/docs/linghe/quant/group.html b/docs/linghe/quant/group.html new file mode 100644 index 0000000..e0191d6 --- /dev/null +++ b/docs/linghe/quant/group.html @@ -0,0 +1,89 @@ + + + + + + + linghe.quant.group API documentation + + + + + + + + + +
    +
    +

    +linghe.quant.group

    + +

    Copyright (c) Ant Financial Service Group and its affiliates.

    +
    + + + + +
    +
    +
    + + def + triton_group_quant(x, dtype=torch.float8_e4m3fn, group_size=128, round_scale=False): + + +
    + + +

    groupwise quantize x, group is in under rowwise format

    + +
    Arguments:
    + +
      +
    • x: input tensor
    • +
    • group_size: group wise
    • +
    • round_scale: whether round scale to power of 2
    • +
    + +
    Returns:
    + +
    +

    y: quantized tensor, float8_e4m3fn + s: quantization scale, float32

    +
    +
    + + +
    +
    + + \ No newline at end of file diff --git a/docs/linghe/quant/hadamard.html b/docs/linghe/quant/hadamard.html new file mode 100644 index 0000000..e9d1157 --- /dev/null +++ b/docs/linghe/quant/hadamard.html @@ -0,0 +1,54 @@ + + + + + + + linghe.quant.hadamard API documentation + + + + + + + + + +
    +
    +

    +linghe.quant.hadamard

    + + + + + +
    +
    + + \ No newline at end of file diff --git a/docs/linghe/quant/hadamard/seperate_hadamard.html b/docs/linghe/quant/hadamard/seperate_hadamard.html new file mode 100644 index 0000000..dce669e --- /dev/null +++ b/docs/linghe/quant/hadamard/seperate_hadamard.html @@ -0,0 +1,56 @@ + + + + + + + linghe.quant.hadamard.seperate_hadamard API documentation + + + + + + + + + +
    +
    +

    +linghe.quant.hadamard.seperate_hadamard

    + +

    Copyright (c) Ant Financial Service Group and its affiliates.

    +
    + + + + +
    +
    + + \ No newline at end of file diff --git a/docs/linghe/quant/smooth.html b/docs/linghe/quant/smooth.html new file mode 100644 index 0000000..1a83b50 --- /dev/null +++ b/docs/linghe/quant/smooth.html @@ -0,0 +1,55 @@ + + + + + + + linghe.quant.smooth API documentation + + + + + + + + + +
    +
    +

    +linghe.quant.smooth

    + + + + + +
    +
    + + \ No newline at end of file diff --git a/docs/linghe/quant/smooth/reused_smooth.html b/docs/linghe/quant/smooth/reused_smooth.html new file mode 100644 index 0000000..927018e --- /dev/null +++ b/docs/linghe/quant/smooth/reused_smooth.html @@ -0,0 +1,56 @@ + + + + + + + linghe.quant.smooth.reused_smooth API documentation + + + + + + + + + +
    +
    +

    +linghe.quant.smooth.reused_smooth

    + +

    Copyright (c) Ant Financial Service Group and its affiliates.

    +
    + + + + +
    +
    + + \ No newline at end of file diff --git a/docs/linghe/quant/smooth/seperate_smooth.html b/docs/linghe/quant/smooth/seperate_smooth.html new file mode 100644 index 0000000..5ba5130 --- /dev/null +++ b/docs/linghe/quant/smooth/seperate_smooth.html @@ -0,0 +1,56 @@ + + + + + + + linghe.quant.smooth.seperate_smooth API documentation + + + + + + + + + +
    +
    +

    +linghe.quant.smooth.seperate_smooth

    + +

    Copyright (c) Ant Financial Service Group and its affiliates.

    +
    + + + + +
    +
    + + \ No newline at end of file diff --git a/docs/linghe/utils/add.html b/docs/linghe/utils/add.html index 10a2108..9c8fe47 100644 --- a/docs/linghe/utils/add.html +++ b/docs/linghe/utils/add.html @@ -71,10 +71,14 @@
    Arguments:
    • x: Tensor
    • y: Tensor
    • -
    • accum: whether accum y to x
    • +
    • accum: x += y if accum=True else x.copy_(y)
    -

    Returns: x += y if accum=True else x.copy_(y)

    +
    Returns:
    + +
    +

    updated x

    +
diff --git a/docs/linghe/utils/dot.html b/docs/linghe/utils/dot.html index dc32301..49a7b28 100644 --- a/docs/linghe/utils/dot.html +++ b/docs/linghe/utils/dot.html @@ -28,6 +28,9 @@

API Documentation

@@ -51,6 +54,35 @@

+
+
+ + def + triton_dot(x, y): + + +
+ + +

vector dot multiply, output = sum(x*y, 1), +it is used to calculate gradient of router weight

+ +
Arguments:
+ +
    +
  • x:
  • +
  • y:
  • +
+ +
Returns:
+ +
+

output of sum(x*y, 1)

+
+
+ + +
\ No newline at end of file diff --git a/docs/linghe/utils/gather.html b/docs/linghe/utils/gather.html index 17a5e5d..8f18e38 100644 --- a/docs/linghe/utils/gather.html +++ b/docs/linghe/utils/gather.html @@ -28,6 +28,30 @@

API Documentation

@@ -51,6 +75,268 @@

+
+
+ + def + triton_make_row_id_map(routing_map: torch.Tensor, multiple_of: int = 1): + + +
+ + +

make row id map, values in the tensor are the row indices

+ +
Arguments:
+ +
    +
  • routing_map: a tensor of 0/1 values, 1 indicates routed
  • +
  • multiple_of: padding the tokens of each expert to multiple of this value
  • +
+ +
Returns:
+ +
+

row id map with shape [n_tokens, n_experts]

+
+
+ + +
+
+
+ + def + triton_make_row_id_map_and_indices(routing_map: torch.Tensor, num_out_tokens: int, multiple_of: int = 1): + + +
+ + +

similar with triton_make_row_id_map, but output an indices tensor as well

+ +
Arguments:
+ +
    +
  • routing_map: [n_tokens, n_experts]
  • +
  • num_out_tokens: sum(round_up_to(n_tokens, multiple_of))
  • +
  • multiple_of: padding the tokens of each expert to this value
  • +
+ +
Returns:
+ +
+

row_in_map: [n_tokens, n_experts] + row_indices: [num_out_tokens]

+
+
+ + +
+
+
+ + def + triton_index_select(x, indices, scale=None, out=None, scale_out=None): + + +
+ + +

index select for quantized tensor

+ +
Arguments:
+ +
    +
  • x: [bs, dim]
  • +
  • indices: [K]
  • +
  • scale: [bs]
  • +
+ +
Returns:
+ +
+

out: output of selected x + scale_out: scale of selected scale

+
+
+ + +
+
+
+ + def + triton_permute_with_mask_map( inp: torch.Tensor, scale: torch.Tensor, probs: torch.Tensor, row_id_map: torch.Tensor, num_out_tokens: int, contiguous: bool = True, tokens_per_expert: Optional[torch.Tensor] = None): + + +
+ + +

gather quantized tensor with row id map

+ +
Arguments:
+ +
    +
  • inp: [num_tokens, hidden_size], rowwise quantized tensor
  • +
  • scale: [num_tokens], quantization scale
  • +
  • probs: router prob, used as weight
  • +
  • row_id_map: [n_experts, num_tokens] +index >= 0: row index of output tensor +index == -1: ignore +Note: index may not be contiguous
  • +
  • num_out_tokens: output token count, including padding tokens
  • +
  • contiguous: whether indices in row_id_map is contiguous, +False means padded
  • +
  • tokens_per_expert: [num_experts], token count per expert, +non-blocking cuda tensor
  • +
+ +
Returns:
+ +
+

output: permuted quantized tensor + permuted_scale: permuted quantization scale + permuted_probs: permuted router prob

+
+
+ + +
+
+
+ + def + triton_batch_transpose_smooth_permute_with_indices( x, scale, org_smooth_scale, smooth_scales, indices, token_count_per_expert, splits, x_q=None, x_scale=None, round_scale=False): + + +
+ + +

used for smooth quantization backward in megatron 0.12, +x is gathered, requantized, padded to multiple of 32 and tranposed

+ +
Arguments:
+ +
    +
  • x: dy, [bs, dim], it is smooth quantized
  • +
  • scale: [bs], quantized scale
  • +
  • org_smooth_scale: [dim]
  • +
  • smooth_scales: [n_experts, dim]
  • +
  • indices: [sum(tokens_per_experts)]
  • +
  • token_count_per_expert: [n_experts], tensor of token count per expert
  • +
  • splits: [n_experts], list of token_count_per_expert
  • +
  • round_scale: round quantization scale to power of 2
  • +
+ +
Returns:
+ +
+

x_q: [sum(roundup(tokens_per_experts)) * dim] + x_scale: [sum(roundup(tokens_per_experts))]

+
+
+ + +
+
+
+ + def + triton_smooth_weighted_permute_with_indices( grads, tokens, smooth_scales, token_count_per_expert, indices, x_q=None, x_scale=None, x_sum=None, reverse=False, round_scale=False): + + +
+ + +

select and smooth and quant, used in megatron 0.11 all2all moe

+ +
Arguments:
+ +
    +
  • grads: [bs, dim]
  • +
  • tokens: [bs, dim]
  • +
  • smooth_scales: [n_experts, dim]
  • +
  • token_count_per_expert: [n_experts]
  • +
  • indices: [n_experts*topk]
  • +
  • reverse: whether scale is 1/scale
  • +
  • round_scale: whether round scale to power of 2
  • +
+ +
Returns:
+ +
+

x_q: [bs*topk, dim] + x_scale: [bstopk] + x_sum: [bstopk]

+
+
+ + +
+
+
+ + def + triton_smooth_permute_with_indices( grad_data, grad_scale, smooth_scales, token_count_per_expert, indices, x_q=None, x_scale=None, reverse=False, round_scale=False): + + +
+ + +

select and smooth and quant

+ +
Arguments:
+ +
    +
  • grad_data: [bs, dim]
  • +
  • grad_scale: [bs]
  • +
  • smooth_scales: [n_experts, dim]
  • +
  • token_count_per_expert: [n_experts]
  • +
  • indices: [n_experts*topk]
  • +
  • x_q: [bs*topk, dim]
  • +
  • x_scale: [bs*topk]
  • +
  • reverse:
  • +
  • round_scale:
  • +
+ +

Returns:

+
+ + +
+
+
+ + def + triton_smooth_permute_with_mask_map( inp: torch.Tensor, row_id_map: torch.Tensor, scale: torch.Tensor, num_tokens: int, num_experts: int, num_out_tokens: int, hidden_size: int, smooth_scales: torch.Tensor, reverse=True, round_scale=False): + + +
+ + +

gather and optional dequant and smooth quant

+ +
Arguments:
+ +
    +
  • inp: [num_tokens, hidden_size], rowwise quantized tensor
  • +
  • row_id_map: [n_experts, num_tokens], indices
  • +
  • scale: [num_tokens, hs], rowwise_scale_inv, optional
  • +
  • num_tokens: [n_experts]
  • +
  • num_experts:
  • +
  • num_out_tokens:
  • +
  • hidden_size:
  • +
  • smooth_scales: [n_experts, hidden_size]
  • +
  • reverse:
  • +
  • round_scale:
  • +
+ +

Returns:

+
+ + +
\ No newline at end of file diff --git a/docs/linghe/utils/loss.html b/docs/linghe/utils/loss.html index 98e8f96..73e62df 100644 --- a/docs/linghe/utils/loss.html +++ b/docs/linghe/utils/loss.html @@ -28,6 +28,12 @@

API Documentation

@@ -51,6 +57,65 @@

+
+
+ + def + triton_softmax_cross_entropy_forward(logits, labels): + + +
+ + +

compute token-wise softmax cross entropy loss

+ +
Arguments:
+ +
    +
  • logits: logits tensor
  • +
  • labels: labels tensor
  • +
+ +
Returns:
+ +
+

loss of each token

+
+
+ + +
+
+
+ + def + triton_softmax_cross_entropy_backward(logits, labels, sum_exp, max_logit, input_grad, output_grad=None): + + +
+ + +

backward of softmax cross entropy loss

+ +
Arguments:
+ +
    +
  • logits: logit tensor, [bs, dim]
  • +
  • labels: label tensor, [bs]
  • +
  • sum_exp: [bs]
  • +
  • max_logit: [bs]
  • +
  • input_grad: gradient, [bs, dim]
  • +
+ +
Returns:
+ +
+

output_grad: [bs, dim]

+
+
+ + +
\ No newline at end of file diff --git a/docs/linghe/utils/norm.html b/docs/linghe/utils/norm.html index 8c606d3..63e9fb9 100644 --- a/docs/linghe/utils/norm.html +++ b/docs/linghe/utils/norm.html @@ -28,6 +28,9 @@

API Documentation

    +
  • + triton_rms_norm_forward +
  • triton_rms_norm_and_block_quant_forward
  • @@ -55,6 +58,35 @@

    +
    +
    + + def + triton_rms_norm_forward(x, weight, eps=1e-06, out=None): + + +
    + + +

    rms norm

    + +
    Arguments:
    + +
      +
    • x: input tensor
    • +
    • weight: weight of rms norm
    • +
    • eps: epsilon of rms norm
    • +
    + +
    Returns:
    + +
    +

    out: output tensor

    +
    +
    + + +
    @@ -111,14 +143,18 @@
    Returns:
    Arguments:
      -
    • x:
    • -
    • gate:
    • -
    • weight:
    • -
    • eps:
    • -
    • group_size:
    • +
    • x: output of attn, [bs, length, n_heads, head_dim]
    • +
    • gate: gate tensor, [length, bs, dim]
    • +
    • weight: rms norm weight, [dim]
    • +
    • eps: epsilon of rms norm
    • +
    • group_size: group size of group rms norm
    -

    Returns:

    +
    Returns:
    + +
    +

    output tensor

    +
    diff --git a/docs/linghe/utils/rearange.html b/docs/linghe/utils/rearange.html index ff1bdb1..dab027f 100644 --- a/docs/linghe/utils/rearange.html +++ b/docs/linghe/utils/rearange.html @@ -28,6 +28,9 @@

    API Documentation

    @@ -51,6 +54,38 @@

    +
    +
    + + def + triton_split_and_cat(x, counts, indices, scales=None): + + +
    + + +

    split x to multiple tensors and cat with indices, +it is used for permutation in moe

    + +
    Arguments:
    + +
      +
    • x: [bs, dim]
    • +
    • counts: [n_split]
    • +
    • indices: [n_split]
    • +
    • scales: [bs]
    • +
    + +
    Returns:
    + +
    +

    y: output tensor + output_scales: output scales if scales is not None

    +
    +
    + + +
    \ No newline at end of file diff --git a/docs/linghe/utils/reduce.html b/docs/linghe/utils/reduce.html index 06ac7f5..d000c2d 100644 --- a/docs/linghe/utils/reduce.html +++ b/docs/linghe/utils/reduce.html @@ -28,6 +28,15 @@

    API Documentation

    @@ -51,6 +60,92 @@

    +
    +
    + + def + triton_abs_max(x, scale=None, smooth_scale=None, min_value=1e-30, axis=0): + + +
    + + +

    columnwise abs max of x, it is used in smooth quantization

    + +
    Arguments:
    + +
      +
    • x: input tensor, may be quantized tensor
    • +
    • scale: quantization scale if x is quantized
    • +
    • smooth_scale: optional smooth scale
    • +
    • min_value: output = max(max(abs(x,0)), min_value)
    • +
    • axis: reduce axis
    • +
    + +
    Returns:
    + +
    +

    max tensor

    +
    +
    + + +
    +
    +
    + + def + triton_batch_count_zero(xs): + + +
    + + +

    count zero in tensor list, it is used to monitor zeros in gradient tensor

    + +
    Arguments:
    + +
      +
    • xs: input tensors
    • +
    + +
    Returns:
    + +
    +

    a single-value int64 tensor

    +
    +
    + + +
    +
    +
    + + def + triton_batch_sum_with_ord(xs, ord=2): + + +
    + + +

    return sum(abs(x)**ord).

    + +
    Arguments:
    + +
      +
    • xs: Tensor lists.
    • +
    • ord: the order of tensor.
    • +
    + +
    Returns:
    + +
    +

    a single-value fp32 tensor

    +
    +
    + + +
    \ No newline at end of file diff --git a/docs/linghe/utils/rope.html b/docs/linghe/utils/rope.html index 49aefc3..df394a3 100644 --- a/docs/linghe/utils/rope.html +++ b/docs/linghe/utils/rope.html @@ -28,6 +28,15 @@

    API Documentation

    @@ -51,6 +60,113 @@

    +
    +
    + + def + triton_half_rope_forward(q, k, freqs): + + +
    + + +

    apply norm to qk, then apply half rope to qk

    + +
    Arguments:
    + +
      +
    • q: query tensor, [len, bs, q_head, head_dim]
    • +
    • k: key tensor, [len, bs, kv_head, head_dim]
    • +
    • freqs: rope freqs
    • +
    + +
    Returns:
    + +
    +

    qo: + ko:

    +
    +
    + + +
    +
    +
    + + def + triton_qk_norm_and_half_rope_forward( qkv, q_norm_weight, k_norm_weight, freqs, H=32, h=4, eps=1e-06, interleave=True, transpose=False): + + +
    + + +

    split qkv to q/k/v, apply qk norm and half rope to q/k, + transpose q/k/v to flash-attention layout

    + +
    Arguments:
    + +
      +
    • qkv: QKV tensor with size of [S, B, dim], heads are interleaved
    • +
    • q_norm_weight: rms norm weight for query
    • +
    • k_norm_weight: rms norm weight for key
    • +
    • freqs: Freqs tensor based on half dim.
    • +
    • H: Number of attention heads.
    • +
    • h: Number of key/value heads.
    • +
    • eps: epsilon value for L2 normalization.
    • +
    • interleave: whether head of qkv is interleaved, i.e., [qqkvqqkv]
    • +
    • transpose: whether qkv is tranposed, i.e., [S, B, dim], +only support transpose format currently
    • +
    + +
    Returns:
    + +
    +

    qo: shape [B, S, H, head_dim] + ko: shape [B, S, h, head_dim] + vo: shape [B, S, h, head_dim]

    +
    +
    + + +
    +
    +
    + + def + triton_qk_norm_and_half_rope_backward( gq, gk, gv, qkv, q_norm_weight, k_norm_weight, freqs, eps=1e-06, transpose=False, interleave=True): + + +
    + + +

    backward kernel of triton_qk_norm_and_half_rope_forward

    + +
    Arguments:
    + +
      +
    • gq: gradient of qo, [len, bs, q_head, head_dim]
    • +
    • gk: gradient of ko, [len, bs, q_head, head_dim]
    • +
    • gv: gradient of vo, [len, bs, q_head, head_dim]
    • +
    • qkv: input qkv
    • +
    • q_norm_weight:
    • +
    • k_norm_weight:
    • +
    • freqs:
    • +
    • eps:
    • +
    • transpose:
    • +
    • interleave:
    • +
    + +
    Returns:
    + +
    +

    dqkv: gradient of qkv + dqw: gradient of q_norm_weight + dkw: gradient of k_norm_weight

    +
    +
    + + +
    \ No newline at end of file diff --git a/docs/linghe/utils/scatter.html b/docs/linghe/utils/scatter.html index 469c703..68e5f39 100644 --- a/docs/linghe/utils/scatter.html +++ b/docs/linghe/utils/scatter.html @@ -28,6 +28,15 @@

    API Documentation

    @@ -51,6 +60,95 @@

    +
    +
    + + def + triton_aligned_scatter_add( x: torch.Tensor, outputs: torch.Tensor, indices: torch.Tensor, weights: Optional[torch.Tensor] = None): + + +
    + + +

    scatter_add for megatron 0.11

    + +
    Arguments:
    + +
      +
    • x: input tensor
    • +
    • outputs: output tensor
    • +
    • indices: gather indices
    • +
    • weights: rowwise weight, it is router prob in MoE router
    • +
    + +
    Returns:
    + +
    +

    output tensor

    +
    +
    + + +
    +
    +
    + + def + triton_scatter_add(x, outputs, indices): + + +
    + + +

    naive version of scatter add, very slow

    + +
    Arguments:
    + +
      +
    • x: input tensor
    • +
    • outputs: output tensor
    • +
    • indices: indices
    • +
    + +
    Returns:
    + +
    +

    outputs

    +
    +
    + + +
    +
    +
    + + def + triton_unpermute_with_mask_map(grad: torch.Tensor, row_id_map: torch.Tensor, probs: torch.Tensor): + + +
    + + +

    scatter add with row id map

    + +
    Arguments:
    + +
      +
    • grad: gradient tensor, [num_out_tokens, hidden_size]
    • +
    • row_id_map: row id map, [n_experts, num_tokens]
    • +
    • probs: [num_out_tokens]
    • +
    + +
    Returns:
    + +
    +

    output: [num_tokens, hidden_size] + restore_probs: [num_tokens, num_experts]

    +
    +
    + + +
    \ No newline at end of file diff --git a/docs/linghe/utils/silu.html b/docs/linghe/utils/silu.html index 3b5a58c..175a5c1 100644 --- a/docs/linghe/utils/silu.html +++ b/docs/linghe/utils/silu.html @@ -28,6 +28,18 @@

    API Documentation

    @@ -51,6 +63,147 @@

    +
    +
    + + def + triton_silu_and_block_quant_forward(x, out=None, scale=None, round_scale=False, output_mode=2): + + +
    + + +

    fused silu and blockwise quantization, used in shared expert

    + +
    Arguments:
    + +
      +
    • x: input tensor
    • +
    • round_scale: whether round scale to power of 2
    • +
    • output_mode: one of {0, 1, 2} +0: only output non-transposed quantized tensor +1: only output transposed quantized tensor +2: output both
    • +
    + +
    Returns:
    + +
    +

    out: quantized tensor + scale: quantization scale + transpose_output: quantized tensor of transposed output + transpose_scale: quantization scale of transposed output

    +
    +
    + + +
    +
    +
    + + def + triton_silu_and_block_quant_backward(g, x, round_scale=False): + + +
    + + +

    backward of triton_silu_and_block_quant_forward

    + +
    Arguments:
    + +
      +
    • g: gradient
    • +
    • x: input tensor
    • +
    • round_scale: whether round to power of 2
    • +
    + +
    Returns:
    + +
    +

    dx: quantized non-transposed gradient + dx_scale: scales of quantization non-transposed gradient + transpose_dx: quantized transposed gradient + transpose_dx_scale: scales of quantization transposed gradient

    +
    +
    + + +
    +
    +
    + + def + triton_batch_weighted_silu_and_block_quant_forward( x, weight, counts, splits=None, out=None, scale=None, round_scale=False, output_mode=2): + + +
    + + +

    silu and blockwise quantize activation in routed experts

    + +
    Arguments:
    + +
      +
    • x: activation tensor in routed experts
    • +
    • weight: router prob tensor
    • +
    • counts: cuda tensor of token count per expert
    • +
    • splits: python int list of token count per expert
    • +
    • round_scale: whether round scale to power of 2
    • +
    • output_mode: one of {0, 1, 2} +0: only output non-transposed quantized tensor +1: only output transposed quantized tensor +2: output both
    • +
    + +
    Returns:
    + +
    +

    out: quantized tensor + scale: quantization scale + transpose_output: quantized tensor of transposed output + transpose_scale: quantization scale of transposed output

    +
    +
    + + +
    +
    +
    + + def + triton_batch_weighted_silu_and_block_quant_backward(g, x, weight, counts, splits=None, round_scale=False): + + +
    + + +

    backward of triton_batch_weighted_silu_and_block_quant_forward

    + +
    Arguments:
    + +
      +
    • g: gradient
    • +
    • x: input tensor
    • +
    • weight: router prob tensor
    • +
    • counts: cuda tensor of token count per expert
    • +
    • splits: python int list of token count per expert
    • +
    • round_scale: whether round scale to power of 2
    • +
    + +
    Returns:
    + +
    +

    dx: quantized non-transposed gradient + dx_scale: scales of quantization non-transposed gradient + dw: gradient of weight + transpose_dx: quantized transposed gradient + transpose_dx_scale: scales of quantization transposed gradient

    +
    +
    + + +
    \ No newline at end of file diff --git a/docs/linghe/utils/transpose.html b/docs/linghe/utils/transpose.html index 3512573..b278ac2 100644 --- a/docs/linghe/utils/transpose.html +++ b/docs/linghe/utils/transpose.html @@ -28,6 +28,18 @@

    API Documentation

    @@ -51,6 +63,122 @@

    +
    +
    + + def + triton_transpose( x: torch.Tensor, dim0: Optional[int] = None, dim1: Optional[int] = None): + + +
    + + +

    transpose x with dim0 and dim1

    + +
    Arguments:
    + +
      +
    • x: input tensor
    • +
    • dim0: dim 0
    • +
    • dim1: dim 1
    • +
    + +
    Returns:
    + +
    +

    transposed tensor

    +
    +
    + + +
    +
    +
    + + def + triton_transpose_and_pad(x, out=None, pad=True): + + +
    + + +

    transpose x and padding the column size to be mutiplier of 32, +it is used for calculated gradient of weight with torch._scaled__mm

    + +
    Arguments:
    + +
      +
    • x: input tensor
    • +
    • out:
    • +
    • pad: whether need padding
    • +
    + +
    Returns:
    + +
    +

    out: output tensor

    +
    +
    + + +
    +
    +
    + + def + triton_batch_transpose(xs, xts=None): + + +
    + + +

    batch transpose x

    + +
    Arguments:
    + +
      +
    • xs: input tensor list, [M, N]*expert
    • +
    + +
    Returns:
    + +
    +

    xts: output tensor list, [N,M]*expert

    +
    +
    + + +
    +
    +
    + + def + triton_batch_transpose_and_pad(x, count_list, x_t=None, pad=True): + + +
    + + +

    transpose and pad each tensor stored in x

    + +
    Arguments:
    + +
      +
    • x: [sum(bs), N]
    • +
    • count_list: a python list of token count
    • +
    • pad: whether pad to mutiplier of 32, +padding value should be filled with 0 if padded
    • +
    + +
    Returns:
    + +
    +

    x_t: output tensor

    +
    +
    + + +
    \ No newline at end of file diff --git a/linghe/facade/add.py b/linghe/facade/add.py index d477b14..c40aca4 100644 --- a/linghe/facade/add.py +++ b/linghe/facade/add.py @@ -9,6 +9,9 @@ class InplaceAddFunction(torch.autograd.Function): + """ + + """ @staticmethod def forward(ctx, x: torch.Tensor, y: torch.Tensor): return triton_inplace_add(x, y) @@ -22,7 +25,6 @@ def inplace_add(x: torch.Tensor, y: torch.Tensor): """ inplace add y to x with mix precise Args: - ctx: autograd context x: to be updated y: add to x Returns: diff --git a/linghe/facade/fp32_gemm.py b/linghe/facade/fp32_gemm.py index 5bc1e0f..8ab9ae9 100644 --- a/linghe/facade/fp32_gemm.py +++ b/linghe/facade/fp32_gemm.py @@ -11,6 +11,9 @@ class Fp32GEMM(torch.autograd.Function): + """ + + """ @staticmethod def forward(ctx, input: torch.Tensor, weight: torch.Tensor): shape = input.shape diff --git a/linghe/facade/hadamard_quant_linear.py b/linghe/facade/hadamard_quant_linear.py new file mode 100644 index 0000000..586f104 --- /dev/null +++ b/linghe/facade/hadamard_quant_linear.py @@ -0,0 +1,163 @@ +# -*- coding: utf-8 -*- +""" +Copyright (c) Ant Financial Service Group and its affiliates. +""" + +import math +from typing import Optional + +import torch + +from linghe.quant.hadamard import triton_hadamard_quant + + + +class _HadamardQuantLinear(torch.autograd.Function): + @staticmethod + def forward( + ctx, + input: torch.Tensor, + weight: torch.Tensor, + bias: Optional[torch.Tensor], + hadamard_matrix: torch.Tensor + ): + ctx.input_requires_grad = input.requires_grad + ctx.weight_requires_grad = weight.requires_grad + ctx.bias_requires_grad = bias is not None and bias.requires_grad + + ctx.out_dtype = input.dtype + ctx.input_shape = input.shape + input = input.view(-1, input.shape[-1]) + + x_q, x_scale, xt_q, xt_scale = triton_hadamard_quant(input, hadamard_matrix) + w_q, w_scale, wt_q, wt_scale = triton_hadamard_quant(weight, hadamard_matrix) + + output = torch._scaled_mm(x_q, + w_q.t(), + scale_a=x_scale, + scale_b=w_scale, + out_dtype=ctx.out_dtype, + use_fast_accum=True + ) + + if bias is not None: + output += bias + + saved_tensors = [ + xt_q if ctx.weight_requires_grad else None, + xt_scale if ctx.weight_requires_grad else None, + wt_q if ctx.input_requires_grad else None, + wt_scale if ctx.input_requires_grad else None, + hadamard_matrix if ctx.weight_requires_grad or ctx.weight_requires_grad else None + ] + + ctx.save_for_backward(*saved_tensors) + out_shape = (*ctx.input_shape[0:-1], -1) + return output.view(out_shape) + + @staticmethod + def backward( + ctx, + output_grad: torch.Tensor, + ): + xt_q, xt_scale, wt_q, wt_scale, hadamard_matrix = ctx.saved_tensors + results = [None, None, None, None] + + output_grad = output_grad.view(-1, output_grad.shape[-1]) + + y_q, y_scale, yt_q, yt_scale = triton_hadamard_quant(output_grad, hadamard_matrix) + + dx = torch._scaled_mm(y_q, + wt_q.t(), + scale_a=y_scale, + scale_b=wt_scale, + out_dtype=ctx.out_dtype, + use_fast_accum=True + ) + + # calculate input grad and assign to results[0] + results[0] = dx.view(ctx.input_shape) + + # calculate weight grad and assign to results[1] + dw = torch._scaled_mm(yt_q, + xt_q.t(), + scale_a=yt_scale, + scale_b=xt_scale, + out_dtype=ctx.out_dtype, + use_fast_accum=True + ) + results[1] = dw + + if ctx.bias_requires_grad: + # calculate bias grad and assign to results[2] + results[2] = torch.sum(output_grad, dim=0) + + return tuple(results) + +class HadamardQuantLinear(torch.nn.Module): + def __init__( + self, + in_features: int, + out_features: int, + bias: bool = True, + device=None, + dtype=None + ): + """ + a naive implementation of hadamard transformation and quantization + Args: + in_features: in feature number + out_features: out feature number + bias: whether use bias + device: weight device + dtype: weight dtype + impl: implementation of hadamard quantization + """ + super().__init__() + self.in_features = in_features + self.out_features = out_features + self.weight = torch.nn.parameter.Parameter( + torch.empty((out_features, in_features), device=device, + dtype=dtype)) + if bias: + self.bias = torch.nn.parameter.Parameter( + torch.empty(out_features, device=device, dtype=dtype)) + else: + self.bias = None + + size = 32 if 'H20' in torch.cuda.get_device_properties(0).name else 64 + data = self._hadamard_matrix(size, device=device, dtype=dtype, + norm=True) + self.hadamard_matrix = torch.nn.parameter.Parameter(data, + requires_grad=False) + self.reset_parameters() + + def _hadamard_matrix(self, size, device=None, dtype=None, norm=False): + assert 2 ** int(math.log2(size)) == size + m2 = torch.tensor([[1, 1], [1, -1]], device=device, dtype=torch.float32) + m = m2 + for _ in range(int(math.log2(size)) - 1): + m = torch.kron(m, m2) + if norm: + m = m / size ** 0.5 + if dtype is not None: + m = m.to(dtype) + return m + + def forward(self, input: torch.Tensor) -> torch.Tensor: + if self.training: + return _HadamardQuantLinear.apply(input, self.weight, self.bias, + self.hadamard_matrix) + else: + output = input @ self.weight.t() + if self.bias is not None: + output = output + self.bias + return output + + def extra_repr(self) -> str: + return f"in_features={self.in_features}, out_features={self.out_features}, bias={self.bias is not None}" + + def reset_parameters(self): + self.weight.data.normal_(mean=0.0, std=0.02) + if self.bias is not None: + self.bias.data.zero_() diff --git a/linghe/facade/loss.py b/linghe/facade/loss.py index a1fe7b9..1fa7294 100644 --- a/linghe/facade/loss.py +++ b/linghe/facade/loss.py @@ -10,6 +10,9 @@ class SoftmaxCrossEntropyFunction(torch.autograd.Function): + """ + + """ @staticmethod def forward(ctx, logits, labels, inplace=False): shape = logits.shape @@ -55,6 +58,9 @@ def softmax_cross_entropy(logits: torch.Tensor, labels: torch.Tensor, inplace: b class GradScalingFunction(torch.autograd.Function): + """ + + """ @staticmethod def forward(ctx, x, coef=0.2): ctx.coef = coef diff --git a/linghe/facade/norm.py b/linghe/facade/norm.py index 7f09318..435942f 100644 --- a/linghe/facade/norm.py +++ b/linghe/facade/norm.py @@ -10,6 +10,9 @@ class RMSNormFunction(torch.autograd.Function): + """ + + """ @staticmethod def forward(ctx, x, weight, eps=1e-6): output = triton_rms_norm_forward( @@ -53,6 +56,9 @@ def rms_norm(x: torch.Tensor, weight: torch.Tensor, eps: float = 1e-6): return RMSNormFunction.apply(x, weight, eps) class GroupNormGateFunction(torch.autograd.Function): + """ + + """ @staticmethod def forward(ctx, attn_output, gate, weight, eps=1e-6, group_size=4): output = triton_group_norm_gate_forward( diff --git a/linghe/facade/rope.py b/linghe/facade/rope.py index bdf52fa..3719095 100644 --- a/linghe/facade/rope.py +++ b/linghe/facade/rope.py @@ -10,6 +10,9 @@ class QkNormHalfRopeFunction(torch.autograd.Function): + """ + + """ @staticmethod def forward(ctx, qkv, q_norm_weight, k_norm_weight, freqs, H=32, h=4, eps=1e-6): diff --git a/linghe/facade/smooth_quant_linear.py b/linghe/facade/smooth_quant_linear.py new file mode 100644 index 0000000..fccbabb --- /dev/null +++ b/linghe/facade/smooth_quant_linear.py @@ -0,0 +1,154 @@ +# -*- coding: utf-8 -*- +""" +Copyright (c) Ant Financial Service Group and its affiliates. +""" + +from typing import Optional + +import torch + + +from linghe.quant.smooth import triton_smooth_quant, \ + triton_transpose_smooth_quant +from linghe.utils.transpose import triton_transpose_and_pad +from linghe.utils.reduce import triton_abs_max + +class _SmoothQuantLinear(torch.autograd.Function): + @staticmethod + def forward( + ctx, + input: torch.Tensor, + weight: torch.Tensor, + smooth_scale: torch.Tensor, + bias: Optional[torch.Tensor] + ): + ctx.input_requires_grad = input.requires_grad + ctx.weight_requires_grad = weight.requires_grad + ctx.bias_requires_grad = bias is not None and bias.requires_grad + + ctx.out_dtype = input.dtype + ctx.input_shape = input.shape + input = input.view(-1, input.shape[-1]) + + x_q, x_scale, x_maxs = triton_smooth_quant(input, 1 / smooth_scale) + w_q, w_scale, w_maxs = triton_smooth_quant(weight, smooth_scale) + + output = torch._scaled_mm(x_q, + w_q.t(), + scale_a=x_scale.view(-1, 1), + scale_b=w_scale.view(1, -1), + out_dtype=ctx.out_dtype, + use_fast_accum=True) + + if bias is not None: + output += bias + + saved_tensors = [ + x_q if ctx.weight_requires_grad else None, + x_scale if ctx.weight_requires_grad else None, + w_q if ctx.input_requires_grad else None, + w_scale if ctx.input_requires_grad else None, + smooth_scale if ctx.weight_requires_grad or ctx.weight_requires_grad else None + ] + + ctx.save_for_backward(*saved_tensors) + out_shape = (*ctx.input_shape[0:-1], -1) + return output.view(out_shape) + + @staticmethod + def backward( + ctx, + output_grad: torch.Tensor + ): + x_q, x_s, w_q, w_s, smooth_scale = ctx.saved_tensors + results = [None, None, None, None] + + output_grad = output_grad.view(-1, output_grad.shape[-1]) + + y_q, y_scale, y_maxs = triton_smooth_quant(output_grad, w_s) + + wt_q = triton_transpose_and_pad(w_q, pad=True) + dx = torch._scaled_mm(y_q, + wt_q.t(), + scale_a=y_scale.view(-1, 1), + scale_b=smooth_scale.view(1, -1), + out_dtype=ctx.out_dtype, + use_fast_accum=True) + + # calculate input grad and assign to results[0] + results[0] = dx.view(ctx.input_shape) + + # calculate weight grad and assign to results[1] + yt_q, yt_scale, yt_maxs = triton_transpose_smooth_quant(output_grad, x_s) + + xt_q = triton_transpose_and_pad(x_q, pad=True) + dw = torch._scaled_mm(yt_q, + xt_q.t(), + scale_a=yt_scale.view(-1, 1), + scale_b=1/smooth_scale.view(1, -1), + out_dtype=ctx.out_dtype, + use_fast_accum=True) + + results[1] = dw + + if ctx.bias_requires_grad: + # calculate bias grad and assign to results[2] + results[2] = torch.sum(output_grad, dim=0) + + return tuple(results) + + +class QuantLinear(torch.nn.Module): + def __init__( + self, + in_features: int, + out_features: int, + bias: bool = True, + device=None, + dtype=None + ): + super().__init__() + self.in_features = in_features + self.out_features = out_features + self.weight = torch.nn.parameter.Parameter( + torch.empty((out_features, in_features), device=device, + dtype=dtype)) + if bias: + self.bias = torch.nn.parameter.Parameter( + torch.empty(out_features, device=device, dtype=dtype)) + else: + self.bias = None + + self.gap_step = 16 + self.decay_coef = 0.9 + self.smooth_scale = None + self.smooth_update_step = 0 + + self.reset_parameters() + + def forward(self, input: torch.Tensor) -> torch.Tensor: + if self.training: + + if self.smooth_update_step % self.gap_step == 0: + input_maxs = triton_abs_max(input) + weight_maxs = triton_abs_max(self.weight.data) + self.smooth_scale = torch.sqrt(input_maxs * weight_maxs) + + output, smooth_scale = _SmoothQuantLinear.apply(input, + self.weight, + self.bias, + self.smooth_scale) + self.smooth_update_step += 1 + else: + output = input @ self.weight.t() + if self.bias is not None: + output = output + self.bias + return output + + def extra_repr(self) -> str: + return f"in_features={self.in_features}, out_features={self.out_features}, bias={self.bias is not None}" + + def reset_parameters(self): + self.weight.data.normal_(mean=0.0, std=0.02) + if self.bias is not None: + self.bias.data.zero_() diff --git a/linghe/facade/transpose.py b/linghe/facade/transpose.py index 6ee19b3..9d8de83 100644 --- a/linghe/facade/transpose.py +++ b/linghe/facade/transpose.py @@ -9,6 +9,9 @@ class TransposeDim01Function(torch.autograd.Function): + """ + + """ @staticmethod def forward(ctx, x): return triton_transpose(x, dim0=0, dim1=1) diff --git a/linghe/quant/hadamard/seperate_hadamard.py b/linghe/quant/hadamard.py similarity index 52% rename from linghe/quant/hadamard/seperate_hadamard.py rename to linghe/quant/hadamard.py index b52c4ef..1bd77bc 100644 --- a/linghe/quant/hadamard/seperate_hadamard.py +++ b/linghe/quant/hadamard.py @@ -118,13 +118,18 @@ def hadamard_quant_col_kernel( mask=mask_cols[:, None] & mask_rows[None, :]) -# y = x @ w -# dx = y @ wT -# dwT = yT @ x -def triton_hadamard_quant_x(x, hm): - # apply hadamard transformation and quantization for x - # y = x @ w: x->x@h and rowwise quant - # dwT = yT @ x: x->xT@h and rowwise quant +def triton_hadamard_quant(x, hm): + """ + apply hadamard transformation and then quantize transformed tensor + Args: + x: input tensor + hm: hamadard matrix + Returns: + x_q: rowwise quantized tensor of non-transposed x + x_scale: rowwise quantization scale of non-transposed x + xt_q: columnwise quantized tensor of transposed x + xt_scale: columnwise quantization scale of transposed x + """ M, N = x.shape device = x.device BLOCK_SIZE = hm.size(0) @@ -163,154 +168,3 @@ def triton_hadamard_quant_x(x, hm): ) return x_q, x_scale,xt_q, xt_scale - - -# y = x @ w -# dx = y @ wT -# dwT = yT @ x -def triton_hadamard_quant_w(w, hm): - # apply hadamard transformation and quantization for w - # y = x @ w: w->w@h and rowwise quant - # dx = y @ wT: w->h@wT and rowwise quant - M, N = w.shape - device = w.device - w_q = torch.empty((M, N), dtype=torch.float8_e4m3fn, device=device) - wt_q = torch.empty((N, M), dtype=torch.float8_e4m3fn, device=device) - w_scale = torch.empty((M, ), dtype=torch.float32, device=device) - wt_scale = torch.empty((N, ), dtype=torch.float32, device=device) - - BLOCK_SIZE = hm.size(0) - R = 1 - - grid_row = (triton.cdiv(M, R * BLOCK_SIZE),) - hadamard_quant_row_kernel[grid_row]( - w, - hm, - w_q, - w_scale, - M, - N, - BLOCK_SIZE, - R, - num_stages=6, - num_warps=4 - ) - - grid_col = (triton.cdiv(N, R * BLOCK_SIZE),) - hadamard_quant_col_kernel[grid_col]( - w, - hm, - wt_q, - wt_scale, - M, - N, - BLOCK_SIZE, - R, - num_stages=6, - num_warps=4 - ) - - return w_q, w_scale, wt_q, wt_scale - - -# y = x @ w -# dx = y @ wT -# dwT = yT @ x -def triton_hadamard_quant_y(y, hm): - # apply hadamard transformation and quantization for dy - # dx = y @ wT: y->y@h and rowwise quant - # dwT = yT @ x: y->h@yT and rowwise quant - M, N = y.shape - device = y.device - BLOCK_SIZE = hm.size(0) - R = 1 - y_q = torch.empty((M, N), dtype=torch.float8_e4m3fn, device=device) - yt_q = torch.empty((N, M), dtype=torch.float8_e4m3fn, device=device) - y_scale = torch.empty((M, ), dtype=torch.float32, device=device) - yt_scale = torch.empty((N, ), dtype=torch.float32, device=device) - - grid_row = (triton.cdiv(M, R * BLOCK_SIZE),) - hadamard_quant_row_kernel[grid_row]( - y, - hm, - y_q, - y_scale, - M, - N, - BLOCK_SIZE, - R, - num_stages=6, - num_warps=4 - ) - - grid_col = (triton.cdiv(N, R * BLOCK_SIZE),) - hadamard_quant_col_kernel[grid_col]( - y, - hm, - yt_q, - yt_scale, - M, - N, - BLOCK_SIZE, - R, - num_stages=6, - num_warps=4 - ) - - return y_q, y_scale, yt_q, yt_scale - - -def triton_hadamard_quant_nt_megatron(x, w, hm): - x_q, _, x_scale, _ = triton_hadamard_quant_x(x, hm) - w_q, _, w_scale, _ = triton_hadamard_quant_w(w, hm) - return x_q, x_scale, w_q, w_scale - - -def triton_hadamard_quant_nn_megatron(y, w, hm): - y_q, _, y_scale, _ = triton_hadamard_quant_y(y, hm) - _, wt_q, _, wt_scale = triton_hadamard_quant_w(w, hm) - return y_q, y_scale, wt_q, wt_scale - - -def triton_hadamard_quant_tn_megatron(y, x, hm): - _, yt_q, _, yt_scale = triton_hadamard_quant_y(y, hm) - _, xt_q, _, xt_scale = triton_hadamard_quant_x(x, hm) - return yt_q, yt_scale, xt_q, xt_scale - - -def hadamard_quant_forward_megatron(x, w, hm): - x_q, x_scale, w_q, w_scale = triton_hadamard_quant_nt_megatron(x, w, hm) - output = torch._scaled_mm(x_q, - w_q.t(), - scale_a=x_scale, - scale_b=w_scale, - out_dtype=x.dtype, - use_fast_accum=True - ) - return output, x_q, w_q, x_scale, w_scale - - -def hadamard_quant_backward_megatron(y, w, hm): - y_q, y_scale, wt_q, wt_scale = triton_hadamard_quant_nn_megatron(y, w, hm) - output = torch._scaled_mm( - y_q, - wt_q.t(), - scale_a=y_scale, - scale_b=wt_scale, - out_dtype=y.dtype, - use_fast_accum=True - ) - return output, y_q, wt_q.t(), y_scale, wt_scale - - -def hadamard_quant_update_megatron(y, x, hm): - yt_q, yt_scale, xt_q, xt_scale = triton_hadamard_quant_tn_megatron(y, x, hm) - output = torch._scaled_mm(yt_q, - xt_q.t(), - scale_a=yt_scale.t(), - scale_b=xt_scale, - out_dtype=x.dtype, - use_fast_accum=True - ) - return output, yt_q, xt_q, yt_scale, xt_scale - diff --git a/linghe/quant/smooth/reused_smooth.py b/linghe/quant/smooth.py similarity index 82% rename from linghe/quant/smooth/reused_smooth.py rename to linghe/quant/smooth.py index 14aab4e..4844511 100644 --- a/linghe/quant/smooth/reused_smooth.py +++ b/linghe/quant/smooth.py @@ -8,11 +8,11 @@ import triton.language as tl from linghe.tools.util import round_up +from linghe.utils.transpose import triton_transpose_and_pad -# TODO(nanxiao): use max instead of sum @triton.jit -def tokenwise_reused_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, qs_ptr, max_ptr, +def tokenwise_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, qs_ptr, max_ptr, M, T, N: tl.constexpr, W: tl.constexpr, @@ -77,7 +77,7 @@ def tokenwise_reused_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, qs_ptr, max_ptr, @triton.jit -def blockwise_reused_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, qs_ptr, max_ptr, +def blockwise_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, qs_ptr, max_ptr, M, N, H: tl.constexpr, @@ -147,10 +147,12 @@ def blockwise_reused_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, qs_ptr, max_ptr, soffs += H -def triton_reused_smooth_quant(x, smooth_scale, x_q=None, x_scale=None, +def triton_smooth_quant(x, smooth_scale, x_q=None, x_scale=None, reverse=False, round_scale=False, calibrate=False): - # row-wise read, row-wise write + """ + + """ M, N = x.shape device = x.device if x_q is None: @@ -168,7 +170,7 @@ def triton_reused_smooth_quant(x, smooth_scale, x_q=None, x_scale=None, x_maxs = torch.empty((g, N), device=device, dtype=torch.bfloat16) else: x_maxs = None - tokenwise_reused_smooth_quant_kernel[(g,)]( + tokenwise_smooth_quant_kernel[(g,)]( x, x_q, smooth_scale, @@ -197,7 +199,7 @@ def triton_reused_smooth_quant(x, smooth_scale, x_q=None, x_scale=None, else: x_maxs = None grid = (T,) - blockwise_reused_smooth_quant_kernel[grid]( + blockwise_smooth_quant_kernel[grid]( x, x_q, smooth_scale, @@ -221,7 +223,7 @@ def triton_reused_smooth_quant(x, smooth_scale, x_q=None, x_scale=None, @triton.jit -def subrow_reused_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, qs_ptr, +def subrow_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, qs_ptr, subrow_scales_ptr, tail_ri, tail_si, @@ -286,9 +288,12 @@ def subrow_reused_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, qs_ptr, xq.to(q_ptr.dtype.element_ty), mask=mask) -def triton_subrow_reused_smooth_quant(x, smooth_scale, x_q, x_scale, +def triton_subrow_smooth_quant(x, smooth_scale, x_q, x_scale, subrow_scales, offset, size, reverse=False, round_scale=False): + """ + + """ M, N = x_q.shape W = 128 if offset % N == 0: @@ -310,7 +315,7 @@ def triton_subrow_reused_smooth_quant(x, smooth_scale, x_q, x_scale, HEAD = True grid = (1,) - subrow_reused_smooth_quant_kernel[grid]( + subrow_smooth_quant_kernel[grid]( x, x_q, smooth_scale, @@ -333,7 +338,7 @@ def triton_subrow_reused_smooth_quant(x, smooth_scale, x_q, x_scale, @triton.jit -def depracated_tokenwise_reused_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, +def depracated_tokenwise_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, qs_ptr, M, W, N: tl.constexpr, REVERSE: tl.constexpr, @@ -361,9 +366,12 @@ def depracated_tokenwise_reused_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, mask=pid * W + i < M) -def triton_depracated_tokenwise_reused_smooth_quant(x, smooth_scale, x_q=None, +def triton_depracated_tokenwise_smooth_quant(x, smooth_scale, x_q=None, x_scale=None, reverse=False, round_scale=False): + """ + + """ # row-wise read, row-wise write M, N = x.shape device = x.device @@ -374,7 +382,7 @@ def triton_depracated_tokenwise_reused_smooth_quant(x, smooth_scale, x_q=None, sm = torch.cuda.get_device_properties(device).multi_processor_count W = triton.cdiv(M, sm) grid = (sm,) - depracated_tokenwise_reused_smooth_quant_kernel[grid]( + depracated_tokenwise_smooth_quant_kernel[grid]( x, x_q, smooth_scale, @@ -447,8 +455,9 @@ def triton_batch_smooth_quant(x, smooth_scales, token_count_per_expert, x_q=None, x_scale=None, x_maxs=None, reverse=False, round_scale=False, calibrate=False): - # row-wise read, row-wise write + """ + """ M, N = x.shape device = x.device n_expert = token_count_per_expert.shape[0] @@ -569,8 +578,9 @@ def triton_batch_pad_transpose_smooth_quant(x, splits, x_q=None, x_scale=None, x_maxs=None, reverse=False, round_scale=False): - # col-wise read, row-wise write + """ + """ M, N = x.shape device = x.device n_expert = token_count_per_expert.shape[0] @@ -605,7 +615,7 @@ def triton_batch_pad_transpose_smooth_quant(x, @triton.jit -def reused_transpose_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, qs_ptr, M, N, P, +def transpose_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, qs_ptr, M, N, P, H: tl.constexpr, W: tl.constexpr, EVEN: tl.constexpr, REVERSE: tl.constexpr, @@ -677,13 +687,16 @@ def reused_transpose_smooth_quant_kernel(x_ptr, q_ptr, ss_ptr, qs_ptr, M, N, P, soffs += H -def triton_reused_transpose_smooth_quant(x, +def triton_transpose_smooth_quant(x, smooth_scale, reverse=False, pad=False, round_scale=False): # col-wise read, row-wise write # M should be padded if M % 32 != 0 + """ + + """ M, N = x.shape device = x.device P = (M + 31) // 32 * 32 if pad else M @@ -695,7 +708,7 @@ def triton_reused_transpose_smooth_quant(x, EVEN = P % H == 0 and M == P grid = (triton.cdiv(N, W),) - reused_transpose_smooth_quant_kernel[grid]( + transpose_smooth_quant_kernel[grid]( x, x_q, smooth_scale, @@ -715,7 +728,7 @@ def triton_reused_transpose_smooth_quant(x, @triton.jit -def reused_transpose_rescale_smooth_quant_kernel(x_ptr, q_ptr, +def transpose_rescale_smooth_quant_kernel(x_ptr, q_ptr, org_smooth_scale_ptr, org_quant_scale_ptr, transpose_smooth_scale_ptr, @@ -804,14 +817,15 @@ def reused_transpose_rescale_smooth_quant_kernel(x_ptr, q_ptr, """ -def triton_reused_transpose_rescale_smooth_quant(x_q, org_smooth_scale, +def triton_transpose_rescale_smooth_quant(x_q, org_smooth_scale, org_quant_scale, transpose_smooth_scale, reverse=True, pad=False, round_scale=False): - # col-wise read, row-wise write + """ + """ assert reverse M, N = x_q.shape device = x_q.device @@ -824,7 +838,7 @@ def triton_reused_transpose_rescale_smooth_quant(x_q, org_smooth_scale, EVEN = P == M and M % H == 0 grid = (triton.cdiv(N, W),) - reused_transpose_rescale_smooth_quant_kernel[grid]( + transpose_rescale_smooth_quant_kernel[grid]( x_q, xt_q, org_smooth_scale, @@ -842,58 +856,133 @@ def triton_reused_transpose_rescale_smooth_quant(x_q, org_smooth_scale, return xt_q, x_scale -def triton_reused_smooth_quant_nt(x, w, smooth_scale): - x_q, x_scale, x_maxs = triton_reused_smooth_quant(x, 1 / smooth_scale) - w_q, w_scale, x_maxs = triton_reused_smooth_quant(w, smooth_scale) - return x_q, x_scale, w_q, w_scale +""" +megatron fp8 training steps: +step 0: init w smooth scale w_smooth +step 1: smooth and quant w after w is updated by optimizer +step 2: in forward step, columnwise smooth x and rowwise quant x, calc y=x@w; + meanwhile, record the columnwise max of x, it is used to update w_smooth +step 3: in dgrad step, columnwise smooth y and rowwise quant y, transpose x, calc dx=y@wT +step 4: in wgrad step, dequant then smooth an then quant y_q to get yt_q, calc dw=yT@x -def triton_reused_smooth_quant_nn(y, w, smooth_scale): - y_q, y_scale, x_maxs = triton_reused_smooth_quant(y, smooth_scale) - w_q, w_scale = triton_reused_transpose_smooth_quant(w, 1 / smooth_scale) - return y_q, y_scale, w_q, w_scale - +alternative (it's not suitable for fp8 combine): +step 4: in wgrad step, rowwise smooth y and columnwise quant y and transpose to get yt_q, calc dw=yT@x -def triton_reused_smooth_quant_tn(y, x, smooth_scale): - y_q, y_scale = triton_reused_transpose_smooth_quant(y, smooth_scale) - x_q, x_scale = triton_reused_transpose_smooth_quant(x, 1 / smooth_scale) - return y_q, y_scale, x_q, x_scale +""" +""" +divide x by smooth_scale and row-wise quantization +smooth scale is updated by square root of x's column-wise maxs, and set in weight's x_maxs attr -def reused_smooth_quant_forward(x, w, smooth_scale): - x_q, x_s, w_q, w_s = triton_reused_smooth_quant_nt(x, w, smooth_scale) - output = torch._scaled_mm(x_q, - w_q.t(), - scale_a=x_s.view(-1, 1), - scale_b=w_s.view(1, -1), - out_dtype=x.dtype, - use_fast_accum=True) - return output +transpose: transpose quantized x for wgrad +pad: # pad M to be multiplier of 32, including quant scales and transposed x +""" -def reused_smooth_quant_backward(y, w, smooth_scale): - y_q, y_s, w_q, w_s = triton_reused_smooth_quant_nn(y, w, smooth_scale) - output = torch._scaled_mm(y_q, - w_q.t(), - scale_a=y_s.view(-1, 1), - scale_b=w_s.view(1, -1), - out_dtype=y.dtype, - use_fast_accum=True) - return output +# y = x @ w +# dx = y @ wT +# dwT = yT @ x +def triton_smooth_quant_input(x, smooth_scale, x_q=None, x_scale=None, xt_q=None, + transpose=True, pad=True, round_scale=False): + """ -def reused_smooth_quant_update(y, x, smooth_scale): - y_q, y_s, x_q, x_s = triton_reused_smooth_quant_tn(y, x, smooth_scale) - output = torch._scaled_mm(y_q, - x_q.t(), - scale_a=y_s.view(-1, 1), - scale_b=x_s.view(1, -1), - out_dtype=y.dtype, - use_fast_accum=True) - return output + """ + x_q, x_scale, x_maxs = triton_smooth_quant(x, smooth_scale, x_q=x_q, + x_scale=x_scale, reverse=False, + round_scale=round_scale) + if transpose: + xt_q = triton_transpose_and_pad(x_q, out=xt_q, pad=pad) + else: + xt_q = None + xt_scale = smooth_scale + + return x_q, xt_q, x_scale, xt_scale + + +# y = x @ w +# dx = y @ wT +# dwT = yT @ x +def triton_smooth_quant_gradient(y, + smooth_scale, + transpose_smooth_scale, + reverse=True, + transpose=True, + pad=True, + round_scale=False): + """ + + """ + assert reverse, ("args `smooth_scale` and/or `transpose_smooth_scale` " + "must be in reciprocal format in triton_smooth_quant_grad") + y_q, y_scale, _ = triton_smooth_quant(y, smooth_scale, reverse=True, + round_scale=round_scale) + if transpose: + yt_q, yt_scale = triton_transpose_smooth_quant(y, + transpose_smooth_scale, + reverse=True, + pad=pad, + round_scale=round_scale) + else: + yt_q, yt_scale = None, None + + return y_q, yt_q, y_scale, yt_scale + + +def triton_smooth_quant_weight(w, + smooth_scale, + w_q, + quant_scale, + subrow_scales, offset=0, + round_scale=False): + """ + + """ + assert w.ndim == 1 + assert w_q.size(1) == smooth_scale.size(0) + + size = w.numel() + M, N = w_q.shape + + if size == M * N: + triton_smooth_quant(w.view(M, N), smooth_scale, x_q=w_q, + x_scale=quant_scale, + round_scale=round_scale) + elif offset % N == 0 and size % N == 0: + n_row = size // N + row_id = offset // N + w_q_slice = w_q[row_id:row_id + n_row] + quant_scale_slice = quant_scale[row_id:row_id + n_row] + triton_smooth_quant(w.view(n_row,N), smooth_scale, x_q=w_q_slice, + x_scale=quant_scale_slice, + round_scale=round_scale) + else: + row_si = (offset - 1)//N + 1 + row_ei = (offset + size) // N + col_si = offset % N + col_ei = (offset + size ) % N + n_row = row_ei - row_si + mw_offset = 0 if col_si == 0 else N - col_si + w_q_slice = w_q[row_si:row_ei] + quant_scale_slice = quant_scale[row_si:row_ei] + w_slice = w[mw_offset:mw_offset+n_row*N].view(n_row,N) + triton_smooth_quant(w_slice, + smooth_scale, + x_q=w_q_slice, + x_scale=quant_scale_slice, + round_scale=round_scale) + + # subrow scale is writed by the row with leading master weights + if col_si > 0 or col_ei > 0: + triton_subrow_smooth_quant(w, + smooth_scale, + w_q, + quant_scale, + subrow_scales, + offset, + size, + reverse=False, + round_scale=round_scale) -def reused_smooth_quant_f_and_b(x, w, y, smooth_scale): - reused_smooth_quant_forward(x, w, smooth_scale) - reused_smooth_quant_backward(y, w, smooth_scale) - reused_smooth_quant_update(y, x, smooth_scale) diff --git a/linghe/quant/smooth/__init__.py b/linghe/quant/smooth/__init__.py deleted file mode 100644 index e69de29..0000000 diff --git a/linghe/quant/smooth/seperate_smooth.py b/linghe/quant/smooth/seperate_smooth.py deleted file mode 100644 index 802021e..0000000 --- a/linghe/quant/smooth/seperate_smooth.py +++ /dev/null @@ -1,133 +0,0 @@ -# -*- coding: utf-8 -*- -""" -Copyright (c) Ant Financial Service Group and its affiliates. -""" - -import torch - -from linghe.quant.smooth.reused_smooth import triton_reused_smooth_quant, \ - triton_reused_transpose_smooth_quant, triton_subrow_reused_smooth_quant -from linghe.utils.transpose import triton_transpose_and_pad - -""" -megatron fp8 training steps: -step 0: init w smooth scale w_smooth -step 1: smooth and quant w after w is updated by optimizer -step 2: in forward step, columnwise smooth x and rowwise quant x, calc y=x@w; - meanwhile, record the columnwise max of x, it is used to update w_smooth -step 3: in dgrad step, columnwise smooth y and rowwise quant y, transpose x, calc dx=y@wT -step 4: in wgrad step, dequant then smooth an then quant y_q to get yt_q, calc dw=yT@x - -alternative (it's not suitable for fp8 combine): -step 4: in wgrad step, rowwise smooth y and columnwise quant y and transpose to get yt_q, calc dw=yT@x - -""" - -""" -divide x by smooth_scale and row-wise quantization -smooth scale is updated by square root of x's column-wise maxs, and set in weight's x_maxs attr - -transpose: transpose quantized x for wgrad -pad: # pad M to be multiplier of 32, including quant scales and transposed x - -""" - - -# y = x @ w -# dx = y @ wT -# dwT = yT @ x -def triton_smooth_quant_input(x, smooth_scale, x_q=None, x_scale=None, xt_q=None, - transpose=True, pad=True, round_scale=False): - x_q, x_scale, x_maxs = triton_reused_smooth_quant(x, smooth_scale, x_q=x_q, - x_scale=x_scale, reverse=False, - round_scale=round_scale) - - if transpose: - xt_q = triton_transpose_and_pad(x_q, out=xt_q, pad=pad) - else: - xt_q = None - xt_scale = smooth_scale - - return x_q, xt_q, x_scale, xt_scale - - -# y = x @ w -# dx = y @ wT -# dwT = yT @ x -def triton_smooth_quant_grad(y, smooth_scale, transpose_smooth_scale, reverse=True, - transpose=True, pad=True, round_scale=False): - assert reverse, "args `smooth_scale` and/or `transpose_smooth_scale` must be in reciprocal format in triton_smooth_quant_grad" - y_q, y_scale, _ = triton_reused_smooth_quant(y, smooth_scale, reverse=True, - round_scale=round_scale) - if transpose: - yt_q, yt_scale = triton_reused_transpose_smooth_quant(y, - transpose_smooth_scale, - reverse=True, - pad=pad, - round_scale=round_scale) - else: - yt_q, yt_scale = None, None - - return y_q, yt_q, y_scale, yt_scale - -""" -we stat the max/mean of rowwise maximums -gate: 1.15/0.14 -up: 0.34/0.14 -down 1.12/0.15 -large value may cause underflow in w, but leading to overflow in dy -however, underflow in w only influences a row of w, but will influences -all the rows in dy, therefore we use a very small value to avoid overflow in dy - -furthermore, we clip the values of the subrow within the master weight, to avoid -inconsistant values between training and evaluation. - -""" -def triton_smooth_quant_w(w, smooth_scale, w_q, quant_scale, subrow_scales, offset=0, - round_scale=False): - assert w.ndim == 1 - assert w_q.size(1) == smooth_scale.size(0) - - size = w.numel() - M, N = w_q.shape - - if size == M * N: - triton_reused_smooth_quant(w.view(M, N), smooth_scale, x_q=w_q, - x_scale=quant_scale, - round_scale=round_scale) - elif offset % N == 0 and size % N == 0: - n_row = size // N - row_id = offset // N - w_q_slice = w_q[row_id:row_id + n_row] - quant_scale_slice = quant_scale[row_id:row_id + n_row] - triton_reused_smooth_quant(w.view(n_row,N), smooth_scale, x_q=w_q_slice, - x_scale=quant_scale_slice, - round_scale=round_scale) - else: - row_si = (offset - 1)//N + 1 - row_ei = (offset + size) // N - col_si = offset % N - col_ei = (offset + size ) % N - n_row = row_ei - row_si - mw_offset = 0 if col_si == 0 else N - col_si - w_q_slice = w_q[row_si:row_ei] - quant_scale_slice = quant_scale[row_si:row_ei] - w_slice = w[mw_offset:mw_offset+n_row*N].view(n_row,N) - triton_reused_smooth_quant(w_slice, - smooth_scale, - x_q=w_q_slice, - x_scale=quant_scale_slice, - round_scale=round_scale) - - # subrow scale is writed by the row with leading master weights - if col_si > 0 or col_ei > 0: - triton_subrow_reused_smooth_quant(w, - smooth_scale, - w_q, - quant_scale, - subrow_scales, - offset, - size, - reverse=False, - round_scale=round_scale) - diff --git a/linghe/tools/util.py b/linghe/tools/util.py index c2486a0..a97768e 100644 --- a/linghe/tools/util.py +++ b/linghe/tools/util.py @@ -4,7 +4,6 @@ """ import math - import torch @@ -80,6 +79,45 @@ def torch_block_quant(w, B=128, dtype=torch.float8_e4m3fn, round_scale=False): return wq, scale +def torch_smooth_quant(x, smooth_scale, reverse=False, round_scale=False): + x = x.float() + x_maxs = x.abs().amax(0) + if reverse: + x_smooth = x * smooth_scale + else: + x_smooth = x / torch.maximum(smooth_scale, + 1e-30 * torch.ones_like(smooth_scale)) + scale = x_smooth.abs().amax(1) / 448 + scale = torch.maximum(scale, 1e-30 * torch.ones_like(scale)) + if round_scale: + scale = torch.exp2(torch.ceil(torch.log2(scale))) + x_q = (x_smooth / scale[:, None]).to(torch.float8_e4m3fn) + return x_q, scale, x_maxs + + +def torch_batch_smooth_quant(xs, smooth_scales, indices, token_count_per_expert, + reverse=False, round_scale=False): + q_refs = [] + scale_refs = [] + s = 0 + for i, c in enumerate(token_count_per_expert): + idx = indices[s:s + c] + y_slice = xs[idx] + if reverse: + y_smooth = y_slice * smooth_scales[i] + else: + y_smooth = y_slice / smooth_scales[i] + scale = y_smooth.abs().amax(1) / 448 + if round_scale: + scale = torch.exp2(torch.ceil(torch.log2(scale))) + q_refs.append((y_smooth / scale[:, None]).to(torch.float8_e4m3fn)) + scale_refs.append(scale) + s += c + q_ref = torch.cat(q_refs, 0) + scale_ref = torch.cat(scale_refs, 0) + return q_ref, scale_ref + + def torch_make_indices(logits, topk=8, bias=-0.01): M, n_experts = logits.shape device = logits.device @@ -105,6 +143,195 @@ def torch_make_indices(logits, topk=8, bias=-0.01): return probs, route_map, token_count_per_expert, indices, row_id_map +# quant with scaling to 448 +def torch_duplex_smooth_tensor_quant(x, w, dtype): + # w:[bs, in] w:[out, in] + x = x.clone() + w = w.clone() + fmax = torch.finfo(dtype).max + x_max = torch.max(torch.abs(x).float(), dim=0, keepdim=True)[0] + w_max = torch.max(torch.abs(w).float(), dim=0, keepdim=True)[0] + scale = (x_max / w_max) ** 0.5 + x_max_ = x_max / scale + w_max_ = w_max * scale + x_scale = x_max_ / fmax + w_scale = w_max_ / fmax + rescale = fmax / torch.maximum(x_max_.max(), w_max_.max()) + x_q = (x * (rescale / scale).to(x.dtype)).to(dtype) + w_q = (w * (scale * rescale).to(x.dtype)).to(dtype) + + return x_q, w_q, scale, rescale + + +def torch_duplex_smooth_quant(x, w, dtype=torch.float8_e4m3fn): + # w:[bs, in] w:[out, in] + x = x.clone() + w = w.clone() + fmax = torch.finfo(dtype).max + x_max = torch.max(torch.abs(x).float(), dim=0, keepdim=True)[0] + w_max = torch.max(torch.abs(w).float(), dim=0, keepdim=True)[0] + maxs = (x_max * w_max) ** 0.5 + x_scale = x_max / maxs + w_scale = w_max / maxs # reciprocal of x_scale + x_smooth = x / x_scale + w_smooth = w / w_scale + x_max = torch.max(torch.abs(x_smooth).float(), dim=1, keepdim=True)[0] + w_max = torch.max(torch.abs(w_smooth).float(), dim=1, keepdim=True)[0] + x_scale = x_max / fmax + w_scale = w_max / fmax + x_q = (x_smooth * (1.0 / x_scale).to(x.dtype)).to(dtype) + w_q = (w_smooth * (1.0 / w_scale).to(x.dtype)).to(dtype) + + return x_q, w_q, x_scale, w_scale + + +def torch_outlier_quant(x, w, dtype): + x = x.clone() + w = w.clone() + fmax = torch.finfo(dtype).max + max_val, max_idx = torch.topk(x.abs().float().max(dim=0)[0], 5) + # print(max_idx) + x_outlier = x[:, max_idx[:4]] + x[:, max_idx[:4]] = 0.0 + x_scale = max_val[-1] / fmax + xq = (x / x_scale.to(x.dtype)).to(dtype) + w_max = w.abs().float().max() + w_scale = w_max / fmax + wq = (w / w_scale.to(x.dtype)).to(dtype) + return xq, wq, x_scale, w_scale, max_idx[:4], x_outlier + + +def make_hadamard_matrix(n, device='cuda:0', dtype=torch.bfloat16, norm=False): + assert 2 ** int(math.log2(n)) == n + m2 = torch.tensor([[1, 1], [1, -1]], device='cpu', dtype=torch.float32) + m = m2 + for i in range(int(math.log2(n)) - 1): + m = torch.kron(m, m2) + if norm: + m = m / n ** 0.5 + return m.to(dtype=dtype, device=device) + + +def torch_hadamard_transform(x, hm, side='right'): + assert side in ('right', 'left') + x = x.clone() + hm = hm.clone() + M, K = x.shape + B = hm.size(0) + xp = torch.reshape(x, (M // B, B, K // B, B)).permute(0, 2, 1, + 3).contiguous() + if side == 'right': + xp = xp @ hm + else: + xp = hm @ xp + xp = xp.permute(0, 2, 1, 3) + xp = torch.reshape(xp, (M, K)) + return xp + + +# token-wise and channel-wise +def torch_channel_quant_f_and_b(x, w, y): + M, K = x.shape + N, K = w.shape + M, N = y.shape + x_scale = x.abs().float().amax(dim=1, keepdim=True) / 448.0 # [M,1] + w_scale = w.abs().float().amax(dim=1, keepdim=True) / 448.0 # [N,1] + xq = (x / x_scale).to(torch.float8_e4m3fn) + wq = (w / w_scale).to(torch.float8_e4m3fn) + o = torch._scaled_mm(xq, + wq.t(), + scale_a=x_scale.view(-1, 1), + scale_b=w_scale.view(1, -1), + out_dtype=torch.bfloat16, + use_fast_accum=True) + + # dx = y @ wT + # absort w quant scale to y + ys = y * w_scale.view(1, N) + y_scale = ys.abs().float().amax(dim=1, keepdim=True) / 448.0 + 1e-9 + yq = (ys / y_scale).to(torch.float8_e4m3fn) + w_dummy_scale = torch.ones((1, K), dtype=torch.float32, device=x.device) + dx = torch._scaled_mm(yq, + wq.t().contiguous().t(), + scale_a=y_scale, + scale_b=w_dummy_scale, + out_dtype=torch.bfloat16, + use_fast_accum=True) + + # dw = yT@x + yt = y.t().contiguous() + yts = yt * x_scale.view(1, M) + yt_scale = yts.abs().float().amax(dim=1, keepdim=True) / 448.0 + 1e-9 + ytq = (yts / yt_scale).to(torch.float8_e4m3fn) + dw = torch._scaled_mm(ytq, + xq.t().contiguous().t(), + scale_a=yt_scale.view(-1, 1), + scale_b=w_dummy_scale, + out_dtype=torch.bfloat16, + use_fast_accum=True) + return xq, wq, yq, ytq, o, dx, dw + + +# smooth and token-wise/channel-wise +def torch_reuse_smooth_quant_f_and_b(x, w, y): + x = x.clone() + w = w.clone() + y = y.clone() + M, K = x.shape + N, K = w.shape + M, N = y.shape + x_smooth_max = torch.amax(torch.abs(x).float(), dim=0, keepdim=True) + w_smooth_max = torch.amax(torch.abs(w).float(), dim=0, keepdim=True) + maxs = (x_smooth_max * w_smooth_max) ** 0.5 + x_smooth_scale = x_smooth_max / maxs # [K, 1] + w_smooth_scale = w_smooth_max / maxs # [K, 1] reciprocal of x_scale + x_smooth = x / x_smooth_scale + w_smooth = w / w_smooth_scale + + x_quant_max = torch.amax(torch.abs(x_smooth).float(), dim=1, keepdim=True) + w_quant_max = torch.amax(torch.abs(w_smooth).float(), dim=1, keepdim=True) + + x_quant_scale = x_quant_max / 448.0 # [M, 1] + w_quant_scale = w_quant_max / 448.0 # [N, 1] + xq = (x_smooth / x_quant_scale).to(torch.float8_e4m3fn) + wq = (w_smooth / w_quant_scale).to(torch.float8_e4m3fn) + + o = torch._scaled_mm(xq, + wq.t(), + scale_a=x_quant_scale.view(-1, 1), + scale_b=w_quant_scale.view(1, -1), + out_dtype=torch.bfloat16, + use_fast_accum=True) + + # print(f'{x_smooth_scale=} {x_quant_scale[:,0]=} {w_quant_scale=}') + + # dx = y @ wT + # absort w quant scale to y + ys = y * w_quant_scale.view(1, N) + y_scale = ys.abs().float().amax(dim=1, keepdim=True) / 448.0 + 1e-9 + yq = (ys / y_scale).to(torch.float8_e4m3fn) + dx = torch._scaled_mm(yq, + wq.t().contiguous().t(), + scale_a=y_scale, + scale_b=w_smooth_scale.view(1, -1), + out_dtype=torch.bfloat16, + use_fast_accum=True) + + # dw = yT@x + yt = y.t().contiguous() # [N, M] + yts = yt * x_quant_scale.view(1, M) + yt_scale = yts.abs().amax(dim=1, keepdim=True) / 448.0 + 1e-9 + ytq = (yts / yt_scale).to(torch.float8_e4m3fn) + dw = torch._scaled_mm(ytq, + xq.t().contiguous().t(), + scale_a=yt_scale.view(-1, 1), + scale_b=x_smooth_scale.view(1, -1), + out_dtype=torch.bfloat16, + use_fast_accum=True) + + return xq, wq, yq, ytq, o, dx, dw + + def fp16_forward(x, w): return x @ w @@ -217,3 +444,45 @@ def read_and_tile(filename, tile=True): f'y.max={y.abs().max().item():.3f} y.mean={y.abs().mean().item():.3f}') return x, w, y + + +def torch_fp16_vector_scaled_mm(x, weight, x_scale, weight_scale): + output = torch._scaled_mm(x, + weight, + scale_a=x_scale, + scale_b=weight_scale, + out_dtype=torch.bfloat16, + use_fast_accum=True) + return output + + +def torch_fp32_vector_scaled_mm(x, weight, x_scale, weight_scale, ones, + out=None): + output = torch._scaled_mm(x, + weight, + scale_a=ones, + scale_b=ones, + out_dtype=torch.float32, + use_fast_accum=True, + out=out) + return output * x_scale * weight_scale + + +def torch_fp16_scaler_scaled_mm(x, weight, x_scale, weight_scale): + output = torch._scaled_mm(x, + weight, + scale_a=x_scale, + scale_b=weight_scale, + out_dtype=torch.bfloat16, + use_fast_accum=True) + return output + + +def torch_fp32_scaler_scaled_mm(x, weight, x_scale, weight_scale): + output = torch._scaled_mm(x, + weight, + scale_a=x_scale, + scale_b=weight_scale, + out_dtype=torch.float32, + use_fast_accum=True) + return output diff --git a/linghe/utils/norm.py b/linghe/utils/norm.py index d030abb..c55745c 100644 --- a/linghe/utils/norm.py +++ b/linghe/utils/norm.py @@ -525,4 +525,105 @@ def triton_group_norm_gate_backward(grad_output, x, gate, weight, eps=1e-6, grou num_warps=8 ) dw = tmp_dw.sum(dim=0).to(weight.dtype) - return dx, dg, dw \ No newline at end of file + return dx, dg, dw + + + +@triton.jit +def rms_norm_and_smooth_quant_forward_kernel(x_ptr, weight_ptr, smooth_scale_ptr, + out_ptr, scale_ptr, max_ptr, rms_ptr, + eps, + M, + T, + N: tl.constexpr, + W: tl.constexpr, + CALIBRATE: tl.constexpr, + OUTPUT: tl.constexpr, + ROUND: tl.constexpr): + pid = tl.program_id(axis=0) + # row-wise read, row-wise write + weight = tl.load(weight_ptr + tl.arange(0, N)).to(tl.float32)[None, :] + smooth_scale = tl.load(smooth_scale_ptr + tl.arange(0, N))[None, :] + smooth_scale = 1.0 / tl.maximum(smooth_scale, 1e-30) + if CALIBRATE: + # triton 3.3.1 has bug with N = 2048 and calibrate=True + maxs = tl.zeros((N, ), dtype=tl.float32) + offs = pid * W * T * N + tl.arange(0, W)[:, None] * N + tl.arange(0, N)[ + None, :] + for i in range(T): + indices = pid * W * T + i * W + tl.arange(0, W) + x = tl.load(x_ptr + offs, mask=indices[:, None] < M).to(tl.float32) + rms = 1/tl.sqrt(tl.sum(x * x, axis=1) / N + eps) + if OUTPUT: + tl.store(rms_ptr + indices, rms, mask=indices < M) + x = x * rms[:, None] * weight + + if CALIBRATE: + maxs = tl.maximum(maxs, tl.max(tl.abs(x),0)) + + x = x * smooth_scale + scale = tl.maximum(tl.max(tl.abs(x), 1) / 448.0, 1e-30) + if ROUND: + scale = tl.exp2(tl.ceil(tl.log2(scale))) + q = (x / scale[:, None]).to(out_ptr.dtype.element_ty) + tl.store(scale_ptr + indices, scale, mask=indices < M) + tl.store(out_ptr + offs, q, mask=indices[:, None] < M) + offs += N * W + + if CALIBRATE: + tl.store(max_ptr + pid * N + tl.arange(0, N), maxs) + + +# rms is used for moe routing, it is stored as 1/rms +def triton_rms_norm_and_smooth_quant_forward(x, weight, smooth_scale=None, + eps=1e-6, + out=None, scale=None, rms=None, + calibrate=False, + output_rms=False, + round_scale=False): + """ + + """ + M, N = x.shape + assert N <= 8192 and 8192 % N == 0 + device = x.device + + if out is None: + out = torch.empty((M, N), device=device, dtype=torch.float8_e4m3fn) + + if scale is None: + scale = torch.empty((M,), device=device, dtype=torch.float32) + W = 8192 // N + T = 8 if M // W >= 4096 else 4 + assert M % (T * W) == 0 + g = M // (T * W) + if calibrate: + maxs = torch.empty((g, N), dtype=torch.float32, device=device) + else: + maxs = None + if output_rms and rms is None: + rms = torch.empty((M,), dtype=torch.float32, device=device) + grid = (g,) + rms_norm_and_smooth_quant_forward_kernel[grid]( + x, + weight, + smooth_scale, + out, + scale, + maxs, + rms, + eps, + M, + T, + N, + W, + calibrate, + output_rms, + round_scale, + num_stages=3, + num_warps=2 if N == 2048 else 4 + ) + if calibrate: + maxs = maxs.amax(0) + + return out, scale, maxs, rms diff --git a/linghe/utils/silu.py b/linghe/utils/silu.py index 0d8c9c4..bfcf910 100644 --- a/linghe/utils/silu.py +++ b/linghe/utils/silu.py @@ -3,11 +3,159 @@ Copyright (c) Ant Financial Service Group and its affiliates. """ +from typing import Optional import torch import triton import triton.language as tl + +@triton.jit +def weighted_silu_forward_kernel(x_ptr, weight_ptr, out_ptr, M, T, + N: tl.constexpr, + n: tl.constexpr, + W: tl.constexpr, + WEIGHT: tl.constexpr): + pid = tl.program_id(axis=0) + + row_offs = pid * W * T * n + tl.arange(0, W)[:, None] * n + col_offs = tl.arange(0, n)[None, :] + + for i in range(T): + indices = pid * W * T + i * W + tl.arange(0, W) + mask = indices[:, None] < M + x1 = tl.load(x_ptr + row_offs * 2 + col_offs, mask=mask).to(tl.float32) + x2 = tl.load(x_ptr + n + row_offs * 2 + col_offs, mask=mask).to( + tl.float32) + if WEIGHT: + w = tl.load(weight_ptr + indices, mask=indices < M).to(tl.float32)[:, + None] + x = x1 / (1 + tl.exp(-x1)) * x2 * w + else: + x = x1 / (1 + tl.exp(-x1)) * x2 + tl.store(out_ptr + row_offs + col_offs, x, mask=mask) + row_offs += n * W + + +# used in bf16 moe +def triton_weighted_silu_forward(x, weight=None, out=None): + """ + compute silu(x)*weight, used in bf16/fp16 training with MoE + Args: + x: input tensor + weight: tokenwise weight + Returns: + out: output tensor + """ + # row-wise read, row-wise write + M, N = x.shape + assert N <= 8192 + device = x.device + if out is None: + out = torch.empty((M, N // 2), device=device, dtype=x.dtype) + WEIGHT = weight is not None + W = 8192 // N + T = 8 + grid = (triton.cdiv(M, T * W),) + weighted_silu_forward_kernel[grid]( + x, + weight, + out, + M, T, + N, + N // 2, + W, + WEIGHT, + num_stages=3, + num_warps=8 + ) + return out + + +@triton.jit +def weighted_silu_backward_kernel(g_ptr, x_ptr, weight_ptr, dx_ptr, dw_ptr, M, + T, + N: tl.constexpr, + n: tl.constexpr, + W: tl.constexpr, + WEIGHT: tl.constexpr): + pid = tl.program_id(axis=0) + + offs = pid * W * T * N + tl.arange(0, W)[:, None] * N + tl.arange(0, n)[ + None, :] + hoffs = pid * W * T * n + tl.arange(0, W)[:, None] * n + tl.arange(0, n)[ + None, :] + for i in range(T): + mask = pid * W * T + i * W + tl.arange(0, W) + x1 = tl.load(x_ptr + offs, mask=mask[:, None] < M).to(tl.float32) + x2 = tl.load(x_ptr + offs + n, mask=mask[:, None] < M).to(tl.float32) + g = tl.load(g_ptr + hoffs, mask=mask[:, None] < M).to(tl.float32) + if WEIGHT: + w = tl.load(weight_ptr + mask, mask=mask < M).to(tl.float32)[:, None] + sigmoid = 1 / (1 + tl.exp(-x1)) + dw = tl.sum(x1 * sigmoid * x2 * g, 1) + tl.store(dw_ptr + mask, dw, mask=mask < M) + dx1 = g * x2 * w * sigmoid * (1 + x1 * tl.exp(-x1) * sigmoid) + tl.store(dx_ptr + offs, dx1, mask=mask[:, None] < M) + + dx2 = g * x1 * sigmoid * w + tl.store(dx_ptr + offs + n, dx2, mask=mask[:, None] < M) + else: + sigmoid = 1 / (1 + tl.exp(-x1)) + dx1 = g * x2 * sigmoid * (1 + x1 * tl.exp(-x1) * sigmoid) + tl.store(dx_ptr + offs, dx1, mask=mask[:, None] < M) + + dx2 = g * x1 * sigmoid + tl.store(dx_ptr + offs + n, dx2, mask=mask[:, None] < M) + offs += N * W + hoffs += n * W + + +def triton_weighted_silu_backward(g: torch.Tensor, + x: torch.Tensor, + weight: Optional[torch.Tensor] = None): + """ + backward of triton_weighted_silu_forward + Args: + g: gradient tensor + x: input tensor + weight: weight tensor + + Returns: + dx: gradient of x + dw: gradient of weight + """ + # row-wise read, row-wise write + M, N = x.shape + assert N <= 8192 + device = x.device + if weight is not None: + dw = torch.empty(weight.shape, device=device, dtype=x.dtype) + WEIGHT = True + else: + dw = None + WEIGHT = False + dx = torch.empty((M, N), device=device, dtype=x.dtype) + W = 8192 // N + T = 8 + grid = (triton.cdiv(M, W*T),) + weighted_silu_backward_kernel[grid]( + g, + x, + weight, + dx, + dw, + M, T, + N, + N // 2, + W, + WEIGHT, + num_stages=3, + num_warps=8 + ) + return dx, dw + + @triton.jit def silu_and_block_quant_forward_kernel(x_ptr, out_ptr, scale_ptr, @@ -548,3 +696,744 @@ def triton_batch_weighted_silu_and_block_quant_backward(g, x, weight, ) dw = dws.sum(1, keepdim=True).to(weight.dtype) return dx, dx_scale, dw, transpose_dx, transpose_dx_scale + + + + + +# n is power of 2 +@triton.jit +def silu_and_smooth_quant_forward_kernel(x_ptr, smooth_scale_ptr, out_ptr, scale_ptr, + max_ptr, M, T, n: tl.constexpr, + W: tl.constexpr, ROUND: tl.constexpr, + CALIBRATE: tl.constexpr): + pid = tl.program_id(axis=0) + + row_offs = pid * T * W * n + tl.arange(0, W)[:, None] * n + col_offs = tl.arange(0, n)[None, :] + smooth_scale = tl.load(smooth_scale_ptr + tl.arange(0, n)) + smooth_scale = 1.0 / smooth_scale + if CALIBRATE: + maxs = tl.zeros((W, n), dtype=tl.float32) + + for i in range(T): + indices = pid * T * W + i * W + tl.arange(0, W) + mask = indices[:, None] < M + x1 = tl.load(x_ptr + row_offs * 2 + col_offs, mask=mask).to(tl.float32) + x2 = tl.load(x_ptr + n + row_offs * 2 + col_offs, mask=mask).to( + tl.float32) + x = x1 / (1 + tl.exp(-x1)) * x2 + if CALIBRATE: + maxs = tl.maximum(x.abs(), maxs) + x = x * smooth_scale + scale = tl.maximum(tl.max(x.abs(), 1) / 448, 1e-30) + if ROUND: + scale = tl.exp2(tl.ceil(tl.log2(scale))) + tl.store(scale_ptr + indices, scale, mask=indices < M) + x = (x / scale[:, None]).to(out_ptr.dtype.element_ty) + tl.store(out_ptr + row_offs + col_offs, x, mask=mask) + row_offs += n * W + + if CALIBRATE: + maxs = tl.max(maxs, 0) + tl.store(max_ptr + pid * n + tl.arange(0, n), maxs) + + +# n is NOT power of 2 +@triton.jit +def compatible_silu_and_smooth_quant_forward_kernel(x_ptr, smooth_scale_ptr, out_ptr, + scale_ptr, max_ptr, M, + T: tl.constexpr, n: tl.constexpr, + B: tl.constexpr, + ROUND: tl.constexpr, + CALIBRATE: tl.constexpr): + pid = tl.program_id(axis=0) + + # rowwise read with block size [T, B] + row_offs = pid * T * n + tl.arange(0, T)[:, None] * n + col_offs = tl.arange(0, B)[None, :] + + nb = n // B + maxs = tl.zeros((T,), dtype=tl.float32) + for i in range(nb): + + smooth_scale = tl.load(smooth_scale_ptr + i * B + tl.arange(0, B)) + x1 = tl.load(x_ptr + row_offs * 2 + col_offs).to(tl.float32) + x2 = tl.load(x_ptr + n + row_offs * 2 + col_offs).to(tl.float32) + x = x1 / (1 + tl.exp(-x1)) * x2 + if CALIBRATE: + x_maxs = tl.max(x.abs(), 0) + tl.store(max_ptr + pid * n + i * B + tl.arange(0, B), x_maxs) + x = x / smooth_scale + maxs = tl.maximum(tl.max(x.abs(), 1), maxs) + col_offs += B + + scale = tl.maximum(maxs / 448, 1e-30) + if ROUND: + scale = tl.exp2(tl.ceil(tl.log2(scale))) + tl.store(scale_ptr + pid * T + tl.arange(0, T), scale) + + col_offs = tl.arange(0, B)[None, :] + for i in range(nb): + smooth_scale = tl.load(smooth_scale_ptr + i * B + tl.arange(0, B)) + + x1 = tl.load(x_ptr + row_offs * 2 + col_offs).to(tl.float32) + x2 = tl.load(x_ptr + n + row_offs * 2 + col_offs).to(tl.float32) + x = x1 / (1 + tl.exp(-x1)) * x2 + x = x / smooth_scale + + x = (x / scale[:, None]).to(out_ptr.dtype.element_ty) + tl.store(out_ptr + row_offs + col_offs, x) + col_offs += B + + + + +# used in shared expert +def triton_silu_and_smooth_quant_forward(x, smooth_scale=None, out=None, scale=None, + maxs=None, round_scale=False, + calibrate=False): + """ + + """ + M, N = x.shape + n = N // 2 + device = x.device + if out is None: + out = torch.empty((M, N // 2), device=device, dtype=torch.float8_e4m3fn) + if scale is None: + scale = torch.empty((M,), device=device, dtype=torch.float32) + + if triton.next_power_of_2(N) == N and N <= 8192: + # sm = torch.cuda.get_device_properties(device).multi_processor_count + W = 8192 // N + T = 8 if M//W >= 1024 else 4 + assert M % (T*W) == 0 + g = M//(T*W) + # T = triton.cdiv(M, sm * W) + if maxs is None and calibrate: + maxs = torch.empty((g, n), device=device, dtype=torch.float32) + grid = (g,) + silu_and_smooth_quant_forward_kernel[grid]( + x, + smooth_scale, + out, + scale, + maxs, + M, + T, + n, + W, + round_scale, + calibrate, + num_stages=2, + num_warps=16 + ) + else: + B = 512 + T = 16 + assert n % B == 0 and M % T == 0 + grid = (M // T,) + if maxs is None and calibrate: + maxs = torch.empty((M // T, n), device=device, dtype=torch.float32) + compatible_silu_and_smooth_quant_forward_kernel[grid]( + x, + smooth_scale, + out, + scale, + maxs, + M, + T, + N // 2, + B, + round_scale, + calibrate, + num_stages=2, + num_warps=16 + ) + + if calibrate: + maxs = maxs.amax(0) + + + return out, scale, maxs + + + + + +@triton.jit +def silu_and_smooth_quant_backward_kernel(g_ptr, x_ptr, + smooth_scale_ptr, + transpose_smooth_scale_ptr, + dx_ptr, dx_scale_ptr, + transpose_dx_ptr, + transpose_dx_scale_ptr, + M, + n: tl.constexpr, + T: tl.constexpr, + B: tl.constexpr, + REVERSE: tl.constexpr, + ROUND: tl.constexpr): + pid = tl.program_id(axis=0) + + offs = pid * T * n * 2 + tl.arange(0, T)[:, None] * n * 2 + tl.arange(0, B)[ + None, :] + hoffs = pid * T * n + tl.arange(0, T)[:, None] * n + tl.arange(0, B)[None, + :] + toffs = pid * T + tl.arange(0, B)[:, None] * M + tl.arange(0, T)[None, :] + nb = n // B + maxs = tl.zeros((T, ), dtype=tl.float32) + transpose_smooth_scale = tl.load(transpose_smooth_scale_ptr + pid * T + tl.arange(0, T))[:, None] + for i in range(nb): + smooth_scale_1 = tl.load(smooth_scale_ptr + i * B + tl.arange(0, B)) + smooth_scale_2 = tl.load(smooth_scale_ptr + n + i * B + tl.arange(0, B)) + if not REVERSE: + smooth_scale_1 = 1 / smooth_scale_1 + smooth_scale_2 = 1 / smooth_scale_2 + + x1 = tl.load(x_ptr + offs).to(tl.float32) + x2 = tl.load(x_ptr + offs + n).to(tl.float32) + g = tl.load(g_ptr + hoffs).to(tl.float32) + sigmoid = 1 / (1 + tl.exp(-x1)) + + # x1 = tl.load(x_ptr + offs) + # x2 = tl.load(x_ptr + offs + n) + # g = tl.load(g_ptr + hoffs) + # sigmoid = 1 / (1 + tl.exp(-x1.to(tl.float32))) + + dx1 = g * x2 * sigmoid * ( + 1 + x1 * (1 - sigmoid)) + dx2 = g * x1 * sigmoid + + t_dx = dx1 * transpose_smooth_scale + t_s = tl.maximum(tl.max(tl.abs(t_dx), 0) / 448, 1e-30) + if ROUND: + t_s = tl.exp2(tl.ceil(tl.log2(t_s))) + t_dx = t_dx/t_s + tl.store(transpose_dx_ptr + toffs, tl.trans(t_dx.to(transpose_dx_ptr.dtype.element_ty))) + tl.store(transpose_dx_scale_ptr + pid * n * 2 + i * B + tl.arange(0, B), t_s) + + t_dx = dx2 * transpose_smooth_scale + t_s = tl.maximum(tl.max(tl.abs(t_dx), 0) / 448, 1e-30) + if ROUND: + t_s = tl.exp2(tl.ceil(tl.log2(t_s))) + t_dx = t_dx/t_s + tl.store(transpose_dx_ptr + M * n + toffs, tl.trans(t_dx.to(transpose_dx_ptr.dtype.element_ty))) + tl.store(transpose_dx_scale_ptr + pid * n * 2 + n + i * B + tl.arange(0, B), t_s) + + dx1 = dx1 * smooth_scale_1 + dx2 = dx2 * smooth_scale_2 + + # maxs = tl.maximum( + # tl.maximum(dx1.abs(), dx2.abs()), maxs) + maxs = tl.maximum( + tl.maximum(tl.max(dx1.abs(), 1), tl.max(dx2.abs(), 1)), maxs) + + offs += B + hoffs += B + toffs += B * M + + scale = tl.maximum(maxs / 448, 1e-30) + # scale = tl.maximum(tl.max(maxs, 1) / 448, 1e-30) + + if ROUND: + scale = tl.exp2(tl.ceil(tl.log2(scale))) + tl.store(dx_scale_ptr + pid * T + tl.arange(0, T), scale) + + s = 1 / scale[:, None] + offs = pid * T * n * 2 + tl.arange(0, T)[:, None] * n * 2 + tl.arange(0, B)[ + None, :] + hoffs = pid * T * n + tl.arange(0, T)[:, None] * n + tl.arange(0, B)[None, + :] + for i in range(nb): + smooth_scale_1 = tl.load(smooth_scale_ptr + i * B + tl.arange(0, B)) + smooth_scale_2 = tl.load(smooth_scale_ptr + n + i * B + tl.arange(0, B)) + if not REVERSE: + smooth_scale_1 = 1 / smooth_scale_1 + smooth_scale_2 = 1 / smooth_scale_2 + + x1 = tl.load(x_ptr + offs).to(tl.float32) + x2 = tl.load(x_ptr + offs + n).to(tl.float32) + g = tl.load(g_ptr + hoffs).to(tl.float32) + sigmoid = 1 / (1 + tl.exp(-x1)) + dx1 = g * x2 * sigmoid * ( + 1 + x1 * (1 - sigmoid)) * smooth_scale_1 + dx2 = g * x1 * sigmoid * smooth_scale_2 + + dx1 = (dx1 * s).to(dx_ptr.dtype.element_ty) + dx2 = (dx2 * s).to(dx_ptr.dtype.element_ty) + + tl.store(dx_ptr + offs, dx1) + tl.store(dx_ptr + n + offs, dx2) + offs += B + hoffs += B + +# requant multi-column quantized tensor +@triton.jit +def _requant_kernel(x_ptr, scale_ptr, scales_ptr, + M, + N, + H: tl.constexpr, + W: tl.constexpr + ): + rid = tl.program_id(axis=0) + cid = tl.program_id(axis=1) + offs = rid * H * N + cid * W + tl.arange(0, H)[:, None] * N + tl.arange(0, W)[None, :] + global_scale = tl.load(scale_ptr + rid * H + tl.arange(0, H)) + # scales is stored with column-major format + local_scale = tl.load(scales_ptr + cid * M + rid * H + tl.arange(0, H)) + x = tl.load(x_ptr+offs).to(tl.float32) + rescale = local_scale/global_scale + x = x * rescale[:,None] + tl.store(x_ptr+offs, x) + + +# used in shared expert +def triton_silu_and_smooth_quant_backward(g, x, + smooth_scale=None, + transpose_smooth_scale=None, + reverse=True, + round_scale=False): + """ + + """ + assert round_scale + M, N = x.shape + n = N // 2 + device = x.device + dx = torch.empty((M, N), device=device, dtype=torch.float8_e4m3fn) + dx_scale = torch.empty((M,), device=device, dtype=torch.float32) + scale_shape = (N, ) + transpose_dx = torch.empty((N, M), device=device, dtype=torch.float8_e4m3fn) + transpose_dx_scale = torch.empty(scale_shape, device=device, dtype=torch.float32) + + T = 32 + B = 32 + assert M % T == 0 and n % B == 0 + transpose_dx_scales = torch.empty((M // T, N), device=device, dtype=torch.float32) + grid = (M // T,) + silu_and_smooth_quant_backward_kernel[grid]( + g, + x, + smooth_scale, + transpose_smooth_scale, + dx, + dx_scale, + transpose_dx, + transpose_dx_scales, + M, + n, + T, + B, + reverse, + round_scale, + num_stages=3, + num_warps=2 + ) + transpose_dx_scale = transpose_dx_scales.amax(0) + grid = (N // B, M // T) + _requant_kernel[grid](transpose_dx, transpose_dx_scale, transpose_dx_scales, + N, + M, + B, + T) + + return dx, dx_scale, transpose_dx, transpose_dx_scale + + +@triton.jit +def batch_weighted_silu_and_smooth_quant_forward_kernel(x_ptr, weight_ptr, + smooth_scale_ptr, + out_ptr, + scale_ptr, max_ptr, + count_ptr, + accum_ptr, M, + n: tl.constexpr, + W: tl.constexpr, + ROUND: tl.constexpr, + REVERSE: tl.constexpr, + CALIBRATE: tl.constexpr): + eid = tl.program_id(axis=0) + tid = tl.program_id(axis=1) + sm = tl.num_programs(axis=1) + + count = tl.load(count_ptr + eid) + ei = tl.load(accum_ptr + eid) + si = ei - count + c = tl.cdiv(count, sm * W) + + row_offs = si * n + tid * c * W * n + tl.arange(0, W)[:, None] * n + col_offs = tl.arange(0, n)[None, :] + smooth_scale = tl.load(smooth_scale_ptr + n * eid + tl.arange(0, n)) + if not REVERSE: + smooth_scale = 1.0 / smooth_scale + + if CALIBRATE: + maxs = tl.zeros((W, n), dtype=tl.float32) + + for i in range(c): + indices = tid * c * W + i * W + tl.arange(0, W) + mask = indices[:, None] < count + x1 = tl.load(x_ptr + row_offs * 2 + col_offs, mask=mask).to(tl.float32) + x2 = tl.load(x_ptr + n + row_offs * 2 + col_offs, mask=mask).to( + tl.float32) + + w = tl.load(weight_ptr + si + indices, mask=indices < count).to( + tl.float32)[:, + None] + x = x1 / (1 + tl.exp(-x1)) * x2 + + if CALIBRATE: + maxs = tl.maximum(x.abs(), maxs) + + x *= w * smooth_scale + scale = tl.maximum(tl.max(x.abs(), 1) / 448, 1e-30) + if ROUND: + scale = tl.exp2(tl.ceil(tl.log2(scale))) + tl.store(scale_ptr + si + indices, scale, mask=indices < count) + x = (x / scale[:, None]).to(out_ptr.dtype.element_ty) + tl.store(out_ptr + row_offs + col_offs, x, mask=mask) + row_offs += n * W + + if CALIBRATE: + maxs = tl.max(maxs, 0) + tl.store(max_ptr + eid * sm * n + tid * n + tl.arange(0, n), maxs) + + +# used in routed experts +def triton_batch_weighted_silu_and_smooth_quant_forward(x, + weight, + counts, + smooth_scale=None, + splits=None, + out=None, + scale=None, + round_scale=False, + reverse=False, + calibrate=False): + """ + + """ + M, N = x.shape + n = N // 2 + n_experts = counts.shape[0] + assert N <= 8192 + device = x.device + if out is None: + out = torch.empty((M, n), device=device, dtype=torch.float8_e4m3fn) + + sm = torch.cuda.get_device_properties(device).multi_processor_count + tmp_maxs = None + if scale is None: + scale = torch.empty((M,), device=device, dtype=torch.float32) + if M == 0: + maxs = torch.zeros((n_experts, n), device=device, + dtype=torch.float32) + + elif calibrate: + tmp_maxs = torch.empty((n_experts, sm, n), device=device, + dtype=torch.float32) + maxs = torch.empty((n_experts, n), device=device, + dtype=torch.float32) + else: + maxs = None + + if M == 0: + return out, scale, maxs + + accums = torch.cumsum(counts, 0) + W = 8192 // N + grid = (n_experts, sm) + batch_weighted_silu_and_smooth_quant_forward_kernel[grid]( + x, + weight, + smooth_scale, + out, + scale, + tmp_maxs, + counts, + accums, + M, + n, + W, + round_scale, + reverse, + calibrate, + num_stages=3, + num_warps=16 + ) + if calibrate: + maxs = tmp_maxs.amax(1) + + return out, scale, maxs + + +@triton.jit +def batch_weighted_silu_and_smooth_quant_backward_kernel(g_ptr, x_ptr, + weight_ptr, + smooth_scale_ptr, + transpose_smooth_scale_ptr, + count_ptr, + accum_ptr, + dx_ptr, + dx_scale_ptr, + transpose_dx_ptr, + transpose_dx_scale_ptr, + dw_ptr, + n: tl.constexpr, + T: tl.constexpr, + B: tl.constexpr, + E: tl.constexpr, + REVERSE: tl.constexpr, + ROUND: tl.constexpr): + eid = tl.program_id(axis=0) + pid = tl.program_id(axis=1) + max_block = tl.num_programs(axis=1) + + count = tl.load(count_ptr + eid) + round_count = tl.cdiv(count, 32) * 32 + si = tl.load(accum_ptr + eid) - count + + if pid >= tl.cdiv(count, T): + return + + round_off = tl.sum(tl.where(tl.arange(0, E) < eid, + tl.cdiv(tl.load(count_ptr + tl.arange(0, E)), + 32), 0)) * 32 + + offs = si * n * 2 + pid * T * n * 2 + tl.arange(0, T)[:, + None] * n * 2 + tl.arange(0, B)[ + None, :] + hoffs = si * n + pid * T * n + tl.arange(0, T)[:, None] * n + tl.arange(0, + B)[ + None, + :] + toffs = round_off * n * 2 + pid * T + tl.arange(0, B)[:, + None] * round_count + tl.arange(0, T)[ + None, :] + nb = n // B + maxs = tl.zeros((T,), dtype=tl.float32) + indices = pid * T + tl.arange(0, T) + if REVERSE: + transpose_smooth_scale = tl.load( + transpose_smooth_scale_ptr + si + pid * T + tl.arange(0, T), + mask=indices < count)[:, None] + else: + transpose_smooth_scale = 1 / tl.load( + transpose_smooth_scale_ptr + si + pid * T + tl.arange(0, T), + mask=indices < count, other=1e-30)[:, None] + + w = tl.load(weight_ptr + si + pid * T + tl.arange(0, T), + mask=indices < count)[:, None] + dw = tl.zeros((T,), dtype=tl.float32) + qdtype = transpose_dx_ptr.dtype.element_ty + for i in range(nb): + smooth_scale_1 = tl.load( + smooth_scale_ptr + eid * n * 2 + i * B + tl.arange(0, B)) + smooth_scale_2 = tl.load( + smooth_scale_ptr + eid * n * 2 + n + i * B + tl.arange(0, B)) + if not REVERSE: + smooth_scale_1 = 1 / smooth_scale_1 + smooth_scale_2 = 1 / smooth_scale_2 + + x1 = tl.load(x_ptr + offs, mask=indices[:, None] < count).to(tl.float32) + x2 = tl.load(x_ptr + offs + n, mask=indices[:, None] < count).to( + tl.float32) + g = tl.load(g_ptr + hoffs, mask=indices[:, None] < count).to(tl.float32) + sigmoid = 1 / (1 + tl.exp(-x1)) + dx1 = g * x2 * sigmoid * ( + 1 + x1 * (1 - sigmoid)) * w + dx2 = g * x1 * sigmoid * w + + dw += tl.sum(x1 * sigmoid * x2 * g, 1) + + t_dx = dx1 * transpose_smooth_scale + t_s = tl.maximum(tl.max(tl.abs(t_dx), 0) / 448, 1e-30) + if ROUND: + t_s = tl.exp2(tl.ceil(tl.log2(t_s))) + t_dx = t_dx / t_s + tl.store(transpose_dx_ptr + toffs, tl.trans(t_dx.to(qdtype)), + mask=indices[None, :] < round_count) + tl.store( + transpose_dx_scale_ptr + eid * max_block * n * 2 + pid * n * 2 + i * B + tl.arange( + 0, B), t_s) + + t_dx = dx2 * transpose_smooth_scale + t_s = tl.maximum(tl.max(tl.abs(t_dx), 0) / 448, 1e-30) + if ROUND: + t_s = tl.exp2(tl.ceil(tl.log2(t_s))) + t_dx = t_dx / t_s + tl.store(transpose_dx_ptr + round_count * n + toffs, + tl.trans(t_dx.to(qdtype)), mask=indices[None, :] < round_count) + tl.store( + transpose_dx_scale_ptr + eid * max_block * n * 2 + pid * n * 2 + n + i * B + tl.arange( + 0, B), t_s) + + dx1 = dx1 * smooth_scale_1 + dx2 = dx2 * smooth_scale_2 + maxs = tl.maximum( + tl.maximum(tl.max(dx1.abs(), 1), tl.max(dx2.abs(), 1)), maxs) + + offs += B + hoffs += B + toffs += B * round_count + + tl.store(dw_ptr + si + pid * T + tl.arange(0, T), dw, mask=indices < count) + scale = tl.maximum(maxs / 448, 1e-30) + if ROUND: + scale = tl.exp2(tl.ceil(tl.log2(scale))) + tl.store(dx_scale_ptr + si + pid * T + tl.arange(0, T), scale, + mask=indices < count) + + s = 1 / scale[:, None] + offs = si * n * 2 + pid * T * n * 2 + tl.arange(0, T)[:, + None] * n * 2 + tl.arange(0, B)[ + None, :] + hoffs = si * n + pid * T * n + tl.arange(0, T)[:, None] * n + tl.arange(0, + B)[ + None, + :] + for i in range(nb): + smooth_scale_1 = tl.load( + smooth_scale_ptr + eid * n * 2 + i * B + tl.arange(0, B)) + smooth_scale_2 = tl.load( + smooth_scale_ptr + eid * n * 2 + n + i * B + tl.arange(0, B)) + if not REVERSE: + smooth_scale_1 = 1 / smooth_scale_1 + smooth_scale_2 = 1 / smooth_scale_2 + + x1 = tl.load(x_ptr + offs, mask=indices[:, None] < count).to(tl.float32) + x2 = tl.load(x_ptr + offs + n, mask=indices[:, None] < count).to( + tl.float32) + g = tl.load(g_ptr + hoffs, mask=indices[:, None] < count).to(tl.float32) + sigmoid = 1 / (1 + tl.exp(-x1)) + dx1 = g * x2 * sigmoid * ( + 1 + x1 * (1 - sigmoid)) * smooth_scale_1 * w + dx2 = g * x1 * sigmoid * smooth_scale_2 * w + + dx1 = (dx1 * s).to(dx_ptr.dtype.element_ty) + dx2 = (dx2 * s).to(dx_ptr.dtype.element_ty) + + tl.store(dx_ptr + offs, dx1, mask=indices[:, None] < count) + tl.store(dx_ptr + n + offs, dx2, mask=indices[:, None] < count) + offs += B + hoffs += B + + +# requant multi-column quantized tensor +@triton.jit +def _batch_requant_kernel(x_ptr, scale_ptr, scales_ptr, + count_ptr, + N, + H: tl.constexpr, + W: tl.constexpr, + E: tl.constexpr + ): + eid = tl.program_id(axis=0) + rid = tl.program_id(axis=1) + cid = tl.program_id(axis=2) + max_block = tl.num_programs(axis=2) + + count = tl.load(count_ptr + eid) + round_count = tl.cdiv(count, 32) * 32 + if cid >= tl.cdiv(round_count, W): + return + + round_off = tl.sum(tl.where(tl.arange(0, E) < eid, + tl.cdiv(tl.load(count_ptr + tl.arange(0, E)), + 32) * 32, 0)) + + offs = round_off * N + rid * H * round_count + cid * W + tl.arange(0, H)[:, + None] * round_count + tl.arange( + 0, W)[None, :] + global_scale = tl.load(scale_ptr + eid * N + rid * H + tl.arange(0, H)) + # scales is stored with column-major format + local_scale = tl.load( + scales_ptr + max_block * N * eid + cid * N + rid * H + tl.arange(0, H)) + x = tl.load(x_ptr + offs).to(tl.float32) + rescale = local_scale / tl.maximum(global_scale, 1e-30) + x = x * rescale[:, None] + tl.store(x_ptr + offs, x) + + +# used in routed experts +def triton_batch_weighted_silu_and_smooth_quant_backward(g, x, weight, + counts, + smooth_scale=None, + transpose_smooth_scale=None, + splits=None, + reverse=True, + round_scale=False): + """ + + """ + assert round_scale + M, N = x.shape + n = N // 2 + n_expert = counts.shape[0] + assert N <= 8192 and 8192 % N == 0 + assert splits is not None, 'batch mode need splits to launch kernels' + + device = x.device + + accums = torch.cumsum(counts, 0) + + dx = torch.empty((M, N), device=device, dtype=torch.float8_e4m3fn) + + dx_scale = torch.empty((M,), device=device, dtype=torch.float32) + + dw = torch.empty_like(weight) + T = 32 + B = 32 + assert n % B == 0 and T == 32 + max_block = triton.cdiv(max(splits), T) + s = sum([(x + 31) // 32 for x in splits]) * 32 + transpose_dx = torch.empty((N * s,), device=device, + dtype=torch.float8_e4m3fn) + + if s == 0: + transpose_dx_scale = torch.zeros((n_expert, N), device=device, + dtype=torch.float32) + return dx, dx_scale, dw, transpose_dx, transpose_dx_scale + else: + transpose_dx_scales = torch.zeros((n_expert, max_block, N), + device=device, dtype=torch.bfloat16) + + grid = (n_expert, max_block) + batch_weighted_silu_and_smooth_quant_backward_kernel[grid]( + g, + x, + weight, + smooth_scale, + transpose_smooth_scale, + counts, + accums, + dx, + dx_scale, + transpose_dx, + transpose_dx_scales, + dw, + n, + T, + B, + n_expert, + reverse, + round_scale, + num_stages=5, + num_warps=4 + ) + transpose_dx_scale = transpose_dx_scales.amax(1).float() + grid = (n_expert, N // B, max_block) + _batch_requant_kernel[grid](transpose_dx, transpose_dx_scale, + transpose_dx_scales, + counts, + N, + B, + T, + n_expert, + num_stages=3, + num_warps=2) + + return dx, dx_scale, dw, transpose_dx, transpose_dx_scale + diff --git a/tests/test_channel_quant.py b/tests/test_channel_quant.py new file mode 100644 index 0000000..c59fe5e --- /dev/null +++ b/tests/test_channel_quant.py @@ -0,0 +1,47 @@ +# -*- coding: utf-8 -*- +""" +Copyright (c) Ant Financial Service Group and its affiliates. +""" + +import torch + +from linghe.quant.channel.channel import (triton_deprecated_tokenwise_row_quant, + triton_row_quant, + triton_tokenwise_row_quant) +from linghe.tools.benchmark import benchmark_func +from linghe.tools.util import (output_check, + torch_row_quant) + + +def test_row_quant(M=4096, N=4096, round_scale=True, bench=False): + device = 'cuda:0' + dtype = torch.bfloat16 + x = torch.randn((M, N), dtype=dtype, device=device) ** 3 + + x_q_ref, x_scale_ref = torch_row_quant(x, round_scale=round_scale) + + x_q, x_scale = triton_row_quant(x, round_scale=round_scale) + output_check(x_q_ref.float(), x_q.float(), mode='data') + output_check(x_scale_ref, x_scale, mode='scale') + + x_q, x_scale = triton_tokenwise_row_quant(x, round_scale=round_scale) + output_check(x_q_ref.float(), x_q.float(), mode='data') + output_check(x_scale_ref, x_scale, mode='scale') + + if bench: + ref_time = benchmark_func(torch_row_quant, x, n_repeat=100, + ref_bytes=M * N * 3) + benchmark_func(triton_row_quant, x, n_repeat=100, ref_bytes=M * N * 3, + ref_time=ref_time) + benchmark_func(triton_deprecated_tokenwise_row_quant, x, n_repeat=100, + ref_bytes=M * N * 3, ref_time=ref_time) + benchmark_func(triton_tokenwise_row_quant, x, n_repeat=100, + ref_bytes=M * N * 3, ref_time=ref_time) + + +if __name__ == '__main__': + test_row_quant(M=4096, N=4096, round_scale=False) + test_row_quant(M=4090, N=4096, round_scale=True) + test_row_quant(M=4096, N=8192, round_scale=True) + test_row_quant(M=3456, N=2048, round_scale=True) + test_row_quant(M=1, N=2048, round_scale=True) diff --git a/tests/test_gemm.py b/tests/test_fp32_gemm.py similarity index 100% rename from tests/test_gemm.py rename to tests/test_fp32_gemm.py diff --git a/tests/test_gather.py b/tests/test_gather.py index bbbb58b..9b5c012 100644 --- a/tests/test_gather.py +++ b/tests/test_gather.py @@ -5,46 +5,48 @@ import torch -from linghe.tools.benchmark import benchmark_func -from linghe.tools.util import (output_check, - torch_make_indices) from linghe.utils.gather import (triton_make_row_id_map, triton_make_row_id_map_and_indices, triton_index_select, - triton_permute_with_mask_map) + triton_permute_with_mask_map, + triton_smooth_permute_with_indices, + triton_smooth_permute_with_mask_map, + triton_smooth_weighted_permute_with_indices, + triton_batch_transpose_smooth_permute_with_indices) +from linghe.tools.util import (output_check, + torch_batch_smooth_quant, + torch_make_indices, + torch_smooth_quant) +from linghe.tools.benchmark import benchmark_func def torch_index_select(y, indices): output = y.index_select(0, indices) return output - def torch_select_with_padded_map_mask(y, mask_map, out_tokens): E = mask_map.shape[1] if y.ndim > 1: - output = torch.zeros((out_tokens, y.shape[1]), dtype=y.dtype, - device=y.device) + output = torch.zeros((out_tokens, y.shape[1]), dtype=y.dtype, device=y.device) else: - output = torch.zeros((out_tokens,), dtype=y.dtype, device=y.device) + output = torch.zeros((out_tokens, ), dtype=y.dtype, device=y.device) for i in range(E): - indices = mask_map[:, i] - src_idx = torch.nonzero(indices > -1) + indices = mask_map[:,i] + src_idx = torch.nonzero(indices>-1) dst_idx = indices[src_idx] output[dst_idx] = y[src_idx] return output - def torch_ravel_with_padded_map_mask(y, mask_map, out_tokens): E = mask_map.shape[1] - output = torch.zeros((out_tokens,), dtype=y.dtype, device=y.device) + output = torch.zeros((out_tokens, ), dtype=y.dtype, device=y.device) for i in range(E): - indices = mask_map[:, i] - src_idx = torch.nonzero(indices > -1) + indices = mask_map[:,i] + src_idx = torch.nonzero(indices>-1) dst_idx = indices[src_idx] - output[dst_idx] = y[src_idx, i] + output[dst_idx] = y[src_idx,i] return output - def torch_fp16_index_select(x, scales, indices): return x.index_select(0, indices), scales.index_select(0, indices) @@ -53,6 +55,78 @@ def torch_scatter(logits, routing_map, weights): logits[routing_map] = weights +# optional dequant and smooth and quant +def torch_smooth_permute_with_indices(grad_data, grad_scale, indices, + smooth_scales, + token_count_per_expert_list, + round_scale=True): + M, N = grad_data.shape + if grad_scale is not None: + B = grad_data.shape[1] // ( + 1 if grad_scale.ndim == 1 else grad_scale.shape[1]) + q_refs = [] + scale_refs = [] + s = 0 + for i, c in enumerate(token_count_per_expert_list): + c = token_count_per_expert_list[i] + data_slice = grad_data.view(torch.uint8)[indices[s:s + c]].view( + torch.float8_e4m3fn) + if grad_scale is not None: + scale_slice = grad_scale[indices[s:s + c]] + y_smooth = (data_slice.float().view(c, N // B, B) * scale_slice[:, :, + None]).view(c, N) / \ + smooth_scales[i] + else: + y_smooth = data_slice.float() / smooth_scales[i] + scale = y_smooth.abs().amax(1) / 448 + if round_scale: + scale = torch.exp2(torch.ceil(torch.log2(scale))) + scale_refs.append(scale) + q = (y_smooth / scale[:, None]).to(torch.float8_e4m3fn) + q_refs.append(q.view(torch.uint8)) + s += c + q_ref = torch.cat(q_refs, 0).view(torch.float8_e4m3fn) + scale_ref = torch.cat(scale_refs, 0) + + return q_ref, scale_ref + + + +# desmooth,dequant, gather, pad, transpose, smooth, quant +def torch_batch_transpose_smooth_permute_with_indices(x_q, x_scale, org_smooth_scale, smooth_scales, + indices, + token_count_per_expert_list, + round_scale=True): + M, DIM = x_q.shape + q_refs = [] + scale_refs = [] + s = 0 + for i, c in enumerate(token_count_per_expert_list): + c = token_count_per_expert_list[i] + if c == 0: + y_scale = torch.zeros((DIM,), dtype=torch.float32, device=x_q.device) + scale_refs.append(y_scale.view(-1)) + continue + N = (c + 31)//32 * 32 + data_slice = x_q[indices[s:s + c]] + if x_scale is not None: + scale_slice = x_scale[indices[s:s + c]] + y = data_slice.float() * scale_slice[:, None] * org_smooth_scale + else: + y = data_slice.float() + smooth_scale = smooth_scales[s:s+c] + if N > c: + y = torch.nn.functional.pad(y, (0,0,0, N-c)) + smooth_scale = torch.nn.functional.pad(smooth_scale, (0, N-c)) + y_q, y_scale, y_max= torch_smooth_quant(y.t().contiguous(), smooth_scale, reverse=True, round_scale=round_scale) + scale_refs.append(y_scale.view(-1)) + q_refs.append(y_q.view(-1)) + s += c + q_ref = torch.cat(q_refs, 0) + scale_ref = torch.stack(scale_refs, 0) + return q_ref, scale_ref + + def test_make_id_map(M=4098, n_experts=32, topk=2, bias=0.0, bench=False): dtype = torch.bfloat16 device = 'cuda:0' @@ -64,6 +138,7 @@ def test_make_id_map(M=4098, n_experts=32, topk=2, bias=0.0, bench=False): token_count_per_expert_list = token_count_per_expert.tolist() out_tokens = sum(token_count_per_expert_list) + row_id_map_output = triton_make_row_id_map(mask_map) assert (row_id_map - row_id_map_output).abs().sum().item() == 0 @@ -71,6 +146,77 @@ def test_make_id_map(M=4098, n_experts=32, topk=2, bias=0.0, bench=False): assert (row_id_indices - indices).abs().sum().item() == 0 + +def test_triton_smooth_permute_with_indices(M=4096, N=4096, n_experts=256, + topk=8, bench=False): + device = 'cuda:0' + y = torch.randn((M, N), dtype=torch.bfloat16, device=device) + logits = torch.randn((M, n_experts), dtype=torch.float32, device=device) + smooth_scales = 1 + 10 * torch.rand((n_experts, N), device=device, + dtype=torch.float32) + + probs, mask_map, token_count_per_expert, indices, row_id_map = torch_make_indices( + logits, topk=topk, bias=0.0) + + y_q, y_scale = triton_smooth_permute_with_indices(y, smooth_scales, + token_count_per_expert, + indices, reverse=False, + round_scale=False) + + y_q_ref, y_scale_ref = torch_batch_smooth_quant(y, smooth_scales, indices, + token_count_per_expert, + reverse=False, + round_scale=False) + + output_check(y_q_ref.float(), y_q.float(), 'data') + output_check(y_scale_ref.float(), y_scale.float(), 'scale') + + if bench: + n_repeat = 100 + benchmark_func(torch_index_select, y, indices, n_repeat=n_repeat) + benchmark_func(triton_smooth_permute_with_indices, y, smooth_scales, + token_count_per_expert, indices, reverse=False, + round_scale=False, n_repeat=n_repeat) + + +def test_triton_smooth_weighted_permute_with_indices(M=4096, N=4096, + n_experts=256, + topk=8, + round_scale=True, + bench=False): + device = 'cuda:0' + reverse = True + y = torch.randn((M, N), dtype=torch.bfloat16, device=device) + logits = torch.randn((M, n_experts), dtype=torch.float32, device=device) + smooth_scales = 1 + 10 * torch.rand((n_experts, N), device=device, + dtype=torch.float32) + probs, mask_map, token_count_per_expert, indices, row_id_map = torch_make_indices( + logits, topk=topk, bias=0.0) + + tokens = torch.randn((indices.shape[0], N), dtype=torch.bfloat16, + device=device) + y_q, y_scale, y_sum = triton_smooth_weighted_permute_with_indices( + y, tokens, smooth_scales, token_count_per_expert, indices, x_q=None, + x_scale=None, reverse=reverse, round_scale=round_scale) + + y_q_ref, y_scale_ref = torch_batch_smooth_quant(y, smooth_scales, indices, + token_count_per_expert, + reverse=reverse, + round_scale=round_scale) + sum_ref = (tokens * y[indices]).sum(1) + + output_check(y_q_ref.float(), y_q.float(), 'data') + output_check(y_scale_ref.float(), y_scale.float(), 'scale') + output_check(sum_ref.float(), y_sum.float(), 'sum') + + if bench: + n_repeat = 100 + benchmark_func(triton_smooth_weighted_permute_with_indices, + y, tokens, smooth_scales, token_count_per_expert, + indices, reverse=reverse, round_scale=round_scale, + n_repeat=n_repeat) + + def test_triton_permute_with_mask_map(M=4096, N=4096, n_experts=256, topk=8, bench=False): device = 'cuda:0' @@ -100,19 +246,15 @@ def test_triton_permute_with_mask_map(M=4096, N=4096, n_experts=256, topk=8, output_check(scale_out_ref, scale_out, 'scale_out') output_check(probs_out_ref, probs_out, 'prob_out') - nzs = torch.sum(row_id_map >= 0, 0) - bias = torch.cumsum((nzs + 15) // 16 * 16 - nzs, 0) + nzs = torch.sum(row_id_map>=0, 0) + bias = torch.cumsum((nzs + 15)//16*16 - nzs, 0) row_id_map_clone = row_id_map.clone().detach() row_id_map_clone[:, 1:] += bias[:-1] - round_row_id_map = torch.where(row_id_map >= 0, row_id_map_clone, -1) - padded_out_tokens = sum( - [(x + 15) // 16 * 16 for x in token_count_per_expert.tolist()]) - x_out_ref = torch_select_with_padded_map_mask(x, round_row_id_map, - padded_out_tokens) - scale_out_ref = torch_select_with_padded_map_mask(scales, round_row_id_map, - padded_out_tokens) - prob_out_ref = torch_ravel_with_padded_map_mask(probs, round_row_id_map, - padded_out_tokens) + round_row_id_map = torch.where(row_id_map>=0, row_id_map_clone, -1) + padded_out_tokens = sum([(x+15)//16*16 for x in token_count_per_expert.tolist()]) + x_out_ref = torch_select_with_padded_map_mask(x, round_row_id_map, padded_out_tokens) + scale_out_ref = torch_select_with_padded_map_mask(scales, round_row_id_map, padded_out_tokens) + prob_out_ref = torch_ravel_with_padded_map_mask(probs, round_row_id_map, padded_out_tokens) x_out, scale_out, probs_out = triton_permute_with_mask_map(x, scales, probs, round_row_id_map, padded_out_tokens, @@ -128,11 +270,9 @@ def test_triton_permute_with_mask_map(M=4096, N=4096, n_experts=256, topk=8, ref_time = benchmark_func(torch_fp16_index_select, x, scales, indices, n_repeat=n_repeat, ref_bytes=ref_bytes) benchmark_func(triton_index_select, x, indices, scale=scales, - n_repeat=n_repeat, ref_time=ref_time, - ref_bytes=ref_bytes) + n_repeat=n_repeat, ref_time=ref_time, ref_bytes=ref_bytes) benchmark_func(triton_permute_with_mask_map, x, scales, probs, - row_id_map, out_tokens, contiguous=True, - n_repeat=n_repeat, + row_id_map, out_tokens, contiguous=True, n_repeat=n_repeat, ref_time=ref_time, ref_bytes=ref_bytes) benchmark_func(triton_permute_with_mask_map, x, scales, probs, row_id_map, out_tokens, contiguous=False, @@ -141,11 +281,158 @@ def test_triton_permute_with_mask_map(M=4096, N=4096, n_experts=256, topk=8, ref_time=ref_time, ref_bytes=ref_bytes) +def test_triton_smooth_permute_with_mask_map(M=4096, N=4096, n_experts=32, + topk=8, round_scale=True, + bench=False): + device = 'cuda:0' + dtype = torch.bfloat16 + smooth_scales = 1 + 10 * torch.rand((n_experts, N), device=device, + dtype=torch.float32) + logits = torch.randn((M, n_experts), dtype=torch.float32, + device=device) ** 3 + probs, mask_map, token_count_per_expert, indices, row_id_map = torch_make_indices( + logits, topk=topk, bias=-0.01) + + token_count_per_expert_list = token_count_per_expert.tolist() + out_tokens = sum(token_count_per_expert_list) + + B = 128 + grad_data = torch.randn((M, N), dtype=torch.bfloat16, device=device).to( + torch.float8_e4m3fn) + grad_scale = 1 + torch.rand((M, N // B), dtype=torch.float32, device=device) + q_ref, scale_ref = torch_smooth_permute_with_indices(grad_data, grad_scale, + indices, smooth_scales, + token_count_per_expert_list, + round_scale=round_scale) + y_q, y_scale = triton_smooth_permute_with_indices(grad_data, + grad_scale, + smooth_scales, + token_count_per_expert, + indices, + x_q=None, + x_scale=None, + reverse=False, + round_scale=round_scale) + output_check(q_ref.float(), y_q.float(), 'data') + output_check(scale_ref.float(), y_scale.float(), 'scale') + + + + # smooth_scale_ptrs = torch.tensor([x.data_ptr() for x in torch.split(smooth_scales,1)], device=device) + permuted_data, permuted_scale = triton_smooth_permute_with_mask_map( + grad_data, row_id_map, grad_scale, M, n_experts, out_tokens, N, + smooth_scales, reverse=False, round_scale=round_scale) + output_check(q_ref.float(), permuted_data.float(), 'smoothed.data') + output_check(scale_ref.float(), permuted_scale.float(), 'smoothed.scale') + + q_ref, scale_ref = torch_smooth_permute_with_indices(grad_data, None, + indices, smooth_scales, + token_count_per_expert_list, + round_scale=round_scale) + permuted_data, permuted_scale = triton_smooth_permute_with_mask_map( + grad_data, row_id_map, None, M, n_experts, out_tokens, N, + smooth_scales, reverse=False, round_scale=round_scale) + output_check(q_ref.float(), permuted_data.float(), 'smoothed.data') + output_check(scale_ref.float(), permuted_scale.float(), 'smoothed.scale') + + + + if bench: + benchmark_func(triton_smooth_permute_with_indices, grad_data, + grad_scale, smooth_scales, token_count_per_expert, + indices, round_scale=round_scale, n_repeat=100, + ref_bytes=out_tokens * N * 2) + benchmark_func(triton_smooth_permute_with_mask_map, grad_data, + row_id_map, grad_scale, M, n_experts, out_tokens, N, + smooth_scales, reverse=False, round_scale=round_scale, + n_repeat=100, ref_bytes=out_tokens * N * 2) + + + + +def test_triton_batch_transpose_smooth_permute_with_indices(M=1024, N=2048, n_experts=32, topk=8, bench=False): + + device = 'cuda:0' + if True: + logits = torch.randn((M, n_experts), dtype=torch.float32, + device=device) ** 3 + logits[:,0] -= 1000 + logits[:,2] -= 100 + probs, mask_map, token_count_per_expert, indices, row_id_map = torch_make_indices( + logits, topk=topk, bias=-0.01) + + token_count_per_expert_list = token_count_per_expert.tolist() + out_tokens = sum(token_count_per_expert_list) + + x = torch.randn((M, N), dtype=torch.bfloat16, device=device).to( + torch.float8_e4m3fn) + scale = torch.rand((M,), dtype=torch.float32, device=device) + 0.1 + org_smooth_scale = torch.rand((N,), dtype=torch.float32, device=device) + 0.1 + smooth_scales = torch.rand((out_tokens, ), dtype=torch.float32, device=device) + 0.1 + else: + # torch.save({"x":x, "scale":scale, "org_smooth_scale":org_smooth_scale,"smooth_scales":smooth_scales, "indices":indices, "token_count_per_expert":token_count_per_expert,"splits":splits}, '/tmp/debug.bin') + state = torch.load('/tmp/debug.bin') + x = state['x'] + scale = state['scale'] + org_smooth_scale = state['org_smooth_scale'] + smooth_scales = state['smooth_scales'] + indices = state['indices'] + token_count_per_expert = state['token_count_per_expert'] + token_count_per_expert_list = state['splits'] + out_tokens = sum(token_count_per_expert_list) + + + x_q_ref, x_scale_ref = torch_batch_transpose_smooth_permute_with_indices(x, scale, org_smooth_scale, smooth_scales, + indices, + token_count_per_expert_list, + round_scale=True) + + x_q, x_scale = triton_batch_transpose_smooth_permute_with_indices(x, scale, org_smooth_scale, smooth_scales, + indices, + token_count_per_expert, token_count_per_expert_list, + round_scale=True) + output_check(x_q_ref.float(), x_q.float(), 'smoothed.data') + output_check(x_scale_ref.float(), x_scale.float(), 'smoothed.scale') + + + x_q_ref, x_scale_ref = torch_batch_transpose_smooth_permute_with_indices(x, None, None, smooth_scales, + indices, + token_count_per_expert_list, + round_scale=True) + + x_q, x_scale = triton_batch_transpose_smooth_permute_with_indices(x, None, None, smooth_scales, + indices, + token_count_per_expert, token_count_per_expert_list, + round_scale=True) + output_check(x_q_ref.float(), x_q.float(), 'bf16.data') + output_check(x_scale_ref.float(), x_scale.float(), 'bf16.scale') + + + if bench: + benchmark_func(torch_batch_transpose_smooth_permute_with_indices, x, scale, org_smooth_scale, smooth_scales, + indices, + token_count_per_expert_list, + round_scale=True, + ref_bytes=out_tokens * N * 2) + benchmark_func(triton_batch_transpose_smooth_permute_with_indices, x, scale, org_smooth_scale, smooth_scales, + indices, + token_count_per_expert, token_count_per_expert_list, + round_scale=True, + ref_bytes=out_tokens * N * 2) + + if __name__ == '__main__': test_make_id_map(M=4098, n_experts=32, topk=2, bias=0.0, bench=False) - test_triton_permute_with_mask_map(M=16384, N=2048, n_experts=32, topk=8, - bench=False) - test_triton_permute_with_mask_map(M=8192, N=4096, n_experts=32, topk=8, - bench=False) - test_triton_permute_with_mask_map(M=7628, N=2048, n_experts=32, topk=8, - bench=False) + test_triton_smooth_permute_with_indices(M=4096, N=4096, n_experts=32, + topk=8) + test_triton_permute_with_mask_map(M=16384, N=2048, n_experts=32, topk=8, bench=False) + test_triton_permute_with_mask_map(M=8192, N=4096, n_experts=32, topk=8, bench=False) + test_triton_permute_with_mask_map(M=7628, N=2048, n_experts=32, topk=8, bench=False) + + test_triton_smooth_permute_with_mask_map(M=4096, N=4096, n_experts=32, + topk=8) + test_triton_smooth_permute_with_mask_map(M=7628, N=2048, n_experts=32, + topk=8) + + test_triton_batch_transpose_smooth_permute_with_indices(M=16384, N=2048, n_experts=32, topk=2, bench=False) + test_triton_batch_transpose_smooth_permute_with_indices(M=8192, N=4096, n_experts=32, topk=2, bench=False) diff --git a/linghe/quant/hadamard/__init__.py b/tests/test_hadamard_quant.py similarity index 100% rename from linghe/quant/hadamard/__init__.py rename to tests/test_hadamard_quant.py diff --git a/tests/test_norm.py b/tests/test_norm.py index 4ed2f8e..3c52baf 100644 --- a/tests/test_norm.py +++ b/tests/test_norm.py @@ -6,14 +6,16 @@ import torch import torch.nn.functional as F -from linghe.tools.benchmark import benchmark_func +from linghe.utils.norm import (triton_rms_norm_and_smooth_quant_forward, + triton_rms_norm_and_block_quant_forward, + triton_rms_norm_backward, + triton_rms_norm_forward, + triton_group_norm_gate_forward, + triton_group_norm_gate_backward) from linghe.tools.util import (output_check, - torch_group_quant) -from linghe.utils.norm import (triton_rms_norm_and_block_quant_forward, - triton_rms_norm_backward, - triton_rms_norm_forward, - triton_group_norm_gate_forward, - triton_group_norm_gate_backward) + torch_smooth_quant, + torch_group_quant) +from linghe.tools.benchmark import benchmark_func def torch_rms_forward(x, weight): @@ -50,12 +52,11 @@ def torch_rms_backward(x, weight, dy): return x.grad, rmsnorm.weight.grad -def torch_rms_and_quant_forward(x, weight, smooth_scale=None, - round_scale=False): +def torch_rms_and_smooth_quant_forward(x, weight, smooth_scale=None, + round_scale=False): x = x.float() weight = weight.float() - if smooth_scale is not None: - smooth_scale = smooth_scale.float() + smooth_scale = smooth_scale.float() N = x.shape[-1] rmsnorm = torch.nn.RMSNorm( normalized_shape=N, @@ -66,20 +67,15 @@ def torch_rms_and_quant_forward(x, weight, smooth_scale=None, with torch.no_grad(): rmsnorm.weight.copy_(weight) y = rmsnorm(x) - # blockwise - y_q, y_scale = torch_group_quant(y, round_scale=round_scale) - yt_q, yt_scale = torch_group_quant(y.t(), round_scale=round_scale) - return y_q, y_scale, yt_q, yt_scale + # smooth + y_q, y_scale, y_maxs = torch_smooth_quant(y, smooth_scale, reverse=False, + round_scale=round_scale) + return y_q, y_scale, y_maxs -# backward of rms is bf16, do not need quant -def torch_rms_and_quant_backward(x, weight, dy, smooth_scale=None, - round_scale=False): +def torch_rms_and_block_quant_forward(x, weight, round_scale=False): x = x.float() weight = weight.float() - dy = dy.float() - if smooth_scale is not None: - smooth_scale = smooth_scale.float() N = x.shape[-1] rmsnorm = torch.nn.RMSNorm( normalized_shape=N, @@ -89,15 +85,11 @@ def torch_rms_and_quant_backward(x, weight, dy, smooth_scale=None, ) with torch.no_grad(): rmsnorm.weight.copy_(weight) - x = x.clone().detach().requires_grad_() y = rmsnorm(x) - y.backward(gradient=dy) - dx = x.grad - dw = rmsnorm.weight.grad - dx_q, dx_scale = torch_group_quant(dx, round_scale=round_scale) - dxt_q, dxt_scale = torch_group_quant(dx.t(), round_scale=round_scale) - - return dx_q, dx_scale, dw, dxt_q, dxt_scale + # blockwise + y_q, y_scale = torch_group_quant(y, round_scale=round_scale) + yt_q, yt_scale = torch_group_quant(y.t(), round_scale=round_scale) + return y_q, y_scale, yt_q, yt_scale @torch.compile @@ -153,6 +145,40 @@ def test_rmsnorm(M=4096, N=4096, bench=False): ref_bytes=M * N * 3) +def test_rmsnorm_and_smooth_quant(M=4096, N=4096, bench=False): + dtype = torch.bfloat16 + device = 'cuda:0' + + x = torch.randn(M, N, dtype=dtype, requires_grad=True, device=device) + weight = torch.randn(N, dtype=dtype, requires_grad=True, device=device) + smooth_scale = torch.rand(N, dtype=torch.float32, requires_grad=False, + device=device) + 0.1 + calibrate = True + + # smooth + q_ref, scale_ref, maxs_ref = torch_rms_and_smooth_quant_forward(x, weight, + smooth_scale=smooth_scale, + round_scale=True) + + q, scale, maxs, rms = triton_rms_norm_and_smooth_quant_forward(x, weight, + smooth_scale=smooth_scale, + calibrate=calibrate, + output_rms=True, + round_scale=True) + output_check(q_ref, q, mode="smooth.data") + output_check(scale_ref, scale, mode='smooth.scale') + if calibrate: + output_check(maxs_ref, maxs, mode="smooth.maxs") + + if bench: + benchmark_func(triton_rms_norm_and_smooth_quant_forward, x, weight, + smooth_scale=smooth_scale, + calibrate=True, + round_scale=True, + output_rms=True, + ref_bytes=M * N * 3) + + def test_rmsnorm_and_block_quant(M=4096, N=4096, bench=False): dtype = torch.bfloat16 device = 'cuda:0' @@ -161,13 +187,13 @@ def test_rmsnorm_and_block_quant(M=4096, N=4096, bench=False): weight = torch.randn(N, dtype=dtype, requires_grad=True, device=device) # blockwise - q_ref, scale_ref, qt_ref, scale_t_ref = torch_rms_and_quant_forward(x, - weight, - smooth_scale=None, - round_scale=True) + q_ref, scale_ref, qt_ref, scale_t_ref = torch_rms_and_block_quant_forward(x, + weight, + round_scale=True) q, scale, rms, q_t, scale_t = triton_rms_norm_and_block_quant_forward(x, weight, round_scale=True, + output_rms=True, output_mode=2) output_check(q_ref, q, mode="2.block.data") output_check(scale_ref.t(), scale, mode='2.block.scale') @@ -176,6 +202,7 @@ def test_rmsnorm_and_block_quant(M=4096, N=4096, bench=False): q, scale, _, _, _ = triton_rms_norm_and_block_quant_forward(x, weight, round_scale=True, + output_rms=True, output_mode=0) output_check(q_ref, q, mode="0.block.data") output_check(scale_ref.t(), scale, mode='0.block.scale') @@ -183,6 +210,7 @@ def test_rmsnorm_and_block_quant(M=4096, N=4096, bench=False): _, _, _, q_t, scale_t = triton_rms_norm_and_block_quant_forward(x, weight, round_scale=True, rms=rms, + output_rms=True, output_mode=1) output_check(qt_ref, q_t, mode='0.block.t_data') output_check(scale_t_ref.t(), scale_t, mode="0.block.t_scale") @@ -190,16 +218,19 @@ def test_rmsnorm_and_block_quant(M=4096, N=4096, bench=False): if bench: benchmark_func(triton_rms_norm_and_block_quant_forward, x, weight, round_scale=True, + output_rms=True, output_mode=0, ref_bytes=M * N * 3) benchmark_func(triton_rms_norm_and_block_quant_forward, x, weight, round_scale=True, + output_rms=True, output_mode=1, ref_bytes=M * N * 3) benchmark_func(triton_rms_norm_and_block_quant_forward, x, weight, round_scale=True, + output_rms=True, output_mode=2, ref_bytes=M * N * 4) @@ -249,9 +280,14 @@ def test_group_norm_gate_quant(bs=1, length=4096, dim=4096, group_size=4, test_rmsnorm(M=16384, N=2048, bench=False) test_rmsnorm(M=8192, N=4096, bench=False) test_rmsnorm(M=4096, N=8192, bench=False) + test_rmsnorm_and_smooth_quant(M=16384, N=2048, bench=False) + test_rmsnorm_and_smooth_quant(M=8192, N=4096, bench=False) + test_rmsnorm_and_smooth_quant(M=4096, N=8192, bench=False) test_rmsnorm_and_block_quant(M=128, N=2048, bench=False) test_rmsnorm_and_block_quant(M=8192, N=4096, bench=False) test_group_norm_gate_quant(bs=2, length=4096, dim=2048, group_size=4, bench=True) test_group_norm_gate_quant(bs=1, length=4096, dim=4096, group_size=4, bench=True) + + diff --git a/tests/test_silu.py b/tests/test_silu.py index e0e4e24..2375b90 100644 --- a/tests/test_silu.py +++ b/tests/test_silu.py @@ -10,12 +10,18 @@ import torch from linghe.tools.benchmark import benchmark_func -from linghe.utils.silu import ( - triton_batch_weighted_silu_and_block_quant_backward, - triton_batch_weighted_silu_and_block_quant_forward, - triton_silu_and_block_quant_backward, - triton_silu_and_block_quant_forward) -from linghe.tools.util import output_check, torch_group_quant +from linghe.utils.silu import (triton_weighted_silu_forward, + triton_weighted_silu_backward, + triton_batch_weighted_silu_and_smooth_quant_backward, + triton_batch_weighted_silu_and_smooth_quant_forward, + triton_batch_weighted_silu_and_block_quant_backward, + triton_batch_weighted_silu_and_block_quant_forward, + triton_silu_and_smooth_quant_backward, + triton_silu_and_smooth_quant_forward, + triton_silu_and_block_quant_backward, + triton_silu_and_block_quant_forward) +from linghe.tools.util import output_check, torch_smooth_quant, \ + torch_group_quant def torch_silu(x): @@ -40,6 +46,24 @@ def torch_weighted_silu_backward(dy, x, weight): return x.grad, weight.grad +def torch_silu_and_smooth_quant_forward(x, smooth_scale=None, round_scale=True): + M, N = x.shape + x = x.float() + x1, x2 = torch.split(x, N // 2, dim=1) + y = torch.sigmoid(x1) * x1 * x2 + + # smooth + y_q, y_scale, x_maxs = torch_smooth_quant(y, smooth_scale, reverse=False, + round_scale=round_scale) + # y_smooth = y / smooth_scale + # x_maxs = y.abs().float().amax(0) + # y_scale = y_smooth.abs().amax(1) / 448 + # if round_scale: + # y_scale = torch.exp2(torch.ceil(torch.log2(y_scale))) + # y_q = (y_smooth / y_scale[:, None]).to(torch.float8_e4m3fn) + return y_q, y_scale, x_maxs + + def torch_silu_and_block_quant_forward(x, round_scale=True): M, N = x.shape x = x.float() @@ -48,9 +72,26 @@ def torch_silu_and_block_quant_forward(x, round_scale=True): # blockwise y_q, y_scale = torch_group_quant(y, round_scale=round_scale) yt_q, yt_scale = torch_group_quant(y.t(), round_scale=round_scale) - x_maxs = None - return y_q, y_scale, x_maxs, yt_q, yt_scale + return y_q, y_scale, yt_q, yt_scale + + +def torch_silu_and_smooth_quant_backward(grad, x, smooth_scale=None, + transpose_smooth_scale=None, + round_scale=True, reverse=True): + grad = grad.float() + x = x.float().detach().clone().requires_grad_() + y = torch_silu(x) + y.backward(gradient=grad) + dx = x.grad + + q, dx_scale, ms = torch_smooth_quant(dx, smooth_scale, reverse=reverse, + round_scale=round_scale) + yt_q, yt_scale, ms = torch_smooth_quant(dx.t().contiguous(), + transpose_smooth_scale, + reverse=reverse, + round_scale=round_scale) + return q, dx_scale, yt_q, yt_scale def torch_silu_and_block_quant_backward(grad, x, round_scale=True): @@ -66,11 +107,47 @@ def torch_silu_and_block_quant_backward(grad, x, round_scale=True): return q, dx_scale, yt_q, yt_scale +def torch_batch_weighted_silu_and_smooth_quant_forward(xs, weight, + counts, + smooth_scales=None, + round_scale=True, + reverse=False): + counts = counts.tolist() + N = xs.shape[1] + if sum(counts) == 0: + device = xs.device + qs = torch.empty((0, N // 2), device=device, dtype=torch.float8_e4m3fn) + scales = torch.empty((0,), device=device, dtype=torch.float32) + maxs = torch.zeros((len(counts), N), device=device, dtype=torch.float32) + return qs, scales, maxs + + xs = xs.float() + weight = weight.float() + smooth_scales = smooth_scales.float() + + qs = [] + scales = [] + maxs = [] + s = 0 + for i, c in enumerate(counts): + x = xs[s:s + c] + y = torch_weighted_silu(x, weight[s:s + c]) + q, scale, ms = torch_smooth_quant(y, smooth_scales[i], reverse=reverse, + round_scale=round_scale) + qs.append(q) + scales.append(scale) + maxs.append(ms) + + s += c + qs = torch.cat(qs, 0) + scales = torch.cat(scales, 0) + maxs = torch.cat(maxs, 0) + return qs, scales, maxs + + def torch_batch_weighted_silu_and_block_quant_forward(xs, weight, counts, - smooth_scales=None, - round_scale=True, - reverse=False): + round_scale=True): counts = counts.tolist() N = xs.shape[1] if sum(counts) == 0: @@ -98,6 +175,7 @@ def torch_batch_weighted_silu_and_block_quant_forward(xs, weight, scales.append(scale.t().contiguous().view(-1)) qts.append(qt.view(-1)) qtscales.append(qtscale.t().contiguous().view(-1)) + s += c qs = torch.cat(qs, 0) scales = torch.cat(scales, 0) @@ -106,12 +184,64 @@ def torch_batch_weighted_silu_and_block_quant_forward(xs, weight, return qs, scales, qts, qtscales +def torch_batch_weighted_silu_and_smooth_quant_backward(grad_output, x, weight, + counts, + smooth_scales=None, + transpose_smooth_scale=None, + round_scale=True, + reverse=False): + if sum(counts) == 0: + device = x.device + N = x.shape[1] + dx_q = torch.empty((0, N), device=device, dtype=torch.float8_e4m3fn) + dx_scale = torch.empty((0,), device=device, dtype=torch.float32) + dw = torch.empty_like(weight) + qts = torch.empty((0,), device=device, dtype=torch.float8_e4m3fn) + qtscales = torch.zeros((N * len(counts),), device=device, + dtype=torch.float32) + return dx_q, dx_scale, dw, qts, qtscales + + grad_output = grad_output.float() + x = x.float() + weight = weight.float() + smooth_scales = smooth_scales.float() + transpose_smooth_scale = transpose_smooth_scale.float() + + dx, dw = torch_weighted_silu_backward(grad_output, x, weight) + qs = [] + scales = [] + qts = [] + qtscales = [] + s = 0 + for i, c in enumerate(counts): + q, scale, dx_max = torch_smooth_quant(dx[s:s + c], smooth_scales[i], + reverse=reverse, + round_scale=round_scale) + dxt = dx[s:s + c].t().contiguous() + dxt_s = transpose_smooth_scale[s:s + c] + padding_size = (c + 31) // 32 * 32 - c + if padding_size > 0: + dxt = torch.nn.functional.pad(dxt, (0, padding_size, 0, 0)) + dxt_s = torch.nn.functional.pad(dxt_s, (0, padding_size)) + qt, t_scale, dx_max = torch_smooth_quant(dxt, dxt_s, + reverse=reverse, + round_scale=round_scale) + + qs.append(q) + scales.append(scale) + qts.append(qt.view(-1)) + qtscales.append(t_scale.view(-1)) + s += c + dx_q = torch.cat(qs, 0) + dx_scale = torch.cat(scales, 0) + qts = torch.cat(qts, 0) + qtscales = torch.cat(qtscales, 0) + return dx_q, dx_scale, dw, qts, qtscales + + def torch_batch_weighted_silu_and_block_quant_backward(grad_output, x, weight, counts, - smooth_scales=None, - transpose_smooth_scale=None, - round_scale=True, - reverse=False): + round_scale=True): if sum(counts) == 0: device = x.device N = x.shape[1] @@ -126,10 +256,6 @@ def torch_batch_weighted_silu_and_block_quant_backward(grad_output, x, weight, grad_output = grad_output.float() x = x.float() weight = weight.float() - if smooth_scales is not None: - smooth_scales = smooth_scales.float() - if transpose_smooth_scale is not None: - transpose_smooth_scale = transpose_smooth_scale.float() dx, dw = torch_weighted_silu_backward(grad_output, x, weight) qs = [] @@ -145,6 +271,7 @@ def torch_batch_weighted_silu_and_block_quant_backward(grad_output, x, weight, scales.append(scale.t().contiguous().view(-1)) qts.append(qt.view(-1)) qtscales.append(qtscale.t().contiguous().view(-1)) + s += c dx_q = torch.cat(qs, 0) dx_scale = torch.cat(scales, 0) @@ -153,14 +280,106 @@ def torch_batch_weighted_silu_and_block_quant_backward(grad_output, x, weight, return dx_q, dx_scale, dw, qts, qtscales -def test_silu_and_block_quant(M=4096, N=4096, bench=False): +def test_weighted_silu(M=4096, N=4096, bench=False): x = torch.randn((M, N), dtype=torch.bfloat16, device='cuda:0') - x = (x * 10).clone().detach().requires_grad_() + x = (x ** 3 // 10).clone().detach().requires_grad_() + weight = torch.randn((M, 1), dtype=torch.bfloat16, device='cuda:0') grad_output = torch.randn((M, N // 2), dtype=torch.bfloat16, device='cuda:0') + ref_y = torch_weighted_silu(x, weight) + y = triton_weighted_silu_forward(x, weight) + output_check(ref_y, y, 'y') + + dx_ref, dw_ref = torch_weighted_silu_backward(grad_output, x, weight) + dx, dw = triton_weighted_silu_backward(grad_output, x, weight) + output_check(dx_ref, dx, 'dx') + output_check(dw_ref, dw, 'dw') + + if bench: + benchmark_func(triton_weighted_silu_forward, x, weight, n_repeat=100, + ref_bytes=M * N * 3) + benchmark_func(triton_weighted_silu_backward, grad_output, x, weight, + n_repeat=100, ref_bytes=M * N * 5) + + +def test_silu_and_smooth_quant(M=4096, N=4096, bench=False): + if True: + x = torch.randn((M, N), dtype=torch.bfloat16, device='cuda:0') + x = (x * 10).clone().detach().requires_grad_() + grad_output = torch.randn((M, N // 2), dtype=torch.bfloat16, + device='cuda:0') + smooth_scale = 1 + torch.rand((N // 2,), dtype=torch.float32, + device='cuda:0') + grad_smooth_scale = 1 + torch.rand((N,), dtype=torch.float32, + device='cuda:0') + transpose_grad_smooth_scale = 1 + torch.rand((M,), dtype=torch.float32, + device='cuda:0') + else: + d = torch.load('/ossfs/workspace/tmp/vis/silu.bin') + x = d['x'].clone().detach().to('cuda:0').requires_grad_() + grad_output = d['g'].to('cuda:0') + grad_smooth_scale = d['smooth_scale'].to('cuda:0') + N = x.shape[-1] + M = x.shape[0] + smooth_scale = 1 + torch.rand((N // 2,), dtype=torch.float32, + device='cuda:0') + + y_q_ref, y_scale_ref, y_maxs_ref = torch_silu_and_smooth_quant_forward(x, + smooth_scale=smooth_scale) + y_q, y_scale, y_maxs = triton_silu_and_smooth_quant_forward(x, + smooth_scale=smooth_scale, + round_scale=True, + calibrate=True) + output_check(y_q_ref.float(), y_q.float(), 'smooth.y_q') + output_check(y_scale_ref, y_scale, 'smooth.y_scale') + output_check(y_maxs_ref, y_maxs, 'smooth.y_max') + + dx_q_ref, dx_scale_ref, dxt_q_ref, dxt_scale_ref = torch_silu_and_smooth_quant_backward( + grad_output, x, + smooth_scale=grad_smooth_scale, + transpose_smooth_scale=transpose_grad_smooth_scale, + reverse=True, + round_scale=True) + dx_q, dx_scale, dxt_q, dxt_scale = triton_silu_and_smooth_quant_backward( + grad_output, x, + smooth_scale=grad_smooth_scale, + transpose_smooth_scale=transpose_grad_smooth_scale, + reverse=True, + round_scale=True) + + output_check(dx_q_ref.float(), dx_q.float(), 'smooth.dx_data') + output_check(dx_scale_ref, dx_scale, 'smooth.dx_scale') + output_check(dxt_q_ref.float(), dxt_q.float(), 'smooth.dxt_data') + output_check(dxt_scale_ref, dxt_scale, 'smooth.dxt_scale') + + if bench: + benchmark_func(torch_silu_and_smooth_quant_forward, x, + smooth_scale=smooth_scale, + n_repeat=100, ref_bytes=M * N * 2.5) + benchmark_func(triton_silu_and_smooth_quant_forward, x, + smooth_scale=smooth_scale, + n_repeat=100, ref_bytes=M * N * 2.5) + benchmark_func(triton_silu_and_smooth_quant_backward, grad_output, x, + smooth_scale=grad_smooth_scale, + transpose_smooth_scale=transpose_grad_smooth_scale, + n_repeat=100, ref_bytes=M * N * 5) + - y_q_ref, y_scale_ref, _, yt_q_ref, yt_scale_ref = torch_silu_and_block_quant_forward( - x) +def test_silu_and_block_quant(M=4096, N=4096, bench=False): + if True: + x = torch.randn((M, N), dtype=torch.bfloat16, device='cuda:0') + x = (x * 10).clone().detach().requires_grad_() + grad_output = torch.randn((M, N // 2), dtype=torch.bfloat16, + device='cuda:0') + else: + d = torch.load('/ossfs/workspace/tmp/vis/silu.bin') + x = d['x'].clone().detach().to('cuda:0').requires_grad_() + grad_output = d['g'].to('cuda:0') + N = x.shape[-1] + M = x.shape[0] + + y_q_ref, y_scale_ref, yt_q_ref, yt_scale_ref = torch_silu_and_block_quant_forward( + x, round_scale=True) y_q, y_scale, yt_q, yt_scale = triton_silu_and_block_quant_forward(x, round_scale=True, output_mode=2) @@ -199,16 +418,107 @@ def test_silu_and_block_quant(M=4096, N=4096, bench=False): n_repeat=100, ref_bytes=M * N * 5) +def test_triton_batch_weighted_silu_and_smooth_quant(M=1024, N=4096, + n_experts=32, + bench=False): + if True: + count_list = [random.randint(M // 2, M // 2 * 3) // 16 * 16 for _ in + range(n_experts)] + counts = torch.tensor(count_list, device='cuda:0', dtype=torch.int32) + bs = sum(count_list) + + x = torch.randn((bs, N), dtype=torch.bfloat16, device='cuda:0') ** 3 / 4 + weight = torch.randn((bs, 1), dtype=torch.float32, device='cuda:0') + smooth_scales = 1 + torch.rand((n_experts, N // 2), dtype=torch.float32, + device='cuda:0') * 10 + else: + d = torch.load('/ossfs/workspace/Megatron-LM/silu.bin') + counts = d['counts'].cuda() + x = d['x'].cuda() + weight = d['weight'].cuda() + smooth_scales = d['smooth_scale'].cuda() + bs = sum(counts.tolist()) + N = x.shape[-1] + n_experts = counts.shape[0] + + grad_output = torch.randn((bs, N // 2), dtype=torch.bfloat16, + device='cuda:0') ** 3 + grad_smooth_scales = 1 + torch.rand((n_experts, N), dtype=torch.float32, + device='cuda:0') * 10 + transpose_grad_smooth_scales = 1 + torch.rand((bs,), dtype=torch.float32, + device='cuda:0') * 10 + round_scale = True + + x_q_ref, x_scale_ref, x_max_ref = torch_batch_weighted_silu_and_smooth_quant_forward( + x, + weight, + counts, + smooth_scales=smooth_scales, + round_scale=round_scale, + reverse=False) + x_q, x_scale, maxs = triton_batch_weighted_silu_and_smooth_quant_forward(x, + weight, + counts, + smooth_scale=smooth_scales, + round_scale=round_scale, + reverse=False) + output_check(x_q_ref.float(), x_q.float(), 'smooth.data') + output_check(x_scale_ref, x_scale, 'smooth.scale') + + dx_ref, dx_scale_ref, dw_ref, dxt_ref, dxt_scale_ref = torch_batch_weighted_silu_and_smooth_quant_backward( + grad_output, x, weight, count_list, + smooth_scales=grad_smooth_scales, + transpose_smooth_scale=transpose_grad_smooth_scales, + round_scale=round_scale, reverse=False) + dx, dx_scale, dw, dxt, dxt_scale = triton_batch_weighted_silu_and_smooth_quant_backward( + grad_output, x, weight, counts, + smooth_scale=grad_smooth_scales, + transpose_smooth_scale=transpose_grad_smooth_scales, + splits=count_list, + round_scale=round_scale, + reverse=False) + output_check(dx_ref.float(), dx.float(), 'smooth.dx') + output_check(dx_scale_ref, dx_scale, 'smooth.dx_scale') + output_check(dw_ref, dw, 'smooth.dw') + output_check(dxt_ref.float(), dxt.float(), 'smooth.dxt') + output_check(dxt_scale_ref, dxt_scale.view(-1), 'smooth.dxt_scale') + + if bench: + ref_time = None + benchmark_func(triton_batch_weighted_silu_and_smooth_quant_forward, x, + weight, + counts, smooth_scale=smooth_scales, round_scale=True, + ref_bytes=n_experts * M * N * 2.5, ref_time=ref_time) + benchmark_func(triton_batch_weighted_silu_and_smooth_quant_backward, + grad_output, x, weight, counts, + smooth_scale=smooth_scales, + transpose_smooth_scale=transpose_grad_smooth_scales, + splits=count_list, + round_scale=True, + ref_bytes=n_experts * M * N * 4, ref_time=ref_time) + + def test_triton_batch_weighted_silu_and_block_quant(M=1024, N=4096, n_experts=32, bench=False): - count_list = [random.randint(M // 2, M // 2 * 3) // 16 * 16 for _ in - range(n_experts)] - counts = torch.tensor(count_list, device='cuda:0', dtype=torch.int32) - bs = sum(count_list) - - x = torch.randn((bs, N), dtype=torch.bfloat16, device='cuda:0') ** 3 / 10 - weight = torch.randn((bs, 1), dtype=torch.float32, device='cuda:0') + if True: + count_list = [random.randint(M // 2, M // 2 * 3) // 16 * 16 for _ in + range(n_experts)] + counts = torch.tensor(count_list, device='cuda:0', dtype=torch.int32) + bs = sum(count_list) + + x = torch.randn((bs, N), dtype=torch.bfloat16, + device='cuda:0') ** 3 / 10 + weight = torch.randn((bs, 1), dtype=torch.float32, device='cuda:0') + else: + d = torch.load('/ossfs/workspace/Megatron-LM/silu.bin') + counts = d['counts'].cuda() + x = d['x'].cuda() + weight = d['weight'].cuda() + smooth_scales = d['smooth_scale'].cuda() + bs = sum(counts.tolist()) + N = x.shape[-1] + n_experts = counts.shape[0] grad_output = torch.randn((bs, N // 2), dtype=torch.bfloat16, device='cuda:0') ** 3 @@ -220,10 +530,13 @@ def test_triton_batch_weighted_silu_and_block_quant(M=1024, N=4096, counts, round_scale=round_scale) x_q, x_scale, xt_q, xt_scale = triton_batch_weighted_silu_and_block_quant_forward( - x, weight, + x, + weight, counts, + count_list, round_scale=round_scale, - splits=count_list) + output_mode=2) + output_check(x_q_ref.float(), x_q.float(), 'block.q') output_check(x_scale_ref, x_scale, 'block.scale') output_check(xt_q_ref.float(), xt_q.float(), 'block.qt') @@ -248,11 +561,6 @@ def test_triton_batch_weighted_silu_and_block_quant(M=1024, N=4096, counts, round_scale=True, splits=count_list, output_mode=0, n_repeat=100, ref_bytes=n_experts * M * N * 2.5, ref_time=ref_time) - benchmark_func(triton_batch_weighted_silu_and_block_quant_forward, x, - weight, - counts, round_scale=True, splits=count_list, - output_mode=1, n_repeat=100, - ref_bytes=n_experts * M * N * 2.5, ref_time=ref_time) benchmark_func(triton_batch_weighted_silu_and_block_quant_forward, x, weight, counts, round_scale=True, splits=count_list, @@ -265,9 +573,20 @@ def test_triton_batch_weighted_silu_and_block_quant(M=1024, N=4096, if __name__ == '__main__': + test_weighted_silu(M=16384, N=1024, bench=True) + + test_silu_and_smooth_quant(M=16384, N=1024, bench=False) + test_silu_and_smooth_quant(M=8192, N=2048, bench=False) + test_silu_and_smooth_quant(M=4096, N=10240, bench=False) + test_silu_and_smooth_quant(M=4096, N=5120, bench=False) + test_silu_and_block_quant(M=16384, N=1024, bench=True) + test_triton_batch_weighted_silu_and_smooth_quant(M=2048, N=2048, + n_experts=32, bench=False) + test_triton_batch_weighted_silu_and_smooth_quant(M=800, N=2048, n_experts=32, bench=False) + test_triton_batch_weighted_silu_and_smooth_quant(M=0, N=2048, n_experts=32, bench=False) + test_triton_batch_weighted_silu_and_block_quant(M=4096, N=2048, n_experts=32, bench=True) - test_triton_batch_weighted_silu_and_block_quant(M=1008, N=2048, - n_experts=32, bench=False) + test_triton_batch_weighted_silu_and_block_quant(M=1008, N=2048, n_experts=32, bench=False) diff --git a/tests/test_smooth_quant.py b/tests/test_smooth_quant.py new file mode 100644 index 0000000..4ddc166 --- /dev/null +++ b/tests/test_smooth_quant.py @@ -0,0 +1,328 @@ +# -*- coding: utf-8 -*- +""" +Copyright (c) Ant Financial Service Group and its affiliates. +""" + +import torch + +from linghe.quant.smooth import (triton_batch_smooth_quant, + triton_subrow_smooth_quant, + triton_transpose_rescale_smooth_quant, + triton_smooth_quant, + triton_transpose_smooth_quant) +from linghe.tools.benchmark import benchmark_func +from linghe.tools.util import (output_check, + torch_make_indices, + torch_smooth_quant, + round_up) + + +def torch_split_smooth_quant(x_split, smooth_scales, round_scale=False): + x_qs = [] + x_scales = [] + x_maxs = [] + for i, x_ in enumerate(x_split): + x_maxs.append(x_.abs().amax(0)) + x_smooth = x_ / smooth_scales[i] + x_scale_ = x_smooth.float().abs().amax(1) / 448 + if round_scale: + x_scale_ = torch.exp2(torch.ceil(torch.log2(x_scale_))) + x_q_ = (x_smooth / x_scale_[:, None]).to(torch.float8_e4m3fn) + x_qs.append(x_q_) + x_scales.append(x_scale_) + x_maxs = torch.stack(x_maxs, 0) + return x_qs, x_scales, x_maxs + + +def torch_subrow_smooth_quant(x, smooth_scale, x_q, x_scale, subrow_scales, + offset, size, + reverse=False, round_scale=False): + limit = 448 * torch.ones((1,), dtype=smooth_scale.dtype, + device=smooth_scale.device) + # subrow_scales is saved as 448/max + + M, N = x_q.shape + if offset % N > 0: + si = offset % N + k = N - si + x_slice = x.view(-1)[0:k] + smooth_scale_slice = smooth_scale[si: N] + if not reverse: + smooth_scale_slice = 1 / smooth_scale_slice + x_smooth = x_slice * smooth_scale_slice + + scale = subrow_scales[0:1] + if round_scale: + scale = torch.exp2(torch.floor(torch.log2(scale))) + + x_q_slice = torch.minimum(torch.maximum(x_smooth / scale, -limit), + limit).to(torch.float8_e4m3fn) + x_q.view(-1)[offset:offset + k] = x_q_slice + + if (offset + size) % N > 0: + k = (offset + size) % N + x_slice = x.view(-1)[-k:] + smooth_scale_slice = smooth_scale[0: k] + if not reverse: + smooth_scale_slice = 1 / smooth_scale_slice + x_smooth = x_slice * smooth_scale_slice + scale = subrow_scales[1:2] + if round_scale: + scale = torch.exp2(torch.floor(torch.log2(scale))) + x_q_slice = torch.minimum(torch.maximum(x_smooth / scale, -limit), + limit).to(torch.float8_e4m3fn) + x_q.view(-1)[(offset + size - k):(offset + size)] = x_q_slice + x_scale[(offset + size) // N] = scale + + +def torch_rescale_quant(y_q, org_smooth_scale, y_scale, transpose_smooth_scale, + reverse=True, round_scale=True): + assert reverse + y = y_q.float() / org_smooth_scale * y_scale[:, None] + y_q, y_scale, _ = torch_smooth_quant(y.t(), transpose_smooth_scale, + reverse=True, round_scale=round_scale) + return y_q, y_scale + + +def triton_split_smooth_quant(x_split, smooth_scales): + x_qs = [] + x_scales = [] + for i, x_ in enumerate(x_split): + x_q_, x_scale_, _ = triton_smooth_quant(x_, smooth_scales[i]) + x_qs.append(x_q_) + x_scales.append(x_scale_) + return x_qs, x_scales + + +def test_triton_smooth_quant(M=4096, N=4096, bench=False): + device = 'cuda:0' + x = torch.randn((M, N), dtype=torch.bfloat16, device=device) + smooth_scale = torch.randn((N,), device=device, dtype=torch.float32).abs() + x_q_ref, scales_ref, x_maxs_ref = torch_smooth_quant(x, smooth_scale, + reverse=False, + round_scale=True) + + x_q, x_scale, x_maxs = triton_smooth_quant(x, smooth_scale, + reverse=False, + round_scale=True, + calibrate=True) + output_check(x_q_ref.float(), x_q.float(), + 'triton_smooth_quant.data') + output_check(scales_ref, x_scale, 'triton_smooth_quant.scale') + output_check(x_maxs_ref, x_maxs, 'triton_smooth_quant.x_maxs') + + if bench: + benchmark_func(triton_smooth_quant, x, + smooth_scale, + reverse=False, + round_scale=True, + calibrate=False, + ref_bytes=M * N * 3) + + +def test_triton_subrow_smooth_quant(M=4096, N=5120, offset=4096, + size=16384): + device = 'cuda:0' + x = torch.randn((size,), dtype=torch.float32, device=device) + x_q = torch.zeros((M, N), dtype=torch.bfloat16, device=device).to( + torch.float8_e4m3fn) + x_scale = torch.zeros((M,), dtype=torch.float32, device=device).abs() + smooth_scale = torch.randn((N,), device=device, + dtype=torch.float32).abs() + 1 + subrow_scales = torch.randn((2,), device=device, + dtype=torch.float32).abs() + 1 + + x_ref = x.clone() + x_q_ref = x_q.clone() + x_scale_ref = x_scale.clone() + subrow_scales_ref = subrow_scales.clone() + torch_subrow_smooth_quant(x_ref, smooth_scale, x_q_ref, x_scale_ref, + subrow_scales_ref, offset, size, + reverse=False, round_scale=False) + + triton_subrow_smooth_quant(x, smooth_scale, x_q, x_scale, + subrow_scales, offset, size, + reverse=False, round_scale=False) + + output_check(x_q_ref.float(), x_q.float(), 'subrow.data') + output_check(x_scale_ref, x_scale, 'subrow.scale') + + if offset % N > 0: + k = N - offset % N + output_check(x_q_ref.float().view(-1)[offset:offset + k], + x_q.float().view(-1)[offset:offset + k], + 'subrow.data.tail') + + if (offset + size) % N > 0: + k = (offset + size) % N + output_check(x_q_ref.float().view(-1)[offset + size - k:offset + size], + x_q.float().view(-1)[offset + size - k:offset + size], + 'subrow.data.head') + row_id = (offset + size) // N + output_check(x_scale_ref[row_id], x_scale[row_id], 'subrow.scale.slice') + + +def test_triton_transpose_smooth_quant(M=4096, N=4096, bench=False): + device = 'cuda:0' + P = round_up(M, b=32) + y = torch.randn((M, N), dtype=torch.bfloat16, device=device) ** 3 * 1e-10 + transpose_smooth_scale = torch.randn((M,), device=device, + dtype=torch.float32).abs() * 10 + 1 + yt_q, yt_scale = triton_transpose_smooth_quant(y, + transpose_smooth_scale, + reverse=True, + pad=True, + round_scale=True) + q_ref, scale_ref, maxs_ref = torch_smooth_quant(y.T.contiguous(), + transpose_smooth_scale, + reverse=True, + round_scale=True) + + assert yt_q.shape[1] == P + if P > M: + assert yt_q.float()[:, M:].abs().sum().item() == 0 + output_check(q_ref, yt_q[:, :M], + 'triton_transpose_smooth_quant.data') + output_check(scale_ref, yt_scale, + 'triton_transpose_smooth_quant.scale') + + if bench: + benchmark_func(triton_transpose_smooth_quant, y, + transpose_smooth_scale, + reverse=True, + pad=True, + round_scale=True, + ref_bytes=M * N * 3) + + +def test_triton_transpose_rescale_smooth_quant(M=4096, N=4096, + round_scale=False): + device = 'cuda:0' + P = round_up(M, b=32) + y = torch.randn((M, N), dtype=torch.bfloat16, device=device) ** 3 + org_smooth_scale = torch.randn((N,), device=device, + dtype=torch.float32).abs() * 10 + 1 + if round_scale: + org_smooth_scale = torch.exp2(torch.ceil(torch.log2(org_smooth_scale))) + transpose_smooth_scale = torch.randn((M,), device=device, + dtype=torch.float32).abs() + 0.1 + if round_scale: + transpose_smooth_scale = torch.exp2( + torch.ceil(torch.log2(transpose_smooth_scale))) + + y_q, y_scale, y_maxs = triton_smooth_quant(y, org_smooth_scale, + reverse=True, + round_scale=round_scale) + + yt_gt, yt_scale_gt, yt_maxs_gt = torch_smooth_quant(y.t(), + transpose_smooth_scale, + reverse=True, + round_scale=round_scale) + + yt_q_ref, yt_scale_ref = torch_rescale_quant(y_q, org_smooth_scale, y_scale, + transpose_smooth_scale, + reverse=True, + round_scale=round_scale) + + yt_q, yt_scale = triton_transpose_rescale_smooth_quant(y_q, + org_smooth_scale, + y_scale, + transpose_smooth_scale, + reverse=True, + pad=True, + round_scale=round_scale) + + if P > M: + assert yt_q.shape[1] == P + yt_q.float()[:, M:].abs().sum().item() == 0 + + output_check(yt_q_ref, yt_q[:, :M], + 'triton_transpose_rescale_smooth_quant.data') + output_check(yt_scale_ref, yt_scale, + 'triton_transpose_rescale_smooth_quant.scale') + + # should dequant and compare with gt + # output_check(yt_gt, yt_q[:, :M], + # 'triton_transpose_rescale_smooth_quant.data.gt') + # output_check(yt_scale_gt, yt_scale, + # 'triton_transpose_rescale_smooth_quant.scale.gt') + + +def test_triton_batch_smooth_quant(M=4096, N=4096, n_experts=32, topk=8, + round_scale=False, bench=False): + device = 'cuda:0' + + smooth_scales = 1 + 10 * torch.rand((n_experts, N), device=device, + dtype=torch.float32) + + logits = torch.randn((M, n_experts), dtype=torch.float32, device=device) + probs, mask_map, token_count_per_expert, indices, row_id_map = torch_make_indices( + logits, topk=topk, bias=0.0) + token_count_per_expert_list = token_count_per_expert.tolist() + x = torch.randn((sum(token_count_per_expert_list), N), dtype=torch.bfloat16, + device=device) + + x_q, x_scale, x_maxs = triton_batch_smooth_quant(x, smooth_scales, + token_count_per_expert, + reverse=False, + round_scale=round_scale, + calibrate=True) + + x_split = torch.split(x, token_count_per_expert_list) + x_q_ref, x_scale_ref, x_maxs_ref = torch_split_smooth_quant(x_split, + smooth_scales) + x_q_ref = torch.cat([x.view(torch.uint8) for x in x_q_ref], 0).view( + torch.float8_e4m3fn) + x_scale_ref = torch.cat(x_scale_ref, 0) + output_check(x_q_ref.float(), x_q.float(), 'triton_batch_smooth_quant.data') + output_check(x_scale_ref.float(), x_scale.float(), + 'triton_batch_smooth_quant.scale') + output_check(x_maxs_ref.float(), x_maxs.float(), + 'triton_batch_smooth_quant.maxs') + + if bench: + n_repeat = 100 + ref_time = benchmark_func(triton_split_smooth_quant, x_split, + smooth_scales, n_repeat=n_repeat) + benchmark_func(triton_batch_smooth_quant, x, smooth_scales, + token_count_per_expert, reverse=False, + round_scale=round_scale, n_repeat=n_repeat, + ref_time=ref_time) + benchmark_func(triton_batch_smooth_quant, x, smooth_scales, + token_count_per_expert, reverse=False, + round_scale=round_scale, calibrate=True, + n_repeat=n_repeat, ref_time=ref_time) + + +if __name__ == '__main__': + test_triton_smooth_quant(M=16384, N=2048, bench=False) + test_triton_smooth_quant(M=8192, N=4096, bench=False) + test_triton_smooth_quant(M=4096, N=8192, bench=False) + test_triton_smooth_quant(M=8192, N=3072, bench=False) + test_triton_smooth_quant(M=8192, N=6144, bench=False) + test_triton_smooth_quant(M=16384, N=512, bench=False) + test_triton_smooth_quant(M=3457, N=512, bench=False) + + test_triton_subrow_smooth_quant(M=4096, N=5120, offset=5120, + size=2048) + test_triton_subrow_smooth_quant(M=4096, N=5120, offset=4096, + size=5120) + test_triton_subrow_smooth_quant(M=4096, N=5120, offset=5120, + size=5120 * 10 - 1024) + + test_triton_transpose_smooth_quant(M=16384, N=2048, bench=False) + test_triton_transpose_smooth_quant(M=8192, N=4096, bench=False) + test_triton_transpose_smooth_quant(M=4096, N=8192, bench=False) + test_triton_transpose_smooth_quant(M=4096, N=3072, bench=False) + + test_triton_transpose_rescale_smooth_quant(M=4096, N=4096, + round_scale=True) + test_triton_transpose_rescale_smooth_quant(M=3895, N=4096, + round_scale=True) + test_triton_transpose_rescale_smooth_quant(M=4096, N=3072, + round_scale=True) + test_triton_transpose_rescale_smooth_quant(M=395, N=2048, + round_scale=True) + + test_triton_batch_smooth_quant(M=4096, N=4096, n_experts=32, topk=8, + round_scale=False) From a51122b11de95b4c2eb8112a7c84856c5e71b74e Mon Sep 17 00:00:00 2001 From: "nanxiao.zy" Date: Thu, 16 Oct 2025 20:54:27 +0800 Subject: [PATCH 5/7] refine estcase --- build.sh | 2 +- tests/test_channel_quant.py | 2 +- tests/test_dot.py | 7 +--- tests/test_fp32_gemm.py | 31 +++++++------- tests/test_gather.py | 34 ---------------- tests/test_group_quant.py | 16 +------- tests/test_hadamard_quant.py | 78 ++++++++++++++++++++++++++++++++++++ tests/test_norm.py | 6 --- 8 files changed, 99 insertions(+), 77 deletions(-) diff --git a/build.sh b/build.sh index 7a6a6d5..b123d1c 100644 --- a/build.sh +++ b/build.sh @@ -2,6 +2,6 @@ rm -rf build && rm -rf dist && rm -rf linghe.egg-info && python setup.py develop && -python setup.py bdist_wheel && +python setup.py bdist_wheel # pdoc --output-dir docs -d google --no-include-undocumented --no-search --no-show-source linghe \ No newline at end of file diff --git a/tests/test_channel_quant.py b/tests/test_channel_quant.py index c59fe5e..5b4fd49 100644 --- a/tests/test_channel_quant.py +++ b/tests/test_channel_quant.py @@ -5,7 +5,7 @@ import torch -from linghe.quant.channel.channel import (triton_deprecated_tokenwise_row_quant, +from linghe.quant.channel import (triton_deprecated_tokenwise_row_quant, triton_row_quant, triton_tokenwise_row_quant) from linghe.tools.benchmark import benchmark_func diff --git a/tests/test_dot.py b/tests/test_dot.py index 699fe1a..fcedbd6 100644 --- a/tests/test_dot.py +++ b/tests/test_dot.py @@ -7,8 +7,7 @@ from linghe.tools.benchmark import benchmark_func from linghe.tools.util import output_check -from linghe.utils.dot import (triton_dot, - triton_mix_precise_dot) +from linghe.utils.dot import triton_dot def torch_fp16_dot(x, y): @@ -34,13 +33,9 @@ def test_dot(M=4096, N=4096, bench=False): sums_ref = (x.float() * ( q.to(torch.float32) * quant_scale[:, None] * smooth_scale[None, :])).sum(dim=1) - sums = triton_mix_precise_dot(x, q, smooth_scale, quant_scale, reverse=True) - output_check(sums_ref, sums.float(), 'sum') if bench: ref_time = benchmark_func(torch_fp16_dot, x, y, n_repeat=n_repeat) - benchmark_func(triton_mix_precise_dot, x, q, smooth_scale, quant_scale, - reverse=True, n_repeat=n_repeat, ref_time=ref_time) if __name__ == '__main__': diff --git a/tests/test_fp32_gemm.py b/tests/test_fp32_gemm.py index 68fdcf9..3f60d18 100644 --- a/tests/test_fp32_gemm.py +++ b/tests/test_fp32_gemm.py @@ -14,6 +14,17 @@ from linghe.tools.util import output_check + +def torch_fp32_matmul(x, w): + return torch.nn.functional.linear(x.float(), w.float()) + +def torch_fp32_matmul_backward(dy, w): + return (dy @ w).to(torch.bfloat16) + +def torch_fp32_matmul_update(y, x): + return (y.t() @ x).to(torch.bfloat16) + + def test_fp32_matmul(M=2048, N=256, K=8192, bench=False): # M, N, K = 4096, 256, 8192 dtype = torch.bfloat16 @@ -25,15 +36,6 @@ def test_fp32_matmul(M=2048, N=256, K=8192, bench=False): scale = torch.randn(M, dtype=torch.float32, device=device) dy = torch.randn(M, N, dtype=torch.float32, device=device) - def torch_fp32_matmul(x, w): - return torch.nn.functional.linear(x.float(), w.float()) - - def torch_fp32_matmul_backward(dy, w): - return (dy @ w).to(torch.bfloat16) - - def torch_fp32_matmul_update(y, x): - return (y.t() @ x).to(torch.bfloat16) - y_ref = torch_fp32_matmul(x, w) y = triton_fp32_gemm(x, w) output_check(y_ref, y.float(), mode='fp32_gemm') @@ -43,10 +45,9 @@ def torch_fp32_matmul_update(y, x): output_check(y_ref, y.float(), mode='scaled_fp32_gemm') dx = torch.zeros(M, K, dtype=dtype, device=device) - dx_clone = dx.clone() - triton_fp32_gemm_for_backward(dy, w, dx_clone, accum=True) - dx_ref = dy @ w.float() + dx.float() - output_check(dx_ref, dx_clone.float(), mode='backward') + dx = triton_fp32_gemm_for_backward(dy, w) + dx_ref = dy @ w.float() + output_check(dx_ref, dx.float(), mode='backward') main_grad = triton_fp32_gemm_for_update(y, x) main_grad_ref = y.t() @ (x.float()) @@ -72,8 +73,8 @@ def torch_fp32_matmul_update(y, x): n_repeat=n_repeat, ref_bytes=M * K * 10 + N * K * 4 + M * N * 4, ref_linghe=2 * M * N * K) - benchmark_func(triton_fp32_gemm_for_backward, dy, w, dx_clone, - accum=True, n_repeat=n_repeat, + benchmark_func(triton_fp32_gemm_for_backward, dy, w, + n_repeat=n_repeat, ref_bytes=M * K * 2 + N * K * 2 + M * N * 4, ref_linghe=2 * M * N * K, ref_time=ref_time) diff --git a/tests/test_gather.py b/tests/test_gather.py index 9b5c012..4606284 100644 --- a/tests/test_gather.py +++ b/tests/test_gather.py @@ -147,38 +147,6 @@ def test_make_id_map(M=4098, n_experts=32, topk=2, bias=0.0, bench=False): -def test_triton_smooth_permute_with_indices(M=4096, N=4096, n_experts=256, - topk=8, bench=False): - device = 'cuda:0' - y = torch.randn((M, N), dtype=torch.bfloat16, device=device) - logits = torch.randn((M, n_experts), dtype=torch.float32, device=device) - smooth_scales = 1 + 10 * torch.rand((n_experts, N), device=device, - dtype=torch.float32) - - probs, mask_map, token_count_per_expert, indices, row_id_map = torch_make_indices( - logits, topk=topk, bias=0.0) - - y_q, y_scale = triton_smooth_permute_with_indices(y, smooth_scales, - token_count_per_expert, - indices, reverse=False, - round_scale=False) - - y_q_ref, y_scale_ref = torch_batch_smooth_quant(y, smooth_scales, indices, - token_count_per_expert, - reverse=False, - round_scale=False) - - output_check(y_q_ref.float(), y_q.float(), 'data') - output_check(y_scale_ref.float(), y_scale.float(), 'scale') - - if bench: - n_repeat = 100 - benchmark_func(torch_index_select, y, indices, n_repeat=n_repeat) - benchmark_func(triton_smooth_permute_with_indices, y, smooth_scales, - token_count_per_expert, indices, reverse=False, - round_scale=False, n_repeat=n_repeat) - - def test_triton_smooth_weighted_permute_with_indices(M=4096, N=4096, n_experts=256, topk=8, @@ -423,8 +391,6 @@ def test_triton_batch_transpose_smooth_permute_with_indices(M=1024, N=2048, n_ex if __name__ == '__main__': test_make_id_map(M=4098, n_experts=32, topk=2, bias=0.0, bench=False) - test_triton_smooth_permute_with_indices(M=4096, N=4096, n_experts=32, - topk=8) test_triton_permute_with_mask_map(M=16384, N=2048, n_experts=32, topk=8, bench=False) test_triton_permute_with_mask_map(M=8192, N=4096, n_experts=32, topk=8, bench=False) test_triton_permute_with_mask_map(M=7628, N=2048, n_experts=32, topk=8, bench=False) diff --git a/tests/test_group_quant.py b/tests/test_group_quant.py index 132de67..d8c68ec 100644 --- a/tests/test_group_quant.py +++ b/tests/test_group_quant.py @@ -5,8 +5,7 @@ import torch -from linghe.quant.group import (triton_group_quant, - triton_persist_group_quant) +from linghe.quant.group import triton_group_quant from linghe.tools.benchmark import benchmark_func from linghe.tools.util import (output_check, torch_group_quant) @@ -19,21 +18,10 @@ def test_group_quant(M=4096, N=4096, B=128, round_scale=False, bench=False): output_check(xq_ref.float(), xq.float(), mode='data') output_check(x_scale_ref.float(), x_scale.float(), mode='scale') - xq, x_scale = triton_persist_group_quant(x, group_size=B, - round_scale=round_scale) - output_check(xq_ref.float(), xq.float(), mode='data') - output_check(x_scale_ref.float(), x_scale.float(), mode='scale') - - # torch.testing.assert_close(xq_ref.float(), xq.float(), rtol=0.02, atol=0.02) - if bench: n_repeat = 100 - ref_time = benchmark_func(triton_group_quant, x, group_size=B, + benchmark_func(triton_group_quant, x, group_size=B, n_repeat=n_repeat, ref_bytes=M * N * 3) - benchmark_func(triton_persist_group_quant, x, group_size=B, - n_repeat=n_repeat, ref_time=ref_time, - ref_bytes=M * N * 3) - if __name__ == '__main__': test_group_quant(M=4096, N=4096, B=128) diff --git a/tests/test_hadamard_quant.py b/tests/test_hadamard_quant.py index e69de29..c4c6a57 100644 --- a/tests/test_hadamard_quant.py +++ b/tests/test_hadamard_quant.py @@ -0,0 +1,78 @@ +# -*- coding: utf-8 -*- +""" +Copyright (c) Ant Financial Service Group and its affiliates. +""" + +import torch + +from linghe.quant.hadamard import triton_hadamard_quant +from linghe.tools.benchmark import benchmark_func +from linghe.tools.util import (output_check, + make_hadamard_matrix, + torch_hadamard_transform, + torch_row_quant, + ) + + + + +# apply hadamard transformation and quantization for x +def torch_hadamard_quant(x, hm, round_scale=False): + xh = torch_hadamard_transform(x, hm, side='right') + q, s = torch_row_quant(xh, round_scale=round_scale) + xht = torch_hadamard_transform(x.t().contiguous(), hm, side='right') + qt, st = torch_row_quant(xht, round_scale=round_scale) + + return xh,xht,q,s,qt,st + + +def test_hadamard_quant(M=8192, N=1024, K=2048, B=64, bench=False): + dtype = torch.bfloat16 + device = 'cuda:0' + x = torch.randn((M, K), dtype=dtype, device=device) + w = torch.randn((N, K), dtype=dtype, device=device) + dy = torch.randn((M, N), dtype=dtype, device=device) + + hm = make_hadamard_matrix(B, dtype=dtype, device=device, norm=True) + + + y_ref = x@w.t() + dx_ref = dy@w + dw_ref = dy.t()@x + + xh,xht,xq,xs,xqt,xst = torch_hadamard_quant(x, hm, round_scale=False) + wh,wht,wq,ws,wqt,wst = torch_hadamard_quant(w, hm, round_scale=False) + dyh,dyht,dyq,dys,dyqt,dyst = torch_hadamard_quant(dy, hm, round_scale=False) + + y = xh@wh.t() + dx = dyh@wht.t() + dw = dyht@xht.t() + + output_check(y_ref,y,'bf16.y') + output_check(dx_ref,dx,'bf16.dx') + output_check(dw_ref,dw,'bf16.dw') + + x_q, x_scale, xt_q, xt_scale = triton_hadamard_quant(x, hm) + output_check(xq, x_q, 'x.data') + output_check(xs, x_scale, 'x.scale') + output_check(xqt, xt_q, 'xt.data') + output_check(xst, xt_scale, 'xt.scale') + + + w_q, w_scale, wt_q, wt_scale = triton_hadamard_quant(w, hm) + output_check(wq, w_q, 'w.data') + output_check(ws, w_scale, 'w.scale') + output_check(wqt, wt_q, 'wt.data') + output_check(wst, wt_scale, 'wt.scale') + + + dy_q, dy_scale, dyt_q, dyt_scale = triton_hadamard_quant(dy, hm) + output_check(dyq, dy_q, 'dy.data') + output_check(dys, dy_scale, 'dy.scale') + output_check(dyqt, dyt_q, 'dyt.data') + output_check(dyst, dyt_scale, 'dyt.scale') + + + +if __name__ == '__main__': + test_hadamard_quant(M=8192, N=1024, K=2048, B=64, bench=False) \ No newline at end of file diff --git a/tests/test_norm.py b/tests/test_norm.py index 3c52baf..fb45261 100644 --- a/tests/test_norm.py +++ b/tests/test_norm.py @@ -193,7 +193,6 @@ def test_rmsnorm_and_block_quant(M=4096, N=4096, bench=False): q, scale, rms, q_t, scale_t = triton_rms_norm_and_block_quant_forward(x, weight, round_scale=True, - output_rms=True, output_mode=2) output_check(q_ref, q, mode="2.block.data") output_check(scale_ref.t(), scale, mode='2.block.scale') @@ -202,7 +201,6 @@ def test_rmsnorm_and_block_quant(M=4096, N=4096, bench=False): q, scale, _, _, _ = triton_rms_norm_and_block_quant_forward(x, weight, round_scale=True, - output_rms=True, output_mode=0) output_check(q_ref, q, mode="0.block.data") output_check(scale_ref.t(), scale, mode='0.block.scale') @@ -210,7 +208,6 @@ def test_rmsnorm_and_block_quant(M=4096, N=4096, bench=False): _, _, _, q_t, scale_t = triton_rms_norm_and_block_quant_forward(x, weight, round_scale=True, rms=rms, - output_rms=True, output_mode=1) output_check(qt_ref, q_t, mode='0.block.t_data') output_check(scale_t_ref.t(), scale_t, mode="0.block.t_scale") @@ -218,19 +215,16 @@ def test_rmsnorm_and_block_quant(M=4096, N=4096, bench=False): if bench: benchmark_func(triton_rms_norm_and_block_quant_forward, x, weight, round_scale=True, - output_rms=True, output_mode=0, ref_bytes=M * N * 3) benchmark_func(triton_rms_norm_and_block_quant_forward, x, weight, round_scale=True, - output_rms=True, output_mode=1, ref_bytes=M * N * 3) benchmark_func(triton_rms_norm_and_block_quant_forward, x, weight, round_scale=True, - output_rms=True, output_mode=2, ref_bytes=M * N * 4) From d0a738f1e8f3936290d182025be2c05abd892f45 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E5=8D=97=E9=9C=84?= Date: Thu, 16 Oct 2025 20:55:24 +0800 Subject: [PATCH 6/7] refine doc --- docs/linghe/facade.html | 2 + docs/linghe/facade/hadamard_quant_linear.html | 190 ++++++++++++++++++ docs/linghe/facade/smooth_quant_linear.html | 179 +++++++++++++++++ docs/linghe/quant/hadamard.html | 46 ++++- docs/linghe/quant/smooth.html | 13 +- docs/linghe/utils/silu.html | 64 ++++++ 6 files changed, 483 insertions(+), 11 deletions(-) create mode 100644 docs/linghe/facade/hadamard_quant_linear.html create mode 100644 docs/linghe/facade/smooth_quant_linear.html diff --git a/docs/linghe/facade.html b/docs/linghe/facade.html index a1dbc13..e50704a 100644 --- a/docs/linghe/facade.html +++ b/docs/linghe/facade.html @@ -29,9 +29,11 @@

    Submodules

    diff --git a/docs/linghe/facade/hadamard_quant_linear.html b/docs/linghe/facade/hadamard_quant_linear.html new file mode 100644 index 0000000..530195c --- /dev/null +++ b/docs/linghe/facade/hadamard_quant_linear.html @@ -0,0 +1,190 @@ + + + + + + + linghe.facade.hadamard_quant_linear API documentation + + + + + + + + + +
    +
    +

    +linghe.facade.hadamard_quant_linear

    + +

    Copyright (c) Ant Financial Service Group and its affiliates.

    +
    + + + + +
    +
    +
    + + class + HadamardQuantLinear(torch.nn.modules.module.Module): + + +
    + + +

    Base class for all neural network modules.

    + +

    Your models should also subclass this class.

    + +

    Modules can also contain other Modules, allowing them to be nested in +a tree structure. You can assign the submodules as regular attributes::

    + +
    import torch.nn as nn
    +import torch.nn.functional as F
    +
    +
    +class Model(nn.Module):
    +    def __init__(self) -> None:
    +        super().__init__()
    +        self.conv1 = nn.Conv2d(1, 20, 5)
    +        self.conv2 = nn.Conv2d(20, 20, 5)
    +
    +    def forward(self, x):
    +        x = F.relu(self.conv1(x))
    +        return F.relu(self.conv2(x))
    +
    + +

    Submodules assigned in this way will be registered, and will also have their +parameters converted when you call to(), etc.

    + +
    + +

    As per the example above, an __init__() call to the parent class +must be made before assignment on the child.

    + +
    + +

    :ivar training: Boolean represents whether this module is in training or + evaluation mode. +:vartype training: bool

    +
    + + +
    +
    + + HadamardQuantLinear( in_features: int, out_features: int, bias: bool = True, device=None, dtype=None) + + +
    + + +

    a naive implementation of hadamard transformation and quantization

    + +
    Arguments:
    + +
      +
    • in_features: in feature number
    • +
    • out_features: out feature number
    • +
    • bias: whether use bias
    • +
    • device: weight device
    • +
    • dtype: weight dtype
    • +
    • impl: implementation of hadamard quantization
    • +
    +
    + + +
    +
    +
    + + def + forward(self, input: torch.Tensor) -> torch.Tensor: + + +
    + + +

    Define the computation performed at every call.

    + +

    Should be overridden by all subclasses.

    + +
    + +

    Although the recipe for forward pass needs to be defined within +this function, one should call the Module instance afterwards +instead of this since the former takes care of running the +registered hooks while the latter silently ignores them.

    + +
    +
    + + +
    +
    +
    + + def + extra_repr(self) -> str: + + +
    + + +

    Return the extra representation of the module.

    + +

    To print customized extra information, you should re-implement +this method in your own modules. Both single-line and multi-line +strings are acceptable.

    +
    + + +
    +
    +
    + + \ No newline at end of file diff --git a/docs/linghe/facade/smooth_quant_linear.html b/docs/linghe/facade/smooth_quant_linear.html new file mode 100644 index 0000000..2752e47 --- /dev/null +++ b/docs/linghe/facade/smooth_quant_linear.html @@ -0,0 +1,179 @@ + + + + + + + linghe.facade.smooth_quant_linear API documentation + + + + + + + + + +
    +
    +

    +linghe.facade.smooth_quant_linear

    + +

    Copyright (c) Ant Financial Service Group and its affiliates.

    +
    + + + + +
    +
    +
    + + class + QuantLinear(torch.nn.modules.module.Module): + + +
    + + +

    Base class for all neural network modules.

    + +

    Your models should also subclass this class.

    + +

    Modules can also contain other Modules, allowing them to be nested in +a tree structure. You can assign the submodules as regular attributes::

    + +
    import torch.nn as nn
    +import torch.nn.functional as F
    +
    +
    +class Model(nn.Module):
    +    def __init__(self) -> None:
    +        super().__init__()
    +        self.conv1 = nn.Conv2d(1, 20, 5)
    +        self.conv2 = nn.Conv2d(20, 20, 5)
    +
    +    def forward(self, x):
    +        x = F.relu(self.conv1(x))
    +        return F.relu(self.conv2(x))
    +
    + +

    Submodules assigned in this way will be registered, and will also have their +parameters converted when you call to(), etc.

    + +
    + +

    As per the example above, an __init__() call to the parent class +must be made before assignment on the child.

    + +
    + +

    :ivar training: Boolean represents whether this module is in training or + evaluation mode. +:vartype training: bool

    +
    + + +
    +
    + + QuantLinear( in_features: int, out_features: int, bias: bool = True, device=None, dtype=None) + + +
    + + +

    Initialize internal Module state, shared by both nn.Module and ScriptModule.

    +
    + + +
    +
    +
    + + def + forward(self, input: torch.Tensor) -> torch.Tensor: + + +
    + + +

    Define the computation performed at every call.

    + +

    Should be overridden by all subclasses.

    + +
    + +

    Although the recipe for forward pass needs to be defined within +this function, one should call the Module instance afterwards +instead of this since the former takes care of running the +registered hooks while the latter silently ignores them.

    + +
    +
    + + +
    +
    +
    + + def + extra_repr(self) -> str: + + +
    + + +

    Return the extra representation of the module.

    + +

    To print customized extra information, you should re-implement +this method in your own modules. Both single-line and multi-line +strings are acceptable.

    +
    + + +
    +
    +
    + + \ No newline at end of file diff --git a/docs/linghe/quant/hadamard.html b/docs/linghe/quant/hadamard.html index e9d1157..917cf30 100644 --- a/docs/linghe/quant/hadamard.html +++ b/docs/linghe/quant/hadamard.html @@ -25,10 +25,13 @@ -

    Submodules

    - + +

    API Documentation

    + @@ -44,11 +47,44 @@

    Submodules

    linghe.quant.hadamard

    - +

    Copyright (c) Ant Financial Service Group and its affiliates.

    +
    + +
    +
    + + def + triton_hadamard_quant(x, hm): + + +
    + + +

    apply hadamard transformation and then quantize transformed tensor

    + +
    Arguments:
    + +
      +
    • x: input tensor
    • +
    • hm: hamadard matrix
    • +
    + +
    Returns:
    + +
    +

    x_q: rowwise quantized tensor of non-transposed x + x_scale: rowwise quantization scale of non-transposed x + xt_q: columnwise quantized tensor of transposed x + xt_scale: columnwise quantization scale of transposed x

    +
    +
    + + +
    \ No newline at end of file diff --git a/docs/linghe/quant/smooth.html b/docs/linghe/quant/smooth.html index 1a83b50..c903c18 100644 --- a/docs/linghe/quant/smooth.html +++ b/docs/linghe/quant/smooth.html @@ -25,11 +25,10 @@ -

    Submodules

    - + +

    API Documentation

    +
      +
    @@ -45,7 +44,9 @@

    Submodules

    linghe.quant.smooth

    - +

    Copyright (c) Ant Financial Service Group and its affiliates.

    +
    + diff --git a/docs/linghe/utils/silu.html b/docs/linghe/utils/silu.html index 175a5c1..cbb5131 100644 --- a/docs/linghe/utils/silu.html +++ b/docs/linghe/utils/silu.html @@ -28,6 +28,12 @@

    API Documentation

      +
    • + triton_weighted_silu_forward +
    • +
    • + triton_weighted_silu_backward +
    • triton_silu_and_block_quant_forward
    • @@ -63,6 +69,64 @@

      +
      +
      + + def + triton_weighted_silu_forward(x, weight=None, out=None): + + +
      + + +

      compute silu(x)*weight, used in bf16/fp16 training with MoE

      + +
      Arguments:
      + +
        +
      • x: input tensor
      • +
      • weight: tokenwise weight
      • +
      + +
      Returns:
      + +
      +

      out: output tensor

      +
      +
      + + +
      +
      +
      + + def + triton_weighted_silu_backward( g: torch.Tensor, x: torch.Tensor, weight: Optional[torch.Tensor] = None): + + +
      + + +

      backward of triton_weighted_silu_forward

      + +
      Arguments:
      + +
        +
      • g: gradient tensor
      • +
      • x: input tensor
      • +
      • weight: weight tensor
      • +
      + +
      Returns:
      + +
      +

      dx: gradient of x + dw: gradient of weight

      +
      +
      + + +
      From 16982564130d1db1f6fa67219342defa06711605 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E5=8D=97=E9=9C=84?= Date: Thu, 16 Oct 2025 20:57:54 +0800 Subject: [PATCH 7/7] refine doc --- README.md | 4 +- asserts/api.md | 204 ------------------------------------------------- 2 files changed, 2 insertions(+), 206 deletions(-) delete mode 100644 asserts/api.md diff --git a/README.md b/README.md index 277206e..08e3b3b 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@ ## Introduction --- -Our repo, Linghe, is designed for LLM training, especially for MoE training with FP8 quantizaiton. It provides 2 main categories of kernels: +Our repo, linghe, is designed for LLM training, especially for MoE training with FP8 quantizaiton. It provides 2 main categories of kernels: - **Fused quantization kernels**: fuse quantization with previous layer, e.g., RMS norm and Silu. - **Memory-efficiency kernels**: fuse multiple IO-itensive operations, e.g., ROPE with qk-norm. @@ -36,7 +36,7 @@ Our repo, Linghe, is designed for LLM training, especially for MoE training with --- We benchmark on H800 with batch size 8192, hidden size 2048, num experts 256, activation experts 8. -| kernel | baseline(us) | Linghe(us) | speedup | +| kernel | baseline(us) | linghe(us) | speedup | |--------|--------------|------------|---------| | RMSNorm+Quantization(forward) | 159.3 us | 72.4 us | 2.2 | | Split+qk-norm+rope+transpose(forward) | 472 us | 59.1 us | 7.99 | diff --git a/asserts/api.md b/asserts/api.md deleted file mode 100644 index a6ff272..0000000 --- a/asserts/api.md +++ /dev/null @@ -1,204 +0,0 @@ -# API Reference - - -``` -linghe.utils.norm.triton_rms_norm_and_block_quant_forward(x, weight, eps:Optional[float]=1e-6, out:Optional[torch.Tensor]=None, scale:Optional[torch.Tensor]=None, rms:Optional[torch.Tensor]=None, round_scale: Optional[bool]=False, output_mode:Optional[int]=2) -``` - -Computes the forward pass of RMSNorm and block quantization. - -**Parameters:** -- x(*torch.Tensor*) - Input tensor. [M, N] -- weight(*torch.Tensor*) - RMSNorm weight. [N] -- eps(*float*) - epsilon value for L2 normalization. -- round_scale(*bool*) - Set whether to force power of 2 scales. -- rms(*torch.Tensor*) - Reciprocal of the root mean square of the input calculated over the last dimension.[N] -- output_mode - (*int*, {0, 1, 2}, default = 2) 0 only output non-transpose tensor, 1 only output transposed tensor, 2 return both. - ---- - -**` -Class linghe.facade.rope.QkNormHalfRopeFunction -`** - -``` -forward(qkv:, q_norm_weight, k_norm_weight, freqs, H, h, eps:Optional[float]=1e-6) -``` -Split qkv, and apply L2 nrom and ROPE on q and k. - -**Parameters:** -- qkv(*torch.Tensor*) - QKV tensor with size of [S, B, dim] -- freqs(*torch.Tensor*) - Freqs matrix based on half dim. -- H(*int*) - Number of attention heads. -- h(*int*) - Number of query groups. -- eps(*float*) - epsilon value for L2 normalization. - -``` -backward(grad_q, grad_k, grad_v) -``` -**Parameters:** -- grad_q(*torch.Tensor*) Grad of q tensor. -- grad_k(*torch.Tensor*) Grad of k tensor. -- grad_v(*torch.Tensor*) Gard of v tensor. - ---- - -**` -Class linghe.facade.fp32_linear.FusedFp32GEMM -`** - -Optimized fp32 gemm in router gate function. Convert bf16 input and weight to float32 during the gemm operation. - -``` -forward(input, weight) -``` -**Parameters:** -- input(*torch.Tensor*) - Input tensor with [B, S, dim], dtype of bf16. -- weight(*torch.Tensor*) - Weight tensor of router. - -``` -backward(grad_output) -``` -**Parameters:** -- grad_output(*torch.Tensor*) - Gradient of the activation. - ---- - -``` -linghe.utils.gather.triton_permute_with_mask_map(inp, scale, probs, row_id_map, num_out_tokens, contiguous, tokens_per_expert) -``` -Permute the tokens and probs based on the routing map. Index indicates row index of the output tensor(-1 means not selected). Perform well even when inp.size(0) < expert padding number, do not need extra explict padding. - -**Parameters:** -- inp(*torch.Tensor*) - Input hidden.[num_tokens, hidden_size] -- scale(*torch.Tensor*) - [num_tokens, scale_size] -- prob(*torch.Tensor*) - [num_tokens] Router prob. -- row_id_map(*torch.Tensor*) - [n_experts, num_tokens] Index indicates row index of the output tensor. -- num_out_tokens(*int*) - Output token count, including padding tokens. -- contiguous(*bool*) - Whether indices in row_id_map is contiguous, should be False if padded. -- token_per_expert(bool) - [num_experts] Token count per expert, non-blocking cuda tensor. - ---- - -``` -linghe.utils.scatter.triton_unpermute_with_mask_map(grad, row_id_map, probs) -``` -Unpermute a tensor with permuted tokens with router mapping. - -**Parameters:** -- inp(*torch.Tensor*) - [num_tokens, hidden_size] Permuted tokens. -- row_id_map(*torch.Tensor*) - [n_experts, num_tokens] Routing map to unpermute the tokens. -- prob(*torch.Tensor*) - [num_out_tokens] Permuted probs. - ---- - -``` -linghe.util.silu.triton_silu_and_block_quant_forward(x, out:Optional[torch.Tensor]=None, scale:Optional[torch.Tensor]=None, round_scale:Optional[bool]=False, output_mode:Optional[int]=2) -``` - -Applies the forward pass of Sigmoid Linear Unit(SiLU) element-wise and block quant.(used in shared expert layers.) - -**Parameters:** -- x(*torch.Tensor*) - Input tensor to be quanted. -- round_scale(*bool*) - Set whether to force power of 2 scales. -- output_mode - (*int*, {0, 1, 2}, default = 2) 0 only output non-transpose tensor, 1 only output transposed tensor, 2 return both. - ---- - -``` -linghe.util.silu.triton_silu_and_block_quant_backward(g, x, round_scale:Optional[bool]=False) -``` -**Parameters:** -- g(*torch.Tensor*) - Gradient tensor to be quanted. -- x(*torch.Tensor*) - Input tensor. -- round_scale(*bool*) - Set whether to force power of 2 scales. Default to False. - ---- - -``` -linghe.util.silu.triton_batch_weighted_silu_and_block_quant_forward(x, weight, counts, splits:Optional[List]=None ,out:Optional[torch.Tensor]=None, scale:Optional[torch.Tensor]=None, round_scale:Optional[bool]=False, output_mode:Optional[int]=2) -``` - -Fused op for batched weighted SiLU and block quant. - -**Parameters:** -- x(*torch.Tensor*) - Input tensor. -- weight(*torch.Tensor*) - Permuted probs -- couts(*torch.Tensor*) - Tokens per expert cuda tensor. -- splits(*List[int]*) - List of tokens per expert. If compute in batch mode should not be None. -- output_mode - (*int*, {0, 1, 2}, default = 2) 0 only output non-transpose tensor, 1 only output transposed tensor, 2 return both. - ---- - -``` -linghe.util.silu.triton_batch_weighted_silu_and_block_quant_backward(g, x, weight, counts, splits:Optional[List]=None, round_scale:Optional[bool]=False) -``` -Return blockwise quantized gradient of silu backward. -Quantized tensor is A tuple of () -**Parameters:** -- g(*torch.Tensor*) - Input gradient tensor. -- x(*torch.Tensor*) - Input tensor. -- weight(*torch.Tensor*) - Permuted probs, -- counts(*torch.Tensor*) - Tokens per expert, it is a CUDA tensor. -- splits(*List[int]*) - Tokens per expert, it is a list of int. -- round_scale(bool) - round scale to integer pow of 2 ---- - -**` -Class linghe.facade.loss.SoftmaxCrossEntropyFunction -`** - -SoftmaxCrossEntropy. - -``` -forward(logits, labels, inplace: Optional[bool]=False) -``` - -Fast impl of softmax cross entropy. - -**Parameters:** -- logits(*torch.Tensor*) - Input logits. -- labels(*torch.Tensor*) - Input labels. -- inplace(*bool*) - reuse the `logits` tensor as gradient tensor if inplace=True, else allocate a new tensor. - -``` - - -``` -linghe.util.reduce.triton_batch_sum_with_ord(xs, ord:Optional[int]=2) -``` -return sum(abs(x)**ord). - -**Parameters:** -- xs(*List[torch.Tensor]*) - Tensor lists. -- ord(*int*) - the order of tensor. - ---- - -``` -linghe.util.reduce.triton_batch_count_zero(xs) -``` -Parallel count zeros in the given tensor lists, return the total zero number. - -**Parameters:** -- xs(*List[torch.Tensor]*) - Tensor lists. - ---- - -**` -Class linghe.facade.norm.GroupNormGateFunction -`** -Fused operation of group RMSNorm and sigmoid gate function. - -``` -forward(x, gate, weight, eps:Optional[float]=1e-6, group_size:Optional[int]=4) -``` -Note that the output shape is transposed [S, B, dim] - -**Parameters:** - -- x(*torch.Tensor*) - [B, S, dim], output tensor of attention kernel. -- gate(*torch.Tensor*) - [S, B, dim], gate tensor. -- weight(*torch.Tensor*) - [dim], RMSNorm weight tensor. -- group_size(int) - group size of RMSNorm -```