[NPU]: optimize GEGLU implementation with flatten 1D approach #1031

noemotiovon · 2026-01-20T02:23:18Z

Refactor Ascend GEGLU kernels to use flatten 1D grid-stride loop pattern instead of row-based tiling approach for better performance
Simplify block size calculation using compute_default_tiling_strategy
Align type conversion logic with GPU version for consistency
Update test tolerances for NPU bfloat16 (1e4) to handle precision differences

Hardware Type: Ascend 910B4

run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

noemotiovon · 2026-01-20T02:34:35Z

Benchmark:

**************************************
     BENCHMARKING SPEED for GEGLU
**************************************
[WARNING] Please DO NOT tune args ['num_warps']!
[WARNING] Please DO NOT tune args ['num_warps']!
********** Benchmark Data **********
[
  {
    "kernel_name": "geglu",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "T",
    "x_label": "sequence length",
    "x_values": [
      1024,
      2048,
      4096,
      8192
    ],
    "y_values_50": [
      41.29764175415039,
      77.97412109375,
      153.22802734375,
      305.4518127441406
    ],
    "y_values_20": [
      41.29764175415039,
      77.97412109375,
      153.22802734375,
      305.4518127441406
    ],
    "y_values_80": [
      41.29764175415039,
      77.97412109375,
      153.22802734375,
      305.4518127441406
    ],
    "timestamp": "2026-01-20 02:12:01",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"bsz\": 8, \"hidden_size\": 4096, \"intermediate_size\": 11008, \"hidden_act\": \"gelu_pytorch_tanh\", \"dtype\": \"torch.bfloat16\"}",
    "liger_version": "0.6.4"
  },
  {
    "kernel_name": "geglu",
    "kernel_provider": "huggingface",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "T",
    "x_label": "sequence length",
    "x_values": [
      1024,
      2048,
      4096,
      8192
    ],
    "y_values_50": [
      43.27560043334961,
      82.03520202636719,
      159.79339599609375,
      318.13525390625
    ],
    "y_values_20": [
      43.27560043334961,
      82.03520202636719,
      159.79339599609375,
      318.13525390625
    ],
    "y_values_80": [
      43.27560043334961,
      82.03520202636719,
      159.79339599609375,
      318.13525390625
    ],
    "timestamp": "2026-01-20 02:12:10",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"bsz\": 8, \"hidden_size\": 4096, \"intermediate_size\": 11008, \"hidden_act\": \"gelu_pytorch_tanh\", \"dtype\": \"torch.bfloat16\"}",
    "liger_version": "0.6.4"
  },
  {
    "kernel_name": "geglu",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "T",
    "x_label": "sequence length",
    "x_values": [
      1024,
      2048,
      4096,
      8192
    ],
    "y_values_50": [
      10.68362045288086,
      20.787700653076172,
      41.33747863769531,
      82.21530151367188
    ],
    "y_values_20": [
      10.68362045288086,
      20.787700653076172,
      41.33747863769531,
      82.21530151367188
    ],
    "y_values_80": [
      10.68362045288086,
      20.787700653076172,
      41.33747863769531,
      82.21530151367188
    ],
    "timestamp": "2026-01-20 02:12:15",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"bsz\": 8, \"hidden_size\": 4096, \"intermediate_size\": 11008, \"hidden_act\": \"gelu_pytorch_tanh\", \"dtype\": \"torch.bfloat16\"}",
    "liger_version": "0.6.4"
  },
  {
    "kernel_name": "geglu",
    "kernel_provider": "huggingface",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "T",
    "x_label": "sequence length",
    "x_values": [
      1024,
      2048,
      4096,
      8192
    ],
    "y_values_50": [
      11.14247989654541,
      21.72364044189453,
      43.50083923339844,
      86.65560150146484
    ],
    "y_values_20": [
      11.14247989654541,
      21.72364044189453,
      43.50083923339844,
      86.65560150146484
    ],
    "y_values_80": [
      11.14247989654541,
      21.72364044189453,
      43.50083923339844,
      86.65560150146484
    ],
    "timestamp": "2026-01-20 02:12:20",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"bsz\": 8, \"hidden_size\": 4096, \"intermediate_size\": 11008, \"hidden_act\": \"gelu_pytorch_tanh\", \"dtype\": \"torch.bfloat16\"}",
    "liger_version": "0.6.4"
  },
  {
    "kernel_name": "geglu",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "T",
    "x_label": "sequence length",
    "x_values": [
      1024,
      2048,
      4096,
      8192
    ],
    "y_values_50": [
      27.273399353027344,
      50.27893829345703,
      98.00166320800781,
      193.2967987060547
    ],
    "y_values_20": [
      27.273399353027344,
      50.27893829345703,
      98.00166320800781,
      193.2967987060547
    ],
    "y_values_80": [
      27.273399353027344,
      50.27893829345703,
      98.00166320800781,
      193.2967987060547
    ],
    "timestamp": "2026-01-20 02:12:27",
    "kernel_operation_mode": "backward",
    "extra_benchmark_config_str": "{\"bsz\": 8, \"hidden_size\": 4096, \"intermediate_size\": 11008, \"hidden_act\": \"gelu_pytorch_tanh\", \"dtype\": \"torch.bfloat16\"}",
    "liger_version": "0.6.4"
  },
  {
    "kernel_name": "geglu",
    "kernel_provider": "huggingface",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "T",
    "x_label": "sequence length",
    "x_values": [
      1024,
      2048,
      4096,
      8192
    ],
    "y_values_50": [
      28.78070068359375,
      53.18796157836914,
      102.28961944580078,
      204.07730102539062
    ],
    "y_values_20": [
      28.78070068359375,
      53.18796157836914,
      102.28961944580078,
      204.07730102539062
    ],
    "y_values_80": [
      28.78070068359375,
      53.18796157836914,
      102.28961944580078,
      204.07730102539062
    ],
    "timestamp": "2026-01-20 02:12:35",
    "kernel_operation_mode": "backward",
    "extra_benchmark_config_str": "{\"bsz\": 8, \"hidden_size\": 4096, \"intermediate_size\": 11008, \"hidden_act\": \"gelu_pytorch_tanh\", \"dtype\": \"torch.bfloat16\"}",
    "liger_version": "0.6.4"
  }
]
**************************************
     BENCHMARKING MEMORY for GEGLU
**************************************
********** Benchmark Data **********
[
  {
    "kernel_name": "geglu",
    "kernel_provider": "liger",
    "metric_name": "memory",
    "metric_unit": "MB",
    "gpu_name": "Ascend910B4",
    "x_name": "T",
    "x_label": "sequence length",
    "x_values": [
      1024,
      2048,
      4096,
      8192
    ],
    "y_values_50": [
      1910.0087890625,
      3092.0078125,
      5668.0078125,
      10820.0078125
    ],
    "y_values_20": [
      1910.0087890625,
      3092.0078125,
      5668.0078125,
      10820.0078125
    ],
    "y_values_80": [
      1910.0087890625,
      3092.0078125,
      5668.0078125,
      10820.0078125
    ],
    "timestamp": "2026-01-20 02:12:40",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"bsz\": 8, \"hidden_size\": 4096, \"intermediate_size\": 11008, \"hidden_act\": \"gelu_pytorch_tanh\", \"dtype\": \"torch.bfloat16\"}",
    "liger_version": "0.6.4"
  },
  {
    "kernel_name": "geglu",
    "kernel_provider": "huggingface",
    "metric_name": "memory",
    "metric_unit": "MB",
    "gpu_name": "Ascend910B4",
    "x_name": "T",
    "x_label": "sequence length",
    "x_values": [
      1024,
      2048,
      4096,
      8192
    ],
    "y_values_50": [
      2082.00927734375,
      3436.00830078125,
      6356.00830078125,
      12196.0078125
    ],
    "y_values_20": [
      2082.00927734375,
      3436.00830078125,
      6356.00830078125,
      12196.0078125
    ],
    "y_values_80": [
      2082.00927734375,
      3436.00830078125,
      6356.00830078125,
      12196.0078125
    ],
    "timestamp": "2026-01-20 02:12:49",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"bsz\": 8, \"hidden_size\": 4096, \"intermediate_size\": 11008, \"hidden_act\": \"gelu_pytorch_tanh\", \"dtype\": \"torch.bfloat16\"}",
    "liger_version": "0.6.4"
  },
  {
    "kernel_name": "geglu",
    "kernel_provider": "liger",
    "metric_name": "memory",
    "metric_unit": "MB",
    "gpu_name": "Ascend910B4",
    "x_name": "T",
    "x_label": "sequence length",
    "x_values": [
      1024,
      2048,
      4096,
      8192
    ],
    "y_values_50": [
      922.0048828125,
      1566.0048828125,
      2854.0048828125,
      5430.0048828125
    ],
    "y_values_20": [
      922.0048828125,
      1566.0048828125,
      2854.0048828125,
      5430.0048828125
    ],
    "y_values_80": [
      922.0048828125,
      1566.0048828125,
      2854.0048828125,
      5430.0048828125
    ],
    "timestamp": "2026-01-20 02:12:56",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"bsz\": 8, \"hidden_size\": 4096, \"intermediate_size\": 11008, \"hidden_act\": \"gelu_pytorch_tanh\", \"dtype\": \"torch.bfloat16\"}",
    "liger_version": "0.6.4"
  },
  {
    "kernel_name": "geglu",
    "kernel_provider": "huggingface",
    "metric_name": "memory",
    "metric_unit": "MB",
    "gpu_name": "Ascend910B4",
    "x_name": "T",
    "x_label": "sequence length",
    "x_values": [
      1024,
      2048,
      4096,
      8192
    ],
    "y_values_50": [
      1094.00537109375,
      1910.00537109375,
      3542.00537109375,
      6806.00537109375
    ],
    "y_values_20": [
      1094.00537109375,
      1910.00537109375,
      3542.00537109375,
      6806.00537109375
    ],
    "y_values_80": [
      1094.00537109375,
      1910.00537109375,
      3542.00537109375,
      6806.00537109375
    ],
    "timestamp": "2026-01-20 02:13:00",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"bsz\": 8, \"hidden_size\": 4096, \"intermediate_size\": 11008, \"hidden_act\": \"gelu_pytorch_tanh\", \"dtype\": \"torch.bfloat16\"}",
    "liger_version": "0.6.4"
  },
  {
    "kernel_name": "geglu",
    "kernel_provider": "liger",
    "metric_name": "memory",
    "metric_unit": "MB",
    "gpu_name": "Ascend910B4",
    "x_name": "T",
    "x_label": "sequence length",
    "x_values": [
      1024,
      2048,
      4096,
      8192
    ],
    "y_values_50": [
      1910.0087890625,
      3092.0078125,
      5668.0078125,
      10820.0078125
    ],
    "y_values_20": [
      1910.0087890625,
      3092.0078125,
      5668.0078125,
      10820.0078125
    ],
    "y_values_80": [
      1910.0087890625,
      3092.0078125,
      5668.0078125,
      10820.0078125
    ],
    "timestamp": "2026-01-20 02:13:04",
    "kernel_operation_mode": "backward",
    "extra_benchmark_config_str": "{\"bsz\": 8, \"hidden_size\": 4096, \"intermediate_size\": 11008, \"hidden_act\": \"gelu_pytorch_tanh\", \"dtype\": \"torch.bfloat16\"}",
    "liger_version": "0.6.4"
  },
  {
    "kernel_name": "geglu",
    "kernel_provider": "huggingface",
    "metric_name": "memory",
    "metric_unit": "MB",
    "gpu_name": "Ascend910B4",
    "x_name": "T",
    "x_label": "sequence length",
    "x_values": [
      1024,
      2048,
      4096,
      8192
    ],
    "y_values_50": [
      2082.00927734375,
      3436.00830078125,
      6356.00830078125,
      12196.0078125
    ],
    "y_values_20": [
      2082.00927734375,
      3436.00830078125,
      6356.00830078125,
      12196.0078125
    ],
    "y_values_80": [
      2082.00927734375,
      3436.00830078125,
      6356.00830078125,
      12196.0078125
    ],
    "timestamp": "2026-01-20 02:13:11",
    "kernel_operation_mode": "backward",
    "extra_benchmark_config_str": "{\"bsz\": 8, \"hidden_size\": 4096, \"intermediate_size\": 11008, \"hidden_act\": \"gelu_pytorch_tanh\", \"dtype\": \"torch.bfloat16\"}",
    "liger_version": "0.6.4"
  }
]

- Refactor Ascend GEGLU kernels to use flatten 1D grid-stride loop pattern instead of row-based tiling approach for better performance - Simplify block size calculation using compute_default_tiling_strategy - Align type conversion logic with GPU version for consistency - Update test tolerances for NPU bfloat16 (1e4) to handle precision differences

Tcc0403 · 2026-01-20T10:45:27Z

test/transformers/test_geglu.py

+            # TODO: we should find a better way to tune this. 1e4 is too large apparently
+            1e-2 if device != "npu" else 1e4,


Do you know what tensor couldn't pass with this tolerance? gradients or inputs?

Thanks for the question. I double-checked which tensors require the large tolerance.

On NPU with bfloat16:

Forward outputs (y1 vs y2) differ at around O(1e2).

Weight gradients (gate_proj / up_proj / down_proj) are also at O(1e2).

The largest discrepancy is in the input gradients: x1.grad vs x2.grad can reach O(1e4).

So the forward and weight gradients are already numerically different at ~1e2, and the input gradients further amplify this difference.

================================================================================ SUMMARY - Minimum atol needed for each tensor (rtol=1e-2): ================================================================================ output : min_atol=1e2 , max_abs_diff=2.048000e+03 gate_proj.weight.grad : min_atol=1e3 , max_abs_diff=2.048000e+03 up_proj.weight.grad : min_atol=1e2 , max_abs_diff=2.048000e+03 down_proj.weight.grad : min_atol=1e2 , max_abs_diff=2.048000e+03 input.grad : min_atol=1e4 , max_abs_diff=4.096000e+03

Also worth noting: the tolerances used here are consistent with the previous NPU GEGLU kernel implementation, so this change does not introduce new numerical error compared to the existing behavior on NPU.

noemotiovon force-pushed the op_geglu branch from f534187 to b524da8 Compare January 20, 2026 09:11

Tcc0403 reviewed Jan 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NPU]: optimize GEGLU implementation with flatten 1D approach #1031

[NPU]: optimize GEGLU implementation with flatten 1D approach #1031

noemotiovon commented Jan 20, 2026

Uh oh!

noemotiovon commented Jan 20, 2026

Uh oh!

Tcc0403 Jan 20, 2026

Uh oh!

noemotiovon Jan 20, 2026

Uh oh!

noemotiovon Jan 20, 2026

Uh oh!

noemotiovon Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# TODO: we should find a better way to tune this. 1e4 is too large apparently
		1e-2 if device != "npu" else 1e4,

[NPU]: optimize GEGLU implementation with flatten 1D approach #1031

Are you sure you want to change the base?

[NPU]: optimize GEGLU implementation with flatten 1D approach #1031

Conversation

noemotiovon commented Jan 20, 2026

Uh oh!

noemotiovon commented Jan 20, 2026

Uh oh!

Tcc0403 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

noemotiovon Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

noemotiovon Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

noemotiovon Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants