-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Operator Description
The Gated DeltaNet (GDN)[1] is an emerging linear attention[2] operator that has been adopted in state-of-the-art large language models such as Qwen3-Next and Kimi Linear. By introducing a learnable gating mechanism into the original DeltaNet framework[3], GDN enables dynamic, input-dependent control over memory updates, effectively balancing retention and forgetting in long-context sequences. This enhancement preserves the core advantage of linear attention—computational and memory complexity that scales linearly with sequence length—while significantly improving numerical stability, retrieval accuracy, and adaptability to shifting contextual dependencies.
Implementation Plan
1. Kernel Implementation (L1)
- Kernel: Implement TileLang kernel in
top/kernels/LinearAttn/ - Verification: Pass functional correctness tests
2. Op Definition (L2)
- Interface: Define
torch.opsinterface intop/ops/gatedDeltaNet.py - Unit Tests: Implement
tests/test_<op_name>.py(Compare vs PyTorch Ref)- FP16 (close: 1e-3)
- BF16 (close: 1.6e-2)
- Benchmarks: Implement
benchmarks/benchmark_gatedDeltaNet.py- Latency
- TFLOPS
- DRAM Bandwidth
Reference
[1] S. Yang, J. Kautz, and A. Hatamizadeh, “Gated Delta Networks: Improving Mamba2 with Delta Rule,” arXiv preprint arXiv:2412.06464, 2024. DOI: 10.48550/arXiv.2412.06464.
[2] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are RNNs: Fast autoregressive transformers with linear attention,” in Proc. 37th Int. Conf. Mach. Learn. (ICML), vol. 119, 2020, pp. 5156–5165.
[3] I. Schlag, K. Irie, and J. Schmidhuber, “Linear transformers are secretly fast weight programmers,” in Proc. 38th Int. Conf. Mach. Learn. (ICML), vol. 139, 2021, pp. 9355–9366.