[New Op] Linear Attention in Gated DeltaNet

## Operator Description


The Gated DeltaNet (GDN)[1] is an emerging linear attention[2] operator that has been adopted in state-of-the-art large language models such as Qwen3-Next and Kimi Linear. By introducing a learnable gating mechanism into the original DeltaNet framework[3], GDN enables dynamic, input-dependent control over memory updates, effectively balancing retention and forgetting in long-context sequences. This enhancement preserves the core advantage of linear attention—computational and memory complexity that scales linearly with sequence length—while significantly improving numerical stability, retrieval accuracy, and adaptability to shifting contextual dependencies.


## Implementation Plan

### 1. Kernel Implementation (L1)

- [ ] **Kernel**: Implement TileLang kernel in `top/kernels/LinearAttn/`
- [ ] **Verification**: Pass functional correctness tests

### 2. Op Definition (L2)

- [ ] **Interface**: Define `torch.ops` interface in `top/ops/gatedDeltaNet.py`
- [ ] **Unit Tests**: Implement `tests/test_<op_name>.py` (Compare vs PyTorch Ref)
 - [ ] FP16 (close: 1e-3)
 - [ ] BF16 (close: 1.6e-2)
- [ ] **Benchmarks**: Implement `benchmarks/benchmark_gatedDeltaNet.py`
 - [ ] Latency
 - [ ] TFLOPS
 - [ ] DRAM Bandwidth

## Reference


[1] S. Yang, J. Kautz, and A. Hatamizadeh, “Gated Delta Networks: Improving Mamba2 with Delta Rule,” arXiv preprint arXiv:2412.06464, 2024. DOI: 10.48550/arXiv.2412.06464.
[2] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are RNNs: Fast autoregressive transformers with linear attention,” in Proc. 37th Int. Conf. Mach. Learn. (ICML), vol. 119, 2020, pp. 5156–5165.
[3] I. Schlag, K. Irie, and J. Schmidhuber, “Linear transformers are secretly fast weight programmers,” in Proc. 38th Int. Conf. Mach. Learn. (ICML), vol. 139, 2021, pp. 9355–9366.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Op] Linear Attention in Gated DeltaNet #173

Operator Description

Implementation Plan

1. Kernel Implementation (L1)

2. Op Definition (L2)

Reference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[New Op] Linear Attention in Gated DeltaNet #173

Description

Operator Description

Implementation Plan

1. Kernel Implementation (L1)

2. Op Definition (L2)

Reference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions