[RWKV7] Remove in-place operations and add gradient checkpointing for `v_first` #145

Triang-jyed-driung · 2025-01-26T14:16:17Z

The attention layer needs to return v_first, see https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v5/src/model.py#L885.
No in-place operations for v_first, see https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v5/src/model.py#L870.
This design of RWKV-7 should be treated with care. https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v5/src/model.py#L1273

Rationale 1: In-place operations harm gradient computation. A possible error looks like:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [4, 192, 768]], which is output 0 of torch::autograd::CopyBackwards, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Rationale 2: This requires an additional parameter returned for RWKV7Attention. However, given the nature of the value embeddings, one expects RWKV7Attention at the 0th layer, even if the model is hybrid architecture. Instead of forcing to unify the interface and return values which may cause undefined behavior, an additional value to unpack would sure raise the awareness of hybrid architecture builders.

[Flame] Mark accelerate-based training framework as legacy

No in-place operations for `v_first` for gradient computation

Triang-jyed-driung added 3 commits January 26, 2025 21:55

Update modeling_rwkv7.py

107a894

Merge pull request #1 from fla-org/main

547d117

[Flame] Mark accelerate-based training framework as legacy

Update rwkv7.py

4c0d43a

No in-place operations for `v_first` for gradient computation

yzhangcs merged commit 1bc96a6 into fla-org:main Jan 27, 2025
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RWKV7] Remove in-place operations and add gradient checkpointing for `v_first` #145

[RWKV7] Remove in-place operations and add gradient checkpointing for `v_first` #145

Triang-jyed-driung commented Jan 26, 2025

[RWKV7] Remove in-place operations and add gradient checkpointing for v_first #145

[RWKV7] Remove in-place operations and add gradient checkpointing for v_first #145

Conversation

Triang-jyed-driung commented Jan 26, 2025

[RWKV7] Remove in-place operations and add gradient checkpointing for `v_first` #145

[RWKV7] Remove in-place operations and add gradient checkpointing for `v_first` #145