Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RWKV7] Remove in-place operations and add gradient checkpointing for v_first #145

Merged
merged 3 commits into from
Jan 27, 2025

Conversation

Triang-jyed-driung
Copy link
Contributor

  1. The attention layer needs to return v_first, see https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v5/src/model.py#L885.
  2. No in-place operations for v_first, see https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v5/src/model.py#L870.
  3. This design of RWKV-7 should be treated with care. https://github.com/BlinkDL/RWKV-LM/blob/main/RWKV-v5/src/model.py#L1273

Rationale 1: In-place operations harm gradient computation. A possible error looks like:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [4, 192, 768]], which is output 0 of torch::autograd::CopyBackwards, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Rationale 2: This requires an additional parameter returned for RWKV7Attention. However, given the nature of the value embeddings, one expects RWKV7Attention at the 0th layer, even if the model is hybrid architecture. Instead of forcing to unify the interface and return values which may cause undefined behavior, an additional value to unpack would sure raise the awareness of hybrid architecture builders.

[Flame] Mark accelerate-based training framework as legacy
No in-place operations for `v_first` for gradient computation
@yzhangcs yzhangcs merged commit 1bc96a6 into fla-org:main Jan 27, 2025
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants