PyTorch-compatible backward API #7665

tohtana · 2025-11-03T21:02:24Z

Currently DeepSpeed's backward API has more constraints compared to PyTorch's normal backward API.
Here is the usage as described in the documentation:

    loss = model_engine(batch)
    model_engine.backward(loss)

In this example,

Only accepts a (scalar) loss value
Need to call engine's backward API

In contrast, in standard PyTorch, you can do:

    output = model(batch)
    output.backward(out_grad)

There are several use cases that rely on this flexibility. For example, combining multiple models or using loss functions defined separately from the main model.

If you attempt the same pattern with a DeepSpeed engine, some preprocessing and postprocessing steps will be silently skipped, which can lead to incorrect results.

The document explains we can call _backward_epilogue manually (possibly backward_prologue as well). However, it's easy for users to miss these calls, and passing a non-scalar gradient is still not supported.

This PR introduces the same .backward() behavior as PyTorch, allowing .backward() to be called directly on tensors and supporting non-scalar outputs.

To implement post-backward hooks, we had to use some torch internal APIs. See comments for more details. When the internal APIs are not available, DeepSpeed engine only accepts the traditional way model_engine.backward(loss).

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

sfc-gh-truwase · 2025-11-04T11:51:25Z

@tohtana, this is a very exciting usability improvement. Please remember to update the documentation.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana · 2025-11-14T02:14:44Z

@sfc-gh-truwase I think this PR is now ready for review, though the latest change on HF transformer causes an error with test_zero_nesting_init.py::TestNestedParallelInit::test_nested_parallel_init.

deepspeed/runtime/zero/stage3.py

deepspeed/runtime/utils.py

deepspeed/runtime/zero/stage3.py

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana added 11 commits October 17, 2025 16:23

rename backward prologue method

8a27283

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

refactor loss scaling

d85cfd9

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

refactor backward

bded5c8

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

fix for bf16 optimizer

1f413d6

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

simplify preprocess/postprocess of backward

cc87977

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

fix order of backward postprocess

95018a3

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

enable non-scalar backward only for ZeROOptimizer

80d0e7d

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

fix zero+fp16 case

db70476

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

add config to enable allow_user_backward

50b29d8

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

fix flag for error handling

076b187

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

resolve conflict

f6748d1

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana and others added 18 commits November 6, 2025 16:24

add test cases

280b1fa

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Merge branch 'master' into tohtana/backward_non_scalar

5d5e64e

fix format

6ce26f3

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

return scaled loss from engine's backward

0c579d5

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Merge branch 'master' into tohtana/backward_non_scalar

1615036

remove option to enable user backward

c8758f7

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

add hook utility

9962f2c

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

fix for z2

a8f15a0

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

fix scaling

1d0a721

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

exclude unused params from counter

39372ac

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

set default flag

7eacbc7

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

handle non-zero optimizer

6cce937

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

call epilogue in engine's backward

b72b5a7

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

prevent hooks from being called from nested backward

1307a87

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

run post hook fo rz3

98cc865

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Merge branch 'master' into tohtana/backward_non_scalar

adb6990

added comments

9328dfa

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

remove hard-coded tolerances

01b3251

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana added 2 commits November 13, 2025 15:24

add test for multiple engines

78f7ad4

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

update document

73f7ff1

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

tohtana marked this pull request as ready for review November 14, 2025 02:13

tohtana requested review from loadams and tjruwase as code owners November 14, 2025 02:13

sfc-gh-truwase reviewed Nov 17, 2025

View reviewed changes

deepspeed/runtime/zero/stage3.py Show resolved Hide resolved

sfc-gh-truwase reviewed Nov 17, 2025

View reviewed changes

deepspeed/runtime/utils.py Outdated Show resolved Hide resolved

sfc-gh-truwase reviewed Nov 17, 2025

View reviewed changes

deepspeed/runtime/utils.py Outdated Show resolved Hide resolved

sfc-gh-truwase reviewed Nov 17, 2025

View reviewed changes

deepspeed/runtime/zero/stage3.py Outdated Show resolved Hide resolved

tohtana added 5 commits November 17, 2025 00:29

remove deprecated comment

26308cd

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

simplify utility func to count effective grad nodes

9963546

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

fix combination with leaf module

b730f46

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

refactor tests

08b1599

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

refactor tests

92d3068

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

sfc-gh-truwase approved these changes Nov 18, 2025

View reviewed changes

tohtana and others added 3 commits November 18, 2025 13:32

fix loss scaling

ebac40b

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Merge branch 'master' into tohtana/backward_non_scalar

fcf7c8c

Merge branch 'master' into tohtana/backward_non_scalar

90e1b7d

tohtana enabled auto-merge (squash) November 19, 2025 00:01

tohtana merged commit 53e91a0 into deepspeedai:master Nov 19, 2025
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PyTorch-compatible backward API #7665

PyTorch-compatible backward API #7665

Uh oh!

tohtana commented Nov 3, 2025 •

edited

Loading

Uh oh!

sfc-gh-truwase commented Nov 4, 2025

Uh oh!

tohtana commented Nov 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PyTorch-compatible backward API #7665

PyTorch-compatible backward API #7665

Uh oh!

Conversation

tohtana commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-truwase commented Nov 4, 2025

Uh oh!

tohtana commented Nov 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tohtana commented Nov 3, 2025 •

edited

Loading