Raise when in place operations occur on leafs requiring grad #1458

beverlylytle · 2024-11-20T10:26:37Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

kshitij12345

The fix looks good. We should add a small test to verify that this error raised when expected. Thanks @beverlylytle

thunder/tests/test_inplace_functionalization.py

thunder/tests/test_inplace_copy.py

kshitij12345 · 2024-11-21T15:28:04Z

thunder/executors/torchex.py

@@ -2190,6 +2182,9 @@ def is_float_type(self, input):


 def _copy__impl(copy_from, copy_to):
+    cd = get_compile_data()
+    if cd is not None and cd.is_grad_enabled and copy_to.is_leaf and copy_to.requires_grad:
+        raise RuntimeError("a leaf Variable that requires grad is being used in an in-place operation.")


I am wondering if Symbol copy_ in thunder/torch/__init__.py is more appropriate location for the check.

lightning-thunder/thunder/torch/__init__.py

Lines 1961 to 1963 in 60f3ee1

@torchsymbol(torch.Tensor.copy_, is_method=True) # , tags=(prims.OpTags.IN_PLACE,))

def copy_(a, b, /):

return prims.copy_(b, a)

a and b are proxies and it it not clear to me if a proxy knows that it is a leaf.

They do not. It's only a PyTorch concept that's available at runtime inside _copy__impl.

Right, previously I missed that the fix was in copy_impl. And since, it is happening at runtime, I am wondering if compile_data is actually available.

Quick test shows (see below) that it wouldn't be. So, we probably need a way to check if this copy was called under no_grad in users code (as PyTorch supports inplace of leaf tensors under no_grad, see comment).

Snippet to check if compile_data is available -

import torch import thunder from thunder.extend import OperatorExecutor from thunder.core.compile_data import get_compile_data from thunder.core.proxies import TensorProxy ex = OperatorExecutor("ex") def clone_impl(x): cd = get_compile_data() print(cd) # None return x clone = ex.register_operator("clone", meta=lambda x: TensorProxy(like=x), fn=clone_impl) def fn(x): return clone(x) x = torch.ones(3) jfn = thunder.jit(fn) jfn(x) exec_trace = thunder.last_traces(jfn)[-1] # print(exec_trace)

Indeed, compile_data was not available, but now it should be with the added context manager in thunder/init.py

I think this is still incorrect because as discussed in #1486, the value of compile_data.is_grad_enabled here would be that of last updated state which can lead to incorrectness when used outside of tracing context.

We can see the discrepancy here.

import torch import thunder x = torch.randn(3, 3, requires_grad=True) @torch.no_grad def fn(x): return x.add_(1) fn(x) # This works thunder.jit(fn)(x) # This raises error

So, whether the copy is in no_grad region needs to be captured during the tracing time.

Right, this is why I created the other issue. This PR fixes the leaf/grad issue when there is no annotation. When there is an annotation, another approach is required. This other approach may or may not involve using compile data in _copy__impl.

As far as I understand, compile data is the medium for passing around data such as whether grad is enabled. But as the other issue points out, compile data reflects the end state of a function call and not the "live" state, at least at the time it reaches _copy__impl. So I'm left with the questions "are there other mechanisms for passing around whether grad is enabled?" "where else in the execution is it simultaneously knowable that a (1) leaf tensor (2) requiring grad is being (3) copied when (4) grad is enabled?" "is it feasible/desirable to make the compile data more dynamic?" "is there a way to context-manage the tensors so that their requires_grad flags are set to False when the interpreter sees torch._C._set_grad_enabled(False), and then later restored, thereby obviating the need for the compile data for this check?" Do you have suggestions for a fix that addresses both issues? Or can we close out this issue and move the discussion to the more involved issue?

So to tackle - leaf tensor requiring grad being copied into when grad is enabled, I think similar to a previous commit,
we can update prims.copy to take a argument is_grad_enabled. With this, ltorch.copy will query cd.is_grad_enabled and call prims.copy by also passing this argument.

lightning-thunder/thunder/torch/__init__.py

Lines 1984 to 1986 in 9de5434

@torchsymbol(torch.Tensor.copy_, is_method=True) # , tags=(prims.OpTags.IN_PLACE,))

def copy_(a, b, /):

return prims.copy_(b, a)

With these changes, the copy_impl's signature will also change to accept is_grad_enabled and it will be called at runtime with a tensor which we can query if it is a leaf and also whether grad was enabled or not when calling that particular copy. Wdyt @beverlylytle?

Though, I am curious if there is another approach to this - cc: @IvanYashchuk

Let's see what the CI thinks.

I agree with modifying thunder.torch.copy to query cd.is_grad_enabled and passing that to prims.copy.

thunder/tests/test_inplace_functionalization.py

kshitij12345

Overall looks good to me, I just have a couple of questions. Thank you @beverlylytle

kshitij12345 · 2024-12-16T20:03:52Z

thunder/executors/nvfuserex_impl.py

@@ -2085,6 +2087,7 @@ def copy_(
    *,
    fd: FusionDefinition,
    lc_to_nv_map: dict,
+    grad_enabled: bool,


What is the behaviour for nvfuser? I think that we ignore this argument. Should we raise a warning instead?

Yes, the argument is ignored, and nvfuser does not fail.

thunder/tests/test_inplace_functionalization.py

kshitij12345 · 2024-12-16T20:07:51Z

thunder/torch/__init__.py

@@ -1983,7 +1983,8 @@ def copysign_(a, b, /):

 @torchsymbol(torch.Tensor.copy_, is_method=True)  # , tags=(prims.OpTags.IN_PLACE,))
 def copy_(a, b, /):
-    return prims.copy_(b, a)
+    cd = get_compile_data()
+    return prims.copy_(b, a, grad_enabled=cd.is_grad_enabled if cd is not None else False)


if cd is None (probably happens for thunder.trace with default arguments), should we assume that we are running with grad_enabled with a warning? I think that it is likely case. Wdyt?

cc: @IvanYashchuk

I don't have the background to have an opinion on this. I defer to you @kshitij12345 and @IvanYashchuk.

cd is our controlled way of specifying and querying the state of PyTorch. If it's None, I don't think we should do anything special. It's the responsibility of the outside system to set up a correct cd object. is_grad_enabled is a sensible default because if nothing else is specified, we should assume we are executing a program as given in the "inference" mode with no additional side transformations.

thunder/core/prims.py

thunder/executors/nvfuserex_impl.py

IvanYashchuk · 2024-12-18T09:29:30Z

thunder/executors/torchex.py

@@ -2190,6 +2182,9 @@ def is_float_type(self, input):


 def _copy__impl(copy_from, copy_to):
+    cd = get_compile_data()
+    if cd is not None and cd.is_grad_enabled and copy_to.is_leaf and copy_to.requires_grad:
+        raise RuntimeError("a leaf Variable that requires grad is being used in an in-place operation.")


I agree with modifying thunder.torch.copy to query cd.is_grad_enabled and passing that to prims.copy.

thunder/tests/test_inplace_functionalization.py

thunder/executors/nvfuserex_impl.py

thunder/examine/__init__.py

beverlylytle · 2024-12-20T13:22:29Z

thunder/torch/__init__.py

+def _copy_(a, b, /):
+    cd = get_compile_data()
+    return prims.copy_(b, a, grad_enabled=cd.is_grad_enabled if cd is not None else False)
+
+
+@torchsymbol(torch.Tensor.copy_, is_method=True)  # , tags=(prims.OpTags.IN_PLACE,))
+def copy_(a, b, /):
+    return _copy_(a, b)
+
+


Consider the following snippet:

import thunder import torch x = torch.rand((2,3), dtype=torch.float32, device='cuda') def f(x): return x.to(torch.float64).sin_()

Here a new tensor of type float64 is being created and an inplace operation is occurring on it. Explicitly:

def f(x): y = x.to(torch.float64) return y.sin_()

One might expect that the following would be a less efficient, but still more or less equivalent version of the above:

def g(x): y = x.to(torch.float64) z = y.sin() return y.copy_(z)

However, for

jf = thunder.jit(f); jg = thunder.jit(g)

jf(x) executes successfully while jg(x) results in

An error occurred while defining nvFuser FusionDefinition None. If you believe this is a bug or need assistance, please file an issue at https://github.com/NVIDIA/Fuser/issues/new Here's a script to reproduce the error: ```python # CUDA devices: # 0: NVIDIA RTX 6000 Ada Generation # torch version: 2.5.1+cu124 # cuda version: 12.4 # nvfuser version: 0.2.23+gitd53be45 import torch from nvfuser import FusionDefinition, DataType def nvfuser_incomplete_fusion(fd : FusionDefinition) -> None : T0 = fd.define_tensor(shape=[2, 3], contiguity=[True, True], dtype=DataType.Float, is_cpu=False, stride_order=[1, 0]) T1 = fd.ops.cast(T0, dtype=DataType.Double) T2 = fd.ops.sin(T1) T3 = fd.ops.set(T2) fd.add_output(T3, T1) fd.add_output(T1) with FusionDefinition() as fd: nvfuser_fusion_idNone(fd) ```<adding extra characters for md> Traceback (most recent call last): File "/home/blytle/miniforge3/envs/thdrs/lib/python3.10/site-packages/nvfuser/__init__.py", line 105, in __exit__ self._finalize_definition() RuntimeError: INTERNAL ASSERT FAILED at "/workspace/Fuser/csrc/python_frontend/fusion_state.cpp":141, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Detected exception while building Fusion Ir. The failing RecordFunctor is: fd.add_output(T3, T1) NvFuser error message is: INTERNAL ASSERT FAILED at "/workspace/Fuser/csrc/fusion.cpp":784, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. alias source can only be a fusion input Exception raised from aliasOutputToInput at /workspace/Fuser/csrc/fusion.cpp:784 (most recent call first): ….

nvFuser does not like inplace operations performed on tensors which are not input to the fusion definition. Happily, they are usually functionalized away in during tracing. The functionalization pass makes assumptions about what an inplace op looks like. Using a version of copy_ annotated by torchsymbol within the other inplace ops, like sin_ , breaks those assumptions and leads to many tests failing with errors as above. Hence, the split between _copy_ and copy_.

As a side note, I was also surprised to discover that

def f(x): x.sin_() return x def g(x): z = torch.sin(x) x.copy_(z) return x x = torch.rand((2, 2), device='cuda', dtype=torch.float64) jf = thunder.jit(f) jf(x) # this is fine jg = thunder.jit(g) jg(x) # fails with an AssertionError on "assert return_bsym.sym.id == prims.PrimIDs.RETURN"

beverlylytle · 2024-12-20T13:29:28Z

thunder/torch/__init__.py

@@ -2241,7 +2246,7 @@ def true_divide(a: NumberLike | TensorLike, b: NumberLike | TensorLike, /) -> Nu

 @torchsymbol(torch.Tensor.true_divide_, is_method=True, tags=(prims.OpTags.IN_PLACE,))
 def true_divide_(a: TensorLike, b: NumberLike | TensorLike, /) -> TensorLike:
-    return prims.copy_(true_divide(a, b))


I found the lack of a second argument here odd.

beverlylytle added 3 commits November 19, 2024 18:40

raise if is_leaf and require_grad in inplace operations

b56dd80

Merge branch 'main' into check_inplace_leafs

d79173c

restore wraps

e4ad972

mruberry requested review from kshitij12345 and IvanYashchuk November 20, 2024 14:09

beverlylytle marked this pull request as ready for review November 21, 2024 11:21

beverlylytle requested review from mruberry, lantiga and t-vi as code owners November 21, 2024 11:21

kshitij12345 reviewed Nov 21, 2024

View reviewed changes

thunder/tests/test_inplace_functionalization.py Outdated Show resolved Hide resolved

thunder/tests/test_inplace_functionalization.py Show resolved Hide resolved

add test and comment

551de30

kshitij12345 reviewed Nov 21, 2024

View reviewed changes

thunder/tests/test_inplace_copy.py Outdated Show resolved Hide resolved

kshitij12345 reviewed Nov 21, 2024

View reviewed changes

thunder/tests/test_inplace_functionalization.py Outdated Show resolved Hide resolved

test thunder, not torch

a179d12

IvanYashchuk added autograd in-place labels Nov 22, 2024

kshitij12345 mentioned this pull request Nov 22, 2024

thunder may treat global (maybe nonlocal) value as constant in computation trace without a check in prologue #1464

Open

beverlylytle and others added 6 commits November 25, 2024 19:32

add parameter to copy__meta

a26ed67

Merge branch 'main' into check_inplace_leafs

c279c14

Merge branch 'main' into check_inplace_leafs

590ef18

apply function with compile data

3fb1f53

pass grad_enabled bool instead of relying on compile data

cfd5143

Merge branch 'main' into check_inplace_leafs

ad518a5

kshitij12345 reviewed Dec 16, 2024

View reviewed changes

beverlylytle added 3 commits December 17, 2024 11:06

respond to comments

6b88b30

Merge branch 'main' into check_inplace_leafs

d18f0c4

remove executor from test

6662c59

IvanYashchuk reviewed Dec 18, 2024

View reviewed changes

remove default value

6d1c65b

crcrpar reviewed Dec 19, 2024

View reviewed changes

thunder/examine/__init__.py Outdated Show resolved Hide resolved

beverlylytle added 3 commits December 20, 2024 10:30

clean up

5ce0da0

more clean uo

8dc73db

split copy_

cca913e

beverlylytle commented Dec 20, 2024

View reviewed changes

small fixes

d60d42b

beverlylytle mentioned this pull request Dec 20, 2024

Consider adding is_leaf attribute to TensorProxies #1577

Open

IvanYashchuk mentioned this pull request Dec 30, 2024

Remove requires_grad from regular TensorProxy and add specialized TensorProxy with requires_grad and is_leaf attributes #1599

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise when in place operations occur on leafs requiring grad #1458

Raise when in place operations occur on leafs requiring grad #1458

beverlylytle commented Nov 20, 2024

kshitij12345 left a comment

kshitij12345 Nov 21, 2024

beverlylytle Nov 22, 2024

IvanYashchuk Nov 22, 2024

kshitij12345 Nov 22, 2024 •

edited

Loading

beverlylytle Nov 28, 2024

kshitij12345 Dec 6, 2024

beverlylytle Dec 9, 2024

kshitij12345 Dec 11, 2024

beverlylytle Dec 12, 2024

IvanYashchuk Dec 18, 2024

kshitij12345 left a comment

kshitij12345 Dec 16, 2024

beverlylytle Dec 17, 2024

kshitij12345 Dec 16, 2024 •

edited

Loading

beverlylytle Dec 17, 2024 •

edited

Loading

IvanYashchuk Dec 18, 2024

IvanYashchuk Dec 18, 2024

beverlylytle Dec 20, 2024

beverlylytle Dec 20, 2024

	@torchsymbol(torch.Tensor.copy_, is_method=True) # , tags=(prims.OpTags.IN_PLACE,))
	def copy_(a, b, /):
	return prims.copy_(b, a)

Raise when in place operations occur on leafs requiring grad #1458

Are you sure you want to change the base?

Raise when in place operations occur on leafs requiring grad #1458

Conversation

beverlylytle commented Nov 20, 2024

What does this PR do?

PR review

Did you have fun?

kshitij12345 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kshitij12345 Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kshitij12345 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kshitij12345 Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

beverlylytle Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kshitij12345 Nov 22, 2024 •

edited

Loading

kshitij12345 Dec 16, 2024 •

edited

Loading

beverlylytle Dec 17, 2024 •

edited

Loading