Patching tensor proxy shape in trace #1260

jjsjann123 · 2024-10-04T13:05:28Z

Working on #1253

Updating deferred _numel's deferred computation, so we can handle call without argument. (This is an existing technical debt, see Make numel a method of TensorProxy not a property attribute #925 for more context).
This change fixes the error reported in symbolic values: error in infer_tensor_properties, through reshape or numel #1257, but let's keep that issue open since there likely will be other more errors coming from that repro script.
Updating TensorProxy.shape, so shape query within trace context is recorded with prims.shape.
This doesn't contradict Modeling of shape queries #1133. The explicit trace ensures correctness of the trace. I'll follow up with transformation to optimize shape queries.
There are occasions where we don't want to leave a prims.shape in the log while query shape of a TensorProxy. e.g. printing a trace. I'm working around those cases via explicitly accessing TensorProxy._shape.
add prims.shape support in python executor, since prims.shape could show up in prologue trace, where torch executor isn't available.
Updating dce pass to choose the first producer instead of the last one, when multiple producer exists for a single proxy.

TODO:
Need to update DCE pass to remove duplicated producer in a given scope. I intend to handle that in a separate PR. See comment for details.

for more information, see https://pre-commit.ci

…ce' into patching_TensorProxyShape_in_trace

…hape_in_trace

jjsjann123 · 2024-10-05T00:40:01Z

Note for myself: failing cases looks real, since I don't see them in other PRs. But strangely I'm not seeing the repro on my local non-cuda environment.... 😕 , even though the failing test does run on CPU as well.
e.g. thunder/tests/test_jit_general.py::test_cache_symbolic_values_reshape[cpu]

Let me try to grab a node with an actual GPU on it.

jjsjann123 · 2024-10-05T15:15:10Z

errr. I don't know what's going on with that failing test. but looks like CI on main is green, so I must have done something else here...

jjsjann123 · 2024-10-07T17:18:08Z

Still don't see what's wrong with this failing test and I can't repro it locally: https://dev.azure.com/Lightning-AI/lightning/_build/results?buildId=216738&view=logs&j=2840892e-91ab-5245-da62-77ec9923516a&t=444f4171-6797-5730-4229-41427ed3bdc9&l=15058

checking my luck again with CI.

~~This one was really strange, I even pulled the CI container but still no repro...~~
ok, double checked and looks like the pytorch version in my pulled container has been bumped. Maybe I'll get lucky with the CI this time.

jjsjann123 · 2024-10-07T18:08:21Z

🤯 Nope. CI failed.

jjsjann123 · 2024-10-07T19:13:44Z

kkk. Looks like indeed a real issue here.

t_0 = tensor([[ 8,  0,  8,  6],
        [-2,  4,  6, -4],
        [-5, -8,  8,  9],
        [-1, -5,  5, -8]], device='cuda:0', dtype=torch.int8)
t_1 = tensor(-1, dtype=torch.int8)

torch.pow(t_0, t_1)

Strangely this doesn't repro locally with that test. That's a bit concerning.

…hape_in_trace

jjsjann123 · 2024-10-07T19:21:32Z

Now I know where the failure is coming from, I'll try to get a repro on this one and verify if it's caused in this PR (likely maybe?!) But is it really an issue introduced by this PR?!

Though I'm not sure why others didn't hit this one.

…ce' into patching_TensorProxyShape_in_trace

jjsjann123 · 2024-10-07T20:01:41Z

Note for myself:

FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_pow_torch_cuda_thunder.dtypes.int8 - RuntimeError: "reciprocal_cuda" not implemented for 'Char'
Something very strange here. In #1260 (comment)

Opinfo test for pow runs through some random inputs. specifically I think this one

lightning-thunder/thunder/tests/opinfos.py

Lines 1841 to 1843 in fceb64e

    
           # Tests the inputs are a CPU scalar tensor and a CUDA tensor 
        
           a = make_tensor((4, 4), device=device, dtype=dtype, requires_grad=requires_grad, **kwargs) 
        
           b = make_tensor((), device="cpu", dtype=dtype, requires_grad=requires_grad, **kwargs)

# Tests the inputs are a CPU scalar tensor and a CUDA tensor
a = make_tensor((4, 4), device=device, dtype=dtype, requires_grad=requires_grad, **kwargs)
b = make_tensor((), device="cpu", dtype=dtype, requires_grad=requires_grad, **kwargs)
So for int8 type, pytorch implementation cannot handle where b == torch.tensor(-1) -> https://github.com/pytorch/pytorch/blob/fe44b6a67f32b562c88701b630e65b62ce1b63ba/aten/src/ATen/native/cuda/PowKernel.cu#L178

Somehow this is reliably failing on CI for me in my PR, but not on my local runs.

Anyway, I'm trying to avoid this by exclude_zero=True 🤞

for more information, see https://pre-commit.ci

…ce' into patching_TensorProxyShape_in_trace

jjsjann123 · 2024-10-08T19:20:45Z

@t-vi This PR is ready for review now. 🙇

cc'ing @tfogal @kevinstephano

tfogal

I'd recommend that TomV (or Mike? both?) take a look at this before it goes in, but my naive view is that this seems good.

thunder/core/proxies.py

thunder/tests/test_jit_general.py

tfogal · 2024-10-08T23:18:08Z

thunder/tests/test_jit_general.py

@@ -1116,7 +1116,7 @@ def forward(self, x):
    ("cpu", "cuda"),
 )
 def test_cache_symbolic_values_reshape(device):
-    if not torch.cuda.is_available():
+    if device == "cuda" and not torch.cuda.is_available():


do we get anything out of running this test on multiple devices?

I am wondering if it makes more sense to just not parameterize the test and run it once on the CPU.

That's a good call!
At this point since nvfuser isn't taking shape operations at all, GPU test doesn't do anything. Let me clean it up.

@jjsjann123 This is still pending?

thunder/tests/test_jit_general.py

thunder/tests/opinfos.py

t-vi

Looks great, than you @jjsjann123 @tfogal

jjsjann123 and others added 9 commits October 4, 2024 08:15

quick fix on reshape

3fb1302

patching tracectx for printing

7c5677c

[pre-commit.ci] auto fixes from pre-commit.com hooks

60f9f17

for more information, see https://pre-commit.ci

switch to _shape to avoid trace

3d3700d

Merge remote-tracking branch 'origin/patching_TensorProxyShape_in_tra…

218ae1d

…ce' into patching_TensorProxyShape_in_trace

reverting this first and follow up to check the print again

27a7130

TensorProxy string print shouldn't trace its shape access

db34394

fixing tests

cb79133

Merge remote-tracking branch 'origin/main' into patching_TensorProxyS…

fefea9e

…hape_in_trace

jjsjann123 added 2 commits October 5, 2024 08:16

fixing dce to handle multiple producers; adding tests

d8e2d97

let's not set the whole world on fire

100c9f8

Merge branch 'main' into patching_TensorProxyShape_in_trace

81ae5c5

Merge remote-tracking branch 'origin/main' into patching_TensorProxyS…

cf42387

…hape_in_trace

jjsjann123 added 3 commits October 7, 2024 19:42

quick patch on tests

851ad77

Merge remote-tracking branch 'origin/patching_TensorProxyShape_in_tra…

1ad3d37

…ce' into patching_TensorProxyShape_in_trace

updating torch issue

8995ed5

jjsjann123 and others added 4 commits October 7, 2024 21:03

refactor the fix

98c5f92

[pre-commit.ci] auto fixes from pre-commit.com hooks

0520e42

for more information, see https://pre-commit.ci

typo

30fa4b4

Merge remote-tracking branch 'origin/patching_TensorProxyShape_in_tra…

cdc98b7

…ce' into patching_TensorProxyShape_in_trace

jjsjann123 mentioned this pull request Oct 8, 2024

TensorProxy.shape should be unpacked automatically #1253

Closed

jjsjann123 marked this pull request as ready for review October 8, 2024 19:20

jjsjann123 requested a review from mruberry as a code owner October 8, 2024 19:20

jjsjann123 requested review from lantiga and t-vi as code owners October 8, 2024 19:20

jjsjann123 requested review from tfogal and kevinstephano October 8, 2024 19:20

jjsjann123 mentioned this pull request Oct 8, 2024

Dyanmic Shape TensorProxy #1039

Open

7 tasks

tfogal approved these changes Oct 8, 2024

View reviewed changes

addressing review comments

0843c06

jjsjann123 commented Oct 9, 2024

View reviewed changes

thunder/tests/opinfos.py Show resolved Hide resolved

t-vi approved these changes Oct 11, 2024

View reviewed changes

t-vi enabled auto-merge (squash) October 11, 2024 12:12

t-vi merged commit ec50c73 into main Oct 18, 2024
41 checks passed

t-vi deleted the patching_TensorProxyShape_in_trace branch October 18, 2024 18:09

jjsjann123 mentioned this pull request Oct 21, 2024

removing test for symbolic values cache running on different devices #1334

Merged

jjsjann123 mentioned this pull request Nov 27, 2024

symbolic values: error in infer_tensor_properties, through reshape or numel #1257

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patching tensor proxy shape in trace #1260

Patching tensor proxy shape in trace #1260

jjsjann123 commented Oct 4, 2024 •

edited

Loading

jjsjann123 commented Oct 5, 2024

jjsjann123 commented Oct 5, 2024

jjsjann123 commented Oct 7, 2024 •

edited

Loading

jjsjann123 commented Oct 7, 2024

jjsjann123 commented Oct 7, 2024

jjsjann123 commented Oct 7, 2024

jjsjann123 commented Oct 7, 2024

jjsjann123 commented Oct 8, 2024

tfogal left a comment

tfogal Oct 8, 2024

jjsjann123 Oct 9, 2024

t-vi Oct 11, 2024

t-vi left a comment

Patching tensor proxy shape in trace #1260

Patching tensor proxy shape in trace #1260

Conversation

jjsjann123 commented Oct 4, 2024 • edited Loading

jjsjann123 commented Oct 5, 2024

jjsjann123 commented Oct 5, 2024

jjsjann123 commented Oct 7, 2024 • edited Loading

jjsjann123 commented Oct 7, 2024

jjsjann123 commented Oct 7, 2024

jjsjann123 commented Oct 7, 2024

jjsjann123 commented Oct 7, 2024

jjsjann123 commented Oct 8, 2024

tfogal left a comment

Choose a reason for hiding this comment

tfogal Oct 8, 2024

Choose a reason for hiding this comment

jjsjann123 Oct 9, 2024

Choose a reason for hiding this comment

t-vi Oct 11, 2024

Choose a reason for hiding this comment

t-vi left a comment

Choose a reason for hiding this comment

jjsjann123 commented Oct 4, 2024 •

edited

Loading

jjsjann123 commented Oct 7, 2024 •

edited

Loading