Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patching tensor proxy shape in trace #1260

Merged
merged 21 commits into from
Oct 18, 2024
Merged

Conversation

jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented Oct 4, 2024

Working on #1253

  1. Updating deferred _numel's deferred computation, so we can handle call without argument. (This is an existing technical debt, see Make numel a method of TensorProxy not a property attribute #925 for more context).
    This change fixes the error reported in symbolic values: error in infer_tensor_properties, through reshape or numel #1257, but let's keep that issue open since there likely will be other more errors coming from that repro script.

  2. Updating TensorProxy.shape, so shape query within trace context is recorded with prims.shape.
    This doesn't contradict Modeling of shape queries #1133. The explicit trace ensures correctness of the trace. I'll follow up with transformation to optimize shape queries.
    There are occasions where we don't want to leave a prims.shape in the log while query shape of a TensorProxy. e.g. printing a trace. I'm working around those cases via explicitly accessing TensorProxy._shape.

  3. add prims.shape support in python executor, since prims.shape could show up in prologue trace, where torch executor isn't available.

  4. Updating dce pass to choose the first producer instead of the last one, when multiple producer exists for a single proxy.

TODO:
Need to update DCE pass to remove duplicated producer in a given scope. I intend to handle that in a separate PR. See comment for details.

@jjsjann123
Copy link
Collaborator Author

Note for myself: failing cases looks real, since I don't see them in other PRs. But strangely I'm not seeing the repro on my local non-cuda environment.... 😕 , even though the failing test does run on CPU as well.
e.g. thunder/tests/test_jit_general.py::test_cache_symbolic_values_reshape[cpu]

Let me try to grab a node with an actual GPU on it.

@jjsjann123
Copy link
Collaborator Author

errr. I don't know what's going on with that failing test. but looks like CI on main is green, so I must have done something else here...

@jjsjann123
Copy link
Collaborator Author

jjsjann123 commented Oct 7, 2024

Still don't see what's wrong with this failing test and I can't repro it locally: https://dev.azure.com/Lightning-AI/lightning/_build/results?buildId=216738&view=logs&j=2840892e-91ab-5245-da62-77ec9923516a&t=444f4171-6797-5730-4229-41427ed3bdc9&l=15058

checking my luck again with CI.

This one was really strange, I even pulled the CI container but still no repro...
ok, double checked and looks like the pytorch version in my pulled container has been bumped. Maybe I'll get lucky with the CI this time.

@jjsjann123
Copy link
Collaborator Author

🤯 Nope. CI failed.

@jjsjann123
Copy link
Collaborator Author

kkk. Looks like indeed a real issue here.

t_0 = tensor([[ 8,  0,  8,  6],
        [-2,  4,  6, -4],
        [-5, -8,  8,  9],
        [-1, -5,  5, -8]], device='cuda:0', dtype=torch.int8)
t_1 = tensor(-1, dtype=torch.int8)

torch.pow(t_0, t_1)

Strangely this doesn't repro locally with that test. That's a bit concerning.

@jjsjann123
Copy link
Collaborator Author

Now I know where the failure is coming from, I'll try to get a repro on this one and verify if it's caused in this PR (likely maybe?!) But is it really an issue introduced by this PR?!

Though I'm not sure why others didn't hit this one.

@jjsjann123
Copy link
Collaborator Author

Note for myself:

FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_pow_torch_cuda_thunder.dtypes.int8 - RuntimeError: "reciprocal_cuda" not implemented for 'Char'
Something very strange here. In #1260 (comment)

Opinfo test for pow runs through some random inputs. specifically I think this one

# Tests the inputs are a CPU scalar tensor and a CUDA tensor
a = make_tensor((4, 4), device=device, dtype=dtype, requires_grad=requires_grad, **kwargs)
b = make_tensor((), device="cpu", dtype=dtype, requires_grad=requires_grad, **kwargs)

# Tests the inputs are a CPU scalar tensor and a CUDA tensor
a = make_tensor((4, 4), device=device, dtype=dtype, requires_grad=requires_grad, **kwargs)
b = make_tensor((), device="cpu", dtype=dtype, requires_grad=requires_grad, **kwargs)
So for int8 type, pytorch implementation cannot handle where b == torch.tensor(-1) -> https://github.com/pytorch/pytorch/blob/fe44b6a67f32b562c88701b630e65b62ce1b63ba/aten/src/ATen/native/cuda/PowKernel.cu#L178

Somehow this is reliably failing on CI for me in my PR, but not on my local runs.

Anyway, I'm trying to avoid this by exclude_zero=True 🤞

@jjsjann123 jjsjann123 marked this pull request as ready for review October 8, 2024 19:20
@jjsjann123 jjsjann123 requested a review from mruberry as a code owner October 8, 2024 19:20
@jjsjann123
Copy link
Collaborator Author

@t-vi This PR is ready for review now. 🙇

cc'ing @tfogal @kevinstephano

@jjsjann123 jjsjann123 mentioned this pull request Oct 8, 2024
7 tasks
Copy link
Collaborator

@tfogal tfogal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend that TomV (or Mike? both?) take a look at this before it goes in, but my naive view is that this seems good.

thunder/core/proxies.py Show resolved Hide resolved
thunder/core/proxies.py Show resolved Hide resolved
thunder/tests/test_jit_general.py Outdated Show resolved Hide resolved
@@ -1116,7 +1116,7 @@ def forward(self, x):
("cpu", "cuda"),
)
def test_cache_symbolic_values_reshape(device):
if not torch.cuda.is_available():
if device == "cuda" and not torch.cuda.is_available():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we get anything out of running this test on multiple devices?

I am wondering if it makes more sense to just not parameterize the test and run it once on the CPU.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good call!
At this point since nvfuser isn't taking shape operations at all, GPU test doesn't do anything. Let me clean it up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjsjann123 This is still pending?

thunder/tests/test_jit_general.py Show resolved Hide resolved
Copy link
Collaborator

@t-vi t-vi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, than you @jjsjann123 @tfogal

@t-vi t-vi enabled auto-merge (squash) October 11, 2024 12:12
@t-vi t-vi merged commit ec50c73 into main Oct 18, 2024
41 checks passed
@t-vi t-vi deleted the patching_TensorProxyShape_in_trace branch October 18, 2024 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants