Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BFS traversal could not visit some nodes while fusing take_along_axis #3718

Closed
riccardofelluga opened this issue Jan 16, 2025 · 5 comments · Fixed by #3845
Closed

BFS traversal could not visit some nodes while fusing take_along_axis #3718

riccardofelluga opened this issue Jan 16, 2025 · 5 comments · Fixed by #3845
Assignees
Labels

Comments

@riccardofelluga
Copy link
Contributor

While investigating #1552 in Thunder, I encountered the following error:

RuntimeError:  INTERNAL ASSERT FAILED at "/workspace/Fuser/csrc/bfs.h":241, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. BFS traversal could not visit some nodes:  idg{60} (from:  idg{96 104 110 116 122 128 134 146 152 158 164 177} idg{95 97 103 109 115 121 127 133 145 151 157 163 176 178} idg{92 98 100 106 112 118 124 130 142 148 154 160 173 179} idg{94 102 108 114 120 126 132 144 150 156 162 175} idg{62 66 68 70 74}), visited: ( idg{17 59 61 65 67 69 71 73 75 77 90 171} idg{91 93 99 101 105 107 111 113 117 119 123 125 129 131 141 143 147 149 153 155 159 161 172 174} idg{62 66 68 70 74} idg{94 102 108 114 120 126 132 144 150 156 162 175} idg{92 98 100 106 112 118 124 130 142 148 154 160 173 179} idg{95 97 103 109 115 121 127 133 145 151 157 163 176 178} idg{96 104 110 116 122 128 134 146 152 158 164 177})
Exception raised from traverse at /workspace/Fuser/csrc/bfs.h:241 (most recent call first):
frame #0: nvfuser::nvfCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x177 (0x7fb967b73d01 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x52 (0x7fb967f83542 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #2: <unknown function> + 0x6620e3 (0x7fb9680620e3 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x6643e5 (0x7fb9680643e5 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x64f126 (0x7fb96804f126 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x6515b2 (0x7fb9680515b2 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x6566f3 (0x7fb9680566f3 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #7: <unknown function> + 0x65747c (0x7fb96805747c in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #8: <unknown function> + 0x69d393 (0x7fb96809d393 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #9: <unknown function> + 0x4998e7 (0x7fb967e998e7 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #10: <unknown function> + 0x4ab6bb (0x7fb967eab6bb in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #11: <unknown function> + 0x4aad6f (0x7fb967eaad6f in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #12: <unknown function> + 0x49ae77 (0x7fb967e9ae77 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #13: <unknown function> + 0x49ae77 (0x7fb967e9ae77 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #14: <unknown function> + 0x4aad6f (0x7fb967eaad6f in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #15: <unknown function> + 0x49ae77 (0x7fb967e9ae77 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #16: <unknown function> + 0x49ae77 (0x7fb967e9ae77 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #17: <unknown function> + 0x49ae77 (0x7fb967e9ae77 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #18: <unknown function> + 0x498deb (0x7fb967e98deb in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #19: <unknown function> + 0x45dc2f (0x7fb967e5dc2f in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #20: nvfuser::GpuLower::run() + 0x23f (0x7fb967e56d8f in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #21: nvfuser::KernelExecutor::compile(nvfuser::Fusion*, nvfuser::KernelArgumentHolder const&, nvfuser::LaunchParams const&, nvfuser::CompileParams, nvfuser::SchedulerType) + 0x5da (0x7fb9682a8c0a in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #22: <unknown function> + 0x8b1640 (0x7fb9682b1640 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #23: <unknown function> + 0x8e8c7e (0x7fb9682e8c7e in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #24: nvfuser::FusionKernelRuntime::compileFusionParallel(nvfuser::KernelArgumentHolder) + 0x716 (0x7fb9682eeba6 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #25: nvfuser::FusionExecutorCache::runFusionWithInputs(c10::ArrayRef<c10::IValue> const&, std::optional<nvfuser::PrimDataType>, std::optional<signed char>) + 0x1db (0x7fb9682de78b in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #26: nvfuser::python_frontend::FusionDefinition::execute(c10::ArrayRef<c10::IValue> const&, std::optional<signed char>, bool, bool, bool, std::vector<std::string, std::allocator<std::string> >, std::vector<std::string, std::allocator<std::string> >) const + 0xb6c (0x7fb9684afb8c in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #27: <unknown function> + 0x217cfd (0x7fb967c17cfd in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #28: <unknown function> + 0x2c8933 (0x7fb967cc8933 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #29: <unknown function> + 0x1ff234 (0x7fb967bff234 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #30: <unknown function> + 0x238415 (0x5561ccaf2415 in /workspace/lightning-thunder/.thunder/bin/python)
frame #31: _PyObject_MakeTpCall + 0x142 (0x5561ccacd402 in /workspace/lightning-thunder/.thunder/bin/python)
frame #32: <unknown function> + 0x186180 (0x5561cca40180 in /workspace/lightning-thunder/.thunder/bin/python)
frame #33: PyEval_EvalCode + 0xaf (0x5561ccbaff8f in /workspace/lightning-thunder/.thunder/bin/python)
frame #34: <unknown function> + 0x2f4d50 (0x5561ccbaed50 in /workspace/lightning-thunder/.thunder/bin/python)
frame #35: <unknown function> + 0x18754b (0x5561cca4154b in /workspace/lightning-thunder/.thunder/bin/python)
frame #36: <unknown function> + 0x219eef (0x5561ccad3eef in /workspace/lightning-thunder/.thunder/bin/python)
frame #37: <unknown function> + 0x2b2ed4 (0x5561ccb6ced4 in /workspace/lightning-thunder/.thunder/bin/python)
frame #38: <unknown function> + 0x2186ea (0x5561ccad26ea in /workspace/lightning-thunder/.thunder/bin/python)
frame #39: PyObject_Vectorcall + 0x3b (0x5561ccacdb3b in /workspace/lightning-thunder/.thunder/bin/python)
frame #40: <unknown function> + 0x186180 (0x5561cca40180 in /workspace/lightning-thunder/.thunder/bin/python)
frame #41: <unknown function> + 0x2153a8 (0x5561ccacf3a8 in /workspace/lightning-thunder/.thunder/bin/python)
frame #42: PyObject_Call + 0x149 (0x5561ccaceb59 in /workspace/lightning-thunder/.thunder/bin/python)
frame #43: <unknown function> + 0x186a23 (0x5561cca40a23 in /workspace/lightning-thunder/.thunder/bin/python)
frame #44: PyEval_EvalCode + 0xaf (0x5561ccbaff8f in /workspace/lightning-thunder/.thunder/bin/python)
frame #45: <unknown function> + 0x318c69 (0x5561ccbd2c69 in /workspace/lightning-thunder/.thunder/bin/python)
frame #46: <unknown function> + 0x318be0 (0x5561ccbd2be0 in /workspace/lightning-thunder/.thunder/bin/python)
frame #47: <unknown function> + 0x319239 (0x5561ccbd3239 in /workspace/lightning-thunder/.thunder/bin/python)
frame #48: _PyRun_SimpleFileObject + 0x1c3 (0x5561ccbd2fa3 in /workspace/lightning-thunder/.thunder/bin/python)
frame #49: _PyRun_AnyFileObject + 0x55 (0x5561ccbd2dc5 in /workspace/lightning-thunder/.thunder/bin/python)
frame #50: Py_RunMain + 0x3aa (0x5561ccbdc4fa in /workspace/lightning-thunder/.thunder/bin/python)
frame #51: Py_BytesMain + 0x42 (0x5561ccbdbfb2 in /workspace/lightning-thunder/.thunder/bin/python)
frame #52: <unknown function> + 0x23a90 (0x7fba82023a90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #53: __libc_start_main + 0x89 (0x7fba82023b49 in /lib/x86_64-linux-gnu/libc.so.6)
frame #54: _start + 0x25 (0x5561ccb48725 in /workspace/lightning-thunder/.thunder/bin/python)

with repro:

# CUDA devices:
#  0: NVIDIA RTX 6000 Ada Generation
#  1: NVIDIA RTX 6000 Ada Generation
# torch version: 2.5.1+cu124
# cuda version: 12.4
# nvfuser version: 0.2.24+gitac84633
import torch
from nvfuser import FusionDefinition, DataType

def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(shape=[1, 4096, 152064], contiguity=[None, True, True], dtype=DataType.BFloat16, is_cpu=False, stride_order=[2, 1, 0])
    T1 = fd.define_tensor(shape=[1, 4096], contiguity=[None, True], dtype=DataType.Int, is_cpu=False, stride_order=[1, 0])
    T2 = fd.ops.cast(T0, dtype=DataType.Float)
    T15 = fd.ops.slice(T2, start_indices=[0, 0, 0], end_indices=[1, 4095, 152064], strides=[1, 1, 1], manual_normalization=0)
    T16 = fd.ops.stride_order(T15, stride_order=[2, 1, 0])
    T26 = fd.ops.slice(T1, start_indices=[0, 1], end_indices=[1, 4096], strides=[1, 1], manual_normalization=0)
    T27 = fd.ops.stride_order(T26, stride_order=[1, 0])
    T31 = fd.ops.reshape(T16, new_shape=[4095, 152064])
    T34 = fd.ops.reshape(T27, new_shape=[4095])
    T35 = fd.ops.max(T31, dims=[1], keepdim=False, dtype=DataType.Null)
    T39 = fd.ops.broadcast_in_dim(T35, shape=[4095, 1], broadcast_dims=[0])
    T40 = fd.ops.abs(T39)
    S41 = fd.define_scalar(float("inf"), dtype=DataType.Double)
    T42 = fd.ops.eq(T40, S41)
    S43 = fd.define_scalar(0.00000, dtype=DataType.Double)
    T44 = fd.ops.where(T42, S43, T39)
    T48 = fd.ops.broadcast_in_dim(T44, shape=[4095, 152064], broadcast_dims=[0, 1])
    T49 = fd.ops.sub(T31, T48)
    T50 = fd.ops.exp(T49)
    T51 = fd.ops.sum(T50, dims=[1], keepdim=False, dtype=DataType.Null)
    T55 = fd.ops.broadcast_in_dim(T51, shape=[4095, 1], broadcast_dims=[0])
    T56 = fd.ops.log(T55)
    T57 = fd.ops.add(T56, T44)
    T61 = fd.ops.broadcast_in_dim(T57, shape=[4095, 152064], broadcast_dims=[0, 1])
    T62 = fd.ops.sub(T31, T61)
    T63 = fd.ops.neg(T62)
    T67 = fd.ops.broadcast_in_dim(T34, shape=[4095, 1], broadcast_dims=[0])
    T68 = fd.ops.take_along_axis(T63, T67, dim=1)
    S69 = fd.define_scalar(-100, dtype=DataType.Int)
    T70 = fd.ops.ne(T67, S69)
    S71 = fd.define_scalar(0.00000, dtype=DataType.Double)
    T72 = fd.ops.where(T70, T68, S71)
    T73 = fd.ops.sum(T72, dims=[0, 1], keepdim=False, dtype=DataType.Null)
    T74 = fd.ops.cast(T70, dtype=DataType.Int)
    T75 = fd.ops.sum(T74, dims=[0, 1], keepdim=False, dtype=DataType.Null)
    T76 = fd.ops.cast(T75, dtype=DataType.Float)
    T77 = fd.ops.reciprocal(T76)
    T78 = fd.ops.mul(T73, T77)
    fd.add_output(T78)

with FusionDefinition() as fd:
    nvfuser_fusion_id0(fd)

inputs = [
    torch.testing.make_tensor((1, 4096, 152064), dtype=torch.bfloat16, device='cuda:0'),
    torch.testing.make_tensor((1, 4096), dtype=torch.int64, device='cuda:0'),
]
fd.execute(inputs)

ps. The repro seems to be working on H100 but only with flag CUDA_LAUNCH_BLOCKING=1, without it there is an IMA

@kevinstephano
Copy link
Collaborator

@protonu could you check if this is the same error as found in #3702?

@naoyam
Copy link
Collaborator

naoyam commented Feb 6, 2025

This one is because takeAlongAxis and slice are used together. Here's the failing segment:

Inputs:
  T1_g_int64_t[bS3{1}, iS4{4096}]
  T27_g_float[iS59{4095}, iS60{152064}]
Outputs:
  T40_g_float[]

%kernel_math {
T5_l_int64_t[bS15{1}, iS17{4095}rf]
   = slice( T1_g_int64_t[bS3{1}, iS4{4096}], { {0, 1, 1} {1, 4096, 1} } )
T41_g_int64_t[iS77{4095}]
   = squeeze( T5_l_int64_t[bS15{1}, iS17{4095}rf], flags = {true, false} )
T28_g_int64_t[iS61{4095}, bS62{1}]
   = broadcast( T41_g_int64_t[iS77{4095}], flags = {false, true} )
T31_g_bool[iS67{4095}, bS68{1}]
   = T28_g_int64_t[iS61{4095}, bS62{1}]
   != -100;
T30_g_float[iS65{4095}, bS66{1}]
   = takeAlongAxis( T27_g_float[iS59{4095}, iS60{152064}], T28_g_int64_t[iS61{4095}, bS62{1}], dim = 1 )
T32_l_float[iS69{4095}, bS70{1}]
   = where(T31_g_bool[iS67{4095}, bS68{1}]
  , T30_g_float[iS65{4095}, bS66{1}]
  , double(0));
T33_l_float[iS71{4095}]
   = squeeze( T32_l_float[iS69{4095}, bS70{1}], flags = {false, true} )
T34_g_float[rS72{4095}]
   = reduction( T33_l_float[iS71{4095}], op = add, initial value = float(0), allreduce = false )
T35_l_int64_t[iS73{4095}, bS74{1}]
   = (int64_t)(T31_g_bool[iS67{4095}, bS68{1}]);
T36_l_int64_t[iS75{4095}]
   = squeeze( T35_l_int64_t[iS73{4095}, bS74{1}], flags = {false, true} )
T37_l_int64_t[rS76{4095}]
   = reduction( T36_l_int64_t[iS75{4095}], op = add, initial value = 0, allreduce = false )
T38_g_float[]
   = (float)(T37_l_int64_t[rS76{4095}]);
T39_g_float[]
   = reciprocal(T38_g_float[]);
T40_g_float[]
   = T34_g_float[rS72{4095}]
   * T39_g_float[];
} // %kernel_math

When a fusion has slice/pad, we automatically switch to using the IdModel-based indexer, but that doesn't yet support ops like takeAlongAxis. That would be something we would need to work on this Q anyway, but for now, a quick fix would be to disallow fusing those ops together. I'll create a patch.

@riccardofelluga
Copy link
Contributor Author

@naoyam Out of curiosity, does this mean that with the patch, the region will be split in two fusions?

@naoyam
Copy link
Collaborator

naoyam commented Feb 7, 2025

Unfortunately, yes at this moment. This is a patch to make it run. Need more work for performance, which is part of our Q1 plans.

@kevinstephano
Copy link
Collaborator

take_along_axis is disabled through Thunder. Therefore, it should not be exposed to Thunder.

https://github.com/Lightning-AI/lightning-thunder/blob/8163863787a5e2b20834f4751ba00b968c7b18dd/thunder/executors/nvfuserex_impl.py#L1345-L1354

# TAKE_ALONG_AXIS is currently disabled
# There was an nvFuser bug that prevented this which is now fixed; we should
# investigate re-enabling take_along_axis.
# # TODO Check that the nvFuser version is >= 0.0.10 when this operator was added
# def take_along_axis(a: TensorProxy, /, index: TensorProxy, dim: int, *, fd: FusionDefinition, lc_to_nv_map: dict) -> Any:
#     nv_a = getnv(a, fd, lc_to_nv_map)
#     nv_index = getnv(index, fd, lc_to_nv_map)

#     return fd.ops.take_along_axis(nv_a, nv_index, dim)
# register_supported(PrimIDs.TAKE_ALONG_AXIS, take_along_axis, _take_check)

@naoyam naoyam closed this as completed in b6e1530 Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants