BFS traversal could not visit some nodes while fusing `take_along_axis` #3718

riccardofelluga · 2025-01-16T12:03:03Z

While investigating #1552 in Thunder, I encountered the following error:

RuntimeError:  INTERNAL ASSERT FAILED at "/workspace/Fuser/csrc/bfs.h":241, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. BFS traversal could not visit some nodes:  idg{60} (from:  idg{96 104 110 116 122 128 134 146 152 158 164 177} idg{95 97 103 109 115 121 127 133 145 151 157 163 176 178} idg{92 98 100 106 112 118 124 130 142 148 154 160 173 179} idg{94 102 108 114 120 126 132 144 150 156 162 175} idg{62 66 68 70 74}), visited: ( idg{17 59 61 65 67 69 71 73 75 77 90 171} idg{91 93 99 101 105 107 111 113 117 119 123 125 129 131 141 143 147 149 153 155 159 161 172 174} idg{62 66 68 70 74} idg{94 102 108 114 120 126 132 144 150 156 162 175} idg{92 98 100 106 112 118 124 130 142 148 154 160 173 179} idg{95 97 103 109 115 121 127 133 145 151 157 163 176 178} idg{96 104 110 116 122 128 134 146 152 158 164 177})
Exception raised from traverse at /workspace/Fuser/csrc/bfs.h:241 (most recent call first):
frame #0: nvfuser::nvfCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x177 (0x7fb967b73d01 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x52 (0x7fb967f83542 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #2: <unknown function> + 0x6620e3 (0x7fb9680620e3 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x6643e5 (0x7fb9680643e5 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x64f126 (0x7fb96804f126 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x6515b2 (0x7fb9680515b2 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #6: <unknown function> + 0x6566f3 (0x7fb9680566f3 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #7: <unknown function> + 0x65747c (0x7fb96805747c in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #8: <unknown function> + 0x69d393 (0x7fb96809d393 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #9: <unknown function> + 0x4998e7 (0x7fb967e998e7 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #10: <unknown function> + 0x4ab6bb (0x7fb967eab6bb in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #11: <unknown function> + 0x4aad6f (0x7fb967eaad6f in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #12: <unknown function> + 0x49ae77 (0x7fb967e9ae77 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #13: <unknown function> + 0x49ae77 (0x7fb967e9ae77 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #14: <unknown function> + 0x4aad6f (0x7fb967eaad6f in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #15: <unknown function> + 0x49ae77 (0x7fb967e9ae77 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #16: <unknown function> + 0x49ae77 (0x7fb967e9ae77 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #17: <unknown function> + 0x49ae77 (0x7fb967e9ae77 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #18: <unknown function> + 0x498deb (0x7fb967e98deb in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #19: <unknown function> + 0x45dc2f (0x7fb967e5dc2f in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #20: nvfuser::GpuLower::run() + 0x23f (0x7fb967e56d8f in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #21: nvfuser::KernelExecutor::compile(nvfuser::Fusion*, nvfuser::KernelArgumentHolder const&, nvfuser::LaunchParams const&, nvfuser::CompileParams, nvfuser::SchedulerType) + 0x5da (0x7fb9682a8c0a in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #22: <unknown function> + 0x8b1640 (0x7fb9682b1640 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #23: <unknown function> + 0x8e8c7e (0x7fb9682e8c7e in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #24: nvfuser::FusionKernelRuntime::compileFusionParallel(nvfuser::KernelArgumentHolder) + 0x716 (0x7fb9682eeba6 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #25: nvfuser::FusionExecutorCache::runFusionWithInputs(c10::ArrayRef<c10::IValue> const&, std::optional<nvfuser::PrimDataType>, std::optional<signed char>) + 0x1db (0x7fb9682de78b in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #26: nvfuser::python_frontend::FusionDefinition::execute(c10::ArrayRef<c10::IValue> const&, std::optional<signed char>, bool, bool, bool, std::vector<std::string, std::allocator<std::string> >, std::vector<std::string, std::allocator<std::string> >) const + 0xb6c (0x7fb9684afb8c in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #27: <unknown function> + 0x217cfd (0x7fb967c17cfd in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #28: <unknown function> + 0x2c8933 (0x7fb967cc8933 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #29: <unknown function> + 0x1ff234 (0x7fb967bff234 in /workspace/Fuser/nvfuser/_C.cpython-312-x86_64-linux-gnu.so)
frame #30: <unknown function> + 0x238415 (0x5561ccaf2415 in /workspace/lightning-thunder/.thunder/bin/python)
frame #31: _PyObject_MakeTpCall + 0x142 (0x5561ccacd402 in /workspace/lightning-thunder/.thunder/bin/python)
frame #32: <unknown function> + 0x186180 (0x5561cca40180 in /workspace/lightning-thunder/.thunder/bin/python)
frame #33: PyEval_EvalCode + 0xaf (0x5561ccbaff8f in /workspace/lightning-thunder/.thunder/bin/python)
frame #34: <unknown function> + 0x2f4d50 (0x5561ccbaed50 in /workspace/lightning-thunder/.thunder/bin/python)
frame #35: <unknown function> + 0x18754b (0x5561cca4154b in /workspace/lightning-thunder/.thunder/bin/python)
frame #36: <unknown function> + 0x219eef (0x5561ccad3eef in /workspace/lightning-thunder/.thunder/bin/python)
frame #37: <unknown function> + 0x2b2ed4 (0x5561ccb6ced4 in /workspace/lightning-thunder/.thunder/bin/python)
frame #38: <unknown function> + 0x2186ea (0x5561ccad26ea in /workspace/lightning-thunder/.thunder/bin/python)
frame #39: PyObject_Vectorcall + 0x3b (0x5561ccacdb3b in /workspace/lightning-thunder/.thunder/bin/python)
frame #40: <unknown function> + 0x186180 (0x5561cca40180 in /workspace/lightning-thunder/.thunder/bin/python)
frame #41: <unknown function> + 0x2153a8 (0x5561ccacf3a8 in /workspace/lightning-thunder/.thunder/bin/python)
frame #42: PyObject_Call + 0x149 (0x5561ccaceb59 in /workspace/lightning-thunder/.thunder/bin/python)
frame #43: <unknown function> + 0x186a23 (0x5561cca40a23 in /workspace/lightning-thunder/.thunder/bin/python)
frame #44: PyEval_EvalCode + 0xaf (0x5561ccbaff8f in /workspace/lightning-thunder/.thunder/bin/python)
frame #45: <unknown function> + 0x318c69 (0x5561ccbd2c69 in /workspace/lightning-thunder/.thunder/bin/python)
frame #46: <unknown function> + 0x318be0 (0x5561ccbd2be0 in /workspace/lightning-thunder/.thunder/bin/python)
frame #47: <unknown function> + 0x319239 (0x5561ccbd3239 in /workspace/lightning-thunder/.thunder/bin/python)
frame #48: _PyRun_SimpleFileObject + 0x1c3 (0x5561ccbd2fa3 in /workspace/lightning-thunder/.thunder/bin/python)
frame #49: _PyRun_AnyFileObject + 0x55 (0x5561ccbd2dc5 in /workspace/lightning-thunder/.thunder/bin/python)
frame #50: Py_RunMain + 0x3aa (0x5561ccbdc4fa in /workspace/lightning-thunder/.thunder/bin/python)
frame #51: Py_BytesMain + 0x42 (0x5561ccbdbfb2 in /workspace/lightning-thunder/.thunder/bin/python)
frame #52: <unknown function> + 0x23a90 (0x7fba82023a90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #53: __libc_start_main + 0x89 (0x7fba82023b49 in /lib/x86_64-linux-gnu/libc.so.6)
frame #54: _start + 0x25 (0x5561ccb48725 in /workspace/lightning-thunder/.thunder/bin/python)

with repro:

# CUDA devices:
#  0: NVIDIA RTX 6000 Ada Generation
#  1: NVIDIA RTX 6000 Ada Generation
# torch version: 2.5.1+cu124
# cuda version: 12.4
# nvfuser version: 0.2.24+gitac84633
import torch
from nvfuser import FusionDefinition, DataType

def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(shape=[1, 4096, 152064], contiguity=[None, True, True], dtype=DataType.BFloat16, is_cpu=False, stride_order=[2, 1, 0])
    T1 = fd.define_tensor(shape=[1, 4096], contiguity=[None, True], dtype=DataType.Int, is_cpu=False, stride_order=[1, 0])
    T2 = fd.ops.cast(T0, dtype=DataType.Float)
    T15 = fd.ops.slice(T2, start_indices=[0, 0, 0], end_indices=[1, 4095, 152064], strides=[1, 1, 1], manual_normalization=0)
    T16 = fd.ops.stride_order(T15, stride_order=[2, 1, 0])
    T26 = fd.ops.slice(T1, start_indices=[0, 1], end_indices=[1, 4096], strides=[1, 1], manual_normalization=0)
    T27 = fd.ops.stride_order(T26, stride_order=[1, 0])
    T31 = fd.ops.reshape(T16, new_shape=[4095, 152064])
    T34 = fd.ops.reshape(T27, new_shape=[4095])
    T35 = fd.ops.max(T31, dims=[1], keepdim=False, dtype=DataType.Null)
    T39 = fd.ops.broadcast_in_dim(T35, shape=[4095, 1], broadcast_dims=[0])
    T40 = fd.ops.abs(T39)
    S41 = fd.define_scalar(float("inf"), dtype=DataType.Double)
    T42 = fd.ops.eq(T40, S41)
    S43 = fd.define_scalar(0.00000, dtype=DataType.Double)
    T44 = fd.ops.where(T42, S43, T39)
    T48 = fd.ops.broadcast_in_dim(T44, shape=[4095, 152064], broadcast_dims=[0, 1])
    T49 = fd.ops.sub(T31, T48)
    T50 = fd.ops.exp(T49)
    T51 = fd.ops.sum(T50, dims=[1], keepdim=False, dtype=DataType.Null)
    T55 = fd.ops.broadcast_in_dim(T51, shape=[4095, 1], broadcast_dims=[0])
    T56 = fd.ops.log(T55)
    T57 = fd.ops.add(T56, T44)
    T61 = fd.ops.broadcast_in_dim(T57, shape=[4095, 152064], broadcast_dims=[0, 1])
    T62 = fd.ops.sub(T31, T61)
    T63 = fd.ops.neg(T62)
    T67 = fd.ops.broadcast_in_dim(T34, shape=[4095, 1], broadcast_dims=[0])
    T68 = fd.ops.take_along_axis(T63, T67, dim=1)
    S69 = fd.define_scalar(-100, dtype=DataType.Int)
    T70 = fd.ops.ne(T67, S69)
    S71 = fd.define_scalar(0.00000, dtype=DataType.Double)
    T72 = fd.ops.where(T70, T68, S71)
    T73 = fd.ops.sum(T72, dims=[0, 1], keepdim=False, dtype=DataType.Null)
    T74 = fd.ops.cast(T70, dtype=DataType.Int)
    T75 = fd.ops.sum(T74, dims=[0, 1], keepdim=False, dtype=DataType.Null)
    T76 = fd.ops.cast(T75, dtype=DataType.Float)
    T77 = fd.ops.reciprocal(T76)
    T78 = fd.ops.mul(T73, T77)
    fd.add_output(T78)

with FusionDefinition() as fd:
    nvfuser_fusion_id0(fd)

inputs = [
    torch.testing.make_tensor((1, 4096, 152064), dtype=torch.bfloat16, device='cuda:0'),
    torch.testing.make_tensor((1, 4096), dtype=torch.int64, device='cuda:0'),
]
fd.execute(inputs)

ps. The repro seems to be working on H100 but only with flag CUDA_LAUNCH_BLOCKING=1, without it there is an IMA

The text was updated successfully, but these errors were encountered:

kevinstephano · 2025-02-05T17:45:30Z

@protonu could you check if this is the same error as found in #3702?

naoyam · 2025-02-06T21:58:30Z

This one is because takeAlongAxis and slice are used together. Here's the failing segment:

Inputs:
  T1_g_int64_t[bS3{1}, iS4{4096}]
  T27_g_float[iS59{4095}, iS60{152064}]
Outputs:
  T40_g_float[]

%kernel_math {
T5_l_int64_t[bS15{1}, iS17{4095}rf]
   = slice( T1_g_int64_t[bS3{1}, iS4{4096}], { {0, 1, 1} {1, 4096, 1} } )
T41_g_int64_t[iS77{4095}]
   = squeeze( T5_l_int64_t[bS15{1}, iS17{4095}rf], flags = {true, false} )
T28_g_int64_t[iS61{4095}, bS62{1}]
   = broadcast( T41_g_int64_t[iS77{4095}], flags = {false, true} )
T31_g_bool[iS67{4095}, bS68{1}]
   = T28_g_int64_t[iS61{4095}, bS62{1}]
   != -100;
T30_g_float[iS65{4095}, bS66{1}]
   = takeAlongAxis( T27_g_float[iS59{4095}, iS60{152064}], T28_g_int64_t[iS61{4095}, bS62{1}], dim = 1 )
T32_l_float[iS69{4095}, bS70{1}]
   = where(T31_g_bool[iS67{4095}, bS68{1}]
  , T30_g_float[iS65{4095}, bS66{1}]
  , double(0));
T33_l_float[iS71{4095}]
   = squeeze( T32_l_float[iS69{4095}, bS70{1}], flags = {false, true} )
T34_g_float[rS72{4095}]
   = reduction( T33_l_float[iS71{4095}], op = add, initial value = float(0), allreduce = false )
T35_l_int64_t[iS73{4095}, bS74{1}]
   = (int64_t)(T31_g_bool[iS67{4095}, bS68{1}]);
T36_l_int64_t[iS75{4095}]
   = squeeze( T35_l_int64_t[iS73{4095}, bS74{1}], flags = {false, true} )
T37_l_int64_t[rS76{4095}]
   = reduction( T36_l_int64_t[iS75{4095}], op = add, initial value = 0, allreduce = false )
T38_g_float[]
   = (float)(T37_l_int64_t[rS76{4095}]);
T39_g_float[]
   = reciprocal(T38_g_float[]);
T40_g_float[]
   = T34_g_float[rS72{4095}]
   * T39_g_float[];
} // %kernel_math

When a fusion has slice/pad, we automatically switch to using the IdModel-based indexer, but that doesn't yet support ops like takeAlongAxis. That would be something we would need to work on this Q anyway, but for now, a quick fix would be to disallow fusing those ops together. I'll create a patch.

riccardofelluga · 2025-02-07T08:39:04Z

@naoyam Out of curiosity, does this mean that with the patch, the region will be split in two fusions?

naoyam · 2025-02-07T16:11:08Z

Unfortunately, yes at this moment. This is a patch to make it run. Need more work for performance, which is part of our Q1 plans.

kevinstephano · 2025-02-07T17:50:32Z

take_along_axis is disabled through Thunder. Therefore, it should not be exposed to Thunder.

https://github.com/Lightning-AI/lightning-thunder/blob/8163863787a5e2b20834f4751ba00b968c7b18dd/thunder/executors/nvfuserex_impl.py#L1345-L1354

# TAKE_ALONG_AXIS is currently disabled
# There was an nvFuser bug that prevented this which is now fixed; we should
# investigate re-enabling take_along_axis.
# # TODO Check that the nvFuser version is >= 0.0.10 when this operator was added
# def take_along_axis(a: TensorProxy, /, index: TensorProxy, dim: int, *, fd: FusionDefinition, lc_to_nv_map: dict) -> Any:
#     nv_a = getnv(a, fd, lc_to_nv_map)
#     nv_index = getnv(index, fd, lc_to_nv_map)

#     return fd.ops.take_along_axis(nv_a, nv_index, dim)
# register_supported(PrimIDs.TAKE_ALONG_AXIS, take_along_axis, _take_check)

riccardofelluga mentioned this issue Jan 17, 2025

nvFuser using more memory than inductor for HF CausalLMLoss Lightning-AI/lightning-thunder#1654

Closed

kevinstephano assigned kevinstephano and protonu and unassigned kevinstephano Feb 5, 2025

kevinstephano added the Thunder label Feb 5, 2025

naoyam added a commit that referenced this issue Feb 6, 2025

Fixes #3718

28b04e2

naoyam mentioned this issue Feb 6, 2025

Do not fuse resize-based ops and index ops (yet) #3845

Merged

naoyam closed this as completed in b6e1530 Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BFS traversal could not visit some nodes while fusing `take_along_axis` #3718

BFS traversal could not visit some nodes while fusing `take_along_axis` #3718

riccardofelluga commented Jan 16, 2025

kevinstephano commented Feb 5, 2025

naoyam commented Feb 6, 2025

riccardofelluga commented Feb 7, 2025

naoyam commented Feb 7, 2025

kevinstephano commented Feb 7, 2025

BFS traversal could not visit some nodes while fusing take_along_axis #3718

BFS traversal could not visit some nodes while fusing take_along_axis #3718

Comments

riccardofelluga commented Jan 16, 2025

kevinstephano commented Feb 5, 2025

naoyam commented Feb 6, 2025

riccardofelluga commented Feb 7, 2025

naoyam commented Feb 7, 2025

kevinstephano commented Feb 7, 2025

BFS traversal could not visit some nodes while fusing `take_along_axis` #3718

BFS traversal could not visit some nodes while fusing `take_along_axis` #3718