Investigate Memory and Performance difference using `nvfuser` vs `torch.compile` executor on Qwen2 #1552

kshitij12345 · 2024-12-13T12:21:16Z

On internal image pjnl-20241213 and on H100 -

With ("sdpa", "torchcompile_cat", "nvfuser") -

# Memory - 49690.704896
# <torch.utils.benchmark.utils.common.Measurement object at 0x7ef10052ffe0>
# run_forward_backward()
#   187.69 ms
#   1 measurement, 10 runs , 1 thread

With ("sdpa", "torchcompile") -

# Memory - 40340.889088
# <torch.utils.benchmark.utils.common.Measurement object at 0x7fc319acd6d0>
# run_forward_backward()
#   153.62 ms
#   1 measurement, 10 runs , 1 thread

We should investigate what is happening leading to the difference in memory and perf.

import torch
import torch.utils.benchmark
from thunder.dynamo import ThunderCompiler
from transformers import AutoConfig, AutoModelForCausalLM
import thunder

model_id = "Qwen/Qwen2.5-7B-Instruct"

configuration = AutoConfig.from_pretrained(
    model_id,
    # num_hidden_layers=2,
)
configuration.hidden_size = configuration.num_attention_heads
with torch.device("cuda"):
    model = AutoModelForCausalLM.from_config(configuration).to(torch.bfloat16)

# executors = ("sdpa", "torchcompile_cat", "nvfuser")
executors = ("sdpa", "torchcompile")
backend = ThunderCompiler(executors=executors)
compiled_model = torch.compile(model, backend=backend)

input_ids = torch.randint(0, configuration.vocab_size, (1, 4096), device="cuda")

def run_forward_backward():
    compiled_output = compiled_model(input_ids=input_ids, labels=input_ids)
    compiled_output.loss.backward()


for _ in range(5):
    run_forward_backward()

print(torch.cuda.max_memory_allocated() / 1e6)

import torch
timer = torch.utils.benchmark.Timer("run_forward_backward()", globals={"run_forward_backward": run_forward_backward})
measurement = timer.timeit(number=10)
print(measurement)

# With Nvfuser executor
# Memory - 49690.704896
# <torch.utils.benchmark.utils.common.Measurement object at 0x7ef10052ffe0>
# run_forward_backward()
#   187.69 ms
#   1 measurement, 10 runs , 1 thread

# With torch.compile executor
# Memory - 40340.889088
# <torch.utils.benchmark.utils.common.Measurement object at 0x7fc319acd6d0>
# run_forward_backward()
#   153.62 ms
#   1 measurement, 10 runs , 1 thread

cc @apaz-cli @tfogal

The text was updated successfully, but these errors were encountered:

IvanYashchuk · 2024-12-17T14:14:14Z

There could be a conflict between the "torchcompile_cat" and "nvfuser" executors, creating more fusions and materializing more intermediates than necessary. Inspecting execution traces could reveal if that's happening. If yes, a potential solution could be expanding the scope of the "torchcompile_cat" executor to take more operations into its fusion.

riccardofelluga · 2024-12-18T11:11:38Z

From initial triage, it looks like nvFuser fusion pass in Thunder is picking up a bunch of transpose and reshape ops from around the trace and fuse them together at the start of the backward pass together with the actual computation needed for the torch_nll_loss_backward_impl(as can be seen here in nvFusion0):

[... omitted ...]
value_states_2, value_states_5, = C0
clear_mutable_collection(C0)
del C0
[t1148, t1190, t1202, t1275, t1284, t1286, t1307, t1309, t1501, t1578, t1590, t1663, t1672, t1674, t1695, t1697, t1897] = nvFusion0(hidden_states_24, mul_19, hidden_states_19, attn_output_6, value_states_5, t651, t627, t620, hidden_states_13, mul_10, hidden_states_8, attn_output_2, value_states_2, t451, t429, t425, hidden_states_2)
  # t1148 = prims.reshape(hidden_states_24, (4096, 28))  # t1148: "cuda:0 bf16[4096, 28]"
  # t1190 = prims.reshape(mul_19, (4096, 18944))  # t1190: "cuda:0 bf16[4096, 18944]"
  # t1202 = prims.reshape(hidden_states_19, (4096, 28))  # t1202: "cuda:0 bf16[4096, 28]"
  # t1275 = prims.reshape(attn_output_6, (4096, 28))  # t1275: "cuda:0 bf16[4096, 28]"
  # t1284 = prims.transpose(value_states_5, (0, 1, 3, 2))  # t1284: "cuda:0 bf16[1, 28, 1, 4096]"
  # t1286 = prims.transpose(t651, (0, 1, 3, 2))  # t1286: "cuda:0 bf16[1, 28, 4096, 4096]"
  # t1307 = prims.transpose(t627, (0, 1, 3, 2))  # t1307: "cuda:0 bf16[1, 28, 4096, 2]"
  # t1309 = prims.transpose(t620, (0, 1, 3, 2))  # t1309: "cuda:0 bf16[1, 28, 2, 4096]"
  # t1501 = prims.reshape(hidden_states_13, (4096, 28))  # t1501: "cuda:0 bf16[4096, 28]"
  # t1578 = prims.reshape(mul_10, (4096, 18944))  # t1578: "cuda:0 bf16[4096, 18944]"
  # t1590 = prims.reshape(hidden_states_8, (4096, 28))  # t1590: "cuda:0 bf16[4096, 28]"
  # t1663 = prims.reshape(attn_output_2, (4096, 28))  # t1663: "cuda:0 bf16[4096, 28]"
  # t1672 = prims.transpose(value_states_2, (0, 1, 3, 2))  # t1672: "cuda:0 bf16[1, 28, 1, 4096]"
  # t1674 = prims.transpose(t451, (0, 1, 3, 2))  # t1674: "cuda:0 bf16[1, 28, 4096, 4096]"
  # t1695 = prims.transpose(t429, (0, 1, 3, 2))  # t1695: "cuda:0 bf16[1, 28, 4096, 2]"
  # t1697 = prims.transpose(t425, (0, 1, 3, 2))  # t1697: "cuda:0 bf16[1, 28, 2, 4096]"
  # t1897 = prims.reshape(hidden_states_2, (4096, 28))  # t1897: "cuda:0 bf16[4096, 28]"
del hidden_states_24, mul_19, hidden_states_19, attn_output_6, value_states_5, t627, t620, hidden_states_13, mul_10, hidden_states_8, attn_output_2, value_states_2, t429, t425, hidden_states_2
t1129 = torch_nll_loss_backward_impl(t365, t744, shift_labels_1, None, 'mean', -100, t752)  # t1129: "cuda:0 f32[4095, 152064]"
del t365, shift_labels_1, t752
[t1143, t1147] = nvFusion1(t744, t1129, t366)
[... omitted ...]

This is bad because, already without nvFusion0, the start of the backward pass is where peak memory usage happens.

For what I see I think the work here is two fold: we can set nvFuser to not pick up transform/reshape ops and then work on the scheduling of computation with the idea of putting the production of tensors closer to the consumers(this idea is the main thread in at least #1337, #1560 and #1562)

riccardofelluga · 2025-01-14T13:56:56Z

Update on this issue, another problem that it might be more important for this specific issue is the handling of the loss function part of the computation. Inductor manages to create a single fusion whereas nvFuser can't. This can be observed with this repro:

import thunder
import torch
from thunder.dynamo import ThunderCompiler
from transformers.loss.loss_utils import ForCausalLMLoss 

logits = torch.randn(1, 4096, 152064, dtype=torch.bfloat16, device="cuda", requires_grad=True)
labels = torch.randint(0, 152064, (1, 4096), device="cuda")
vocab_size = 152064
ignore_index = -100

# toggle one or the other to compare perf. 
# executors = ("sdpa", "torchcompile_cat", "nvfuser")
# executors = ("sdpa", "torchcompile")
backend = ThunderCompiler(executors=executors)
compiled_loss = torch.compile(ForCausalLMLoss, disable=False, backend=backend, fullgraph=False)

for i in range(3):
    out = compiled_loss(logits, labels, vocab_size, ignore_index=ignore_index)
    out.backward()

print("Peak memory", torch.cuda.max_memory_allocated() / 1e6)

In this case peak memory usage for nvFuser is ~11.2GB and inductor ~3.7GB.
Looking at forward only, nvFuser creates one fusion before torch.nn.functional.nll_loss

Inductor only in the executors list, forward trace

# Constructed by Unwrap the actual return value
import torch
from thunder.executors.torchex import no_autocast

@torch.no_grad()
@no_autocast
def computation(l_logits_, l_labels_):
  # l_logits_: "cuda:0 bf16[1, 4096, 152064]"
  # l_labels_: "cuda:0 i64[1, 4096]"
  [loss] = TorchCompile0(l_logits_, l_labels_)
    # logits = prims.convert_element_type(l_logits_, dtypes.float32)  # logits: "cuda:0 f32[1, 4096, 152064]"
    # getitem = ltorch.getitem(logits, (..., slice(None,-1,None), slice(None,None,None)))  # getitem: "cuda:0 f32[1, 4095, 152064]"
      # getitem = prims.slice_prim(logits, [0, 0, 0], [1, 4095, 152064], [1, 1, 1])  # getitem: "cuda:0 f32[1, 4095, 152064]"
    # shift_logits = ltorch.contiguous(getitem, memory_format=_torch_memory_format_0)  # shift_logits: "cuda:0 f32[1, 4095, 152064]"
      # shift_logits = prims.stride_order(getitem, (2, 1, 0))  # shift_logits: "cuda:0 f32[1, 4095, 152064]"
    # getitem_1 = ltorch.getitem(l_labels_, (..., slice(1,None,None)))  # getitem_1: "cuda:0 i64[1, 4095]"
      # getitem_1 = prims.slice_prim(l_labels_, [0, 1], [1, 4096], [1, 1])  # getitem_1: "cuda:0 i64[1, 4095]"
    # shift_labels = ltorch.contiguous(getitem_1, memory_format=_torch_memory_format_0)  # shift_labels: "cuda:0 i64[1, 4095]"
      # shift_labels = prims.stride_order(getitem_1, (1, 0))  # shift_labels: "cuda:0 i64[1, 4095]"
    # shift_logits_1 = ltorch.view(shift_logits, -1, 152064)  # shift_logits_1: "cuda:0 f32[4095, 152064]"
      # shift_logits_1 = ltorch.reshape(shift_logits, (-1, 152064))  # shift_logits_1: "cuda:0 f32[4095, 152064]"
        # shift_logits_1 = prims.reshape(shift_logits, (4095, 152064))  # shift_logits_1: "cuda:0 f32[4095, 152064]"
    # shift_labels_1 = ltorch.view(shift_labels, -1)  # shift_labels_1: "cuda:0 i64[4095]"
      # shift_labels_1 = ltorch.reshape(shift_labels, (-1,))  # shift_labels_1: "cuda:0 i64[4095]"
        # shift_labels_1 = prims.reshape(shift_labels, (4095,))  # shift_labels_1: "cuda:0 i64[4095]"
    # loss = ltorch.cross_entropy(shift_logits_1, shift_labels_1, None, None, -100, None, 'mean', 0.0)  # loss: "cuda:0 f32[]"
      # t20 = ltorch.log_softmax(shift_logits_1, 1, dtype=None)  # t20: "cuda:0 f32[4095, 152064]"
        # t18 = ltorch.logsumexp(shift_logits_1, 1, True)  # t18: "cuda:0 f32[4095, 1]"
          # t8 = ltorch.amax(shift_logits_1, 1, True)  # t8: "cuda:0 f32[4095, 1]"
            # t7 = prims.amax(shift_logits_1, (1,))  # t7: "cuda:0 f32[4095]"
            # t8 = prims.broadcast_in_dim(t7, [4095, 1], [0])  # t8: "cuda:0 f32[4095, 1]"
          # t9 = ltorch.abs(t8)  # t9: "cuda:0 f32[4095, 1]"
            # t9 = prims.abs(t8)  # t9: "cuda:0 f32[4095, 1]"
          # t10 = ltorch.eq(t9, float('inf'))  # t10: "cuda:0 b8[4095, 1]"
            # t10 = prims.eq(t9, float('inf'))  # t10: "cuda:0 b8[4095, 1]"
          # t11 = ltorch.where(t10, 0, t8)  # t11: "cuda:0 f32[4095, 1]"
            # t11 = prims.where(t10, 0.0, t8)  # t11: "cuda:0 f32[4095, 1]"
          # t13 = ltorch.sub(shift_logits_1, t11, alpha=1)  # t13: "cuda:0 f32[4095, 152064]"
            # t12 = prims.broadcast_in_dim(t11, (4095, 152064), (0, 1))  # t12: "cuda:0 f32[4095, 152064]"
            # t13 = prims.sub(shift_logits_1, t12)  # t13: "cuda:0 f32[4095, 152064]"
          # t14 = ltorch.exp(t13)  # t14: "cuda:0 f32[4095, 152064]"
            # t14 = prims.exp(t13)  # t14: "cuda:0 f32[4095, 152064]"
          # t16 = ltorch.sum(t14, 1, True, dtype=None)  # t16: "cuda:0 f32[4095, 1]"
            # t15 = prims.sum(t14, (1,))  # t15: "cuda:0 f32[4095]"
            # t16 = prims.broadcast_in_dim(t15, [4095, 1], [0])  # t16: "cuda:0 f32[4095, 1]"
          # t17 = ltorch.log(t16)  # t17: "cuda:0 f32[4095, 1]"
            # t17 = prims.log(t16)  # t17: "cuda:0 f32[4095, 1]"
          # t18 = ltorch.add(t17, t11, alpha=1)  # t18: "cuda:0 f32[4095, 1]"
            # t18 = prims.add(t17, t11)  # t18: "cuda:0 f32[4095, 1]"
        # t20 = ltorch.sub(shift_logits_1, t18, alpha=1)  # t20: "cuda:0 f32[4095, 152064]"
          # t19 = prims.broadcast_in_dim(t18, (4095, 152064), (0, 1))  # t19: "cuda:0 f32[4095, 152064]"
          # t20 = prims.sub(shift_logits_1, t19)  # t20: "cuda:0 f32[4095, 152064]"
      # loss = ltorch.nll_loss(t20, shift_labels_1, None, -100, 'mean')  # loss: "cuda:0 f32[]"
        # t21 = ltorch.neg(t20)  # t21: "cuda:0 f32[4095, 152064]"
          # t21 = prims.neg(t20)  # t21: "cuda:0 f32[4095, 152064]"
        # t22 = ltorch.unsqueeze(shift_labels_1, 1)  # t22: "cuda:0 i64[4095, 1]"
          # t22 = prims.broadcast_in_dim(shift_labels_1, [4095, 1], [0])  # t22: "cuda:0 i64[4095, 1]"
        # t23 = ltorch.take_along_dim(t21, t22, 1)  # t23: "cuda:0 f32[4095, 1]"
          # t23 = prims.take_along_axis(t21, t22, 1)  # t23: "cuda:0 f32[4095, 1]"
        # t24 = ltorch.ne(t22, -100)  # t24: "cuda:0 b8[4095, 1]"
          # t24 = prims.ne(t22, -100)  # t24: "cuda:0 b8[4095, 1]"
        # t25 = ltorch.where(t24, t23, 0)  # t25: "cuda:0 f32[4095, 1]"
          # t25 = prims.where(t24, t23, 0.0)  # t25: "cuda:0 f32[4095, 1]"
        # t26 = ltorch.sum(t25, None, False, dtype=None)  # t26: "cuda:0 f32[]"
          # t26 = prims.sum(t25, (0, 1))  # t26: "cuda:0 f32[]"
        # t28 = ltorch.sum(t24, None, False, dtype=None)  # t28: "cuda:0 i64[]"
          # t27 = ltorch.to(t24, dtypes.int64, None, device=None, dtype=None, copy=False, memory_format=None)  # t27: "cuda:0 i64[4095, 1]"
            # t27 = prims.convert_element_type(t24, dtypes.int64)  # t27: "cuda:0 i64[4095, 1]"
          # t28 = prims.sum(t27, (0, 1))  # t28: "cuda:0 i64[]"
        # loss = ltorch.true_divide(t26, t28)  # loss: "cuda:0 f32[]"
          # t29 = prims.convert_element_type(t28, dtypes.float32)  # t29: "cuda:0 f32[]"
          # loss = prims.div(t26, t29)  # loss: "cuda:0 f32[]"
  return (loss,)

nvFuser in the executors list, forward trace

# Constructed by Unwrap the actual return value
import torch
import torch.nn.functional
from thunder.executors.torchex import no_autocast

@torch.no_grad()
@no_autocast
def computation(l_logits_, l_labels_):
  # l_logits_: "cuda:0 bf16[1, 4096, 152064]"
  # l_labels_: "cuda:0 i64[1, 4096]"
  [shift_labels_1, t20] = nvFusion0(l_logits_, l_labels_)
    # logits = prims.convert_element_type(l_logits_, dtypes.float32)  # logits: "cuda:0 f32[1, 4096, 152064]"
    # getitem = prims.slice_prim(logits, [0, 0, 0], [1, 4095, 152064], [1, 1, 1])  # getitem: "cuda:0 f32[1, 4095, 152064]"
    # shift_logits = prims.stride_order(getitem, (2, 1, 0))  # shift_logits: "cuda:0 f32[1, 4095, 152064]"
    # getitem_1 = prims.slice_prim(l_labels_, [0, 1], [1, 4096], [1, 1])  # getitem_1: "cuda:0 i64[1, 4095]"
    # shift_labels = prims.stride_order(getitem_1, (1, 0))  # shift_labels: "cuda:0 i64[1, 4095]"
    # shift_logits_1 = prims.reshape(shift_logits, (4095, 152064))  # shift_logits_1: "cuda:0 f32[4095, 152064]"
    # shift_labels_1 = prims.reshape(shift_labels, (4095,))  # shift_labels_1: "cuda:0 i64[4095]"
    # t7 = prims.amax(shift_logits_1, (1,))  # t7: "cuda:0 f32[4095]"
    # t8 = prims.broadcast_in_dim(t7, [4095, 1], [0])  # t8: "cuda:0 f32[4095, 1]"
    # t9 = prims.abs(t8)  # t9: "cuda:0 f32[4095, 1]"
    # t10 = prims.eq(t9, float('inf'))  # t10: "cuda:0 b8[4095, 1]"
    # t11 = prims.where(t10, 0.0, t8)  # t11: "cuda:0 f32[4095, 1]"
    # t12 = prims.broadcast_in_dim(t11, (4095, 152064), (0, 1))  # t12: "cuda:0 f32[4095, 152064]"
    # t13 = prims.sub(shift_logits_1, t12)  # t13: "cuda:0 f32[4095, 152064]"
    # t14 = prims.exp(t13)  # t14: "cuda:0 f32[4095, 152064]"
    # t15 = prims.sum(t14, (1,))  # t15: "cuda:0 f32[4095]"
    # t16 = prims.broadcast_in_dim(t15, [4095, 1], [0])  # t16: "cuda:0 f32[4095, 1]"
    # t17 = prims.log(t16)  # t17: "cuda:0 f32[4095, 1]"
    # t18 = prims.add(t17, t11)  # t18: "cuda:0 f32[4095, 1]"
    # t19 = prims.broadcast_in_dim(t18, (4095, 152064), (0, 1))  # t19: "cuda:0 f32[4095, 152064]"
    # t20 = prims.sub(shift_logits_1, t19)  # t20: "cuda:0 f32[4095, 152064]"

  # <eval_with_key>.4:13: 	    loss = torch.nn.functional.cross_entropy(shift_logits_1, shift_labels_2, ignore_index = -100, reduction = 'mean');  shift_logits_1 = shift_labels_2 = None
  t50 = torch.nn.functional.nll_loss(t20, target=shift_labels_1, weight=None, ignore_index=-100, reduction='mean')  # t50: "cuda:0 f32[]"
    # t50 = ltorch.nll_loss(t20, shift_labels_1, None, -100, 'mean')  # t50: "cuda:0 f32[]"
      # t41 = ltorch.neg(t20)  # t41: "cuda:0 f32[4095, 152064]"
        # t41 = prims.neg(t20)  # t41: "cuda:0 f32[4095, 152064]"
      # t42 = ltorch.unsqueeze(shift_labels_1, 1)  # t42: "cuda:0 i64[4095, 1]"
        # t42 = prims.broadcast_in_dim(shift_labels_1, [4095, 1], [0])  # t42: "cuda:0 i64[4095, 1]"
      # t43 = ltorch.take_along_dim(t41, t42, 1)  # t43: "cuda:0 f32[4095, 1]"
        # t43 = prims.take_along_axis(t41, t42, 1)  # t43: "cuda:0 f32[4095, 1]"
      # t44 = ltorch.ne(t42, -100)  # t44: "cuda:0 b8[4095, 1]"
        # t44 = prims.ne(t42, -100)  # t44: "cuda:0 b8[4095, 1]"
      # t45 = ltorch.where(t44, t43, 0)  # t45: "cuda:0 f32[4095, 1]"
        # t45 = prims.where(t44, t43, 0.0)  # t45: "cuda:0 f32[4095, 1]"
      # t46 = ltorch.sum(t45, None, False, dtype=None)  # t46: "cuda:0 f32[]"
        # t46 = prims.sum(t45, (0, 1))  # t46: "cuda:0 f32[]"
      # t48 = ltorch.sum(t44, None, False, dtype=None)  # t48: "cuda:0 i64[]"
        # t47 = ltorch.to(t44, dtypes.int64, None, device=None, dtype=None, copy=False, memory_format=None)  # t47: "cuda:0 i64[4095, 1]"
          # t47 = prims.convert_element_type(t44, dtypes.int64)  # t47: "cuda:0 i64[4095, 1]"
        # t48 = prims.sum(t47, (0, 1))  # t48: "cuda:0 i64[]"
      # t50 = ltorch.true_divide(t46, t48)  # t50: "cuda:0 f32[]"
        # t49 = prims.convert_element_type(t48, dtypes.float32)  # t49: "cuda:0 f32[]"
        # t50 = prims.div(t46, t49)  # t50: "cuda:0 f32[]"
  del t20, shift_labels_1
  return (t50,)

There are a couple of options here, one is to let the loss to be fully captured by inductor such that there are no intermediate tensors stored in memory, or see if nvFuser might be capable to capture nll_loss. This seems pretty important issue because most of the LLMs use this loss fn from HF. @IvanYashchuk, @kshitij12345 what are your opinions on this? I am a bit out of ideas besides these two 🤔

ps. for example I tested with the torchcompile executor and nvFuser but in this order where inductor takes priority, then the memory usage goes down(a lot). In practice, in that experiment inductor captures the entire loss fn instead of nvFuser.

mruberry · 2025-01-14T15:26:13Z

fyi @kevinstephano -- what do you think?

kevinstephano · 2025-01-14T20:35:00Z

It looks like each vocab sized tensor in float is around 2.5GB. Therefore each segmentation is rather expensive and could believably add up to the number suggested.

kevinstephano · 2025-01-14T20:38:01Z

Is nll_loss not decomposed into prims for fusion in Thunder? It's not clear to me why it is not fused.

kevinstephano · 2025-01-14T20:42:44Z

My expectation would be 2 to 3 fusions given nvfuser cannot currently handle the gather like operation after log softmax based on selecting the target index out of the log softmax output.

riccardofelluga · 2025-01-15T08:32:37Z

Oh thanks for the inputs! It might be a case where not all sub-symbols are supported and thus the symbol is not fused. Is take_along_axis considered a gather like op?

riccardofelluga · 2025-01-17T11:55:00Z

#1655 is a remedy to the problem found on the loss computation(more details in #1654). The PR allows torchcompile_cat to capture the cross entropy loss, with the following results on the whole model:

Backend	Peak memory [MB]	Iter. time [ms]
torchcompile_cat + nvfuser	49690.70	205.18
torchcompile_cat + nvfuser + #1655	44706.92	196.73
torchcompile	40454.03	163.07

If we look only at the forward pass it is evident that there still is quite some room for improvement:

Backend	Peak memory [MB]	Iter. time [ms]
torchcompile_cat + nvfuser	48293.49	78.39
torchcompile_cat + nvfuser + #1655	43310.62	74.67
torchcompile	38939.86	59.13

riccardofelluga · 2025-02-04T16:18:21Z

Update on this: check out #1738

kshitij12345 added performance memory use labels Dec 13, 2024

tfogal added thunderfx for things that could be applicable to the dynamo+thunder frontend high priority nemo Issues needed to support NVIDIA NeMo models. labels Dec 13, 2024

nvMelissa assigned IvanYashchuk Dec 16, 2024

IvanYashchuk assigned riccardofelluga and unassigned IvanYashchuk Dec 17, 2024

This was referenced Jan 16, 2025

BFS traversal could not visit some nodes while fusing take_along_axis NVIDIA/Fuser#3718

Closed

nvFuser using more memory than inductor for HF CausalLMLoss #1654

Closed

Add torchcompile_xentropy executor #1655

Merged

riccardofelluga mentioned this issue Jan 17, 2025

Support Qwen2.5-7B-Instruct model #1286

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate Memory and Performance difference using `nvfuser` vs `torch.compile` executor on Qwen2 #1552

Investigate Memory and Performance difference using `nvfuser` vs `torch.compile` executor on Qwen2 #1552

kshitij12345 commented Dec 13, 2024 •

edited

Loading

IvanYashchuk commented Dec 17, 2024

riccardofelluga commented Dec 18, 2024 •

edited by IvanYashchuk

Loading

riccardofelluga commented Jan 14, 2025 •

edited

Loading

mruberry commented Jan 14, 2025

kevinstephano commented Jan 14, 2025

kevinstephano commented Jan 14, 2025

kevinstephano commented Jan 14, 2025

riccardofelluga commented Jan 15, 2025

riccardofelluga commented Jan 17, 2025

riccardofelluga commented Feb 4, 2025

Investigate Memory and Performance difference using nvfuser vs torch.compile executor on Qwen2 #1552

Investigate Memory and Performance difference using nvfuser vs torch.compile executor on Qwen2 #1552

Comments

kshitij12345 commented Dec 13, 2024 • edited Loading

IvanYashchuk commented Dec 17, 2024

riccardofelluga commented Dec 18, 2024 • edited by IvanYashchuk Loading

riccardofelluga commented Jan 14, 2025 • edited Loading

mruberry commented Jan 14, 2025

kevinstephano commented Jan 14, 2025

kevinstephano commented Jan 14, 2025

kevinstephano commented Jan 14, 2025

riccardofelluga commented Jan 15, 2025

riccardofelluga commented Jan 17, 2025

riccardofelluga commented Feb 4, 2025

Investigate Memory and Performance difference using `nvfuser` vs `torch.compile` executor on Qwen2 #1552

Investigate Memory and Performance difference using `nvfuser` vs `torch.compile` executor on Qwen2 #1552

kshitij12345 commented Dec 13, 2024 •

edited

Loading

riccardofelluga commented Dec 18, 2024 •

edited by IvanYashchuk

Loading

riccardofelluga commented Jan 14, 2025 •

edited

Loading