You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need to determine whether Thunder has real accuracy problems computing HF's Qwen 2 model.
The test added in #1406 might fail because the loss computed by the Thunder-generated function is slightly different from HF's implementation. Here's the snippet to reproduce the problem:
importtorchfromthunder.dynamoimportThunderCompilerfromtransformersimportQwen2Config, Qwen2ForCausalLMtorch.manual_seed(0)
# https://huggingface.co/Qwen/Qwen2.5-7B-Instruct/blob/main/config.jsonconfiguration=Qwen2Config(
# Qwen2.5-7B-Instruct uses Grouped-Query Attention, while the default# config uses Multi-Head Attentionnum_attention_heads=28,
num_key_value_heads=4,
# Scaled down for testinghidden_size=56,
vocab_size=2,
max_position_embeddings=32,
)
configuration.num_hidden_layers=1withtorch.device("cuda"):
model=Qwen2ForCausalLM(configuration).to(torch.bfloat16)
# thunder.jit doesn't work with Qwen2, so we use torch.compile# https://github.com/Lightning-AI/lightning-thunder/issues/1405backend=ThunderCompiler()
compiled_model=torch.compile(model, backend=backend, fullgraph=True)
input_ids=torch.randint(0, configuration.vocab_size, (1, configuration.max_position_embeddings), device="cuda")
# input_ids = torch.ones_like(input_ids) * 0ref_output=model(input_ids=input_ids, labels=input_ids)
ref_loss=ref_output.losscompiled_output=compiled_model(input_ids=input_ids, labels=input_ids)
compiled_loss=compiled_output.losstorch.testing.assert_close(compiled_loss, ref_loss)
Thunder may return a different result because upcasting and downcasting to bf16 are different. However, we need to know that Thunder's is indeed more accurate by comparing the distance to the fp64 result and the tolerances in the test may need to be tweaked.
Quick update: dynamo doesn't seems to be able to capture a full graph anymore and the flag full_graph needs to be set to False. With that in mind the same error appears when using inductor as a backend with numerical differences on a similar scale as ThunderCompile
The graph break was introduced by the new loss function selection logic in HF transformers and it's going to (hopefully) be fixed in huggingface/transformers#34616
🐛 Bug
We need to determine whether Thunder has real accuracy problems computing HF's Qwen 2 model.
The test added in #1406 might fail because the loss computed by the Thunder-generated function is slightly different from HF's implementation. Here's the snippet to reproduce the problem:
Thunder may return a different result because upcasting and downcasting to bf16 are different. However, we need to know that Thunder's is indeed more accurate by comparing the distance to the fp64 result and the tolerances in the test may need to be tweaked.
cc @apaz-cli
The text was updated successfully, but these errors were encountered: