Add a test for Mistral-NeMo. #1340

tfogal · 2024-10-21T23:01:58Z

See issue #1285.

What does this PR do?

Adds a test case for #1285.

crcrpar

We already have https://github.com/Lightning-AI/lightning-thunder/blob/79e59d0c5c5f8aa8ef80eb31f3fe918466d64c1c/thunder/tests/test_networks.py, I thought it's nice to add this to the file instead of the new one

thunder/tests/test_mistral_nemo.py

tfogal · 2024-10-22T15:56:38Z

We already have https://github.com/Lightning-AI/lightning-thunder/blob/79e59d0c5c5f8aa8ef80eb31f3fe918466d64c1c/thunder/tests/test_networks.py, I thought it's nice to add this to the file instead of the new one

thanks, good idea, will do

tfogal · 2024-10-22T18:28:08Z

Latest version moves the code where it belongs, in test_networks.py, as Masaki pointed out.

I also rebuilt my container and that made it clear that there's a couple difficulties with this:

I needed to turn trust_remote_code on to get the tiny shakespeare dataset to work
One needs to have an active login to huggingface's hub to grab configurations
This requires some dependencies (transformers, datasets) that we wouldn't otherwise depend on.

As such I made this skipped by default. I still think it makes sense to store as a thunder test, as that's a logical source of truth for the whole team to work on. But I invite discussion.

tfogal · 2024-10-23T16:23:09Z

@t-vi this is ready for review.

The CI failure was a node timing out running the tests, even though all this does is add a single, skipped test; seems like there's more going on there. Do you want me to add empty commits until this happens to pass or can you override that?

lantiga · 2024-10-26T10:04:53Z

thank you @tfogal

I don’t think we need tiny shakespeare to be downloaded, we can get away with something simpler or random (just like with other examples), this way we don’t even have to get the tokenizer

do we actually need credentials for the configs or is it the tokenizer checkpoint that requires it?

IvanYashchuk · 2024-10-28T12:59:16Z

do we actually need credentials for the configs or is it the tokenizer checkpoint that requires it?

There should be a way to avoid using any credentials since the tokenizer is unnecessary and we don't load any weights for the config.

One needs to have an active login to huggingface's hub to grab configurations

But is downloading anything required here? The configuration is defined directly with transformers.models.mistral.configuration_mistral.MistralConfig.

Usually, we check Thunder's ability to run a network with a sample random input and then invoke backward. A mock training loop is unnecessary here and can be avoided. Here's a patch that would verify that Thunder successfully runs the model without any dataset, optimizers, etc:

diff --git a/thunder/tests/test_networks.py b/thunder/tests/test_networks.py
index 57a9a759..9bf0cb57 100644
--- a/thunder/tests/test_networks.py
+++ b/thunder/tests/test_networks.py
@@ -362,7 +362,7 @@ def test_quantization():
 
 
 @thunder.tests.framework.requiresCUDA
-@pytest.mark.skip(reason="Dependencies, trust issues")
+# @pytest.mark.skip(reason="Dependencies, trust issues")
 def test_thunderfx_mistral_nemo_small():
     """
     Runs a small version of Mistral-NeMo
@@ -370,17 +370,9 @@ def test_thunderfx_mistral_nemo_small():
     This is largely based on code from Alexandros Koumparoulis.
     """
     import transformers
-    import datasets
 
     model_id = "mistralai/Mistral-Nemo-Base-2407"
 
-    tokenizer = transformers.AutoTokenizer.from_pretrained(
-        model_id,
-        torch_dtype=torch.bfloat16,
-        ignore_mismatched_sizes=True,
-        trust_remote_code=False,
-    )
-
     # Setup a "small" version of NeMo-Mistral that does not require downloading
     # weights. This is not a configuration that is worth benchmarking.
     # This was created by using
@@ -389,7 +381,7 @@ def test_thunderfx_mistral_nemo_small():
     #   transformers.AutoConfig.from_pretrained(model_id)
     # until they lined up.
     config = transformers.models.mistral.configuration_mistral.MistralConfig(
-        num_hidden_layers=2,
+        num_hidden_layers=1,
         torch_dtype=torch.bfloat16,
         max_position_embeddings=1024,
         architectures=["MistralForCausalLM"],
@@ -404,53 +396,18 @@ def test_thunderfx_mistral_nemo_small():
     model = transformers.AutoModelForCausalLM.from_config(config)
     device = torch.device("cuda")
     model.to(device)
-    mdl = torch.compile(model, backend=thunder.dynamo.ThunderCompiler())
+    backend = thunder.dynamo.ThunderCompiler()
+    mdl = torch.compile(model, backend=backend)
     del model
 
-    # Add a padding token to the tokenizer
-    if tokenizer.pad_token is None:
-        tokenizer.add_special_tokens({"pad_token": "[PAD]"})
-        mdl.resize_token_embeddings(len(tokenizer))
-
-    dataset = datasets.load_dataset("tiny_shakespeare", split="train", trust_remote_code=True)
-
-    def tokenize_function(examples):
-        return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=2)
+    batch_size = 1
+    input_ids = torch.randint(0, config.vocab_size, (batch_size, config.max_position_embeddings), device=device)
+    attention_mask = torch.ones_like(input_ids)
 
-    tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
-
-    # Convert the dataset to PyTorch format and specify columns to return as tensors
-    tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
-
-    dataloader = torch.utils.data.DataLoader(tokenized_dataset, batch_size=1, shuffle=True)
-
-    # Define optimizer and learning rate scheduler
-    optimizer = torch.optim.AdamW(mdl.parameters(), lr=5e-5)
-    num_epochs = 3
-    lr_scheduler = transformers.get_scheduler(
-        "linear",
-        optimizer=optimizer,
-        num_warmup_steps=0,
-        num_training_steps=num_epochs * len(dataloader),
-    )
+    output = mdl(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)
+    logits = output.logits
+    grad_logits = torch.randn_like(logits)
+    logits.backward(grad_logits)
 
-    mdl.train()
-    for epoch in range(num_epochs):
-        total_loss = 0
-        for batch in dataloader:
-            # Move input tensors to device
-            input_ids = batch["input_ids"].to(device)
-            attention_mask = batch["attention_mask"].to(device)
-
-            # Forward pass
-            outputs = mdl(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)
-            loss = outputs.loss
-            total_loss += loss.item()
-
-            # Backward pass
-            optimizer.zero_grad()
-            loss.backward()
-            optimizer.step()
-
-            # Update learning rate
-            lr_scheduler.step()
+    # Check that Thunder has actually compiled the model
+    assert backend.subgraph_infos, "No subgraphs found"

tfogal · 2024-10-28T18:59:44Z

There should be a way to avoid using any credentials

Thanks Luca, Ivan. I've applied Ivan's patch + some minor other changes, and indeed it appears to not download anything new (into ~/.cache/huggingface, at least), now.

Unfortunately in the interim something to seems to have tickled things so that #1240 is now a blocking issue, so I am leaving the skip designation for now :-(

kshitij12345 · 2024-10-29T10:06:54Z

#1240 has been fixed just now, so we can probably remove the skip.

thunder/tests/test_networks.py

tfogal · 2024-10-29T16:32:36Z

hrm my merge of main seemed to have not gone well. meetings now but i will fix after...

See issue #1285. Thanks: Alexandros Koumparoulis Ivan Yashchuk Masaki Kozuki for various fixes/guidance. + Kshiteej Kalambarkar for fixing 1240.

tfogal · 2024-10-29T18:45:59Z

Hi, sorry for the weirdness. I couldn't figure it out why the github diff was wild even though git diff main..tfogal/nemo-test-case was sane, so I just ended up rebasing it.

It's pretty tiny (thanks to Ivan's patch), so hopefully it's not too painful to review from scratch again.

tfogal · 2024-10-29T19:46:08Z

CI failure is real: Found two different const extents in the same set: { bS39{1}; bS43{1}; bS66{1}; bS70{1 ex 8}; bS50{1}; bS54{1 ex 32} } nvFuser issue.

However, it works fine with nvFuser e33316d9480508b49db788a7472f4df52e53af92. Do we need an nvFuser upgrade for this to work?

t-vi

The slimmed-down version looks good.
Thannk you @tfogal @IvanYashchuk @lantiga @crcrpar @kshitij12345

IvanYashchuk · 2024-11-07T07:15:18Z

thunder/tests/test_networks.py

+    # until they lined up sans the hidden and embeddings changes, above.
+    config = transformers.models.mistral.configuration_mistral.MistralConfig(
+        num_hidden_layers=1,
+        torch_dtype=torch.bfloat16,


This setting is ignored by the model instantiation. It can be checked by inspecting for example model.model.layers[0].mlp.gate_proj.weight.dtype.

@riccardofelluga, when you look into what Thunder executes for this and other HF models and what is missing for performance please update this test to use bfloat16 weights.

tfogal requested a review from IvanYashchuk October 21, 2024 23:01

crcrpar reviewed Oct 22, 2024

View reviewed changes

thunder/tests/test_mistral_nemo.py Outdated Show resolved Hide resolved

thunder/tests/test_mistral_nemo.py Outdated Show resolved Hide resolved

tfogal marked this pull request as ready for review October 22, 2024 18:28

tfogal requested review from mruberry, lantiga and t-vi as code owners October 22, 2024 18:28

tfogal mentioned this pull request Oct 28, 2024

grad rule for copy_with_setitem #1322

Merged

kshitij12345 reviewed Oct 29, 2024

View reviewed changes

thunder/tests/test_networks.py Outdated Show resolved Hide resolved

Add a test for Mistral-NeMo.

d77bc05

See issue #1285. Thanks: Alexandros Koumparoulis Ivan Yashchuk Masaki Kozuki for various fixes/guidance. + Kshiteej Kalambarkar for fixing 1240.

tfogal force-pushed the tfogal/nemo-test-case branch from 0d438dd to d77bc05 Compare October 29, 2024 18:40

tfogal mentioned this pull request Oct 30, 2024

Upgrade nvFuser in CI #1372

Closed

t-vi enabled auto-merge (squash) October 31, 2024 13:42

t-vi approved these changes Oct 31, 2024

View reviewed changes

Merge branch 'main' into tfogal/nemo-test-case

9deeef0

t-vi merged commit 908be57 into main Oct 31, 2024
41 checks passed

t-vi deleted the tfogal/nemo-test-case branch October 31, 2024 18:15

IvanYashchuk reviewed Nov 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a test for Mistral-NeMo. #1340

Add a test for Mistral-NeMo. #1340

tfogal commented Oct 21, 2024 •

edited

Loading

crcrpar left a comment

tfogal commented Oct 22, 2024

tfogal commented Oct 22, 2024

tfogal commented Oct 23, 2024

lantiga commented Oct 26, 2024

IvanYashchuk commented Oct 28, 2024 •

edited

Loading

tfogal commented Oct 28, 2024

kshitij12345 commented Oct 29, 2024

tfogal commented Oct 29, 2024

tfogal commented Oct 29, 2024

tfogal commented Oct 29, 2024

t-vi left a comment •

edited

Loading

IvanYashchuk Nov 7, 2024

Add a test for Mistral-NeMo. #1340

Add a test for Mistral-NeMo. #1340

Conversation

tfogal commented Oct 21, 2024 • edited Loading

What does this PR do?

crcrpar left a comment

Choose a reason for hiding this comment

tfogal commented Oct 22, 2024

tfogal commented Oct 22, 2024

tfogal commented Oct 23, 2024

lantiga commented Oct 26, 2024

IvanYashchuk commented Oct 28, 2024 • edited Loading

tfogal commented Oct 28, 2024

kshitij12345 commented Oct 29, 2024

tfogal commented Oct 29, 2024

tfogal commented Oct 29, 2024

tfogal commented Oct 29, 2024

t-vi left a comment • edited Loading

Choose a reason for hiding this comment

IvanYashchuk Nov 7, 2024

Choose a reason for hiding this comment

tfogal commented Oct 21, 2024 •

edited

Loading

IvanYashchuk commented Oct 28, 2024 •

edited

Loading

t-vi left a comment •

edited

Loading