A potential WRONG Design and bug in the preprocessing design! #15

Linnore · 2024-04-10T08:43:56Z

The current workflow of eval_fewshot.py is:

Generate "source", which contains the example QAs and the Question we ask.
Concatenate "source" and "target", where target is the option we want to compare the LLM's output with.
Label ONLY the input_ids at the target part.
Loss is computed using the input_ids of the "target" (labels) and outputs.logits (the LLM's output)

The above steps are done by preprocess().

It seems that a key step is missing before the encodings are feed to the model's forward pass--set the attention_masks corresponding to "target" as 0.

The current design is letting attention_masks as None. For CausalLM, letting attention_masks as None is equivalent as letting attention_masks as 1 for all positions. This means the input_ids of "target" can be seen! Therefore, when doing inference, the output of the LLM will always be similar/same as the the "target", especally when the candidate answers are provided, which explains why the performance of prompting v1.0 for multiple choice selection is even worse than the free answering prompting v2.0.

The original preprocess function is

def preprocess(
    sources: Sequence[str],
    targets: Sequence[str],
    tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
    """Preprocess the data by tokenizing."""
    examples = [s + t for s, t in zip(sources, targets)]
    examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]
    input_ids = examples_tokenized["input_ids"]
    labels = copy.deepcopy(input_ids)
    for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
        label[:source_len] = IGNORE_INDEX
    return dict(input_ids=torch.stack(input_ids).to(device), labels=torch.stack(labels).to(device))

After masking the input_ids corresponding to the target part:

def preprocess(
    sources: Sequence[str],
    targets: Sequence[str],
    tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
    """Preprocess the data by tokenizing."""
    
    examples = [s + t for s, t in zip(sources, targets)]
    examples_tokenized, sources_tokenized = [_tokenize_fn(strings, tokenizer) for strings in (examples, sources)]
    input_ids = examples_tokenized["input_ids"]
    labels = copy.deepcopy(input_ids)
    masks = []
    for label, source_len in zip(labels, sources_tokenized["input_ids_lens"]):
        label[:source_len] = IGNORE_INDEX
        mask = torch.ones_like(label)
        mask[source_len:] = 0
        masks.append(mask)
    return dict(input_ids=torch.stack(input_ids).to(device), labels=torch.stack(labels).to(device), attention_mask=torch.stack(masks).to(device))

I use the following codes to output the LLM's prediction. Inside the for loop of eval_fewshot.py -> main():

    with torch.no_grad():
            # task 6
            outputs = model(**encoding)
            log_likelihood = "Write your codes here"
            
            # output the prediction
            label = problems[i]["label"]
            answerKey = problems[i]["answerKey"]
            
            print("-------------------------------------")
            print("True Answer")
            print(answerKey)
            if answerKey == label:
                print(answer)
            print("-------------------------------------")
            print("Target")
            shift_logits = outputs.logits[..., :-1, :].contiguous()
            shift_labels = encoding["labels"][..., 1:].contiguous()
            shift_logits = shift_logits.view(-1, model.config.vocab_size)
            shift_labels = shift_labels.view(-1)
            print(tokenizer.decode(shift_labels[source_len-1:]))
            print(shift_labels[source_len-1:])
            print("-------------------------------------")
            print("LLM's Prediction")
            prediction = shift_logits.argmax(dim=1)[source_len-1:]
            print(prediction)
            print(tokenizer.decode(prediction))

Before

Before I mask out the input_ids for the target, LLM's output will always align to the given target. The fowling are the first for problems (same question but given 4 different targes) of ARC_challenge_validation.jsonl.

Note: for the same prompt with different targets, the expected output should be the same! But due to the mentioned bug in the provided code bass, the output align with the given targets!

prompt #0: Question: A student compared the speeds at which a large and a small marble rolled down an inclined plane. In order to make the findings more reliable, the student should
Candidate answers: (A) release the marbles at different heights. (B) repeat the experiment several times. (C) tilt the plane at different angles. (D) use two marbles that are the same size.
Gold answer: repeat the experiment several times.

Question: Which is most likely needed when describing the change in position of an object?
Candidate answers: (A) initial speed (B) direction change (C) reference point (D) constant rate
Gold answer: reference point

Question: Juan and LaKeisha roll a few objects down a ramp. They want to see which object rolls the farthest. What should they do so they can repeat their investigation?
Candidate answers: (A) Put the objects in groups. (B) Change the height of the ramp. (C) Choose different objects to roll. (D) Record the details of the investigation.
Gold answer:
-------------------------------------
True Answer
D
-------------------------------------
Target
 Put the objects in groups.
-------------------------------------
LLM's Prediction
 ( the objects in groups.
-------------------------------------
True Answer
D
-------------------------------------
Target
 Change the height of the ramp.
-------------------------------------
LLM's Prediction
 ( the height of the ramp.
-------------------------------------
True Answer
D
-------------------------------------
Target
 Choose different objects to roll.
-------------------------------------
LLM's Prediction
 ( different objects to roll.
-------------------------------------
True Answer
D
Record the details of the investigation.
-------------------------------------
Target
 Record the details of the investigation.
-------------------------------------
LLM's Prediction
 ( the details of the investigation.

After

After masking out the target attention, now the LLM truly answering the question, even though it is saying nonsense (since the max_length for prediction is also fixed by defining the target!):

prompt #0: Question: A student compared the speeds at which a large and a small marble rolled down an inclined plane. In order to make the findings more reliable, the student should
Candidate answers: (A) release the marbles at different heights. (B) repeat the experiment several times. (C) tilt the plane at different angles. (D) use two marbles that are the same size.
Gold answer: repeat the experiment several times.

Question: Which is most likely needed when describing the change in position of an object?
Candidate answers: (A) initial speed (B) direction change (C) reference point (D) constant rate
Gold answer: reference point

Question: Juan and LaKeisha roll a few objects down a ramp. They want to see which object rolls the farthest. What should they do so they can repeat their investigation?
Candidate answers: (A) Put the objects in groups. (B) Change the height of the ramp. (C) Choose different objects to roll. (D) Record the details of the investigation.
Gold answer:
-------------------------------------
True Answer
D
-------------------------------------
Target
 Put the objects in groups.
-------------------------------------
LLM's Prediction
 ( the details in order

  0% 1/1194 [00:00<03:33,  5.58it/s]-------------------------------------
True Answer
D
-------------------------------------
Target
 Change the height of the ramp.
-------------------------------------
LLM's Prediction
 ( the details of the details

-------------------------------------
True Answer
D
-------------------------------------
Target
 Choose different objects to roll.
-------------------------------------
LLM's Prediction
 ( different objects in repeat the
-------------------------------------
True Answer
D
Record the details of the investigation.
-------------------------------------
Target
 Record the details of the investigation.
-------------------------------------
LLM's Prediction
 ( the details of the details

The text was updated successfully, but these errors were encountered:

Linnore · 2024-04-10T09:19:18Z

So this design is to let the LLM overcome the leaked target and give the answer it really believes in... If this is true then issue closed.

Linnore closed this as completed Apr 10, 2024

Linnore reopened this Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A potential WRONG Design and bug in the preprocessing design! #15

A potential WRONG Design and bug in the preprocessing design! #15

Linnore commented Apr 10, 2024 •

edited

Loading

Linnore commented Apr 10, 2024

A potential **WRONG Design and bug** in the preprocessing design! #15

A potential **WRONG Design and bug** in the preprocessing design! #15

Comments

Linnore commented Apr 10, 2024 • edited Loading

Before

After

Linnore commented Apr 10, 2024

A potential WRONG Design and bug in the preprocessing design! #15

A potential WRONG Design and bug in the preprocessing design! #15

Linnore commented Apr 10, 2024 •

edited

Loading