Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observing eval accuracy considerably lower than reported? #3

Open
knagrecha opened this issue Dec 14, 2023 · 12 comments
Open

Observing eval accuracy considerably lower than reported? #3

knagrecha opened this issue Dec 14, 2023 · 12 comments

Comments

@knagrecha
Copy link

Hi, thanks for open-sourcing this code. I'm noticing that my tests with GPT-2 variants show considerably lower eval accuracies than what's reported in the paper & charts. I'm using the command provided in the README. I do not think the eval code itself is incorrect --- testing it with LLaMA shows much higher eval accuracies (as I would expect). But I cannot replicate the GPT-2 results; any pointers on what the issue might be?

@knagrecha
Copy link
Author

knagrecha commented Dec 14, 2023

As an example:

sciq - GPT-2-medium reports accuracy of 0.43 (0.5 after I lowered the learning rate). LLaMA-7B ground truth got 0.84. LLaMA-7b-transferred got 0.43 (0.81 after I lowered the learning rate).

@WuTheFWasThat
Copy link
Contributor

0.43 is worse than random so something is either wrong with the ML there or your eval set isn't big enough

@knagrecha
Copy link
Author

knagrecha commented Dec 15, 2023

Yeah I figured the eval set size seemed small but assumed that the line in the README would work directly. Might test it out again later with a larger eval size.

@knagrecha
Copy link
Author

knagrecha commented Dec 15, 2023

10X'd the train/test sizes. new results on sciq with gpt2-med and llama-7b after a quick run.

GPT-2-Med ending acc: 0.661 +/- 0.006694460396477075

LLaMA-7B ending acc (gt): 0.866 +/- 0.015234434679370284

LLaMA-7B ending acc (transfer): Accuracy: 0.704 +/- 0.020414896521902825

Looks nice! Pretty closely aligned with the Qwen results, with slightly lower transfer efficacy. Hope others will add their OSS model eval results soon too.

Would suggest increasing the n_docs/n_test_docs values in the README command? Current values seem pretty low.

@WuTheFWasThat
Copy link
Contributor

haha yeah, they are low! can update that

things to generally keep in mind:

  • things are somewhat noisy in general, even with a large dataset. results are cleaner when averaging across many seeds. i'm not totally sure why but i think they're noisier than our internal setup was
  • truncating the dataset to be smaller makes things even noisier

@knagrecha
Copy link
Author

Off-topic, but I am curious about how you guys are thinking of labeling by a weak supervisor vs criticism/scoring by a weak supervisor. I guess there can be an argument in both directions, whether labeling is easier for a weak model or criticism.

@knagrecha
Copy link
Author

I guess criticism may introduce even more noise due to hallucinations, but if alignment is from the perspective of a “weaker human” to strong model, it may intuitively be easier than labeling.

@agokrani
Copy link

I am having the same issue of noise on my side, could it be possible that this is because of the way classification head was initialized. The paper claims to initialize the head with embedding weights of token "0" and "1" whereas in the code it seems like we are initializing it differently.

@WuTheFWasThat
Copy link
Contributor

I actually tried initializing using unembeddings, and it didn't seem to help. but I didn't test very extensively. my hunch is it's not the issue.

by the way, there is some substantial literature on noisiness of fine-tuning, e.g. https://arxiv.org/pdf/2002.06305.pdf

@agokrani
Copy link

Will look at this, thanks a lot. It would be nice to know how did you initialize with unembedding weights.

@WuTheFWasThat
Copy link
Contributor

WuTheFWasThat commented Dec 19, 2023

here's the code i had used, had done it sort of hackily

        # NOTE: this has to happen after the rest of the model is initialized
        unemb = self.transformer.wte.weight.data
        assert self.num_labels == 2
        inds = [
            11491, # incorrect
            3376, # correct
        ]
        new_data = unemb[inds, :]
        self.score.weight.data.copy_(new_data)

@agokrani
Copy link

Thank you so much @WuTheFWasThat, will test it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants