-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Observing eval accuracy considerably lower than reported? #3
Comments
As an example: sciq - GPT-2-medium reports accuracy of 0.43 (0.5 after I lowered the learning rate). LLaMA-7B ground truth got 0.84. LLaMA-7b-transferred got 0.43 (0.81 after I lowered the learning rate). |
0.43 is worse than random so something is either wrong with the ML there or your eval set isn't big enough |
Yeah I figured the eval set size seemed small but assumed that the line in the README would work directly. Might test it out again later with a larger eval size. |
10X'd the train/test sizes. new results on sciq with gpt2-med and llama-7b after a quick run. GPT-2-Med ending acc: 0.661 +/- 0.006694460396477075 LLaMA-7B ending acc (gt): 0.866 +/- 0.015234434679370284 LLaMA-7B ending acc (transfer): Accuracy: 0.704 +/- 0.020414896521902825 Looks nice! Pretty closely aligned with the Qwen results, with slightly lower transfer efficacy. Hope others will add their OSS model eval results soon too. Would suggest increasing the n_docs/n_test_docs values in the README command? Current values seem pretty low. |
haha yeah, they are low! can update that things to generally keep in mind:
|
Off-topic, but I am curious about how you guys are thinking of labeling by a weak supervisor vs criticism/scoring by a weak supervisor. I guess there can be an argument in both directions, whether labeling is easier for a weak model or criticism. |
I guess criticism may introduce even more noise due to hallucinations, but if alignment is from the perspective of a “weaker human” to strong model, it may intuitively be easier than labeling. |
I am having the same issue of noise on my side, could it be possible that this is because of the way classification head was initialized. The paper claims to initialize the head with embedding weights of token "0" and "1" whereas in the code it seems like we are initializing it differently. |
I actually tried initializing using unembeddings, and it didn't seem to help. but I didn't test very extensively. my hunch is it's not the issue. by the way, there is some substantial literature on noisiness of fine-tuning, e.g. https://arxiv.org/pdf/2002.06305.pdf |
Will look at this, thanks a lot. It would be nice to know how did you initialize with unembedding weights. |
here's the code i had used, had done it sort of hackily
|
Thank you so much @WuTheFWasThat, will test it |
Hi, thanks for open-sourcing this code. I'm noticing that my tests with GPT-2 variants show considerably lower eval accuracies than what's reported in the paper & charts. I'm using the command provided in the README. I do not think the eval code itself is incorrect --- testing it with LLaMA shows much higher eval accuracies (as I would expect). But I cannot replicate the GPT-2 results; any pointers on what the issue might be?
The text was updated successfully, but these errors were encountered: