-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SEAT score of bert-large-cased seems wrong #1
Comments
Hi Jihyung,
Thanks for pointing these out to us.
The first error is odd, and from what I understand, a bug of the
pytorch-pretrained-bert library. I think you can either patch the library
or update to a more recent version to resolve it (though that might break
other aspects, I'm not sure) .
The second error is our bad. We should definitely be appending the [CLS]
token. I believe we tried multiple ways to aggregate the BERT sentence
vectors and found that they performed similar, so I don't think this will
have a major impact on results, but we can try running it with the
corrected code (and if you do as well, please let us know!).
Best,
Alex
…On Sat, May 1, 2021 at 6:12 AM Jihyung Moon ***@***.***> wrote:
Dear authors,
It seems the reported SEAT score of bert-large-cased is wrong.
<https://user-images.githubusercontent.com/24843996/116778744-c7ea8900-aaae-11eb-9c10-00236335e747.png>
I was able to reproduce the results based on the current code base,
however, I found two errors in the code.
1. Even though I called bert-large-*cased*, tokenized tokens are all
*lowercased.*
import pytorch_pretrained_bert as bert
version = 'bert-large-cased'
tokenizer = bert.BertTokenizer.from_pretrained(version)text = 'SEAT score of bert-large-CASED'tokenized = tokenizer.tokenize(text) # ['seat', 'score', 'of', 'be', '##rt', '-', 'large', '-', 'case', '##d']
2. The score is calculated from the *first token* embedding vector, *not
a [CLS] token*.
Below is the sentbias/encoders/bert.py, and you can find text is not
prepended with a [CLS] token.
''' Convenience functions for handling BERT '''import torchimport pytorch_pretrained_bert as bert
def load_model(version='bert-large-uncased'):
''' Load BERT model and corresponding tokenizer '''
tokenizer = bert.BertTokenizer.from_pretrained(version)
model = bert.BertModel.from_pretrained(version)
model.eval()
return model, tokenizer
def encode(model, tokenizer, texts):
''' Use tokenizer and model to encode texts '''
encs = {}
for text in texts:
tokenized = tokenizer.tokenize(text) # <<< BUG: a [CLS] token should be prepended
indexed = tokenizer.convert_tokens_to_ids(tokenized)
segment_idxs = [0] * len(tokenized)
tokens_tensor = torch.tensor([indexed])
segments_tensor = torch.tensor([segment_idxs])
enc, _ = model(tokens_tensor, segments_tensor, output_all_encoded_layers=False)
enc = enc[:, 0, :] # extract the last rep of the first input
encs[text] = enc.detach().view(-1).numpy()
return encs
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKDWG6UFBBWBJAYALN4NULTLPHYPANCNFSM436FSYTQ>
.
|
Thanks for the reply! To mitigate error propagation, I fixed the codes to resolve the two abovementioned issues. |
Thank you for the informative conversation @inmoonlight @W4ngatang I have some quick questions:
|
Hi Yifei,
Sorry for the late reply.
1. I believe it's extracting the top-level representation of the first
token (which would have been the [CLS] token).
2. That's something you could do, but we followed the original BERT paper
and (tried to) use the [CLS] token representation as the overall
representation. You can easily try both!
3. I'm not sure, but the data is quite small, so it shouldn't take too long
to run, even on CPU.
Best,
Alex
…On Tue, Feb 22, 2022 at 6:50 AM Li, Yifei 李逸飞 ***@***.***> wrote:
Thank you for the informative conversation @inmoonlight
<https://github.com/inmoonlight> @W4ngatang <https://github.com/W4ngatang>
I have some quick questions:
1. So the purpose of enc[:, 0, :] here is to extract the last
representation of token [CLS]? (though as mentioned above, it won’t be much
different even instead pick the representation of the first subword’s
token).
2. Is there any reason not to average the tokens along 0-axis (i.e.
the average of each token’s last representation), which I thought might
make more sense in evaluation given it catches each token’s information?
3. Is there any reason the code doesn’t put the model, token etc. on
GPU here?
—
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKDWG5ZKO4DULHON3JS7XDU4N2BNANCNFSM436FSYTQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thank you for your reply. That’s very helpful, Alex! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Dear authors,
It seems the reported SEAT score of
bert-large-cased
is wrong.I was able to reproduce the results based on the current code base, however, I found two errors in the code.
1. Even though I called bert-large-cased, tokenized tokens are all lowercased.
2. The score is calculated from the first token embedding vector, not a [CLS] token.
Below is the
sentbias/encoders/bert.py
, and you can find text is not prepended with a[CLS]
token.The text was updated successfully, but these errors were encountered: