You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The documentation section at the blog has wrong length of segment_ids
Issue replication:
For the given code, if one prints out the shape of the inputs to the BERT model, they can find the issue.
text_1 = "Who was Jim Henson ?"
text_2 = "Jim Henson was a puppeteer"
# Tokenized input with special tokens around it (for BERT: [CLS] at the beginning and [SEP] at the end)
indexed_tokens = tokenizer.encode(text_1, text_2, add_special_tokens=True)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
# Convert inputs to PyTorch tensors
segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])
print(segments_tensors.shape)
print(tokens_tensor.shape)
The output is:
torch.Size([1, 16])
torch.Size([1, 14])
The shape mismatch is very clear.
Proposed Fix:
Since the segment_ids are hardcoded, we can probably make them programmatic.
(I can raise the pull request but I was not able to figure the source code for the documentation of this specific page.)
The text was updated successfully, but these errors were encountered:
The documentation section at the blog has wrong length of
segment_ids
Issue replication:
For the given code, if one prints out the shape of the inputs to the BERT model, they can find the issue.
The output is:
The shape mismatch is very clear.
Proposed Fix:
Since the segment_ids are hardcoded, we can probably make them programmatic.
(I can raise the pull request but I was not able to figure the source code for the documentation of this specific page.)
The text was updated successfully, but these errors were encountered: