Issue in Segment ID list for BERT example in documentation #258

fz-29 · 2021-12-01T05:31:03Z

The documentation section at the blog has wrong length of segment_ids

Issue replication:

For the given code, if one prints out the shape of the inputs to the BERT model, they can find the issue.

text_1 = "Who was Jim Henson ?"
text_2 = "Jim Henson was a puppeteer"

# Tokenized input with special tokens around it (for BERT: [CLS] at the beginning and [SEP] at the end)
indexed_tokens = tokenizer.encode(text_1, text_2, add_special_tokens=True)

# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

# Convert inputs to PyTorch tensors
segments_tensors = torch.tensor([segments_ids])
tokens_tensor = torch.tensor([indexed_tokens])
print(segments_tensors.shape)
print(tokens_tensor.shape)

The output is:

torch.Size([1, 16])
torch.Size([1, 14])

The shape mismatch is very clear.

Proposed Fix:

Since the segment_ids are hardcoded, we can probably make them programmatic.

(I can raise the pull request but I was not able to figure the source code for the documentation of this specific page.)

The text was updated successfully, but these errors were encountered:

jdsgomes · 2022-01-28T15:08:16Z

Hi @fz-29 , thank you for raising this issue.
The page you are looking for is this one https://github.com/pytorch/hub/blob/8f8788108bd95b39f9c8729aa1904161e476401e/huggingface_pytorch-transformers.md.
Please add the original authors to the PR review.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue in Segment ID list for BERT example in documentation #258

Issue in Segment ID list for BERT example in documentation #258

fz-29 commented Dec 1, 2021

jdsgomes commented Jan 28, 2022

Issue in Segment ID list for BERT example in documentation #258

Issue in Segment ID list for BERT example in documentation #258

Comments

fz-29 commented Dec 1, 2021

jdsgomes commented Jan 28, 2022