Skip to content

Commit

Permalink
fixed sentence-length bug when using the tokenizer with sentence pair…
Browse files Browse the repository at this point in the history
…s. (#60)

The `sep_offset_value` in `build_scatter_offsets` method should be computed on just the first sentence of the pair in input otherwise if the second sentence is bigger, the resulting sentence offsets are wrong.
  • Loading branch information
andreim14 authored May 31, 2023
1 parent 71f6699 commit e535425
Showing 1 changed file with 6 additions and 3 deletions.
9 changes: 6 additions & 3 deletions transformers_embedder/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -227,11 +227,14 @@ def build_scatter_offsets(
# otherwise, we can just use word_ids as is
else:
word_offsets = word_ids
# here we retrieve the max offset for the sample, which will be used as SEP offset
# and also as padding value for the offsets
sep_offset_value = max([w for w in word_offsets if w is not None]) + 1

# replace first None occurrence with sep_offset
sep_index = word_offsets.index(None)

# here we retrieve the max offset for the sample, which will be used as SEP offset
# and also as padding value for the offsets
sep_offset_value = max([w for w in word_offsets[:sep_index] if w is not None]) + 1

word_offsets[sep_index] = sep_offset_value
# if there is a text pair, we need to adjust the offsets for the second text
if there_is_text_pair:
Expand Down

0 comments on commit e535425

Please sign in to comment.