fixed sentence-length bug when using the tokenizer with sentence pair…

…s. (#60) The `sep_offset_value` in `build_scatter_offsets` method should be computed on just the first sentence of the pair in input otherwise if the second sentence is bigger, the resulting sentence offsets are wrong.
Riccorl · May 31, 2023 · e535425 · e535425
1 parent 71f6699
commit e535425
Showing 1 changed file with 6 additions and 3 deletions.
diff --git a/transformers_embedder/tokenizer.py b/transformers_embedder/tokenizer.py
@@ -227,11 +227,14 @@ def build_scatter_offsets(
             # otherwise, we can just use word_ids as is
             else:
                 word_offsets = word_ids
-            # here we retrieve the max offset for the sample, which will be used as SEP offset
-            # and also as padding value for the offsets
-            sep_offset_value = max([w for w in word_offsets if w is not None]) + 1
+
             # replace first None occurrence with sep_offset
             sep_index = word_offsets.index(None)
+
+            # here we retrieve the max offset for the sample, which will be used as SEP offset
+            # and also as padding value for the offsets
+            sep_offset_value = max([w for w in word_offsets[:sep_index] if w is not None]) + 1
+
             word_offsets[sep_index] = sep_offset_value
             # if there is a text pair, we need to adjust the offsets for the second text
             if there_is_text_pair: