Fix: token embeddings inconsistency #3275
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello!
Currently token embeddings returned are different based on whether you pass
Noneor"token_embeddings" to output_value inencode`.Current master behavior:
This is because in the case of
None, the tokens are not truncated by removing padding tokens. While this discrepancy is not necessarily harmless, it is a bit surprising, and can lead to mean bugs. For example, it is fine to take the mean whenoutput_value=="token_embeddings", but not when it isNone, because otherwise you'd include padding in the mean. I think in both cases it should be truncated.To tackle this I:
util.py.output_valueisNoneor"token_embeddings". This is to keep it off the hot path when people just want sentence embeddings.The new function for removing padding is also 50x faster than the old one, and should be equivalent. Note that the functions are not equivalent in the case of non-contiguous attention masks. If the attention mask can look like this:
(so, a group of zeros, followed by ones, and then zeros again) , the new method would grab the first 0 (index 3), while the old method would grab the third to last one (index 9). But I am not aware of cases where attention masks can look like this.
Let me know what you think.