Skip to content

Conversation

@stephantul
Copy link
Contributor

Hello!

Currently token embeddings returned are different based on whether you pass None or "token_embeddings" to output_value in encode`.

Current master behavior:

[x.shape for x in s.encode(["dog", "dog and cat"], output_value="token_embeddings")]
[torch.Size([3, 768]), torch.Size([5, 768])]

[x["token_embeddings"].shape for x in s.encode(["dog", "dog and cat"], output_value=None)]
[torch.Size([5, 768]), torch.Size([5, 768])]

This is because in the case of None, the tokens are not truncated by removing padding tokens. While this discrepancy is not necessarily harmless, it is a bit surprising, and can lead to mean bugs. For example, it is fine to take the mean when output_value=="token_embeddings", but not when it is None, because otherwise you'd include padding in the mean. I think in both cases it should be truncated.

To tackle this I:

  1. Introduced a new function that truncates based on an attention mask, and put it in util.py.
  2. Applied this function if output_value is None or "token_embeddings". This is to keep it off the hot path when people just want sentence embeddings.
  3. Introduced a test that shows the discrepancy above no longer shows up, and also tests equivalence between the new function and the old method for removing padding.

The new function for removing padding is also 50x faster than the old one, and should be equivalent. Note that the functions are not equivalent in the case of non-contiguous attention masks. If the attention mask can look like this:

[1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0]

(so, a group of zeros, followed by ones, and then zeros again) , the new method would grab the first 0 (index 3), while the old method would grab the third to last one (index 9). But I am not aware of cases where attention masks can look like this.

Let me know what you think.

@stephantul
Copy link
Contributor Author

Oh and one issue is that this code might not be backwards compatible, because people might rely on this behavior.

@tomaarsen
Copy link
Member

The backwards incompatibility here makes me a bit hesitant. I don't think I'll include this in v4, it's a bit too short notice for me to consider all of the consequences.

  • Tom Aarsen

@stephantul
Copy link
Contributor Author

@tomaarsen got it! Let me know if you'd like to revisit it at some point. Feel free to close it in the mean time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants