Fix: token embeddings inconsistency #3275

stephantul · 2025-03-20T19:05:11Z

Hello!

Currently token embeddings returned are different based on whether you pass None or "token_embeddings" to output_value in encode`.

Current master behavior:

[x.shape for x in s.encode(["dog", "dog and cat"], output_value="token_embeddings")]
[torch.Size([3, 768]), torch.Size([5, 768])]

[x["token_embeddings"].shape for x in s.encode(["dog", "dog and cat"], output_value=None)]
[torch.Size([5, 768]), torch.Size([5, 768])]

This is because in the case of None, the tokens are not truncated by removing padding tokens. While this discrepancy is not necessarily harmless, it is a bit surprising, and can lead to mean bugs. For example, it is fine to take the mean when output_value=="token_embeddings", but not when it is None, because otherwise you'd include padding in the mean. I think in both cases it should be truncated.

To tackle this I:

Introduced a new function that truncates based on an attention mask, and put it in util.py.
Applied this function if output_value is None or "token_embeddings". This is to keep it off the hot path when people just want sentence embeddings.
Introduced a test that shows the discrepancy above no longer shows up, and also tests equivalence between the new function and the old method for removing padding.

The new function for removing padding is also 50x faster than the old one, and should be equivalent. Note that the functions are not equivalent in the case of non-contiguous attention masks. If the attention mask can look like this:

[1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0]

(so, a group of zeros, followed by ones, and then zeros again) , the new method would grab the first 0 (index 3), while the old method would grab the third to last one (index 9). But I am not aware of cases where attention masks can look like this.

Let me know what you think.

stephantul · 2025-03-20T19:10:16Z

Oh and one issue is that this code might not be backwards compatible, because people might rely on this behavior.

tomaarsen · 2025-03-24T14:06:56Z

The backwards incompatibility here makes me a bit hesitant. I don't think I'll include this in v4, it's a bit too short notice for me to consider all of the consequences.

Tom Aarsen

stephantul · 2025-03-25T08:21:21Z

@tomaarsen got it! Let me know if you'd like to revisit it at some point. Feel free to close it in the mean time.

stephantul added 8 commits March 20, 2025 14:44

fix inconsistency in token type embeddings

da154eb

remove old function

46a38a7

Merge branch 'UKPLab:master' into token-type-embeddings-inconsistency

efa2d2e

Merge branch 'UKPLab:master' into token-type-embeddings-inconsistency

5ba8563

rename function

0c61def

add extra test

9f5f7be

add docstring

1a27449

fix comment

c23068c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: token embeddings inconsistency #3275

Fix: token embeddings inconsistency #3275

Uh oh!

stephantul commented Mar 20, 2025

Uh oh!

stephantul commented Mar 20, 2025

Uh oh!

tomaarsen commented Mar 24, 2025

Uh oh!

stephantul commented Mar 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: token embeddings inconsistency #3275

Are you sure you want to change the base?

Fix: token embeddings inconsistency #3275

Uh oh!

Conversation

stephantul commented Mar 20, 2025

Uh oh!

stephantul commented Mar 20, 2025

Uh oh!

tomaarsen commented Mar 24, 2025

Uh oh!

stephantul commented Mar 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants