Implementation of the asymmetric attention #1660

manifest · 2021-11-05T08:09:17Z

manifest
Nov 5, 2021

Perceiver IO uses asymmetric attention: the dimensionality of its key and query doesn't match. That's different comparing to the Flax implementation. Another difference is that for its cross-attention blocks Perceiver IO uses an attention where query's dimension defaults to the dimension of key's inputs and output dimension must match the dimension of query's inputs.

I've written a general implementation of the multi-head attention and its specializations that may be used with Perceiver IO:

QKVAttention, a traditional Transformer-like attention with support for asymmetric dimensionality of its key and query.
KVQAttention, a Perceiver-Decoder/Encoder-like attention where query's dimension defaults to KV-inputs and output dimension defaults to Q-inputs.

Parameters' layout is compatible with the Flax implementation.

Does it make sense to contribute these modules to the Flax codebase?

jheek · 2021-11-08T09:47:59Z

jheek
Nov 8, 2021
Maintainer

Perceiver IO uses asymmetric attention: the dimensionality of its key and query doesn't match.

My guess is they don't match in the time dimension? That should be supported already. You do need the same number of features and heads otherwise you cannot take an inner product.

If I remember the paper correctly the main difference is that the queries are learned rather than formed by a linear projection of inputs. We could consider adding an option to learn the queries instead of projecting them which seems hard to do with the current implementation.

Note that we are hesitant to add new features to the attention layers because there are so many variants and new ones are introduced frequently. That said, If there are small and easy to grasp changes that make them more widely applicable we will of course consider them.

3 replies

manifest Nov 8, 2021
Author

My guess is they don't match in the time dimension? That should be supported already. You do need the same number of features and heads otherwise you cannot take an inner product.

They also don't match in the future dimension.

In the decoder and encoder

feature/last dimension of the query's (Q) inputs different from those of the key's (K) and value's (V) inputs.
feature/last dimension of the query's and key's parameters are different from value's parameters.

Note that we are hesitant to add new features to the attention layers because there are so many variants and new ones are introduced frequently. That said, If there are small and easy to grasp changes that make them more widely applicable we will of course consider them.

I can understand that. Nevertheless, it's possible to have a general asymmetric implementation of the multi-head attention that can also be parametrized for both cases: traditional Transformer-like attention (e.g. linen.MultiHeadDotProductAttention) and PerciverIO encoder/decoder attention.

If not for attention_fn parameter of linen.MultiHeadDotProductAttention, it seems that replacing its implementation would have been possible without any breaking change.

jheek Nov 9, 2021
Maintainer

I see now where the confusion comes from I think. you are talking about the query, key, value before projection and I am thinking about the q,k,v after projection. One thing that is somewhat hard coded now is that query == key == value in terms of num features. Which could be loosened to query == key and the value features could be specified separately.
So if we add a seperate qk_featuers and v_features PerceiverIO is implementable, correct? The other request is about making a attention layers which automatically infer the qk_features and v_features from the inputs?

manifest Nov 9, 2021
Author

So if we add a seperate qk_features and v_features PerceiverIO is implementable, correct?

Yes, that's correct.

The other request is about making a attention layers which automatically infer the qk_features and v_features from the inputs?

Yes.

For traditional Transformer-like attention we want

qk_features, v_features, and out_features be infered from the inputs_q (i.e. the query before projection)

For PerciverIO encoder/decoder attention we want

qk_features and v_features be infered from the inputs_kv (i.e. the key or the value before projection)
d_output be infered from the inputs_q

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of the asymmetric attention #1660

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Implementation of the asymmetric attention #1660

manifest Nov 5, 2021

Replies: 1 comment · 3 replies

jheek Nov 8, 2021 Maintainer

manifest Nov 8, 2021 Author

jheek Nov 9, 2021 Maintainer

manifest Nov 9, 2021 Author

manifest
Nov 5, 2021

Replies: 1 comment 3 replies

jheek
Nov 8, 2021
Maintainer

manifest Nov 8, 2021
Author

jheek Nov 9, 2021
Maintainer

manifest Nov 9, 2021
Author