Replies: 1 comment 3 replies
-
My guess is they don't match in the time dimension? That should be supported already. You do need the same number of features and heads otherwise you cannot take an inner product. If I remember the paper correctly the main difference is that the queries are learned rather than formed by a linear projection of inputs. We could consider adding an option to learn the queries instead of projecting them which seems hard to do with the current implementation. Note that we are hesitant to add new features to the attention layers because there are so many variants and new ones are introduced frequently. That said, If there are small and easy to grasp changes that make them more widely applicable we will of course consider them. |
Beta Was this translation helpful? Give feedback.
-
Perceiver IO uses asymmetric attention: the dimensionality of its key and query doesn't match. That's different comparing to the Flax implementation. Another difference is that for its cross-attention blocks Perceiver IO uses an attention where query's dimension defaults to the dimension of key's inputs and output dimension must match the dimension of query's inputs.
I've written a general implementation of the multi-head attention and its specializations that may be used with Perceiver IO:
Parameters' layout is compatible with the Flax implementation.
Does it make sense to contribute these modules to the Flax codebase?
Beta Was this translation helpful? Give feedback.
All reactions