Empirical Study of Transformer’s Attention Mechanism via the Lens of Kernel.
Correspondence to:
- Yao-Hung Hubert Tsai (yaohungt@cs.cmu.edu)
A Unified Understanding of Transformer's Attention via the Lens of Kernel
Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov
Empirical Methods in Natural Language Processing (EMNLP), 2019.
Please cite our paper if you find our work useful for your research:
@inproceedings{tsai2019TransformerDissection,
title={Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel},
author={Tsai, Yao-Hung Hubert and Bai, Shaojie and Yamada, Makoto and Morency, Louis-Philippe and Salakhutdinov, Ruslan},
booktitle={EMNLP},
year={2019},
}
Slides are here.
Transformer's attention and kernel learning both concurrently and order-agnostically process all inputs by calculating the similarity between inputs.
We present a new formulation of attention via the lens of kernel. This formulation highlights naturally the main components of Transformer's attention, enabling better understanding of this mechanism. Recent variants of Transformers can be expressed through these individual components. The approach also paves the way to a larger space of composing Transformer's attention.
- : query
- : the set for keys
- : set filtering function, which returns a set with its elements that are visible to
- : non-negative kernel function
- : value function
- : linear smoother with kernel smoothing
- Sequence Transformer (Vaswani et al., 2017, Dai et al., 2019): with being non-positional feature space and being the positional embedding space.
- Image Transformer (Parmar et al., 2018): with being non-positional feature space, being the positional space of the height in an image, and being the positional space of the width in an image.
- Set Transformer (Lee et al., 2018): with no any positional information present.
- Multimodal Transformer (Tsai et al., 2019): with representing the language feature space, representing the vision feature space, representing the audio feature space, and representing the temporal indicator space.
Most of the work utilizes asymmetric exponential kernel with learned mappings and scaled factor :
Note that in the paper, we also try linear, polynomial, and rbf kernel. We observe kernel with infinite feature dimension (i.e., exponential kernel and rbf kernel) leads to best performance.
- Absolute Positional Embedding (Vaswani et al., 2017): assuming direct sum of the feature space ().
- Relative Positional Embedding (Dai et al., 2019): with being an asymmetric kernel with coefficients inferred by .
- Relative Positional Embedding (Shaw et al., 2018, Huang et al., 2018): with with being a learnable matrix.
For the above variants of positional embedding integration, we find product kernel (relative positional embedding by Dai et al., 2019) works the best.
- With Positional Embedding (Vaswani et al., 2017, Child et al., 2019): .
- Without Positional Embedding (Dai et al., 2019, Shaw et al., 2018, Huang et al., 2018): .
We empirically observe the value function without positional embedding works better.
- Encoder Self-Attention/ Encoder-Decoder Attention in original Transformer (Vaswani et al., 2017): .
- Decoder Self-Attention in original Transformer (Vaswani et al., 2017): . Define as the set returned here.
- Decoder Self-Attention in Transformer-XL (Dai et al., 2019): . . refers to additional memories.
- Decoder Self-Attention in Sparse Transformer (Child et al., 2019): .
We empirically observe set fitering function contains additional memories work the best.
We see that by changing the kernel construction, we can define a larger space of composing Attention. As an example, in the paper, we present a new form of Attention with a kernel that is
- valid (i.e., a kernel that is symmetric and positive semi-definite)
- delicate in the sense of constructing a kernel on a joint space (i.e., )
We empirically show that this form of kernel construction in Transformer's attention works as good as the kernel construction in Transformer-XL.
We have a large paragraph in the paper discussing the order-invariance problem in Transformer's attention. The main take-away is the operation in decoder self-attention is not order-agnositc. For example, removing positional embedding in decoder self-attention (in original Transformer) will NOT decrease the performance at all. But a better way to integrate positional embedding, our kernel construction for example, still improves over not considering positional embedding.
Apologize that the paper isn't organized perfectly. Please use issues to ask questions.