This repo is the demo code of Transformer-XL using Self-Dependency Unit. This work is closedly related to Gating-enhanced Transformer variants, such as Google's Switch Transformers.
Yekun Chai et. al., Highway Transformer: Self-Gating Enhanced Self-Attentive Networks (ACL 2020)
- PyTorch >= 1.1.0
- TensorboardX >= 1.8
- Tensorboard >= 1.14
- 4 GPUs of each 8GB memory for running 12 layer Transformer-XL
bash getdata.sh
cd pytorch/xl_L6_scripts && bash <script-name>.sh train --work_dir "PATH_TO_WORK_DIR"
cd XL-L6-results && tensorboard --logdir=.
- Line plots of different model settings, where the topmost line (in red) is the baseline model (i.e., original Transformer-XL).
- After adding Self-Dependency Unit (see bottom two curves), it is clear that Highway Transformer speeds up the convergence process during training and evaluation.
training bpc | training loss |
---|---|
eval bpc | eval loss |
---|---|
For attribution in academic contexts, please cite this work as:
@inproceedings{chai-etal-2020-highway,
title = "Highway Transformer: Self-Gating Enhanced Self-Attentive Networks",
author = "Chai, Yekun and
Jin, Shuo and
Hou, Xinwen",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.616",
pages = "6887--6900"
}