2024/Q4 Bachelor Research Project: Architectural Decisions for Language Modelling with (Small) Transformers, where we look at sample-efficient pre-training techniques at a small scale.
For context on the five transformer components studied, see the Project Description. One sentence synopsis: We train small BERT/GPT models (~10M parameters) on a collection of short children stories, and compare which techniques yield better performance on the GLUE/BLiMP benchmarks.
(Adaptive) Activation Functions Filip Ignijić |
Attention Heads & Dimensions Khalit Gulamov |
Infini-Attention at Its Limits Lauri Kesküll |
Sample-Efficient Tokenisation Rafael Borges |
Sparsity at a Fixed Model Size Eugene Wu |
- Slide Deck for weekly meetings.
- Zotero Group for managing our (shared) references.
- Mattermost Channel for communication.
See guides
for tutorials on remote development and implementation details.
- Ronaldo Guide on setting up a remote Docker Dev container.
- DelftBlue Guide for setting up a remote Jupyter server & submitting
sbatch
scripts. - Kaggle & Google Cloud Guide for running stuff in the cloud.
- Evaluation Guide on the BabyLM pipeline.
- Tokenizer Notes on how I created the
10k-tok
tokenizer for RoBERTa & GPT-Neo. - Fabio's Explainability Notes that he presented in week 3.
- Latex Guide on setting up a latex document locally & syncing with Overleaf.
See code/common
for shared resources on tooling and training.
- 10k-tokenizer to use with RoBERTa and GPT-Neo.
- grid search library for generating hyperparameter combinations.
- BabyLM evaluation pipeline pinned to a specific commit for replicability.
- sample pre-training notebook with explanations.
- sample pre-training script used for the baselines.
- sample evaluation script used for the baselines.
Description from ProjectForum
- Motivation to learn (i.e. read papers) about state-of-the-art natural-language modelling techniques.
- Strong Python programming skills.
- Comfortable with shell environments (for submitting jobs to the supercomputer).
- You will need to remember some of your Linear Algebra and Calculus courses (matrices and gradient descent).
- Knowledge of deep learning libraries, like
transformers
,pytorch
,tensorflow
, is a big plus!
Language models (LMs) based on the transformer architecture [1] have, well, transformed the language processing domain. State-of-the-art large LMs are exposed to several orders of magnitude more data than a human, yet are certainly not leveraging all of this information. Given the current trend of exponentially-increasing model sizes, we are predicted to run out of high-quality textual data by 2026, which LMs tend to perform best on [2].
This project studies the effect of architectural decisions on small LMs, with the overarching goal of increasing their sample-efficiency and minimising their parameter count. Recent studies [3, 4] have shown smaller models (≤33M parameters) can exhibit equivalent language understanding of their larger counterparts, and similarly display the desired emergent properties (e.g. reasoning and creativity) that drive their prevalence. This makes them ideal for exploring architectures in a compute-limited setting; and their small scale makes individual components more interpretable [3]. Additionally, small LMs allow local deployment, leading to better support for privacy, personalisation, and democratisation.
Current research lacks precise understanding of how architectural decisions influence natural language understanding and fine-tuned task performance in small LMs. While Eldan & Li [3] show that small LMs can exhibit linguistic understanding greater than GPT-2 (125M parameters) by reducing the breadth of their training data, they miss quantitative evaluations in a down-stream, applied setting. Warstadt et al. [4] find that architectural optimisations are the most promising research direction for small LMs, but these decisions are rarely surveyed across different models. Moreover, research that applies LMs to downstream tasks rarely considers hyperparameters beyond the default, let alone architectural decisions [5].
This project studies small LMs, particularly the impact of architectural decisions in the transformer blocks that perform the bulk of information integration. A block takes as input a sequence of embedded tokens, each represented by a vector of length defined by the model's hidden size. These inputs are transformed by the self-attention module, followed by a feed-forward network (FFN). The attention mechanism [6] itself consists of a number of heads, each of which match tokens' queries against other tokens' keys, to integrate information from the latter with the former. The FFNs (multi-layer perceptrons), allow the model to learn more complex, non-linear patterns. While this is the motivation for their introduction, it is worth noting that the exact way these two modules combine to form complex generated text is still not fully understood.
We explore small transformer architectures, to give context behind design decisions and their impact on language understanding in downstream tasks.
Following are some preliminary research questions based on the small LM exploration by Eldan & Li [3], though students are encouraged to construct/extend their own within the same domain. The models considered for each RQ are GPT-Neo and BERT.
- How is model performance affected by the width of hidden layers? [3]
- How is model performance affected by the depth of its FFNs? [3]
- How is model performance affected by the number of attention heads? [3, 7]
- How is model performance affected by the number of transformer blocks, and the ordering of modules within them? [3, 8]
- How is model performance affected by the width of Query-Key vectors? [6]
Additionally, each student is expected to perform an interpretability study akin to [3] on the specific component of their RQ.
- Base Model: GPT-Neo & BERT at 9M parameters (4 blocks, 512 hidden size, 1024 intermediate FFN size, and 10k token vocabulary). Students augment this base model along the dimension of their RQ.
- Dataset: TinyStories, consisting of ~2M short stories written by GPT-3.5 and GPT-4.
- Evaluation: BabyLM evaluation pipeline, consisting of BLiMP grammar test-suite and fine-tuning on SuperGLUE tasks to evaluate downstream performance. We further study the total training time, and total training samples each model sees to draw conclusions about sample-efficiency. Lastly, we study the resulting models' size and inference speed to evaluate them in an edge-device context.
- Hardware: Due to their small size, the models can be pre-trained locally on students' laptops. Students are also able to train on the DelftBlue cluster.
- Wednesday 13 March, 16:00 (Teams link)
- Friday 15 March, 15:00 (Teams link)
LLMs' increasingly large scale stimulates increasingly more research into optimisation. One strand is to take the models as they are, and optimise them in a top-down fashion. Another strand, perhaps most interestingly, consists of a bottom-up approach: experimenting with design decisions before and during pre-training. We further motivate the need for architectural optimisation by noting that such bottom-up advances are relatively unexplored in applied settings, compared to top-down approaches like prompt-engineering. The studies by Eldan & Li [3], and Warstadt et al. [4] stimulate research in this area by highlighting the competitive performance of small LMs in resource-constrained settings.
TinyStories [3] is synthetic dataset of short childrens' stories generated by GPT-3.5 and GPT-4, consisting of ~2M short stories. By limiting the breadth of the dataset in this manner, it can be used to train and evaluate LMs with ≤33M parameters; and, these small LMs generate more coherent and grammatically-correct text than a 125M-parameter GPT-2 model. Additionally, these smaller models are more interpretable, and reveal individual functions for specific neurons (e.g. attending to the protagonist in a story).
BabyLM [4], is a communal challenge in which participants competed to optimise training on a fixed data budget. The authors provided two budgets: a dataset of 10M and 100M words, similar both in content and amount of language a 12-year old is exposed to. This has led to some fruitful developments in curriculum learning (gradually increasing the complexity of training tasks), knowledge distillation (training a small student on the outputs of a larger teacher model), and architecture optimisation.
- Andrej Karpathy's series on Modeling GPT from scratch (OpenAI co-founder).
- The Illustrated Transformer by Jay Alammar.
- A. Vaswani et al., ‘Attention Is All You Need’. arXiv, Dec. 06, 2017. Accessed: Jan. 25, 2024. [Online]. Available: http://arxiv.org/abs/1706.03762v5
- P. Villalobos, J. Sevilla, L. Heim, T. Besiroglu, M. Hobbhahn, and A. Ho, ‘Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning’. arXiv, Oct. 25, 2022. Accessed: Jan. 23, 2024. [Online]. Available: http://arxiv.org/abs/2211.04325
- R. Eldan and Y. Li, ‘TinyStories: How Small Can Language Models Be and Still Speak Coherent English?’ arXiv, May 24, 2023. Accessed: Nov. 01, 2023. [Online]. Available: http://arxiv.org/abs/2305.07759
- A. Warstadt et al., ‘Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora’, in Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, Singapore: Association for Computational Linguistics, 2023, pp. 1–6. doi: 10.18653/v1/2023.conll-babylm.1.
- A. Wettig, T. Gao, Z. Zhong, and D. Chen, ‘Should You Mask 15% in Masked Language Modeling?’ arXiv, Feb. 10, 2023. Accessed: Jan. 24, 2024. [Online]. Available: http://arxiv.org/abs/2202.08005
- D. Bahdanau, K. Cho, and Y. Bengio, ‘Neural Machine Translation by Jointly Learning to Align and Translate’. arXiv, May 19, 2016. Accessed: Jan. 25, 2024. [Online]. Available: http://arxiv.org/abs/1409.0473
- P. Michel, O. Levy, and G. Neubig, ‘Are Sixteen Heads Really Better than One?’ arXiv, Nov. 04, 2019. Accessed: Jan. 21, 2024. [Online]. Available: http://arxiv.org/abs/1905.10650
- S. Shleifer, J. Weston, and M. Ott, ‘NormFormer: Improved Transformer Pretraining with Extra Normalization’. arXiv, Nov. 01, 2021. Accessed: Jan. 25, 2024. [Online]. Available: http://arxiv.org/abs/2110.09456
-
J. Hoffmann et al., ‘Training Compute-Optimal Large Language Models’. arXiv, Mar. 29, 2022. Accessed: Jan. 25, 2024. [Online]. Available: http://arxiv.org/abs/2203.15556
-
S. Gunasekar et al., ‘Textbooks Are All You Need’. arXiv, Oct. 02, 2023. Accessed: Jan. 23, 2024. [Online]. Available: http://arxiv.org/abs/2306.11644
-
L. G. G. Charpentier and D. Samuel, ‘Not all layers are equally as important: Every Layer Counts BERT’. arXiv, Nov. 07, 2023. Accessed: Jan. 25, 2024. [Online]. Available: http://arxiv.org/abs/2311.02265
-
R. D. Martinez et al., ‘CLIMB – Curriculum Learning for Infant-inspired Model Building’, in Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, Singapore: Association for Computational Linguistics, 2023, pp. 84–99. doi: 10.18653/v1/2023.conll-babylm.10.
-
V. Sanh, L. Debut, J. Chaumond, and T. Wolf, ‘DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter’, Oct. 2019, doi: 10.48550/arXiv.1910.01108.
-
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding’. arXiv, May 24, 2019. Accessed: Jan. 25, 2024. [Online]. Available: http://arxiv.org/abs/1810.04805
-
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, ‘GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding’. arXiv, Feb. 22, 2019. Accessed: Jan. 25, 2024. [Online]. Available: http://arxiv.org/abs/1804.07461
-
P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, ‘SQuAD: 100,000+ Questions for Machine Comprehension of Text’, in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas: Association for Computational Linguistics, 2016, pp. 2383–2392. doi: 10.18653/v1/D16-1264.