This is the official implementation of Rethinking Optimization and Architecture for Tiny Language Models, an empirical investigation about how to construct powerful language models.
Four strategies are proposed to improve performance:
- 🎯 Compact Tokenizer: efficient coverage of corpus;
- 🔍 Architecture Tweak: better depth and width tradeoffs;
- 🎁 Parameter Inheritance: powerful knowledge from larger LLMs;
- 🔥 Multiple-Round Training: memory reinforcement of tiny models.
This repository is modified from the InternEvo training framework.
Here are the steps to organize the codes:
- Clone the InternEvo repository and configure the runtime environment.
- Copy the configuration files
configs/LLM1B.py
to theInternEvo/configs/
directory. - Copy the start script
src/start_finetune.py
to theInternEvo
root directory.
You can follow the guide of InternEvo to pretrain data and train models (https://github.com/InternLM/InternEvo/blob/develop/doc/en/usage.md).
The model's depth, width, and expanding rate can by easily adjusted in the config.
The compact tokenizer is constructed by removing low-frequency vocabularies. To prune tokenizer, you can follow these steps:
- Counting the frequency of tokens cached by the original big tokenizer.
python src/step1_token_frequency_stat.py --src cached_data_dir --dst tmp_stat_files_dir
. Thhe script counts the frequency of all tokens in thecached_data_dir
folder and generates a corresponding JSON file in thetmp_stat_files_dir
folder.python src/step2_token_frequency_stat_combie.py --src tmp_stat_files_dir --dst total_token_freq.json
. Combine all JSON files in thetmp_stat_files_dir
folder and write the frequency of tokens intotal_token_freq.json
- Firstly add the special tokens, and then add the tokens with the highest word frequency to the new tokenizer.
python src/step3_generate_new_tokenizer.py --origin_tokenizer_dir origin_tokenzier --vocab_num compact_tokenizer_size --output new_tokenizer_dir --token_freq_file total_token_freq.json
. This script will generate a new tokenzier in thenew_tokenizer_dir
folder withcompact_tokenizer_size
tokens.
To pretrain by inheriting parameter from a large model, you can use the following command:
python start_finetune.py --config ./configs/LLM1B.py
Note that MODEL_ONLY_FOLDER
is the model's checkpoint pruned from a large model.
If you want to train from scratch, you need the set load_given_ckpt=False
in the config.
To extract a certain proportion of challenging examples from the last epoch, you can utilize the following steps:
- Compute the batch-wise loss
$L={l_1,l_2,\cdots,l_N}$ using the pre-trained frozen model from the previous epoch, where$N$ represents the total number of batches. For instance, a dataset containing 150B tokens might yield approximately 75000 batches when utilizing a batch size of 2M. - Calculate the sampling probability
$p_i = \exp(l_i) \bigg/ {\sum \limits_{j=1}^N \exp(l_j)}$ . - Sample
$N_0$ batches out of$N$ according to the sampling probability$\boldsymbol{p}$ , i.e.,filtered = torch.multinomial(p, N_0, replacement=False)
- Concatenate all the filtered batches to create the training dataset for the subsequent epoch.
Convert the model weight to Hugging Face format using the script tools/transformers/convert2hf.py
.
python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer tokenizer_path/
Then the model can be inferred with Hugging Face.
- InternLM/InternEvo
- huggingface/transformers
- google/sentencepiece
- open-compass/opencompass
- EleutherAI/lm-evaluation-harness
@article{tang2024rethinking,
title={Rethinking Optimization and Architecture for Tiny Language Models},
author={Tang, Yehui and Liu, Fangcheng and Ni, Yunsheng and Tian, Yuchuan and Bai, Zheyuan and Hu, Yi-Qi and Liu, Sichao and Jui, Shangling and Han, Kai and Wang, Yunhe},
journal={arXiv preprint arXiv:2402.02791},
year={2024}
}
@article{wang2023pangu,
title={PanGu-$$\backslash$pi $: Enhancing Language Model Architectures via Nonlinearity Compensation},
author={Wang, Yunhe and Chen, Hanting and Tang, Yehui and Guo, Tianyu and Han, Kai and Nie, Ying and Wang, Xutao and Hu, Hailin and Bai, Zheyuan and Wang, Yun and others},
journal={arXiv preprint arXiv:2312.17276},
year={2023}
}