BabyLlama

Very basic training code for Baby-Llama, our submission to the strict-small track of the BabyLM challenge. See our paper for more details.

We perform some basic regex-based cleaning of the dataset and then train a tokenizer on the cleaned dataset. This is performed in cleaning_and_tokenization.ipynb. The notebook assumes that the babylm dataset (/babylm_10M and /babylm_dev) is placed or symlinked in the /data folder. The tokenizer is saved in '/models' folder. We use the same tokenizer for both teacher and student models.

To train the teacher models:

python train.py --config ./config/gpt-705M.yaml

And analogously for llama-360M.yaml. One can also rewrite the learning rate and the model name defined in the config by adding arguments --lr and --model_name respectively. The trained model is saved in the /models folder. Once the two teacher models are trained, run distill-ensemble-pretraining-baby-llama.py to train the student model using the distillation loss. We modified the Trainer from this repository. Notice that it is not optimized to run on multiple GPUs (teachers are placed on a single GPU). With the current settings (model sizes and batch sizes) everything fits on a single 20GB GPU.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
config		config
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
babylm_dataset.py		babylm_dataset.py
cleaning_and_tokenization-nm.ipynb		cleaning_and_tokenization-nm.ipynb
cleaning_and_tokenization.ipynb		cleaning_and_tokenization.ipynb
distill-ensemble-pretraining-baby-llama.py		distill-ensemble-pretraining-baby-llama.py
mrclean.py		mrclean.py
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BabyLlama

About

Releases

Packages

Languages

License

sounritesh/BabyLlama

Folders and files

Latest commit

History

Repository files navigation

BabyLlama

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages