Skip to content

Implementation and evaluation pipeline for Cobweb in the bundled benchmarks (adopted by the BabyLM Challenge)

License

Notifications You must be signed in to change notification settings

Teachable-AI-Lab/cobweb-babylm

Repository files navigation

Cobweb in Language Modeling

This repo includes:

  • How to train Cobweb suitable for language modeling in general
  • How trained Cobweb can be evaluated with the LM evaluation pipeline with bundled benchmarks (employed by BabyLM Challenge as well) Related link: BabyLM Challenge

To Train a Cobweb Tree

Download Training Data

You first need to download the datasets and put them under ./data-tr in their corresponding location. The folders under ./data-tr are just placeholders. Here the following datasets should be included in the repo:

  • BabyLM Challenge data (https://osf.io/ad7qg/). As for 2024, the BabyLM Challenge provides text training data, multimodal training data, and partial evaluation data. Here we include the text training data only, and please download the text_data folder with the sharing link provided. The text data includes the following types of data:
    • 10M training datasets (train_10M). Five corpuses (bnc_spoken, childes, gutenberg, open_substitles, simple_wiki, and switchboard) are provided and they comparise to about a vocab size of 10 million.
    • 100M training datasets (train_100M). The same five corpuses as forementioned but with more contents, and they comparise to about 100 million words in total in the vocabulary.
    • Validation dataset (dev). The same five corpuses but for validation.
    • Test dataset (test). The same five corpuses but for test.
  • Sherlock Holmes stories text data. Being used in the MSR Sentence Completion Challenge. Can be download here.

Train the Tree

Train the Cobweb tree with

python3 cobweb-tr.py [--arguments]

with a couple of argument options. Here I list some of the most curcial arguments:

  • --type: The type of training data used. Available options: 10M, 100M, dev, test, holmes.
  • --corpus: The corpus used in the training data (if --type=holmes you can ignore this argument). Available options: all (all six corpuses under the training data type), bnc_spoken, childes, gutenberg, open_subtitles, simple_wiki, switchboard.
  • --tokenizer: The tokenizer used. Can be either gpt2 or spacy.
  • --scheme: The distance/dissimilarity function used in calculation. Available options: inverse, linear, exp, gaussian.
  • --window: The context window size (int) in every transformed trained instance.
  • --n-tr-splits: The number of training splits/checkpoints in the training process (int)
  • --load-model: If included, the model will seek for the model file stored in the argument --load-model-file before training.
  • --token-form: The data type for the attributes stored in each transformed trained instance. Available options: str and encoding. If --tokenizer=gpt2, I STRONGLY recommeded using encoding.

For more details just see the original Python script cobweb-tr.py or enter python3 cobweb-tr.py --h.

Evaluation

To evaluate the model, we use the evaluation pipeline lm-eval used by the BabyLM Challenge. The detail of the pipeline is available here. The BabyLM evaluation repo (folked from the original lm-eval) is available here. The evaluation pipeline used here is at ./lm-evaluation-harness, and please check ./lm-evaluation-harness/README.md for more details.

About

Implementation and evaluation pipeline for Cobweb in the bundled benchmarks (adopted by the BabyLM Challenge)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published