Skip to content

Latest commit

 

History

History
67 lines (44 loc) · 4.75 KB

README.org

File metadata and controls

67 lines (44 loc) · 4.75 KB

minbert-assignment

This is an exercise in developing a minimalist version of BERT. It has been forked from Carnegie Mellon University’s CS11-711 Advanced NLP.

In this assignment, you will implement some important components of the BERT model to better understanding its architecture. You will then perform sentence classification on sst dataset and cfimdb dataset with the BERT model.

Assignment Details

Important Notes

  • Follow setup.sh to properly setup the environment and install dependencies. You need a Unix system with conda installed. Windows might also work if you adapt the commands yourself, but it is disrecommended.
  • There is a detailed description of the code structure in structure.md, including a description of which parts you will need to implement.
  • You are only allowed to use libraries that are installed by setup.sh, no other external libraries are allowed (e.g., transformers).
  • We will run your code with the following commands, so make sure that whatever your best results are reproducible using these commands:
    • Do not change any of the existing command options (including defaults) or add any new required parameters.
    • You can add --use_gpu if you have access to a CUDA gpu.
    • You should set NUM_EPOCHS yourself such that your performance is maximized.
mkdir -p my_output

python3 classifier.py --option pretrain --seed 43 --epochs 4 --lr 0.00000000001 --train data/sst-train.txt --dev data/sst-dev.txt --test data/sst-test.txt --dev_out my_output/pretrain-sst-dev-out.txt --test_out my_output/pretrain-sst-test-out.txt

python3 classifier.py --option finetune --seed 43 --epochs NUM_EPOCHS --lr LR --train data/sst-train.txt --dev data/sst-dev.txt --test data/sst-test.txt --dev_out my_output/finetune-sst-dev-out.txt --test_out my_output/finetune-sst-test-out.txt

python3 classifier.py --option finetune --seed 43 --epochs NUM_EPOCHS --lr LR --train data/cfimdb-train.txt --dev data/cfimdb-dev.txt --test data/cfimdb-dev.txt --dev_out my_output/finetune-cfimdb-dev-out.txt --test_out my_output/finetune-cfimdb-test-out.txt

Note: data/cfimdb-test.txt seems to be corrupt with all labels being 0. As such, we are reusing the dev split for testing, too.

Reference accuracies:

Pretraining for SST:

Dev Accuracy: 0.391 (0.007)

Test Accuracy: 0.403 (0.008)

Mean reference accuracies over 10 random seeds with their standard deviation shown in brackets.

Finetuning for SST:

Dev Accuracy: 0.515 (0.004)

Test Accuracy: 0.526 (0.008)

Finetuning for CFIMDB:

Dev Accuracy: 0.966 (0.007)

Submission

All your changes should be in legible git commits on top of this repo. You should upload the repo to Github as a private repo, and add Github users https://github.com/NightMachinery as collaborators.

You should include the directory my_output which the commands above populate. You should git-ignore any unnecessary files, such as those generated by your IDE.

You should also upload a zipped version of the repo. This should include the .git directory.

Grading

To get the credit for BERT and the optimizer, you should implement their missing pieces and pass tests in sanity_check.py (bert implementation) and optimizer_test.py (optimizer implementation).

To get the credit for the classifier, you should have done the above, and have also implemented the missing pieces in classifier.py. You should have run the commands above, and achieve comparable accuracy to the reference numbers cited above.

Cheating

If we suspect that you have cheated, you might fail the course. The burden of evidence is on you. Do maintain a legible git history.

You may NOT use AI tools like Copilot or ChatGPT. You may NOT read code snippets from your classmates. You may NOT copy code from any source, except Q&A sites such as StackOverflow. All such copied code must be clearly delineated with comments and have a link back to their source.

Please note that obfuscation of any form will count against you. Your code must use best practices, and be concice, legible and easily understandable. You should delete unused code from the final version.

Acknowledgement

Parts of the code are from the transformers library (Apache License 2.0).

This exercise has been forked from Carnegie Mellon University’s CS11-711 Advanced NLP, neubig/minbert-assignment: Minimalist BERT implementation assignment for CS11-711.