The json data in this repository was generated by 500 inferences in each sentence on CoNLL2003 train, validation, test data, CoNLL++ (2023) test data, and CoNLL++ (CrossWeigh) test data, using subword regularization (BPE-Dropout) with hyperparameter p = 0.1.
- For the training data, data was generated by five rounds of inferences for 20% of the train data by the model trained on the remaining 80%. (train_cross_var.py)
- For the other data, each data was generated by inferences by all train data. (train_all_train_data.py)
- We used Roberta-base model fine-tuned by CoNLL2003 train dataset. You can use our model and see detail train config in huggingface.
All datasets can be found at choice_data/
We removed the original token and label data. Therefore, in this repository, only the inference labels, the times that they were predicted in 500, and the f1 score with the golden labels are available. If you have original CoNLL2003 data, you can add original token and label data to our datasets.
- F1 score is calculated by seqeval, but there are only O labels in both of predicted labels and golden labels, we set the score at 1.0.
git clone git@github.com:4ldk/CoNLL2003_Choices.git
cd CoNLL2003_Choices
mkdir row_data
- Copy each data to
row_data/
- Copy Original data (
eng.train
,eng.testa
,eng.tesb
) directly.- The following two repositories also have original data, but some data that are not suitable for training have been erased. The data and models published in this repository is based on data that has not been erased.
- Copy
conllpp.txt
of CoNLL++ (2023) directly. - Copy
conllpp_test.txt
of CoNLL++ (CrossWeigh) asconllcw.txt
- Copy Original data (
pip install -r requirement.txt
python3 ./src/add_row_data.py
mkdir model
- Edit
config/config2003.yaml
to settest
test data,loop
the number of test loops,pred_p
subword regularization hyperparameter p,load_local_model: False
andtest_model_name: "4ldk/Roberta-Base-CoNLL2003"
. python3 ./src/predictor.py
- Edit
config/config2003.yaml
- If you want to change the number of divisions in the train data, Change line 12 of
make_cross_data.py
and lines 106 and 108 oftrain_cross_var.py
. python3 ./src/make_cross_data.py
python3 ./src/train_cross_var.py
- Edit
config/config2003.yaml
- Set
load_local_model: True
and others you want to change
- Set
python3 ./src/train_all_train_data.py
python3 ./src/predictor.py