provide a baseline for ccks-2021-task2(address parsing)
prepare bert_pretrained model and revised '--model_name_or_path'
prepare bigram and char embedding and revised pretrain_unigram_path,pretrain_bigram_path
prepare dataset and revised '--data_dir'
run on colab (main.ipynb). turn debug on False before run main.
run on local device( turn debug on False before run main.
The main module contains the follow files:
The Text process -> read a file and convert it to a format for model (fastNLP package).
- Build Model -> char,bigram and bert embedding + Bi-LSTM + CRF model and other model can be added.
- contain two classes. Trainer is for training process. Tester is for testing process which contains model predict and evaluation.
- main file to run on local device.
data folder contains files(train.conll,dev.conll,test.conll).
- can add some trick to import prediction performance. For example model average,Pseudo label,model stacking. Details can be seenBDCI top1 scheme.
- 2.model_name_or_path contains pretrained bert model files (.bin,.json,.txt) which can be downloaded Chinese-BERT-wwm for chinese text (also support other language).
- 3.char and bigram embedding can be downloaded from Flat
- 4.Flat model achieves 87.93, cross validate:88.73, pseudo-labelling: 89.73
- 5.char-bigram-bilstm 86.88
- 6.biaffine-ner, 81.63
- 7.our provided model(char-bigram-bert-bilstm-crf) 90+