First , download and process wikipedia/bookcorpus data with wikiextractor and BookCorpus. For bookcorpus, we further process it into the output format of wikiextractor and the script is in ./bookcorpus/preprocess.ipynb
.
Then, run the following command to generate sentence splited data for pretraining.
./scripts/presplit_sentences_json.py
It should be noticed that the processed output file path should as same as the path in data_utils/corpora.py
Then, run the following command for segabert training.
./scripts/pretrain_segabert_distributed.sh
The default parameters in this bash file are for the large model.