In this repo, we applied XLMRoBERTa to investigate the possibilities of cross- and multi-lingual learning toward the monolingual setting in cross-domain sequence-labeling term extraction and examined the cross-lingual effect of rich-resourced training language on less- resourced testing one. The results demonstrate a promising impact of multi-lingual and cross-lingual cross-domain learning that outperforms the related works in both datasets, which proves their potential when transferring from the rich- to the less-resourced languages.
The work has been accepted in 25th International Conference on Discovery Science, 2022.
Please install all the necessary libraries noted in requirements.txt using this command:
pip install -r requirements.txt
The experiments were conducted on 2 datasets:
ACTER dataset | RSDO5 dataset | |
---|---|---|
Languages | English, French, and Dutch | Slovenian |
Domains | Corruption, Wind energy, Equitation, Heart failure | Biomechanics, Chemistry, Veterinary, Linguistics |
As the orginal dataset does not follow IOB format, we preprocess the data to sequentially map each token with it regarding label. An example of IOB format is demontrated below.
For ACTER dataset, run the following command to preprocess the data:
preprocess.py [-corpus_path] [-term_path] [-output_csv_path] [-language]
where -corpus_path
is the path to the directory containing the corpus files, -term_path
is the path to the directory containing the term files, -output_csv_path
is the path to the output csv file, and -language
is the language of the corpus.
For RSDO5 dataset, the dataset is already in conll format. Please use read_conll()
function in sl_preprocess.py
to get the mapping.
Run the following command to train the model with all the scenarios in ACTER and RSDSO5 datasets:
run.sh
where run.sh
covers the following scenarios:
-
ACTER dataset with XLM-RoBERTa in mono-lingual, cros-lingual, and multi-lingual settings with both ANN and NES version with multi-lingual settings covering only three languages from ACTER and additional Slovenian add-ons (10 scenarios).
-
RSDO5 dataset with XLM-RoBERTa in mono-lingual, cros-lingual, and multi-lingual settings with cross-lingual and multi-lingual taking into account the ANN and NES version (48 scenarios).
Feel free to hyper-parameter tune the model. The current settings are:
num_train_epochs=20, # total # of training epochs
per_device_train_batch_size=32, # batch size per device during training
per_device_eval_batch_size=32, # batch size for evaluation
learning_rate=2e-5, # learning rate
eval_steps = 500,
load_best_model_at_end=True, # load the best model at the end of training
metric_for_best_model="f1",
greater_is_better=True,
We report the following results from all scenarios in terms of
Test = en | Test = fr | Test =nl | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
We report the following results from all scenarios in terms of
Test = ling | Test = vet | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Test = kem | Test = bim | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Test = ling | Test = vet | ||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Test = kem | Test = bim | ||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
The newest results with further details and discussions will be specified in the paper in the Reference.
Tran, Hanh Thi Hong, et al. "Can Cross-Domain Term Extraction Benefit from Cross-lingual Transfer?." Discovery Science: 25th International Conference, DS 2022, Montpellier, France, October 10–12, 2022, Proceedings. Cham: Springer Nature Switzerland, 2022.
- 🐮 TRAN Thi Hong Hanh 🐮
- Prof. Senja POLLAK
- Prof. Antoine DOUCET
- Prof. Matej MARTINC