This repository contains the acronym identification training and development set along with the evaluation scripts for the acronym identification task at SDU@AAAI-21.
The dataset folder contains three files:
- train.json: The training samples for acronym identification task. Each sample has three attributes:
- tokens: The list of words (tokens) of the sample
- labels: The short-form and long-form labels of the words in BIO format. The labels
B-short
andB-long
identifies the beginning of a short-form and long-form phrase, respectively. The labelsI-short
andI-long
indicates the words inside the short-form or long-form phrases. Finally, the labelO
shows the word is not part of any short-form or long-form phrase. - id: The unique ID of the sample
- dev.json: The development set for acronym identification task. The samples in
dev.json
have the same attributes as the samples intrain.json
. - predictions.json: A sample prediction file created from
dev.json
to test the scoring script. The participants should submit the final test predictions of their model in the same format as thepredictions.json
file. Each prediction should have two attributes:- id: The ID of the sample (i.e., the same IDs used in the train/dev/test samples provided in
train.json
,dev.json
andtest.json
) - predictions: The labels of the words of the sample in BIO format. The labels
B-short
andB-long
identifies the beginning of a short-form and long-form phrase, respectively. The labelsI-short
andI-long
indicates the words inside the short-form or long-form phrases. Finally, the labelO
shows the word is not part of any short-form or long-form phrase.
- id: The ID of the sample (i.e., the same IDs used in the train/dev/test samples provided in
In order to familiarize the participants with this task, we provide a rule-based model in the code
directory. This baseline implements the method proposed by Schwartz and Hearst [1]. To identify acronyms, if more than 60% of the characters of a word are uppercased, this model recognizes it as acronym (i.e., short-form). To identify the long-form, it compares the characters of the acronym with the characters of the words that are before or after the acronym up to a certain window size. If the characters of these words could form the acronym, they are labeled as long-form. To run this model, use the following command:
python code/character_match.py -input <path/to/input.json> -output <path/to/output.json>
Please replace the <path/to/input.json>
and <path/to/output.json>
with the real paths to the input file (e..g, dataset/dev.json
) and output file. The output file contains the predictions and can be evaluated by the scorer using the command described in the next section. The official scores for this baseline are: Precision: 93.22%, Recall: 78.90%, F1: 85.46%
To evaluate the predictions (in the format provided in dataset/predictions.json
file), run the following command:
python scorer.py -g path/to/gold.json -p path/to/predictions.json
The path/to/gold.json
and path/to/predictions.json
should be replaced with the real paths to the gold file (e.g., dataset/dev.json
for evaluation on development set) and predictions file (i.e., the predictions generated by your system in the same format as dataset/predictions.json
file). The official evaluation metrics are the macro-averaged precision, recall and F1 for short form and long form predictions. For verbose evaluation (including the micro-averaged precision, recall and F1 and also short form and long form scores seperatedly), use the following command:
python scorer.py -g path/to/gold.json -p path/to/predictions.json -v
In order to participate, please first fill out this form to register for the shared tasks: https://forms.gle/NvnT549mSbyeJQAPA. The team name that is provided in this form will be used in the subsequent submissions and communications. The shared task is organized in two separate phases:
- Development Phase: In this phase, the participants will use the training/development sets provided in this repository to design and develop their models.
- Evaluation Phase: Two weeks before the system runs due, i.e., 20th November 2020, the test set is released here. The test set has the same distribution and format as the development set. Run your model on the provided test set and save the prediction results in a Json file with the same format as the "predictions.json" file. Name the prediction file as "output.json" and send that to the email address sdu-aaai21@googlegroups.com with title "Results of AI-[TEAM-name]-[RUN-ID]", where "[TEAM-name]" should be replaced with the name of your team provided in the registration form and "[RUN-ID]" with a number between 1 to 10 to identify the model run. Each participant team is allowed to submit up to 10 different model runs. Note that your official score is reported for the model run with ID 1. In addition to the "output.json" file, please include the following information in your email:
- Model Description: A brief summary of the model architecture. If your model is using word embedding, please specify what type of word embedding your model is using.
- Extra Data: Whether or not the model employs other resources/data, e.g., acronym glossaries, in the development or evaluation phases.
- Training/Evaluation Time: How long the model takes to be trained/evaluated on the provided dataset
- Run Description: A brief description on what is the difference in the recent model run compared to other runs (if it is applicable)
- Plan for System Report: If you have any plan to submit your system report or release your model publicly, please specify that. Participants are strongly encouraged to submit a system report, regardless of the results.
For more information, see SDU@AAAI-21.
Update: The CodaLab competitions for the shared task is open. Participants can also submit their results to Acronym Identification competition. For more information, please check the CodaLab competition for Acronym Identification.
If you use the dataset, baseline or evaluation script released in this repo, please cite our paper:
@inproceedings{veyseh-et-al-2020-what,
title={{What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation}},
author={Amir Pouran Ben Veyseh and Franck Dernoncourt and Quan Hung Tran and Thien Huu Nguyen},
year={2020},
booktitle={Proceedings of COLING},
link={https://arxiv.org/pdf/2010.14678v1.pdf}
}
The dataset provided for this shared task is licensed under CC BY-NC-SA 4.0 international license, and the evaluation script and the baseline are licensed under MIT license.
[1] Schwartz AS, Hearst MA. A simple algorithm for identifying abbreviation definitions in biomedical text. Pac Symp Biocomput. 2003:451-62. PMID: 12603049.