GitHub - therosko/annotated_datasets_en_comparisson: Comparing Annotated Datasets for Named Entity Recognition in English Literature

Comparing Annotated Datasets for Named Entity Recognition in English Literature

Should you stumble upon an issue or have questions, please contact rositsa.ivanova@unisg.ch.

Abstract:

The growing interest in named entity recognition (NER) in various domains has led to the creation of different benchmark datasets, often with slightly different annotation guidelines. To better understand the different NER benchmark datasets for the domain of English literature and their impact on the evaluation of NER tools, we analyse two existing annotated datasets and create two additional gold standard datasets. Following on from this, we evaluate the performance of two NER tools, one domain-specific and one general-purpose NER tool, using the four gold standards, and analyse the sources for the differences in the measured performance. Our results show that the performance of the two tools varies significantly depending on the gold standard used for the individual evaluations.

Please use the following citation while citing this work:

@InProceedings{ivanova-vanerp-kirrane:2022:LREC,
  author    = {Ivanova, Rositsa  and  van Erp, Marieke  and  Kirrane, Sabrina},
  title     = {Comparing Annotated Datasets for Named Entity Recognition in English Literature},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {3788--3797},
  url       = {https://aclanthology.org/2022.lrec-1.404}
}

How to replicate the experiment

Setup and requirements

The standard setup uses Docker. Once installed follow the steps:

create a folder (mkdir v_mnt), which will be mounted to the container
inside of this folder create a directory called Git mkdir Git
clone this repository inside of the Git directory git clone <repo_path_here> . (note: don't forget the dot in the end ;)
navigate to the cloned repository cd Git. The Dockerfile and requirements.txt are located there.
build the docker image sudo docker build -t ner .
run container with the mounted volume (any of the following works):

directly from the command line docker container run -v <path to folder>/v_mnt:/mnt -it ner /bin/bash
mount a volume to the container when running (docker container run -v <path to folder>/v_mnt:/mnt -it ner /bin/bash) and develop over VSCode (if working on a remote server, connect to remote server over SSH ) (push and pull from the same local folder)

Next steps:

The following steps are available in the get_setup.sh. Those should be executed within the container. You can run all in one by calling the script itself.

We use annotated data from
- Litbank Last commit: a371cd678701fc98371355b328a1a6c4b58508a3
- Dekker et al. Last commit ad31ce1fa515dceabb8febbaa7aa235f3de47ebd
- Two additional annotated datasets created for this work, which can be found in the folder annotation
Tools:
- BookNLP Last commit: f58fbdbb018ba8bf2d836b764d0426afa0f7bc8c. This version of BookNLP uses Stanford CoreNLP v.4.1.0 .
- Flair version 0.8

Further info:

The evaluation of the tools was done using the CoNLL evaluation script. Due to the various types of token labels within the tools and the gold standards, the published evaluation are calculated without considering prefixes. The folder results offers additional results for the tool Flair, which also consider prefixes. Such evaluation is not possible for BookNLP, as the tool does not use prefixes.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
annotation		annotation
data_overlap_manual_collection		data_overlap_manual_collection
docs		docs
results/overlap_conll		results/overlap_conll
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
all_steps.sh		all_steps.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comparing Annotated Datasets for Named Entity Recognition in English Literature

How to replicate the experiment

Setup and requirements

Next steps:

Further info:

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Comparing Annotated Datasets for Named Entity Recognition in English Literature

How to replicate the experiment

Setup and requirements

Next steps:

Further info:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages