-
Notifications
You must be signed in to change notification settings - Fork 23
Evaluation
The evaluation tool provides a range of linking and clustering evaluation measures. These are described briefly below and listed by the nel list-measures
command. For more details of correspondences between linking measures here and in the literature, see Hachey et al. (2014). For clustering, see Pradhan et al. (2014). For a quick reference, see our cheatsheet. (As described there, evaluation can be performed across the whole corpus, or with separate scores for each document/type as well as micro- and macro-averages across all types/docs.)
TAC 2014 will report two official measures, one for linking/wikification and one for nil clustering. For more detail, see the TAC 2014 scoring page.
strong_typed_all_match
is a micro-averaged evaluation of all mentions. A mention is counted as correct if it is a correct link or a correct nil. A correct link must have the same span, entity type, and KB identifier as a gold link. A correct nil must have the same span as a gold nil. This is the official linking evaluation measure for TAC 2014.
mention_ceaf
is based on a one-to-one alignment between system and gold clusters — both KB and nil. It computes an optimal mapping based on overlap between system-gold cluster pairs. System and gold mentions must have the same span to affect the alignment. Unmatched mentions also affect precision and recall.
The evaluation tool also provides a number of diagnostic measures available to isolate performance of system components and compare to numbers reported elsewhere in the literature.
strong_mention_match
is a micro-averaged evaluation of entity mentions. A system span must match a gold span exactly to be counted as correct.
strong_typed_mention_match
additionally requires the correct entity type. This is equivalent to the CoNLL NER evaluation (Tjong Kim Sang & De Meulder, 2003).
strong_linked_mention_match
is the same as strong_mention_match
but only considers non-nil mentions that are linked to KB identifier.
Measures sensitive to partial overlap between the system and gold mentions, using the LoReHLT metric can be constructed with aggregates such as overlap-sumsum
. See the Cheatsheet.
strong_link_match
is a micro-averaged evaluation of links. A system link must have the same span and KB identifier as a gold link to be counted as correct. This is equivalent to Cornolti et al.'s (2013) strong annotation match. Recall here is equivalent to KB accuracy from TAC tasks before 2014.
strong_nil_match
is a micro-averaged evaluation of nil mentions. A system nil must have the same span as a gold nil to be counted as correct. Recall here is equivalent to nil accuracy from TAC tasks before 2014.
strong_all_match
is a micro-averaged link evaluation of all mentions. A mention is counted as correct if is either a link match or a nil match as defined above. This is equivalent to overall accuracy from TAC tasks before 2014.
entity_match
is a micro-averaged document-level set-of-titles measure. It is the same as entity match reported by Cornolti et al. (2013).
entity_ceaf
— like mention_ceaf
— is based on a one-to-one alignment between system and gold entity clusters. Here system-gold cluster pairs are scored by their Dice coefficient.
b_cubed
assesses the proportion of each mention's cluster that is shared between gold and predicted clusterings.
b_cubed_plus
is identical to b_cubed
, but additionally requires a correct KB identifier for non-nil mentions.
muc
counts the number of edits required to translate the gold clustering into the prediction.
pairwise
measures the proportion of mention pairs occurring in the same cluster in both gold and predicted clusterings. It is similar to the Rand Index.
For more detail, see Pradhan et al.'s (2014) excellent overview of clustering measures for coreference evaluation, and our [Coreference_Evaluation](implementation notes).
Our scorer supports specification of some custom evaluation measures. See ./nel list-measures
.
Cornolti et al. (2013). A framework for benchmarking entity-annotation systems. In WWW.
Hachey et al. (2014). Cheap and easy entity evaluation. In ACL.
Ji & Grishman (2011). Knowledge base population: successful approaches and challenges. In ACL.
Pradhan et al. (2014). Scoring Coreference Partitions of Predicted Mentions: A Reference Implementation. In ACL.
Tjong Kim Sang & De Meulder (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In CoNLL.