Skip to content
Ben Hachey edited this page Aug 11, 2014 · 3 revisions

Overview

We use randomised permutation/bootstrap methods to provide a statistical significance test over pairs of systems, but for many systems it will have a long runtime; we also provide a tool to calculate confidence intervals at given percentiles. Overlapping confidence intervals may indicate that system performances do not differ significantly.

Caveat

Note that bootstrap resampling is performed over documents from a single system run. This makes the assumption that predictions on each document are made independently, which is certainly untrue for nil clustering, and may be untrue for linking approaches that exploit cross-document clustering.

References

Davison & Hinkley (1997). Bootstrap methods and their applications. Cambridge University Press.

Lin (2004). Looking for a few good metrics: ROUGE and its evaluation. In NTCIR.

Noreen (1989). Computer-intensive methods for testing hypotheses. Wiley-Interscience.

Tjong Kim Sang & De Meulder (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In CoNLL.

Clone this wiki locally