-
Notifications
You must be signed in to change notification settings - Fork 23
Usage
The NEL evaluation tools are invoked using ./nel
inside the repository. Usage:
./nel <command> [<args>]
To list available commands:
./nel
To get help for a specific command:
./nel <command> -h
The commands that are relevant to TAC KBP entity linking evaluation and analysis are described below.
The following describes a typical workflow. See also run_tac14_evaluation.sh and run_tac13_evaluation.sh.
For data in [TAC14 format](data format):
./nel prepare-tac \
-q /path/to/gold.xml \ # gold queries/mentions file
/path/to/gold.tab \ # gold KB/NIL annotations file
> gold.combined.tsv
For data in TAC12 and TAC13 format, remove extra columns first, e.g.:
cat /path/to/gold.tab \
| cut -f1,2,3 \
> gold.tab
./nel prepare-tac \
-q /path/to/gold.xml \
gold.tab \
> gold.combined.tsv
For data in [TAC14 format](data format):
./nel prepare-tac \
-q /path/to/system.xml \ # system mentions file
/path/to/system.tab \ # system KB/NIL annotations
> system.combined.tsv
For data in TAC12 and TAC13 format, add dummy NE type column first, e.g.:
cat /path/to/system.tab \
| awk 'BEGIN{OFS="\t"} {print $1,$2,"NA",$3}' \
> system.tab
./nel prepare-tac \
-q /path/to/gold.xml \ # gold queries/mentions file
system.tab \ # system KB/NIL annotations
> system.combined.tsv
To calculate micro-averaged scores for all evaluation measures:
./nel evaluate \
-m all \ # report all evaluation measures
-f tab \ # print results in tab-separated format
-g gold.combined.tsv \ # prepared gold standard annotation
system.combined.tsv \ # prepared system output
> system.evaluation
To list available evaluation measures:
./nel list-measures
The following describes additional commands for analysis. See also run_tac14_all.sh (TODO) and run_tac13_all.sh.
To calculate confidence intervals using bootstrap resampling:
./nel confidence \
-m strong_typed_link_match \ # report CI for TAC14 wikification measure
-f tab \ # print results in tab-separated format
-g gold.combined.tsv \ # prepared gold standard annotation
system.combined.tsv \ # prepared system output
> system.confidence
We recommend that you pip install joblib
and use -j NUM_JOBS
to run this in parallel. This is also faster if an individual evaluation measure is specified (e.g., strong_typed_link_match) rather than groups of measures (e.g., tac).
The run_report_confidence.sh script is available to create reports comparing multiple systems.
Note that bootstrap resampling is not appropriate for nil clustering measures. For more detail, see the Significance wiki page.
It is also possible to calculate pairwise differences:
./nel significance \
--permute \ # use permutation method
-f tab \ # print results in tab-separated format
-g gold.combined.tsv \ # prepared gold standard annotation
system1.combined.tsv \ # prepared system1 output
system2.combined.tsv \ # prepared system2 output
> system1-system2.significance
We recommend calculating significance for selected system pairs as it can take a while over all N choose 2 combinations of systems. You can also use -j NUM_JOBS
to run this in parallel.
Note that bootstrap resampling is not appropriate for nil clustering measures. For more detail, see the Significance wiki page.
To create a table of classification errors:
./nel analyze \
-s \ # print summary table
-g gold.combined.tsv \ # prepared gold standard annnotation
system.combined.tsv \ # prepared system output
> system.analysis
Without the -s
flag, the analyze
command will list and categorize differences between the gold standard and system output.
The following describes a workflow for evaluation over subsets of mentions. See also run_tac14_filtered.sh (TODO) and run_tac13_filtered.sh.
Prepared data is in a simple tab-separated format with one mention per line and six columns: document_id
, start_offset
, end_offset
, kb_or_nil_id
, score
, entity_type
. It is possible to use command line tools (e.g., grep
, awk
) to select mentions for evaluation, e.g.:
cat gold.combined.tsv \ # prepared gold standard annotation
| egrep "^eng-(NG|WL)-" \ # select newsgroup and blog (WB) mentions
> gold.WB.tsv # filtered gold standard annotation
cat system.combined.tsv \ # prepared system output
| egrep "^eng-(NG|WL)-" \ # select newsgroup and blog (WB) mentions
> system.WB.tsv # filtered system output
After filtering, evaluation is run as before:
./nel evaluate \
-m all \ # report all evaluation measures
-f tab \ # print results in tab-separated format
-g gold.WB.tsv \ # filtered gold standard annotation
system.WB.tsv \ # filtered system output
> system.WB.evaluation