Skip to content
jnothman edited this page Sep 9, 2014 · 30 revisions

Overview

The NEL evaluation tools are invoked using ./nel inside the repository. Usage:

./nel <command> [<args>]

To list available commands:

./nel

To get help for a specific command:

./nel <command> -h

The commands that are relevant to TAC KBP entity linking evaluation and analysis are described below.

Basic usage

The following describes a typical workflow. See also run_tac14_evaluation.sh and run_tac13_evaluation.sh.

Convert gold standard to evaluation format

For data in [TAC14 format](data format):

./nel prepare-tac \
    -q /path/to/gold.xml \    # gold queries/mentions file
    /path/to/gold.tab \       # gold KB/NIL annotations file
    > gold.combined.tsv

For data in TAC12 and TAC13 format, remove extra columns first, e.g.:

cat /path/to/gold.tab \
    | cut -f1,2,3 \
    > gold.tab
./nel prepare-tac \
    -q /path/to/gold.xml \
    gold.tab \
    > gold.combined.tsv

Convert system output to evaluation format

For data in [TAC14 format](data format):

./nel prepare-tac \
    -q /path/to/system.xml \  # system mentions file
    /path/to/system.tab \     # system KB/NIL annotations
    > system.combined.tsv

For data in TAC12 and TAC13 format, add dummy NE type column first, e.g.:

cat /path/to/system.tab \
    | awk 'BEGIN{OFS="\t"} {print $1,$2,"NA",$3}' \
    > system.tab
./nel prepare-tac \
    -q /path/to/gold.xml \    # gold queries/mentions file
    system.tab \              # system KB/NIL annotations
    > system.combined.tsv

Evaluate system output

To calculate micro-averaged scores for all evaluation measures:

./nel evaluate \
    -m all \                  # report all evaluation measures
    -f tab \                  # print results in tab-separated format
    -g gold.combined.tsv \    # prepared gold standard annotation
    system.combined.tsv \     # prepared system output
    > system.evaluation

To list available evaluation measures:

./nel list-measures

Advanced usage

The following describes additional commands for analysis. See also run_tac14_all.sh (TODO) and run_tac13_all.sh.

Calculate confidence intervals

To calculate confidence intervals using bootstrap resampling:

./nel confidence \
    -m strong_typed_link_match \ # report CI for TAC14 wikification measure
    -f tab \                  # print results in tab-separated format
    -g gold.combined.tsv \    # prepared gold standard annotation
    system.combined.tsv \     # prepared system output
    > system.confidence

We recommend that you pip install joblib and use -j NUM_JOBS to run this in parallel. This is also faster if an individual evaluation measure is specified (e.g., strong_typed_link_match) rather than groups of measures (e.g., tac).

The run_report_confidence.sh script is available to create reports comparing multiple systems.

Note that bootstrap resampling is not appropriate for nil clustering measures. For more detail, see the Significance wiki page.

Calculate significant differences

It is also possible to calculate pairwise differences:

./nel significance \
    --permute \               # use permutation method
    -f tab \                  # print results in tab-separated format
    -g gold.combined.tsv \    # prepared gold standard annotation
    system1.combined.tsv \    # prepared system1 output
    system2.combined.tsv \    # prepared system2 output
    > system1-system2.significance

We recommend calculating significance for selected system pairs as it can take a while over all N choose 2 combinations of systems. You can also use -j NUM_JOBS to run this in parallel.

Note that bootstrap resampling is not appropriate for nil clustering measures. For more detail, see the Significance wiki page.

Analyze error types

To create a table of classification errors:

./nel analyze \
    -s \                      # print summary table
    -g gold.combined.tsv \    # prepared gold standard annnotation
    system.combined.tsv \     # prepared system output
    > system.analysis

Without the -s flag, the analyze command will list and categorize differences between the gold standard and system output.

Filter data for evaluation on subsets

The following describes a workflow for evaluation over subsets of mentions. See also run_tac14_filtered.sh (TODO) and run_tac13_filtered.sh.

Filter prepared data

Prepared data is in a simple tab-separated format with one mention per line and six columns: document_id, start_offset, end_offset, kb_or_nil_id, score, entity_type. It is possible to use command line tools (e.g., grep, awk) to select mentions for evaluation, e.g.:

cat gold.combined.tsv \       # prepared gold standard annotation
    | egrep "^eng-(NG|WL)-" \ # select newsgroup and blog (WB) mentions
    > gold.WB.tsv             # filtered gold standard annotation
cat system.combined.tsv \     # prepared system output
    | egrep "^eng-(NG|WL)-" \ # select newsgroup and blog (WB) mentions
    > system.WB.tsv           # filtered system output

Evaluate on filtered data

After filtering, evaluation is run as before:

./nel evaluate \
    -m all \                  # report all evaluation measures
    -f tab \                  # print results in tab-separated format
    -g gold.WB.tsv \          # filtered gold standard annotation
    system.WB.tsv \           # filtered system output
    > system.WB.evaluation