Intent classification summary metrics

Story

Given an existing set of intent testing results, the user wants to determine the performance of each intent, particularly the 'precision' (positive predictive value), 'recall' (true positive rate), and 'f-score'. From this performance the user can decide which intents need further training or revision.

Workflow

There are two starting possibilities:

The user has run existing k-folds or blind test which creates a summary .csv file
The user has a separate analysis which creates a .csv file with "golden" and "predicted" columns

User executes intentmetrics.py providing an input filename and output filename. If using workflow 2, the user provides the names of the golden and predicted column header names with -g and -t respectively. The summary is written to the output filename.

Prerequisite

User must have an input .csv file containing at least two column headers representing "golden" and "predicted" filenames.

Invocation

Workflow #1 Assuming input file at data/test-out.csv created by other tools in this repository and writing to `test-out-metrics.csv'.

python3 utils/intentmetrics.py -i data/test-out.csv -o data/test-out-metrics.csv

Workflow #2 Assuming input file at data/golden_vs_predicted.csv and writing to data/golden_vs_predicted_metrics.csv. Since a different tool has created the input file we need to specify the names of the golden and predicted column headers which otherwise default to "golden intent" and "predicted intent".

data/golden_vs_predicted.csv example contents

"predicted","golden"
"intent1","intent1"
"intent1","intent2"
"intent1","intent2"
"intent2","intent2"
"intent2","intent2"

Invoke via:

python3 utils/intentmetrics.py -i data/golden_vs_predicted.csv -o data/golden_vs_predicted-metrics.csv -t "predicted" -g "golden"

Sample output

Assuming the small example file data/golden_vs_predicted.csv above, this CSV is generated

"intent","number of samples","number of predictions","recall","precision","f-score"
"intent1","1","3","1.0","0.3333333333333333","0.5"
"intent2","4","2","0.5","1.0","0.6666666666666666"

This mode also generates a treemap where:

SIZE of box relates to number of samples for that intent
COLOR of box relates to the accuracy for that intent

The treemap is organized such that the worst performing intents (by f-score) are located towards the top-right and the best performing are towards the bottom left.

Using a larger example file we get the following treemap:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intent-metrics.md

intent-metrics.md

Intent classification summary metrics

Story

Workflow

Prerequisite

Invocation

Sample output

Files

intent-metrics.md

Latest commit

History

intent-metrics.md

File metadata and controls

Intent classification summary metrics

Story

Workflow

Prerequisite

Invocation

Sample output