These scripts are distributed as part of the Clinical Cases Coding in Spanish language Track (CodiEsp). They are intended to be run via command line:
$> python codiespD_P_evaluate.py -g /path/to/gold_standard.tsv -p /path/to/predictions.tsv -c /path/to/codes.tsv
$> python comp_f1_diag_proc.py -g /path/to/gold_standard.tsv -p /path/to/predictions.tsv -c /path/to/codes.tsv
$> python codiespX_evaluate.py -g /path/to/gold_standard.tsv -p /path/to/predictions.tsv -cD /path/to/codes-D.tsv -cP path/to/codes-P.tsv
They compute the evaluation metrics for the corresponding tasks (Mean Average Precision for CodiEsp-D and CodiEsp-P and F-score for CodiEsp-X). In addition, comp_f1_diag_proc.py computes Precision, Recall and F1 for CodiEsp-D and CodiEsp-P.
Mean Average Precision (MAP) is computed using the Python implementation of TREC evaluation tool, trectools, by Palotti et al. (2019) [1].
-
gold_standard.tsv must be the gold standard files distributed in the CodiEsp Track webpage.
-
predictions.tsv must be the predictions file.
-
For CodiEsp-D and CodiEsp-P, it is a tab-separated file with two columns: clinical case and code. Codes must be ordered by rank. For example:
S1889-836X2016000100006-1 DIAGNOSTICO n20.0 litiasis renal
-
For CodiEsp-X, the file predictions.tsv is also a tab-separated file. In this case, with four columns: clinical case, reference position, code label, code. For example:
S1889-836X2016000100006-1 100 200 DIAGNOSTICO n20.0
-
This software requires to have Python 3 installed on your system with the libraries Pandas, NumPy, SciPy, Matplotlib and trectools. For a detailed description, see requirements.txt.
The directory structure of this repo is not required to run the Python scripts.
This directory contains the gold standard files for each of the sub-tasks, in separated directories. Each sub-directory may contain different sub-directories for each data set: sample, train, development, test, etc. Sample gold standard files and toy data are in this GitHub repository. For more gold standard files, please, visit the CodiEsp Track webpabe. Gold standard files must be in the appropriate format (such as the files distributed in the CodiEsp Track webpage).
This directory contains the submission files for each of the sub-tasks, in separated directories. Each sub-directory may contain different sub-directories for each data set: sample, train, development, test, etc. A toy data directory is provided. Files in the latter directories must be in the appropriate format (explained in the Introduction section).
This directory contains the TSV files with the lists of valid codes for the subtasks (with their descriptions in Spanish and English).
All scripts require the same two parameters:
- The --gs_path (-g) option specifies the path to the Gold Standard file.
- The --pred_path (-p) option specifies the path to the predictions file.
In addition, codiespD_P_evaluate.py and comp_f1_diag_proc.py requires an extra parameter:
- The --valid_codes_path (-c) option specifies the path to the list of valid codes for the CodiEsp subtask we are evaluating.
Finally, codiespX_evaluate.py requires two extra parameters:
- The --valid_codes_D_path (-cD) option specifies the path to the list of valid codes for the CodiEsp-D subtask.
- The --valid_codes_P_path (-cP) option specifies the path to the list of valid codes for the CodiEsp-P subtask.
$> python codiespD_P_evaluate.py -g /path/to/gold_standard.tsv -p /path/to/predictions.tsv -c /path/to/codes.tsv
$> python comp_f1_diag_proc.py -g /path/to/gold_standard.tsv -p /path/to/predictions.tsv -c /path/to/codes.tsv
$> python codiespX_evaluate.py -g /path/to/gold_standard.tsv -p /path/to/predictions.tsv -cD /path/to/codes-D.tsv -cP path/to/codes-P.tsv
Evaluate the system output pred_D.tsv against the gold standard gs_D.tsv (both inside toy_data subfolders).
$> python3 codiespD_P_evaluation.py -g gold/toy_data/gs_D.tsv -p system/toy_data/pred_D.tsv -c codiesp_codes/codiesp-D_codes.tsv
MAP estimate: 0.444
Evaluate the system output pred_X.tsv against the gold standard gs_X.tsv (both inside toy_data subfolders). Evaluation measures are Precision, Recall and F-score. A True Positive is considered when the correct code is predicted and the right reference position is also given (with 10 characters of error tolerance).
$> python3 codiespX_evaluation.py -g gold/toy_data/gs_X.tsv -p system/toy_data/pred_X.tsv -cD codiesp_codes/codiesp-D_codes.tsv -cP codiesp_codes/codiesp-P_codes.tsv
-----------------------------------------------------
Clinical case name Precision
-----------------------------------------------------
S0000-000S0000000000000-00 nan
-----------------------------------------------------
S1889-836X2016000100006-1 0.7
-----------------------------------------------------
codiespX_evaluation.py:248: UserWarning: Some documents do not have predicted codes, document-wise Precision not computed for them.
Micro-average precision = 0.636
-----------------------------------------------------
Clinical case name Recall
-----------------------------------------------------
S0000-000S0000000000000-00 nan
-----------------------------------------------------
S1889-836X2016000100006-1 0.636
-----------------------------------------------------
codiespX_evaluation.py:260: UserWarning: Some documents do not have Gold Standard codes, document-wise Recall not computed for them.
Micro-average recall = 0.538
-----------------------------------------------------
Clinical case name F-score
-----------------------------------------------------
S0000-000S0000000000000-00 nan
-----------------------------------------------------
S1889-836X2016000100006-1 0.667
-----------------------------------------------------
codiespX_evaluation.py:271: UserWarning: Some documents do not have predicted codes, document-wise F-score not computed for them.
codiespX_evaluation.py:274: UserWarning: Some documents do not have Gold Standard codes, document-wise F-score not computed for them.
Micro-average F-score = 0.583
__________________________________________________________
MICRO-AVERAGE STATISTICS:
Micro-average precision = 0.636
Micro-average recall = 0.538
Micro-average F-score = 0.583
Evaluate the system output pred_D.tsv against the gold standard gs_D.tsv (both inside toy_data subfolders). In this case, compute Precision, Recall and F1 score.
$> python3 comp_f1_diag_proc.py -g gold/toy_data/gs_D.tsv -p system/toy_data/pred_D.tsv -c codiesp_codes/codiesp-D_codes.tsv
-----------------------------------------------------
Clinical case name Precision
-----------------------------------------------------
S0004-06142006000900013-1 0.4
-----------------------------------------------------
S0210-48062005000700013-1 nan
-----------------------------------------------------
S1130-05582003000600004-2 0.5
-----------------------------------------------------
comp_f1_diag_proc.py:110: UserWarning: Some documents do not have predicted codes, document-wise Precision not computed for them.
warnings.warn('Some documents do not have predicted codes, ' +
Micro-average precision = 0.444
-----------------------------------------------------
Clinical case name Recall
-----------------------------------------------------
S0004-06142006000900013-1 0.667
-----------------------------------------------------
S0210-48062005000700013-1 0.0
-----------------------------------------------------
S1130-05582003000600004-2 0.4
-----------------------------------------------------
Micro-average recall = 0.4
-----------------------------------------------------
Clinical case name F-score
-----------------------------------------------------
S0004-06142006000900013-1 0.5
-----------------------------------------------------
S0210-48062005000700013-1 nan
-----------------------------------------------------
S1130-05582003000600004-2 0.444
-----------------------------------------------------
comp_f1_diag_proc.py:133: UserWarning: Some documents do not have predicted codes, document-wise F-score not computed for them.
warnings.warn('Some documents do not have predicted codes, ' +
Micro-average F-score = 0.421
__________________________________________________________
MICRO-AVERAGE STATISTICS:
Micro-average precision = 0.444
Micro-average recall = 0.4
Micro-average F-score = 0.421
0.444|0.4|0.421
Antonio Miranda-Escalada (antonio.miranda@bsc.es)
[1] Palotti, Joao and Scells, Harrisen and Zuccon, Guido: TrecTools: an open-source Python library for Information Retrieval practitioners involved in TREC-like campaigns, SIGIR'19, 2019, ACM.
Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estapé, J., Krallinger, M.: Overview of automatic clinical coding: annotations, guidelines, and solutions for non-English clinical cases at CodiEsp track of eHealth CLEF 2020. In: CLEF (Working Notes) (2020)
@inproceedings{miranda2020overview, title={Overview of automatic clinical coding: annotations, guidelines, and solutions for non-english clinical cases at codiesp track of CLEF eHealth 2020}, author={Miranda-Escalada, Antonio and Gonzalez-Agirre, Aitor and Armengol-Estap{'e}, Jordi and Krallinger, Martin}, booktitle={Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings}, year={2020} }