From 3a464dde912d0aca1f573a03644924767ff61d3f Mon Sep 17 00:00:00 2001 From: Kirill Tsukanov Date: Tue, 16 Aug 2022 14:09:14 +0100 Subject: [PATCH] Update manual curation README to explain normalisation procedure --- mappings/disease/README.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/mappings/disease/README.md b/mappings/disease/README.md index cb7cd5c..d38bf44 100644 --- a/mappings/disease/README.md +++ b/mappings/disease/README.md @@ -17,8 +17,26 @@ When amending the file manually, make sure to follow the format: For introducing the changes, the file could be imported into Google Sheets and exported back as TSV. +### Normalisation script + The maintenance script, `normalise.py`, reads the current manual mappings file (`manual_string.tsv`), performs certain normalisations (such as sorting and duplicate removal), and outputs the updated mappings as `efo/manual_string_NORM.tsv`. This file can then be inspected and moved to replace the original input file. To use the script, install dependencies: `pip install --upgrade pandas ontoma`. +Note that if several records are present for a pair of (PROPERTY_TYPE, SEMANTIC_TAG), only one is kept during the deduplication (the most recent one ty ANNOTATION_DATE). Case normalisation is also done during this process. For example, out of these three lines: + +| STUDY | BIOENTITY | PROPERTY_TYPE | PROPERTY_VALUE | SEMANTIC_TAG | ANNOTATOR | ANNOTATION_DATE | +|----------|-----------|---------------|---------------------|--------------------------------------|-------------|-----------------| +| Genebass | | disease | atrial fibrillation | http://www.ebi.ac.uk/efo/EFO_0000275 | Annotator 1 | 2020-02-30 | +| Genebass | | disease | Atrial fibrillation | http://www.ebi.ac.uk/efo/EFO_0000275 | Annotator 2 | 2022-08-16 | +| ClinVar | | disease | atrial fibrillation | http://www.ebi.ac.uk/efo/EFO_0000275 | Annotator 3 | 2021-06-02 | + +Only this one will be kept: + +| STUDY | BIOENTITY | PROPERTY_TYPE | PROPERTY_VALUE | SEMANTIC_TAG | ANNOTATOR | ANNOTATION_DATE | +|----------|-----------|---------------|---------------------|--------------------------------------|-------------|-----------------| +| Genebass | | disease | Atrial fibrillation | http://www.ebi.ac.uk/efo/EFO_0000275 | Annotator 2 | 2022-08-16 | + +It is assumed that every code which uses the `manual_string.tsv` file will also do case normalisation for comparison. This is already performed in ZOOMA and OnToma. + ## Ontology to ontology The second file, `manual_xref.tsv`, is currently not used and only exists as a placeholder.