Skip to content

Commit

Permalink
Update manual curation README to explain normalisation procedure
Browse files Browse the repository at this point in the history
  • Loading branch information
tskir committed Aug 16, 2022
1 parent 623cdfe commit 3a464dd
Showing 1 changed file with 18 additions and 0 deletions.
18 changes: 18 additions & 0 deletions mappings/disease/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,26 @@ When amending the file manually, make sure to follow the format:

For introducing the changes, the file could be imported into Google Sheets and exported back as TSV.

### Normalisation script

The maintenance script, `normalise.py`, reads the current manual mappings file (`manual_string.tsv`), performs certain normalisations (such as sorting and duplicate removal), and outputs the updated mappings as `efo/manual_string_NORM.tsv`. This file can then be inspected and moved to replace the original input file. To use the script, install dependencies: `pip install --upgrade pandas ontoma`.

Note that if several records are present for a pair of (PROPERTY_TYPE, SEMANTIC_TAG), only one is kept during the deduplication (the most recent one ty ANNOTATION_DATE). Case normalisation is also done during this process. For example, out of these three lines:

| STUDY | BIOENTITY | PROPERTY_TYPE | PROPERTY_VALUE | SEMANTIC_TAG | ANNOTATOR | ANNOTATION_DATE |
|----------|-----------|---------------|---------------------|--------------------------------------|-------------|-----------------|
| Genebass | | disease | atrial fibrillation | http://www.ebi.ac.uk/efo/EFO_0000275 | Annotator 1 | 2020-02-30 |
| Genebass | | disease | Atrial fibrillation | http://www.ebi.ac.uk/efo/EFO_0000275 | Annotator 2 | 2022-08-16 |
| ClinVar | | disease | atrial fibrillation | http://www.ebi.ac.uk/efo/EFO_0000275 | Annotator 3 | 2021-06-02 |

Only this one will be kept:

| STUDY | BIOENTITY | PROPERTY_TYPE | PROPERTY_VALUE | SEMANTIC_TAG | ANNOTATOR | ANNOTATION_DATE |
|----------|-----------|---------------|---------------------|--------------------------------------|-------------|-----------------|
| Genebass | | disease | Atrial fibrillation | http://www.ebi.ac.uk/efo/EFO_0000275 | Annotator 2 | 2022-08-16 |

It is assumed that every code which uses the `manual_string.tsv` file will also do case normalisation for comparison. This is already performed in ZOOMA and OnToma.

## Ontology to ontology

The second file, `manual_xref.tsv`, is currently not used and only exists as a placeholder.

0 comments on commit 3a464dd

Please sign in to comment.