Skip to content

Commit

Permalink
Add description of mappers (closes #43)
Browse files Browse the repository at this point in the history
  • Loading branch information
rsgoncalves committed Jun 5, 2024
1 parent 275181d commit 133be13
Showing 1 changed file with 55 additions and 14 deletions.
69 changes: 55 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,44 +10,64 @@ pip install text2term
## Basic Examples

<details>
<summary><u>Examples of Programmatic Use</u></summary>
<summary><b>Examples of Programmatic Mapping</b></summary>

### Examples of Programmatic Use
text2term supports mapping strings specified in different input formats:
### Examples of Programmatic Mapping
text2term supports mapping strings specified in multiple input formats. In the first example, we map strings in a list to an ontology specified by its URL:

```python
import text2term

# map strings in a list to an ontology specified by its URL
import text2term
dfl = text2term.map_terms(source_terms=["asthma", "acute bronchitis"],
target_ontology="http://purl.obolibrary.org/obo/mondo.owl")
```

# map strings listed in a file 'test/unstruct_terms.txt' to an ontology specified by its URL
There is also support for file-based input, for example a file containing a list of strings:
```python
dff = text2term.map_terms(source_terms="test/unstruct_terms.txt",
target_ontology="http://purl.obolibrary.org/obo/mondo.owl")
```

or a table where we can specify the column of terms to map and the table value separator:
```python
dff = text2term.map_terms(source_terms="test/some_table.tsv",
csv_columns=('diseases','optional_ids'), separator="\t",
target_ontology="http://purl.obolibrary.org/obo/mondo.owl")
```

# map strings in a dictionary with associated tags to an ontology specified by its URL
Finally it is possible map strings in a dictionary with associated tags that are preserved in the output:
```python
dfd = text2term.map_terms(source_terms={"asthma":"disease", "acute bronchitis":["disease", "lung"]},
target_ontology="http://purl.obolibrary.org/obo/mondo.owl")
```

text2term supports caching an ontology for repeated use:
</details>

<details>
<summary><b>Examples of Programmatic Caching</b></summary>

### Examples of Programmatic Caching
text2term supports caching an ontology for repeated use. The next example caches an ontology and gives it a name for use later on
```python
# cache ontology and give it a name for use later on
mondo = text2term.cache_ontology(ontology_url="http://purl.obolibrary.org/obo/mondo.owl",
ontology_acronym="MONDO")
```

# now map strings to the cached ontology by specifying as `target_ontology` the name chosen above and the flag `use_cache=True`
dfc = text2term.map_terms(source_terms=["asthma", "acute bronchitis"], target_ontology="MONDO", use_cache=True)
Now we can map strings to the cached ontology by specifying as `target_ontology` the name chosen above and the flag `use_cache=True`

# or more succinctly, use the OntologyCache object `mondo`
```python
dfc = text2term.map_terms(source_terms=["asthma", "acute bronchitis"],
target_ontology="MONDO", use_cache=True)
```

More succinctly, we can use the returned `OntologyCache` object `mondo` as such:
```python
dfo = mondo.map_terms(source_terms=["asthma", "acute bronchitis"])
```
</details>


<details>
<summary><u><b>Examples of Command Line Interface Use</b></u></summary>
<summary><b>Examples of Command Line Interface Use</b></summary>

### Examples of Command Line Interface Use
To show a help message describing all arguments type into a terminal:
Expand Down Expand Up @@ -281,3 +301,24 @@ To display a help message with descriptions of tool arguments do:
`-u` Include all unmapped terms in the output

</details>


## Supported Mappers

The mapping score associated with each mapping is indicative of how similar an input term is to an ontology term (via its labels or synonyms). The mapping/similarity scores generated by text2term are the result of applying one of the following "mappers":

TF-IDF-based mapper
: [TF-IDF](https://en.wikipedia.org/wiki/Tf–idf), a statistical measure often used in information retrieval, measures how important a word is to a document in a corpus of documents. We first generate TF-IDF-based vectors of the source terms and of labels and synonyms of ontology terms. Then we compute the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between vectors to determine how similar a source term is to a target term (label or synonym).

BioPortal Web API-based mapper
: uses an interface to the [BioPortal Annotator](https://bioportal.bioontology.org/annotator) that we built to allow mapping terms to ontologies in the [BioPortal](https://bioportal.bioontology.org) repository. To use it, make sure to specify the target ontology name(s) as they appear in BioPortal.

: _Note_: there are no confidence scores associated with BioPortal annotations, so we decided to set the mapping score of all mappings to 1.

Zooma Web API-based mapper
: uses a [Zooma](https://www.ebi.ac.uk/spot/zooma/) interface that we built to allow mapping terms to ontologies in the [Ontology Lookup Service (OLS)](https://www.ebi.ac.uk/ols4) repository. To use it, make sure to specify the target ontology name(s) as they appear in OLS.

Syntactic distance-based mappers
: text2term provides support for commonly used and popular syntactic (edit) distance metrics. Specifically, we implemented support for Levenshtein, Jaro, Jaro-Winkler, Jaccard, and Indel metrics. We use the [nltk](https://pypi.org/project/nltk/) package to compute Jaccard distances, and [rapidfuzz](https://pypi.org/project/rapidfuzz/) for all others.

_Note_: syntactic distance-based mappers and Web API-based mappers perform slowly (much slower than the TF-IDF mapper). The former because they do pairwise comparisons between each input string and each ontology term label/synonym. In the Web API-based approaches there are networking and API load overheads.

0 comments on commit 133be13

Please sign in to comment.