Add description of mappers (closes #43)

ccb-hms · Jun 5, 2024 · 133be13 · 133be13
1 parent 275181d
commit 133be13
Showing 1 changed file with 55 additions and 14 deletions.
diff --git a/README.md b/README.md
@@ -10,44 +10,64 @@ pip install text2term
 ## Basic Examples
 
 <details>
-  <summary><u>Examples of Programmatic Use</u></summary>
+  <summary><b>Examples of Programmatic Mapping</b></summary>
 
-### Examples of Programmatic Use
-text2term supports mapping strings specified in different input formats:
+### Examples of Programmatic Mapping
+text2term supports mapping strings specified in multiple input formats. In the first example, we map strings in a list to an ontology specified by its URL:
 
 ```python
-import text2term
-
-# map strings in a list to an ontology specified by its URL 
+import text2term 
 dfl = text2term.map_terms(source_terms=["asthma", "acute bronchitis"], 
                           target_ontology="http://purl.obolibrary.org/obo/mondo.owl")
+```
 
-# map strings listed in a file 'test/unstruct_terms.txt' to an ontology specified by its URL
+There is also support for file-based input, for example a file containing a list of strings:
+```python
 dff = text2term.map_terms(source_terms="test/unstruct_terms.txt", 
                           target_ontology="http://purl.obolibrary.org/obo/mondo.owl")
+```
+
+or a table where we can specify the column of terms to map and the table value separator:
+```python
+dff = text2term.map_terms(source_terms="test/some_table.tsv", 
+                          csv_columns=('diseases','optional_ids'), separator="\t",
+                          target_ontology="http://purl.obolibrary.org/obo/mondo.owl")
+```
 
-# map strings in a dictionary with associated tags to an ontology specified by its URL
+Finally it is possible map strings in a dictionary with associated tags that are preserved in the output:
+```python
 dfd = text2term.map_terms(source_terms={"asthma":"disease", "acute bronchitis":["disease", "lung"]}, 
                           target_ontology="http://purl.obolibrary.org/obo/mondo.owl")
 ```
 
-text2term supports caching an ontology for repeated use:
+</details>
+
+<details>
+  <summary><b>Examples of Programmatic Caching</b></summary>
+
+### Examples of Programmatic Caching
+text2term supports caching an ontology for repeated use. The next example caches an ontology and gives it a name for use later on
 ```python
-# cache ontology and give it a name for use later on
 mondo = text2term.cache_ontology(ontology_url="http://purl.obolibrary.org/obo/mondo.owl", 
                                  ontology_acronym="MONDO")
+```
 
-# now map strings to the cached ontology by specifying as `target_ontology` the name chosen above and the flag `use_cache=True`
-dfc = text2term.map_terms(source_terms=["asthma", "acute bronchitis"], target_ontology="MONDO", use_cache=True)
+Now we can map strings to the cached ontology by specifying as `target_ontology` the name chosen above and the flag `use_cache=True`
 
-# or more succinctly, use the OntologyCache object `mondo`
+```python
+dfc = text2term.map_terms(source_terms=["asthma", "acute bronchitis"], 
+                          target_ontology="MONDO", use_cache=True)
+```
+
+More succinctly, we can use the returned `OntologyCache` object `mondo` as such:
+```python
 dfo = mondo.map_terms(source_terms=["asthma", "acute bronchitis"])
 ```
 </details>
 
 
 <details>
-  <summary><u><b>Examples of Command Line Interface Use</b></u></summary>
+  <summary><b>Examples of Command Line Interface Use</b></summary>
 
 ### Examples of Command Line Interface Use
 To show a help message describing all arguments type into a terminal:
@@ -281,3 +301,24 @@ To display a help message with descriptions of tool arguments do:
 `-u` Include all unmapped terms in the output
 
 </details>
+
+
+## Supported Mappers 
+
+The mapping score associated with each mapping is indicative of how similar an input term is to an ontology term (via its labels or synonyms). The mapping/similarity scores generated by text2term are the result of applying one of the following "mappers":
+
+TF-IDF-based mapper
+: [TF-IDF](https://en.wikipedia.org/wiki/Tf–idf), a statistical measure often used in information retrieval, measures how important a word is to a document in a corpus of documents. We first generate TF-IDF-based vectors of the source terms and of labels and synonyms of ontology terms. Then we compute the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between vectors to determine how similar a source term is to a target term (label or synonym).
+
+BioPortal Web API-based mapper 
+: uses an interface to the [BioPortal Annotator](https://bioportal.bioontology.org/annotator) that we built to allow mapping terms to ontologies in the [BioPortal](https://bioportal.bioontology.org) repository. To use it, make sure to specify the target ontology name(s) as they appear in BioPortal. 
+
+: _Note_: there are no confidence scores associated with BioPortal annotations, so we decided to set the mapping score of all mappings to 1.
+
+Zooma Web API-based mapper
+: uses a [Zooma](https://www.ebi.ac.uk/spot/zooma/) interface that we built to allow mapping terms to ontologies in the [Ontology Lookup Service (OLS)](https://www.ebi.ac.uk/ols4) repository. To use it, make sure to specify the target ontology name(s) as they appear in OLS. 
+
+Syntactic distance-based mappers
+: text2term provides support for commonly used and popular syntactic (edit) distance metrics. Specifically, we implemented support for Levenshtein, Jaro, Jaro-Winkler, Jaccard, and Indel metrics. We use the [nltk](https://pypi.org/project/nltk/) package to compute Jaccard distances, and [rapidfuzz](https://pypi.org/project/rapidfuzz/) for all others.  
+
+_Note_: syntactic distance-based mappers and Web API-based mappers perform slowly (much slower than the TF-IDF mapper). The former because they do pairwise comparisons between each input string and each ontology term label/synonym. In the Web API-based approaches there are networking and API load overheads.