Skip to content

Commit

Permalink
update examples as suggested by @zygoballus
Browse files Browse the repository at this point in the history
  • Loading branch information
Jorrit Poelen committed Oct 17, 2024
1 parent e4fc3af commit 9424c2b
Showing 1 changed file with 218 additions and 33 deletions.
251 changes: 218 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,15 +200,13 @@ $ nomer version
```

### Show supported matchers
``` console
$ nomer matchers -v
```bash
nomer matchers -v
```
Result as of v0.3.2 (Oct 2022) is formatted as a table below:

Result as of v0.5.13 (July 2024) is formatted as a table below:

| name | description |
| --- | --- |
| ala | Lookup taxon in Atlas of Living Australia by name or by id using ALATaxon:* prefix. |
| --- | --- |
| batnames | Lookup BatNames taxa by name, synonym using offline-enabled database dump |
| bold-web | Use BOLD webservice to lookup taxa by bin/taxon id using BOLD:* and BOLDTaxon:* prefixes. |
| col | Lookup Catalogue of Life taxon by name or COL:* prefixed ids using offline-enabled database dump |
Expand All @@ -224,55 +222,68 @@ Result as of v0.3.2 (Oct 2022) is formatted as a table below:
| globi-correct | Scrubs names using GloBI's (taxonomic) name scrubber. Scrubbing includes removing of stopwords (e.g., undefined), correcting common typos using a "crappy" names list, parse to canonical name using gnparser (see https://github.com/GlobalNamesArchitecture/gnparser), and more. |
| globi-enrich | Uses GloBI's taxon enricher to find first term match by id or name. Uses various web apis like Encyclopedia of Life, World Registry of Marine Species (WoRMS), Integrated Taxonomic Information System (ITIS), National Biodiversity Network (NBN) and more. |
| globi-rank | Finds taxonomic rank identifiers by rank commons name (e.g., species, order, soort). Uses Wikidata taxon rank items. Caches a copy locally on first usage to allow for subsequent offline usage. |
| globi-suggest | Scrubs names using GloBI's (taxonomic) name scrubber. Scrubbing includes removing of stopwords (e.g., undefined), correcting common typos using a "crappy" names list, parse to canonical name using gnparser (see https://github.com/GlobalNamesArchitecture/gnparser), and more. |
| gn-parse | Attempts extract canonical taxonomic name from name string using https://github.com/GlobalNamesArchitecture/gnparser . |
| gulfbase | Look up taxa of https://gulfbase.org by name or id with BioGoMx:* prefix. |
| inaturalist-id | Lookup taxon in iNaturalist by id with INAT_TAXON:* prefix. |
| indexfungorum | Lookup Index Fungorum taxon by name or id using offline-enabled database dump |
| itis | Lookup ITIS taxon by name or id using offline-enabled database dump |
| itis-web | Use itis webservice to lookup taxa by id using ITIS:* prefix. |
| mdd | Lookup Mammal Diversity Database (MDD) taxon by name or id using offline-enabled database dump |
| nbn | Lookup taxon of National Biodiversity Network by id with NBN:* prefix. |
| ncbi | Lookup NCBI taxa by name, synonym or id using offline-enabled database dump |
| ncbi-web | Lookup NCBI taxon by id with NCBI:* prefix using web apis. |
| nodc | Lookup taxon in the Taxonomic Code of the National Oceanographic Data Center (NODC) by id with prefix NODC: . Maps to ITIS terms if possible. |
| openbiodiv | uses openbiodiv sparql endpoint to resolve openbiodiv terms |
| orcid-web | Lookup ORCID by id with ORCID:* prefix. |
| ott | Lookup Open Tree of Life taxon by name or (OTT\|GBIF\|WORMS\|IF\|NCBI\|IRMNG)* prefixed ids using offline-enabled database dump |
| pbdb | Lookup Paleobio Database taxon by name or id using offline-enabled database dump |
| plazi | Lookup Plazi taxon treatment by name or id using offline-enabled database dump |
| pmid-doi | resolves pubmed id to doi using https://www.ncbi.nlm.nih.gov/pmc/pmctopmid/ |
| remove-stop-words | Removes stop words (e.g., undefined) using a stop word list specified by property [nomer.taxon.name.stopword.url] . |
| tpt | Lookup TPT taxon by name or id using offline-enabled database dump |
| translate-names | Translates incoming names using a two column csv file specified by property [nomer.taxon.name.correction.url] . |
| uksi-current-name | Use UK Species Inventory to find current taxonomic name. |
| wfo | Lookup World of Flora Online taxon by name or WFO:* prefixed ids using offline-enabled database dump |
| wikidata | Lookup Wikidata taxon by name or id using offline-enabled database dump |
| wikidata-web | uses wikidata to cross-walk taxon id across taxonomies |
| worms | Lookup taxon in WoRMS by name or by id with WORMS:* prefix. |
| worms | Lookup World Register of Marine Species by name or WORMS:* prefixed ids using offline-enabled database dump |
| worms-web | Lookup taxon in WoRMS by name or by id with WORMS:* prefix. |



If you'd like to add new matchers, please open [a new issue](https://github.com/globalbioticinteractions/nomer/issues/new) and describe your desires.

### Match term by id with default matcher
### Match term by id

``` console
$ echo -e "NCBI:9606\t" | nomer append > matches.tsv
```bash
echo -e "NCBI:9606\t"\
| nomer append ncbi-web\
> matches.tsv
```

### Match term by name with default matcher
### Match term by name

``` console
$ echo -e "\tHomo sapiens" | nomer append > matches.tsv
```bash
echo -e "\tHomo sapiens"\
| nomer append ncbi-web\
> matches.tsv
```

matches.tsv should now include entries like

``` console
```bash
$ cat matches.tsv
Homo sapiens SAME_AS EOL:327955 Homo sapiens Species إنسان @ar | Insan @az | човешки @bg | মানবীয় @bn | Ljudsko biće @bs | Humà @ca | Muž @cs | Menneske @da | Mensch @de | ανθρώπινο ον @el | Humans @en | Humano @es | Gizakiaren @eu | Ihminen @fi | Homme @fr | Mutum @ha | אנושי @he | մարդու @hy | Umano @it | ადამიანის @ka | Homo @la | žmogaus @lt | Om @mo | Mens @nl | Òme @oc | Om @ro | Человек разумный современный @ru | Qenie Njerëzore @sq | மனிதன் @ta | మానవుడు @te | Aadmi @ur | umuntu @zu | Animalia | Bilateria | Deuterostomia | Chordata | Vertebrata | Gnathostomata | Tetrapoda | Mammalia | Theria | Eutheria | Primates | Haplorrhini | Simiiformes | Hominoidea | Hominidae | Homininae | Homo | Homo sapiens EOL:1 | EOL:3014411 | EOL:8814528 | EOL:694 | EOL:2774383 | EOL:12094272 | EOL:4712200 | EOL:1642 | EOL:57446 | EOL:2844801 | EOL:1645 | EOL:10487985 | EOL:10509493 | EOL:4529848 | EOL:1653 | EOL:10551052 | EOL:42268 | EOL:327955 kingdom | subkingdom | infrakingdom | division | subdivision | infraphylum | superclass | class | subclass | infraclass | order | suborder | infraorder | superfamily | family | subfamily | genus | species http://eol.org/pages/327955 http://media.eol.org/content/2014/08/07/23/02836_98_68.jpg
NCBI:9606 SAME_AS NCBI:9606 Homo sapiens species human @en cellular organisms | Eukaryota | Opisthokonta | Metazoa | Eumetazoa | Bilateria | Deuterostomia | Chordata | Craniata | Vertebrata | Gnathostomata | Teleostomi | Euteleostomi | Sarcopterygii | Dipnotetrapodomorpha | Tetrapoda | Amniota | Mammalia | Theria | Eutheria | Boreoeutheria | Euarchontoglires | Primates | Haplorrhini | Simiiformes | Catarrhini | Hominoidea | Hominidae | Homininae | Homo | Homo sapiens NCBI:131567 | NCBI:2759 | NCBI:33154 | NCBI:33208 | NCBI:6072 | NCBI:33213 | NCBI:33511 | NCBI:7711 | NCBI:89593 | NCBI:7742 | NCBI:7776 | NCBI:117570 | NCBI:117571 | NCBI:8287 | NCBI:1338369 | NCBI:32523 | NCBI:32524 | NCBI:40674 | NCBI:32525 | NCBI:9347 | NCBI:1437010 | NCBI:314146 | NCBI:9443 | NCBI:376913 | NCBI:314293 | NCBI:9526 | NCBI:314295 | NCBI:9604 | NCBI:207598 | NCBI:9605 | NCBI:9606 | superkingdom | clade | kingdom | clade | clade | clade | phylum | subphylum | clade | clade | clade | clade | superclass | clade | clade | clade | class | clade | clade | clade | superorder | order | suborder | infraorder | parvorder | superfamily | family | subfamily | genus | species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606
```

### Match term by id with JSON output
Similarly, you can match terms by id and produce JSON output, instead of tab-separated values using:

```
$ echo -e "NCBI:9606\tHomo sapiens" | nomer append ncbi-taxon-id -o json > matches.json
```bash
echo -e "NCBI:9606\tHomo sapiens"\
| nomer append ncbi-web -o json\
> matches.json
```

Now matches.json looks something like:
Expand All @@ -295,18 +306,176 @@ Now matches.json looks something like:
"@id": "NCBITaxon:2759",
"name": "Eukaryota"
},
"clade": {
"@id": "NCBITaxon:33154",
"name": "Opisthokonta"
},
"kingdom": {
"@id": "NCBITaxon:33208",
"name": "Metazoa"
},
...
"phylum": {
"@id": "NCBITaxon:7711",
"name": "Chordata"
},
"subphylum": {
"@id": "NCBITaxon:89593",
"name": "Craniata"
},
"superclass": {
"@id": "NCBITaxon:8287",
"name": "Sarcopterygii"
},
"class": {
"@id": "NCBITaxon:40674",
"name": "Mammalia"
},
"superorder": {
"@id": "NCBITaxon:314146",
"name": "Euarchontoglires"
},
"order": {
"@id": "NCBITaxon:9443",
"name": "Primates"
},
"suborder": {
"@id": "NCBITaxon:376913",
"name": "Haplorrhini"
},
"infraorder": {
"@id": "NCBITaxon:314293",
"name": "Simiiformes"
},
"parvorder": {
"@id": "NCBITaxon:9526",
"name": "Catarrhini"
},
"superfamily": {
"@id": "NCBITaxon:314295",
"name": "Hominoidea"
},
"family": {
"@id": "NCBITaxon:9604",
"name": "Hominidae"
},
"subfamily": {
"@id": "NCBITaxon:207598",
"name": "Homininae"
},
"genus": {
"@id": "NCBITaxon:9605",
"name": "Homo"
},
"path": {
"names": [
"cellular organisms",
"Eukaryota",
"Opisthokonta",
"Metazoa",
"Eumetazoa",
"Bilateria",
"Deuterostomia",
"Chordata",
"Craniata",
"Vertebrata",
"Gnathostomata",
"Teleostomi",
"Euteleostomi",
"Sarcopterygii",
"Dipnotetrapodomorpha",
"Tetrapoda",
"Amniota",
"Mammalia",
"Theria",
"Eutheria",
"Boreoeutheria",
"Euarchontoglires",
"Primates",
"Haplorrhini",
"Simiiformes",
"Catarrhini",
"Hominoidea",
"Hominidae",
"Homininae",
"Homo",
"Homo sapiens"
],
"ids": [
"NCBI:131567",
"NCBI:2759",
"NCBI:33154",
"NCBI:33208",
"NCBI:6072",
"NCBI:33213",
"NCBI:33511",
"NCBI:7711",
"NCBI:89593",
"NCBI:7742",
"NCBI:7776",
"NCBI:117570",
"NCBI:117571",
"NCBI:8287",
"NCBI:1338369",
"NCBI:32523",
"NCBI:32524",
"NCBI:40674",
"NCBI:32525",
"NCBI:9347",
"NCBI:1437010",
"NCBI:314146",
"NCBI:9443",
"NCBI:376913",
"NCBI:314293",
"NCBI:9526",
"NCBI:314295",
"NCBI:9604",
"NCBI:207598",
"NCBI:9605",
"NCBI:9606"
],
"ranks": [
"",
"superkingdom",
"clade",
"kingdom",
"clade",
"clade",
"clade",
"phylum",
"subphylum",
"clade",
"clade",
"clade",
"clade",
"superclass",
"clade",
"clade",
"clade",
"class",
"clade",
"clade",
"clade",
"superorder",
"order",
"suborder",
"infraorder",
"parvorder",
"superfamily",
"family",
"subfamily",
"genus",
"species"
]
}
}
```

Using tools like [jq](https://stedolan.github.io/jq/), you can now do things like:

```console
$ echo -e "NCBI:9606\tHomo sapiens" | nomer append -o json | jq .family
```
echo -e "NCBI:9606\tHomo sapiens"\
| nomer append -o json ncbi-web\
| jq .family
```
to list all the family taxa associated with the term.

Expand All @@ -316,20 +485,20 @@ to list all the family taxa associated with the term.
### ITIS

``` console
$ echo -e "ITIS:180547" | nomer append globi-enrich
$ echo -e "ITIS:180547" | nomer append itis
ITIS:180547 SAME_AS ITIS:180547 Enhydra lutris Species Animalia | Bilateria | Deuterostomia | Chordata | Vertebrata | Gnathostomata | Tetrapoda | Mammalia | Theria | Eutheria | Carnivora | Caniformia | Mustelidae | Lutrinae | Enhydra | Enhydra lutris ITIS:202423 | ITIS:914154 | ITIS:914156 | ITIS:158852 | ITIS:331030 | ITIS:914179 | ITIS:914181 | ITIS:179913 | ITIS:179916 | ITIS:179925 | ITIS:180539 | ITIS:552303 | ITIS:180545 | ITIS:552326 | ITIS:180546 | ITIS:180547 Kingdom | Subkingdom | Infrakingdom | Phylum | Subphylum | Infraphylum | Superclass | Class | Subclass | Infraclass | Order | Suborder | Family | Subfamily | Genus | Species http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=180547
```

### NCBI

``` console
$ echo -e "NCBI:9606" | nomer append globi-enrich```
$ echo -e "NCBI:9606" | nomer append ncbi```
NCBI:9606 SAME_AS NCBI:9606 Homo sapiens species man @en | human @en cellular organisms | Eukaryota | Opisthokonta | Metazoa | Eumetazoa | Bilateria | Deuterostomia | Chordata | Craniata | Vertebrata | Gnathostomata | Teleostomi | Euteleostomi | Sarcopterygii | Dipnotetrapodomorpha | Tetrapoda | Amniota | Mammalia | Theria | Eutheria | Boreoeutheria | Euarchontoglires | Primates | Haplorrhini | Simiiformes | Catarrhini | Hominoidea | Hominidae | Homininae | Homo | Homo sapiens NCBI:131567 | NCBI:2759 | NCBI:33154 | NCBI:33208 | NCBI:6072 | NCBI:33213 | NCBI:33511 | NCBI:7711 | NCBI:89593 | NCBI:7742 | NCBI:7776 | NCBI:117570 | NCBI:117571 | NCBI:8287 | NCBI:1338369 | NCBI:32523 | NCBI:32524 | NCBI:40674 | NCBI:32525 | NCBI:9347 | NCBI:1437010 | NCBI:314146 | NCBI:9443 | NCBI:376913 | NCBI:314293 | NCBI:9526 | NCBI:314295 | NCBI:9604 | NCBI:207598 | NCBI:9605 | NCBI:9606 | superkingdom | | kingdom | | | | phylum | subphylum | | | | | | | | | class | | | | superorder | order | suborder | infraorder | parvorder | superfamily | family | subfamily | genus | specieshttps://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606
```
### Match term by name with selected matcher

``` console
$ echo -e "\tCanis lupus" | nomer append globi-globalnames
$ echo -e "\tCanis lupus" | nomer append globalnames
Canis lupus SAME_AS NCBI:9612 Canis lupus species | Eukaryota | Opisthokonta | Metazoa | Eumetazoa | Bilateria | Deuterostomia | Chordata | Craniata | Vertebrata | Gnathostomata | Teleostomi | Euteleostomi | Sarcopterygii | Dipnotetrapodomorpha | Tetrapoda | Amniota | Mammalia | Theria | Eutheria | Boreoeutheria | Laurasiatheria | Carnivora | Caniformia | Canidae | Canis | Canis lupus NCBI:131567 | NCBI:2759 | NCBI:33154 | NCBI:33208 | NCBI:6072 | NCBI:33213 | NCBI:33511 | NCBI:7711 | NCBI:89593 | NCBI:7742 | NCBI:7776 | NCBI:117570 | NCBI:117571 | NCBI:8287 | NCBI:1338369 | NCBI:32523 | NCBI:32524 | NCBI:40674 | NCBI:32525 | NCBI:9347 | NCBI:1437010 | NCBI:314145 | NCBI:33554 | NCBI:379584 | NCBI:9608 | NCBI:9611 | NCBI:9612 | superkingdom | | kingdom | | | | phylum | subphylum | | | | | | | | | class | | | | superorder | order | suborder | family | genus | species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9612
Canis lupus SAME_AS OTT:247341 Canis lupus species | | Eukaryota | Opisthokonta | Holozoa | Metazoa | Eumetazoa | Bilateria | Deuterostomia | Chordata | Craniata | Vertebrata | Gnathostomata | Teleostomi | Euteleostomi | Sarcopterygii | Dipnotetrapodomorpha | Tetrapoda | Amniota | Mammalia | Theria | Eutheria | Boreoeutheria | Laurasiatheria | Carnivora | Caniformia | Canidae | Canis | Canis lupus OTT:805080 | OTT:93302 | OTT:304358 | OTT:332573 | OTT:5246131 | OTT:691846 | OTT:641038 | OTT:117569 | OTT:147604 | OTT:125642 | OTT:947318 | OTT:801601 | OTT:278114 | OTT:114656 | OTT:114654 | OTT:458402 | OTT:4940726 | OTT:229562 | OTT:229560 | OTT:244265 | OTT:229558 | OTT:683263 | OTT:5334778 | OTT:392223 | OTT:44565 | OTT:827263 | OTT:770319 | OTT:372706 | OTT:247341 no rank | no rank | domain | no rank | no rank | kingdom | no rank | no rank | no rank | phylum | subphylum | subphylum | superclass | no rank | no rank | class | no rank | superclass | no rank | class | subclass | no rank | no rank | superorder | order | suborder | family | genus | species https://tree.opentreeoflife.org/opentree/ottol@247341
Canis lupus SAME_AS INAT_TAXON:42048 Canis lupus speciesAnimalia | Chordata | Mammalia | Carnivora | Canidae | Canis | Canis lupus kingdom | phylum | class | order | family | genus | species http://inaturalist.org/taxa/42048
Expand All @@ -347,8 +516,14 @@ In addition to appending the found matches to a provided input row, Nomer also s

Looking up _Canis lupus_ using globalnames with the replace command would look like:

``` console
$ echo -e "\tCanis lupus" | nomer replace globi-globalnames
```bash
echo -e "\tCanis lupus"\
| nomer replace globi-globalnames
```

which produces:

```
NCBI:9612 Canis lupus
```

Expand All @@ -362,25 +537,35 @@ ITIS:202423 | NCBI:40674 | NCBI:9612 Animalia | Mammalia | Canis lupus
```

Or when using a matcher that supports lookup by id:
``` console
$ echo -e "ITIS:202423 | NCBI:40674 | NCBI:9612\t" | nomer replace globi-enrich
```bash
echo -e "ITIS:202423 | NCBI:40674 | NCBI:9612\t"\
| nomer replace globi-enrich
```

would produce:

```
ITIS:202423 | NCBI:40674 | NCBI:9612 Animalia | Mammalia | Canis lupus
```

If you have an existing tabular file where the id and name columns are not the first and second respectively,
then, you can change the input/output schema. For instance, if you'd like to match on ids in the third (=2) column
and write the matching id and name in the first (=0) and second (=1) column (= default), you can do something like:
If you have an existing tabular file where the id and name columns are not the first and second respectively, then, you can change the input/output schema. For instance, if you'd like to match on ids in the third (=2) column and write the matching id and name in the first (=0) and second (=1) column (= default), you can do something like:

``` console
$ echo -e "\t\tNCBI:9606" | java -Dnomer.schema.input="[{\"column\":2,\"type\":\"externalId\"}]" -jar nomer.jar replace ncbi-taxon-id
```bash
echo -e "\t\tNCBI:9606"\
| nomer replace --properties <(echo 'nomer.schema.input=[{\"column\":2,\"type\":\"externalId\"}]') ncbi-web
```

which would produce:

```
NCBI:9606 Homo sapiens NCBI:9606
```

To avoid escaping of double quotes (i.e. ```"``` -> ```\"```), and to keep your commands relatively short, perhaps an easier way to change the input / output schema is the save the default properties to a file using ```nomer properties > my.properties```.
Now, edit the properties ```nomer.schema.input``` and ```nomer.schema.output``` to suit your needs. After you are done, you can use the properties by running someting like:

``` console
$ echo -e "\t\tNCBI:9606" | nomer --properties=my.properties replace ncbi-taxon-id
$ echo -e "\t\tNCBI:9606" | nomer replace --properties=my.properties ncbi-web
NCBI:9606 Homo sapiens NCBI:9606
```
... to reproduce the results from the previous example.
Expand Down

0 comments on commit 9424c2b

Please sign in to comment.