Skip to content

Commit

Permalink
Merge pull request #134 from charvolant/master
Browse files Browse the repository at this point in the history
Release 4.0
  • Loading branch information
charvolant authored Oct 13, 2021
2 parents 9bb3d30 + c4830ab commit 7c56803
Show file tree
Hide file tree
Showing 200 changed files with 5,001 additions and 2,133 deletions.
8 changes: 4 additions & 4 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,15 @@ branches:

before_install:
- mkdir -p ~/.m2; wget -q -O ~/.m2/settings.xml https://raw.githubusercontent.com/AtlasOfLivingAustralia/travis-build-configuration/master/travis_maven_settings_simple.xml
- sudo mkdir -p /data/lucene; sudo wget -O /data/lucene/namematching-20200214.tgz https://archives.ala.org.au/archives/nameindexes/20200214/namematching-20200214.tgz
- sudo mkdir -p /data/lucene; sudo wget -O /data/lucene/namematching-20210811.tgz https://archives.ala.org.au/archives/nameindexes/20210811/namematching-20210811.tgz
- cd /data/lucene
- sudo tar zxvf namematching-20200214.tgz
- sudo ln -s namematching-20200214 namematching
- sudo tar zxvf namematching-20210811.tgz
- sudo ln -s namematching-20210811 namematching
- ls -laF
- cd $TRAVIS_BUILD_DIR

script:
- "[ \"${TRAVIS_PULL_REQUEST}\" = \"false\" ] && mvn -P travis clean install deploy || mvn -P travis clean install"
- 'if [ "${TRAVIS_PULL_REQUEST}" = "false" ]; then mvn -P travis clean install deploy; else mvn -P travis clean install; fi'

env:
global:
Expand Down
469 changes: 469 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

68 changes: 49 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,25 @@ This API borrows heavily from the name parsing great work done by [GBIF](https:/
in their [scientific name parser library](https://github.com/gbif/name-parser)
This code contains additions for handling some Australian specific issues.

## Modules

* **ala-name-matching-model** The data model used by the name matching index.
This module contains a number of useful vocabularies that you may want to
include in your application, even if you don' want to name match.
* **ala-name-matching-search** Local name index searching.
Include this in you application if you want to match names against a local name index.
* **ala-name-magcing-builder** Merge taxonomies and build name indexes.
This is a separate module to the searcher so that you can build the name
index that the searcher uses, without importing a shedload of dependencies
if you just want to search for things.
* **ala-name-matching-tools** Some useful utilities that can be used to
do bulk matching for testing and the like.
* **ala-name-matching-distributions** A full distribution zip file, including
some shell scripts to get various commands going.

## Versions

Currently there are 2 versions of this library, 2.x and 3.x.
* 2.x is using lucene 4.
* 3.x is using lucene 6 or above.
Version 4.x of the library uses Lucene 8.

## Generating a name match index

Expand Down Expand Up @@ -41,17 +55,19 @@ You can download the IRMNG DwCA for homonyms from the following URL:

An assembly zip file for this can be downloaded from our maven repository :

[ala-name-matching-3.5-distribution.zip](http://nexus.ala.org.au/service/local/repositories/releases/content/au/org/ala/ala-name-matching/3.5/ala-name-matching-3.5-distribution.zip)
[ala-name-matching-4.0-SNAPSHOT-distribution.zip](http://nexus.ala.org.au/service/local/repositories/releases/content/au/org/ala/ala-name-matching/3.5/ala-name-matching-3.5-distribution.zip)

To generate the name index using the data described above, follow these steps. Alternatively use the [ALA Ansible scripts](https://github.com/AtlasOfLivingAustralia/ala-install)
here using the playbook [nameindexer.yml](https://github.com/AtlasOfLivingAustralia/ala-install/blob/master/ansible/nameindexer-standalone.yml) which does it all for you.

* Download the zip files linked above to a directory e.g. /data/names/ and extract them
* Download the distribution zip [ala-name-matching-3.5-distribution.zip](http://nexus.ala.org.au/service/local/repositories/releases/content/au/org/ala/ala-name-matching/3.5/ala-name-matching-3.5-distribution.zip)
* Download the distribution zip [ala-name-matching-disribution-4.0-SNAPSHOT-distribution.zip](http://nexus.ala.org.au/service/local/repositories/releases/content/au/org/ala/ala-name-matching/3.5/ala-name-matching-distribution-4.0-SNAPSHOT-distribution.zip)
and unzip it.
You wil find a number of shell scripts in the base directory.
* Generate the names index with command:

```
java -jar ala-name-matching-3.5.jar --all --dwca /data/names/dwca-col --target /data/lucene/testdwc-namematching --irmng /data/names/irmng/IRMNG_DWC_HOMONYMS --common /data/names/col_vernacular.txt
./index.sh --all --dwca /data/names/dwca-col --target /data/lucene/testdwc-namematching --irmng /data/names/irmng/IRMNG_DWC_HOMONYMS --common /data/names/col_vernacular.txt
```

Please be aware that the names indexing could take over an hour to complete.
Expand All @@ -66,7 +82,7 @@ into a single, combined taxonomy.
An example command for the taxonomy builder is:

```
java --classpath <classpath> au.org.ala.names.index.TaxonomyBuilder -c /data/names/ala-taxon-config.json -w tmp -o /data/names/combined /data/names/APNI/DwC /data/names/AFD/DwC /data/names/CAAB/DwC
./merge.sh -c /data/names/ala-taxon-config.json -w tmp -o /data/names/combined /data/names/APNI/DwC /data/names/AFD/DwC /data/names/CAAB/DwC
```

More information about the merge configuration can be found [here](doc/merge-config.md).
Expand All @@ -76,14 +92,18 @@ More information about the merge configuration can be found [here](doc/merge-con
This library is built with maven. By default a `mvn install` will try to run a test suite which will fail without a local installation of a name index.
To skip this step, run a build with ```mvn install -DskipTests=true```.

The build creates 3 artefacts in the ala-name-matching/target directory:
The build creates one artefact in the `ala-name-matching-distribution/target` directory:

* ala-name-matching-distribution-4.0-SNAPSHOT-distribution.zip - zip containing the project jar and dependencies

* ala-name-matching-3.5.jar - built jar for the project code only
* ala-name-matching-3.5-distribution.zip - zip containing the project jar and dependencies
* ala-name-matching-3.5-sources.jar - source jar for the project code only
Each module contains two artefacts in the
`ala-name-matching/ala-name-matching-<module>/target` directory:

The name index for Australian names lists used in unit tests can be downloaded [from here](https://biocache.ala.org.au/archives/nameindexes/20200214) and needs to be extracted to the
directory `/data/lucene/namematching-20200214`
* ala-name-matching-<module>-4.0-SNAPSHOT.jar - built jar for the project code only
* ala-name-matching-<module>-4.0-SNAPSHOT-sources.jar - source jar for the project code only

The name index for Australian names lists used in unit tests can be downloaded [from here](https://biocache.ala.org.au/archives/nameindexes/20220629) and needs to be extracted to the
directory `/data/lucene/namematching-20210811`

## ALA Names List

Expand Down Expand Up @@ -116,19 +136,29 @@ The ALA Name Matching is available as a library that can be used in other projec

To use ala-name-matching, include it as a dependency in your pom file:
```
<dependency>
<groupId>au.org.ala</groupId>
<artifactId>ala-name-matching</artifactId>
<version>3.5</version>
</dependency>
<dependency>
<groupId>au.org.ala</groupId>
<artifactId>ala-name-matching-search</artifactId>
<version>4.0-SNAPSHOT</version>
</dependency>
```

If you just want the handy enums and such-like, use
```
<dependency>
<groupId>au.org.ala</groupId>
<artifactId>ala-name-matching-model</artifactId>
<version>4.0-SNAPSHOT</version>
</dependency>
```


If you are using grails 3, you may encounter problems with the newer GBIF
libraries having validation code that conflicts with spring validation.
You can correct this by using

```
compile("au.org.ala:ala-name-matching:3.5") {
compile("au.org.ala:ala-name-matching-search:4.0-SNAPSHOT") {
exclude group: 'org.slf4j', module: 'slf4j-log4j12'
exclude group: 'org.apache.bval', module: 'org.apache.bval.bundle'
}
Expand Down
79 changes: 79 additions & 0 deletions ala-name-matching-builder/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

<modelVersion>4.0.0</modelVersion>

<parent>
<groupId>au.org.ala</groupId>
<artifactId>ala-name-matching</artifactId>
<version>4.0</version>
</parent>

<artifactId>ala-name-matching-builder</artifactId>
<packaging>jar</packaging>
<name>ALA Name Matching Taxonomy Merging and Index Building</name>
<description>Tools to first merge multiple taxonomies together and then build a searchable index out of the resulting taxonomy</description>
<dependencies>
<dependency>
<groupId>au.org.ala</groupId>
<artifactId>ala-name-matching-model</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>au.org.ala</groupId>
<artifactId>ala-name-matching-search</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.gbif</groupId>
<artifactId>dwca-io</artifactId>
<version>${dwca-io.version}</version>
<exclusions>
<exclusion>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
</exclusion>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.gbif.checklistbank</groupId>
<artifactId>checklistbank-common</artifactId>
<version>${checklist-bank.version}</version>
<exclusions>
<exclusion>
<groupId>org.gbif.registry</groupId>
<artifactId>registry-ws-client</artifactId>
</exclusion>
<exclusion>
<groupId>com.beust</groupId>
<artifactId>jcommander</artifactId>
</exclusion>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>jcl-over-slf4j</artifactId>
</exclusion>
<exclusion>
<groupId>io.dropwizard.metrics</groupId>
<artifactId>metrics-core</artifactId>
</exclusion>
<exclusion>
<groupId>io.dropwizard.metrics</groupId>
<artifactId>metrics-ganglia</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>commons-cli</groupId>
<artifactId>commons-cli</artifactId>
<version>${commons-cli.version}</version>
</dependency>
</dependencies>
</project>
Original file line number Diff line number Diff line change
@@ -1,7 +1,25 @@
/*
* Copyright (c) 2021 Atlas of Living Australia
* All Rights Reserved.
*
* The contents of this file are subject to the Mozilla Public
* License Version 1.1 (the "License"); you may not use this file
* except in compliance with the License. You may obtain a copy of
* the License at http://www.mozilla.org/MPL/
*
* Software distributed under the License is distributed on an "AS
* IS" basis, WITHOUT WARRANTY OF ANY KIND, either express or
* implied. See the License for the specific language governing
* rights and limitations under the License.
*
*/

package au.org.ala.names.index;

import au.org.ala.names.model.*;
import au.org.ala.names.util.CleanedScientificName;
import com.opencsv.CSVParser;
import com.opencsv.CSVParserBuilder;
import com.opencsv.CSVReader;
import com.opencsv.CSVReaderBuilder;
import org.gbif.api.exception.UnparsableException;
Expand All @@ -20,6 +38,7 @@
import org.slf4j.LoggerFactory;

import javax.annotation.Nullable;
import java.io.FileReader;
import java.io.InputStreamReader;
import java.util.*;
import java.util.function.Predicate;
Expand Down Expand Up @@ -79,6 +98,11 @@ public class ALANameAnalyser extends NameAnalyser {
* Pattern for bare (no proper period) rank markers
*/
protected static final Pattern LOOSE_MARKERS = Pattern.compile("\\s+(?:" + RANK_MARKERS + "|" + RANK_PLACEHOLDER_MARKERS + ")\\.?\\s+");
/**
* Pattern for unsure markers (cf, aff etc)
*/
protected static final Pattern UNSURE_MARKER = Pattern.compile("\\s+(?:cf|cfr|conf|aff)\\.?\\s+" );

/**
* Pattern for non-name characters
*/
Expand Down Expand Up @@ -202,22 +226,27 @@ public NameKey analyse(@Nullable NomenclaturalCode code, String scientificName,
scientificName = (left + " " + right).trim();
}
}
try {
name = this.nameParser.parse(scientificName, (rankType == null || rankType == RankType.UNRANKED) ? null : rankType.getCbRank());
if (name != null) {
nameType = name.getType();
if (rankType == null && name.getRank() != null)
rankType = RankType.getForCBRank(name.getRank());
if (UNSURE_MARKER.matcher(scientificName).find()) {
// Leave this well alone but indicate that it is doubtful
nameType = NameType.DOUBTFUL;
} else {
try {
name = this.nameParser.parse(scientificName, (rankType == null || rankType == RankType.UNRANKED) ? null : rankType.getCbRank());
if (name != null) {
nameType = name.getType();
if (rankType == null && name.getRank() != null)
rankType = RankType.getForCBRank(name.getRank());
}
} catch (UnparsableException ex) {
// Oh well, worth a try
}
} catch (UnparsableException ex) {
// Oh well, worth a try
}
if (loose) {
if (scientificNameAuthorship == null && name != null) {
String ac = this.normalise(name.authorshipComplete());
if (ac != null && !ac.isEmpty() && !(name instanceof ALAParsedName)) { // ALAParsedName indicates a phrase name; leave as-is
scientificName = name.buildName(true, true, false, true, true, false, true, false, true, false, false, false, true, true);
scientificNameAuthorship = ac;
if (loose) {
if (scientificNameAuthorship == null && name != null) {
String ac = this.normalise(name.authorshipComplete());
if (ac != null && !ac.isEmpty() && !(name instanceof ALAParsedName)) { // ALAParsedName indicates a phrase name; leave as-is
scientificName = name.buildName(true, true, false, true, true, false, true, false, true, false, false, false, true, true);
scientificNameAuthorship = ac;
}
}
}
}
Expand Down Expand Up @@ -333,7 +362,15 @@ protected <T extends Enum<T>> void loadCsv(String resource, Map<String, T> map,
*/
protected void loadPatternCsv(String resource, List<Pattern> list) {
try {
CSVReader reader = new CSVReader(new InputStreamReader(this.getClass().getResourceAsStream(resource), "UTF-8"), ',', '"', 1);
CSVParser csvParser = new CSVParserBuilder()
.withSeparator(',')
.withQuoteChar('"')
.withEscapeChar('\\')
.build();
CSVReader reader = new CSVReaderBuilder(new InputStreamReader(this.getClass().getResourceAsStream(resource), "UTF-8"))
.withCSVParser(csvParser)
.withSkipLines(1)
.build();
String[] next;
while ((next = reader.readNext()) != null) {
String label = next[0];
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
/*
* Copyright (c) 2021 Atlas of Living Australia
* All Rights Reserved.
*
* The contents of this file are subject to the Mozilla Public
* License Version 1.1 (the "License"); you may not use this file
* except in compliance with the License. You may obtain a copy of
* the License at http://www.mozilla.org/MPL/
*
* Software distributed under the License is distributed on an "AS
* IS" basis, WITHOUT WARRANTY OF ANY KIND, either express or
* implied. See the License for the specific language governing
* rights and limitations under the License.
*
*/

package au.org.ala.names.index;

import au.org.ala.names.index.provider.ConceptResolutionPriority;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,24 @@
package au.org.ala.names.index;
/*
* Copyright (c) 2021 Atlas of Living Australia
* All Rights Reserved.
*
* The contents of this file are subject to the Mozilla Public
* License Version 1.1 (the "License"); you may not use this file
* except in compliance with the License. You may obtain a copy of
* the License at http://www.mozilla.org/MPL/
*
* Software distributed under the License is distributed on an "AS
* IS" basis, WITHOUT WARRANTY OF ANY KIND, either express or
* implied. See the License for the specific language governing
* rights and limitations under the License.
*
*/

import au.org.ala.names.model.RankType;
import org.gbif.api.vocabulary.NomenclaturalCode;
package au.org.ala.names.index;

import java.util.*;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;

/**
Expand Down
Loading

0 comments on commit 7c56803

Please sign in to comment.