Skip to content

Latest commit

 

History

History
135 lines (104 loc) · 7.31 KB

README.md

File metadata and controls

135 lines (104 loc) · 7.31 KB

Living Atlas Pipelines extensions

Build Status Travis build for ala-dev branch
Build Status GBIF Jenkins build for dev branch
Coverage Sonar

This module is to add functionality required by the Living Atlases to facilitate the replacement to biocache-store for data ingress.

Architecture

For details on the GBIF implementation, see the pipelines github repository. This project is focussed on extensions to that architecture to support use by the Living Atlases.

Pipelines

Above is a representation of the data flow from source data in Darwin core archives supplied by data providers, to the API access to these data via the biocache-service component.

Within the "Interpreted AVRO" box is a list of "transforms" each of which take the source data and produce an isolated output in a AVRO formatted file.

GBIF's pipelines already supports a number of core transforms for handling biodiversity occurrence data. The Living Atlas pipelines extensions make us of these transforms "as-is" where possible and extend existing transforms where required.

For information on how the architecture between the legacy system biocache-store and pipelines differ, see this page.

Dependent projects

The pipelines work has necessitated some minor additional API additions and change to the following components:

biocache-service

pipelines branch A version 3.x of biocache-service is in development. This will not use Cassandra for storage of occurrence records, but Cassandra is still required for the storage of user assertions and query identifiers (used to store large query parameters such as WKT strings).

ala-namematching-service

A simple drop wizard wrapper around the ala-name-matching library has been developed to support integration with pipelines. This service is package in docker container as is deployed as a service using ansible. For testing locally, use the docker-compose files.

ala-sensitive-data-service

A simple drop wizard wrapper around the ala-sensitive-data-service library has been developed to support integration with pipelines. This service is package in docker container as is deployed as a service using ansible. For testing locally, use the docker-compose files.

Getting started

Ansible scripts have been developed and are available here, Below are some instructions for setting up a local development environment for pipelines. These steps will load a dataset into a SOLR index.

Software requirements:

Setting up la-pipelines

  1. Download shape files from here and expand into /data/pipelines-shp directory
  2. Download a test darwin core archive (e.g. https://archives.ala.org.au/archives/gbif/dr893/dr893.zip)
  3. Create the following directory /data/pipelines-data
  4. Build with maven mvn clean package

Running la-pipelines

  1. Start required docker containers using
    docker-compose -f pipelines/src/main/docker/ala-name-service.yml up -d
    docker-compose -f pipelines/src/main/docker/solr8.yml up -d
    docker-compose -f pipelines/src/main/docker/ala-sensitive-data-service.yml up -d
    Note ala-sensitive-data-service.yml can be ommited if you don't need to run the SDS pipeline but you'll need to add
    index:
      includeSensitiveDataChecks: false
    to the file configs/la-pipelines-local.yaml.
  2. cd scripts
  3. To convert DwCA to AVRO, run ./la-pipelines dwca-avro dr893
  4. To interpret, run ./la-pipelines interpret dr893 --embedded
  5. To mint UUIDs, run ./la-pipelines uuid dr893 --embedded
  6. (Optional) To sample run:
    1. ./la-pipelines sample dr893 --embedded
  7. To setup SOLR:
    1. Run cd ../solr/scripts and then run ' ./update-solr-config.sh
    2. Run cd ../../scripts
  8. To create index avro files, run ./la-pipelines index dr893 --embedded
  9. To generate the SOLR index, run ./la-pipelines solr dr893 --embedded
  10. Check the SOLR index has records in the index by visiting http://localhost:8983/solr/#/biocache/query and clicking the "Execute query" button. It should show a non-zero number for numFound in the JSON response.
  11. Run ./la-pipelines -h for help and more steps:
LA-Pipelines data ingress utility.

The la-pipelines can be executed to run all the ingress steps or only a few of them:

Pipeline ingress steps:

    ┌───── do-all ───────────────────────────────────────────────┐
    │                                                            │
dwca-avro --> interpret --> validate --> uuid --> image-sync ... │
  --> image-load --> sds --> index --> sample --> jackknife --> solr
(...)

Integration Tests

Tests follow the GBIF/failsafe/surefire convention. All integration tests have a suffix of "IT". All junit tests are ran with mvn package and integration tests are ran with mvn verify.

The integration tests will automatically start docker containers for the following:

  • SOLR
  • Elastic
  • Name matching service
  • SDS

Code style and tools

For code style and tool see the recommendations on the GBIF pipelines project. In particular, note the project uses Project Lombok, please install Lombok plugin for Intellij IDEA.

avro-tools is recommended to aid to development for quick views of AVRO outputs. This can be installed on Macs with Homebrew like so:

brew install avro-tools