feat(ena-submission): Add code to submit sequences to ENA and return …

…results to Loculus (#2417) * Add code to create project object for submission, submit to ena and keep state in database. * Create sample.xml objects to send to ENA using PHAG4E's metadata field mapping. Keep state in sample_table. Send slack notifications if submission fails. * Create assembly, add webin-cli to docker container, add call to API to check status of assembly submission. * Map submission results to external metadata fields and upload results to Loculus. * Fix external metadata upload issue in backend, add additional flyway version. * add gcaAccession field to values.yaml * use connection pooling for db requests and explicitly rollback if execute fails * Prevent SQL injection with table_name and column name validation * Update docs, add details of ena-submission to README (taking from all previous PRs). * Do not store all data from get-released-data call * Send slack notification when step fails. --------- Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
loculus-project · Sep 18, 2024 · d400725 · d400725
1 parent 048f7d9
commit d400725
Show file tree

Hide file tree

Showing 32 changed files with 3,665 additions and 232 deletions.
diff --git a/.github/workflows/e2e-k3d.yml b/.github/workflows/e2e-k3d.yml
@@ -34,7 +34,7 @@ jobs:
     env:
       ALL_BROWSERS: ${{ github.ref == 'refs/heads/main' || github.event.inputs.all_browsers && 'true' || 'false' }}
       sha: ${{ github.event.pull_request.head.sha || github.sha }}
-      wait_timeout: ${{ github.ref == 'refs/heads/main' && 900 || 180 }}
+      wait_timeout: ${{ github.ref == 'refs/heads/main' && 900 || 240 }}
     steps:
       - name: Shorten sha
         run: echo "sha=${sha::7}" >> $GITHUB_ENV

diff --git a/.github/workflows/ena-submission-tests.yaml b/.github/workflows/ena-submission-tests.yaml
@@ -0,0 +1,37 @@
+name: ena-submission-tests
+on:
+  # test
+  pull_request:
+    paths:
+      - "ena-submission/**"
+      - ".github/workflows/ena-submission-tests.yml"
+  push:
+    branches:
+      - main
+  workflow_dispatch:
+concurrency:
+  group: ci-${{ github.ref == 'refs/heads/main' && github.run_id || github.ref }}-ena-submission-tests
+  cancel-in-progress: true
+jobs:
+  unitTests:
+    name: Unit Tests
+    runs-on: codebuild-loculus-ci-${{ github.run_id }}-${{ github.run_attempt }}
+    timeout-minutes: 15
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up micromamba
+        uses: mamba-org/setup-micromamba@v1
+        with: 
+            environment-file: ena-submission/environment.yml
+            micromamba-version: 'latest'
+            init-shell: >-
+                bash
+                powershell
+            cache-environment: true
+            post-cleanup: 'all'
+      - name: Run tests
+        run: |
+            micromamba activate loculus-ena-submission
+            python3 scripts/test_ena_submission.py
+        shell: micromamba-shell {0}
+        working-directory: ena-submission
diff --git a/backend/README.md b/backend/README.md
@@ -71,6 +71,8 @@ The service listens, by default, to **port 8079**: <http://localhost:8079/swagge
 
 Note: When using a postgresSQL development platform (e.g. pgAdmin) the hostname is 127.0.0.1 and not localhost - this is defined in the `deploy.py` file.
 
+Note that we also use flyway in the ena-submission pod to create an additional schema in the database, ena-submission. This schema is not added here.
+
 ### Operating the backend behind a proxy
 
 When running the backend behind a proxy, the proxy needs to set X-Forwarded headers:

diff --git a/backend/src/main/resources/db/migration/V1.3__update_view.sql b/backend/src/main/resources/db/migration/V1.3__update_view.sql
@@ -0,0 +1,46 @@
+drop view if exists external_metadata_view cascade;
+
+create view external_metadata_view as
+select
+    sequence_entries_preprocessed_data.accession,
+    sequence_entries_preprocessed_data.version,
+    all_external_metadata.updated_metadata_at,
+    -- || concatenates two JSON objects by generating an object containing the union of their keys
+    -- taking the second object's value when there are duplicate keys.
+    case 
+        when all_external_metadata.external_metadata is null then jsonb_build_object('metadata', (sequence_entries_preprocessed_data.processed_data->'metadata'))
+        else jsonb_build_object('metadata', (sequence_entries_preprocessed_data.processed_data->'metadata') || all_external_metadata.external_metadata)
+    end as joint_metadata
+from
+    sequence_entries_preprocessed_data
+    left join all_external_metadata  on
+        all_external_metadata.accession = sequence_entries_preprocessed_data.accession
+        and all_external_metadata.version = sequence_entries_preprocessed_data.version
+        and sequence_entries_preprocessed_data.pipeline_version = (select version from current_processing_pipeline);
+
+create view sequence_entries_view as
+select
+    se.*,
+    sepd.started_processing_at,
+    sepd.finished_processing_at,
+    sepd.processed_data as processed_data,
+    sepd.processed_data || em.joint_metadata as joint_metadata,
+    sepd.errors,
+    sepd.warnings,
+    case
+        when se.released_at is not null then 'APPROVED_FOR_RELEASE'
+        when se.is_revocation then 'AWAITING_APPROVAL'
+        when sepd.processing_status = 'IN_PROCESSING' then 'IN_PROCESSING'
+        when sepd.processing_status = 'HAS_ERRORS' then 'HAS_ERRORS'
+        when sepd.processing_status = 'FINISHED' then 'AWAITING_APPROVAL'
+        else 'RECEIVED'
+    end as status
+from
+    sequence_entries se
+    left join sequence_entries_preprocessed_data sepd on
+        se.accession = sepd.accession
+        and se.version = sepd.version
+        and sepd.pipeline_version = (select version from current_processing_pipeline)
+    left join external_metadata_view em on
+        se.accession = em.accession
+        and se.version = em.version;
diff --git a/ena-submission/Dockerfile b/ena-submission/Dockerfile
@@ -1,5 +1,14 @@
 FROM mambaorg/micromamba:1.5.8
 
+# Install dependencies needed for webin-cli
+USER root
+RUN apt-get update && apt-get install -y \
+    default-jre \
+    wget \
+    && rm -rf /var/lib/apt/lists/*
+RUN mkdir -p /package && chown -R $MAMBA_USER:$MAMBA_USER /package
+USER $MAMBA_USER
+
 COPY --chown=$MAMBA_USER:$MAMBA_USER environment.yml /tmp/env.yaml
 COPY --chown=$MAMBA_USER:$MAMBA_USER .mambarc /tmp/.mambarc
 
@@ -10,6 +19,11 @@ RUN micromamba config set extract_threads 1 \
 # Set the environment variable to activate the conda environment
 ARG MAMBA_DOCKERFILE_ACTIVATE=1
 
-COPY --chown=$MAMBA_USER:$MAMBA_USER . /package
 
+ENV WEBIN_CLI_VERSION 7.3.1
+USER root
+RUN wget -q "https://github.com/enasequence/webin-cli/releases/download/${WEBIN_CLI_VERSION}/webin-cli-${WEBIN_CLI_VERSION}.jar" -O /package/webin-cli.jar
+USER $MAMBA_USER
+
+COPY --chown=$MAMBA_USER:$MAMBA_USER . /package
 WORKDIR /package
diff --git a/ena-submission/ENA_submission.md b/ena-submission/ENA_submission.md
@@ -35,7 +35,7 @@ We require the following components:
 
 - Analysis: An analysis contains secondary analysis results derived from sequence reads (e.g. a genome assembly).
 
-At the time of writing (October 2023), in contrast to ENA, Pathoplexus has no hierarchy of study/sample/sequence: every sequence is its own study and sample. Therefore we need to figure out how to map sequences to projects, each submitter could have exactly _one_ study pre organism (this is the approach we are currently taking), or each sequence could be associated with its own study.
+In contrast to ENA, Pathoplexus has no hierarchy of study/sample/sequence. Therefore we have decided to create _one_ study per Loculus submission group and organism. For each Loculus sample we will create one sample (with metadata) and one sequence (with the sequence).
 
 ### Mapping sequences and studies
 
@@ -277,7 +277,34 @@ The following could be implement as post-MVP features:
        -password YYYYYY
    ```
 
-5. Save accession numbers (these will be returned by the webin-cli)
+5. Save ERZ accession numbers (these will be returned by the webin-cli)
+6. Wait to receive GCA accession numbers (returned later after assignment by NCBI). This can be retrieved via https://wwwdev.ebi.ac.uk/ena/submit/report/swagger-ui/index.html
+
+```
+curl -X 'GET' \
+  'https://www.ebi.ac.uk/ena/submit/report/analysis-process/{erz_accession}?format=json&max-results=100' \
+  -H 'accept: */*' \
+  -H 'Authorization: Basic KEY'
+```
+
+When processing is finished the response should look like:
+
+```
+[
+  {
+    "report": {
+      "id": "{erz_accession}",
+      "analysisType": "SEQUENCE_ASSEMBLY",
+      "acc": "chromosomes:OZ076380-OZ076381,genome:GCA_964187725.1",
+      "processingStatus": "COMPLETED",
+      "processingStart": "14-06-2024 05:07:40",
+      "processingEnd": "14-06-2024 05:08:19",
+      "processingError": null
+    },
+    "links": []
+  }
+]
+```
 
 ## Promises made to ENA
 

diff --git a/ena-submission/README.md b/ena-submission/README.md
@@ -1,11 +1,94 @@
-## ENA Submission
+# ENA Submission
 
-### Developing Locally
+## Snakemake Rules
 
-The ENA submission pod creates a new schema in the loculus DB, this is managed by flyway. This means to develop locally you will have to start the postgres DB locally e.g. by using the ../deploy.py script or using
+### get_ena_submission_list
+
+This rule runs daily in a cron job, it calls the loculus backend (`get-released-data`), obtains a new list of sequences that are ready for submission to ENA and sends this list as a compressed json file to our slack channel. Sequences are ready for submission IF:
+
+- data in state APPROVED_FOR_RELEASE:
+- data must be state "OPEN" for use
+- data must not already exist in ENA or be in the submission process, this means:
+  - data was not submitted by the `config.ingest_pipeline_submitter`
+  - data is not in the `ena-submission.submission_table`
+  - as an extra check we discard all sequences with `ena-specific-metadata` fields
+
+### all
+
+This rule runs in the ena-submission pod, it runs the following rules in parallel:
+
+#### trigger_submission_to_ena
+
+Download file in `github_url` every 30s. If data is not in submission table already (and not a revision) upload data to `ena-submission.submission_table`.
+
+#### create_project
+
+In a loop:
+
+- Get sequences in `submission_table` in state READY_TO_SUBMIT
+  - if (there exists an entry in the project_table for the corresponding (group_id, organism)):
+    - if (entry is in status SUBMITTED): update `submission_table` to SUBMITTED_PROJECT.
+    - else: update submission_table to SUBMITTING_PROJECT.
+  - else: create project entry in `project_table` for (group_id, organism).
+- Get sequences in `submission_table` in state SUBMITTING_PROJECT
+  - if (corresponding `project_table` entry is in state SUBMITTED): update entries to state SUBMITTED_PROJECT.
+- Get sequences in `project_table` in state READY, prepare submission object, set status to SUBMITTING
+  - if (submission succeeds): set status to SUBMITTED and fill in results: the result of a successful submission is `bioproject_accession` and an ena-internal `ena_submission_accession`.
+  - else: set status to HAS_ERRORS and fill in errors
+- Get sequences in `project_table` in state HAS_ERRORS for over 15min and sequences in status SUBMITTING for over 15min: send slack notification
+
+#### create_sample
+
+Maps loculus metadata to ena metadata using template: https://www.ebi.ac.uk/ena/browser/view/ERC000033
+
+In a loop
+
+- Get sequences in `submission_table` in state SUBMITTED_PROJECT
+  - if (there exists an entry in the `sample_table` for the corresponding (accession, version)):
+    - if (entry is in status SUBMITTED): update `submission_table` to SUBMITTED_SAMPLE.
+    - else: update submission_table to SUBMITTING_SAMPLE.
+  - else: create sample entry in `sample_table` for (accession, version).
+- Get sequences in `submission_table` in state SUBMITTING_SAMPLE
+  - if (corresponding `sample_table` entry is in state SUBMITTED): update entries to state SUBMITTED_SAMPLE.
+- Get sequences in `sample_table` in state READY, prepare submission object, set status to SUBMITTING
+  - if (submission succeeds): set status to SUBMITTED and fill in results, the results of a successful submission are an `sra_run_accession` (starting with ERS) , a `biosample_accession` (starting with SAM) and an ena-internal `ena_submission_accession`.
+  - else: set status to HAS_ERRORS and fill in errors
+- Get sequences in `sample_table` in state HAS_ERRORS for over 15min and sequences in status SUBMITTING for over 15min: send a slack notification
+
+#### create_assembly
+
+In a loop:
+
+- Get sequences in `submission_table` in state SUBMITTED_SAMPLE
+  - if (there exists an entry in the `assembly_table` for the corresponding (accession, version)):
+    - if (entry is in status SUBMITTED): update `assembly_table` to SUBMITTED_ASSEMBLY.
+    - else: update `assembly_table` to SUBMITTING_ASSEMBLY.
+  - else: create assembly entry in `assembly_table` for (accession, version).
+- Get sequences in `submission_table` in state SUBMITTING_SAMPLE
+  - if (corresponding `assembly_table` entry is in state SUBMITTED): update entries to state SUBMITTED_ASSEMBLY.
+- Get sequences in `assembly_table` in state READY, prepare files: we need chromosome_list, fasta files and a manifest file, set status to WAITING
+  - if (submission succeeds): set status to WAITING and fill in results: ena-internal `erz_accession`
+  - else: set status to HAS_ERRORS and fill in errors
+- Get sequences in `assembly_table` in state WAITING, every 5minutes (to not overload ENA) check if ENA has processed the assemblies and assigned them `gca_accession`. If so update the table to status SUBMITTED and fill in results
+- Get sequences in `assembly_table` in state HAS_ERRORS for over 15min and sequences in status SUBMITTING for over 15min, or in state WAITING for over 48hours: send slack notification
+
+#### upload_to_loculus
+
+- Get sequences in `submission_table` state SUBMITTED_ALL.
+- Get the results of all the submissions (from all other tables)
+- Create a POST request to the submit-external-metadata with the results in the expected format.
+  - if (successful): set sequences to state SENT_TO_LOCULUS
+  - else: set sequences to state HAS_ERRORS_EXT_METADATA_UPLOAD
+- Get sequences in `submission_table` in state HAS_ERRORS_EXT_METADATA_UPLOAD for over 15min and sequences in status SUBMITTED_ALL for over 15min: send slack notification
+
+## Developing Locally
+
+### Database
+
+The ENA submission service creates a new schema in the Loculus Postgres DB, managed by flyway. To develop locally you will have to start the postgres DB locally e.g. by using the `../deploy.py` script or using
 
 ```sh
-   docker run -d \
+docker run -d \
    --name loculus_postgres \
    -e POSTGRES_DB=loculus \
    -e POSTGRES_USER=postgres \
@@ -14,34 +97,82 @@ The ENA submission pod creates a new schema in the loculus DB, this is managed b
    postgres:latest
 ```
 
-In our kubernetes pod we run flyway in a docker container, however when running locally it is best to [download the flyway CLI](https://documentation.red-gate.com/fd/command-line-184127404.html).
+### Install and run flyway
 
-You can then run flyway using the
+In our kubernetes pod we run flyway in a docker container, however when running locally it is best to [download the flyway CLI](https://documentation.red-gate.com/fd/command-line-184127404.html) (or `brew install flyway` on macOS).
 
-```
- flyway -user=postgres -password=unsecure -url=jdbc:postgresql://127.0.0.1:5432/loculus -schemas=ena-submission -locations=filesystem:./sql migrate
+You can then create the schema using the following command:
+
+```sh
+flyway -user=postgres -password=unsecure -url=jdbc:postgresql://127.0.0.1:5432/loculus -schemas=ena-submission -locations=filesystem:./flyway/sql migrate
 ```
 
 If you want to test the docker image locally. It can be built and run using the commands:
 
-```
+```sh
 docker build -t ena-submission-flyway .
 docker run -it -e FLYWAY_URL=jdbc:postgresql://127.0.0.1:5432/loculus -e FLYWAY_USER=postgres -e FLYWAY_PASSWORD=unsecure ena-submission-flyway flyway migrate
 ```
 
+### Setting up micromamba environment
+
+<details>
+
+<summary> Setting up micromamba </summary>
+
 The rest of the ena-submission pod uses micromamba:
 
-```bash
+```sh
 brew install micromamba
 micromamba shell init --shell zsh --root-prefix=~/micromamba
 source ~/.zshrc
 ```
 
+</details>
+
 Then activate the loculus-ena-submission environment
 
-```bash
-micromamba create -f environment.yml --platform osx-64 --rc-file .mambarc
+```sh
+micromamba create -f environment.yml --rc-file .mambarc
 micromamba activate loculus-ena-submission
 ```
 
+### Using ENA's webin-cli
+
+In order to submit assemblies you will also need to install ENA's `webin-cli.jar`. Their [webpage](https://ena-docs.readthedocs.io/en/latest/submit/general-guide/webin-cli.html) offers more instructions. This pipeline has been tested with `WEBIN_CLI_VERSION=7.3.1`.
+
+```sh
+wget -q "https://github.com/enasequence/webin-cli/releases/download/${WEBIN_CLI_VERSION}/webin-cli-${WEBIN_CLI_VERSION}.jar" -O /package/webin-cli.jar
+```
+
+### Running snakemake
+
 Then run snakemake using `snakemake` or `snakemake {rule}`.
+
+## Testing
+
+### Run tests
+
+```sh
+micromamba activate loculus-ena-submission
+python3 scripts/test_ena_submission.py
+```
+
+### Testing submission locally
+
+ENA-submission currently is only triggered after manual approval.
+
+The `get_ena_submission_list` runs as a cron-job. It queries Loculus for new sequences to submit to ENA (these are sequences that are in state OPEN, were not submitted by the INSDC_INGEST_USER, do not include ena external_metadata fields and are not yet in the submission_table of the ena-submission schema). If it finds new sequences it sends a notification to slack with all sequences.
+
+It is then the reviewer's turn to review these sequences. [TODO: define review criteria] If these sequences meet our criteria they should be uploaded to [pathoplexus/ena-submission](https://github.com/pathoplexus/ena-submission/blob/main/approved/approved_ena_submission_list.json) (currently we read data from the [test folder](https://github.com/pathoplexus/ena-submission/blob/main/test/approved_ena_submission_list.json) - but this will be changed to the `approved` folder in production). The `trigger_submission_to_ena` rule is constantly checking this folder for new sequences and adding them to the submission_table if they are not already there. Note we cannot yet handle revisions so these should not be added to the approved list [TODO: do not allow submission of revised sequences in `trigger_submission_to_ena`]- revisions will still have to be performed manually.
+
+If you would like to test `trigger_submission_to_ena` while running locally you can also use the `trigger_submission_to_ena_from_file` rule, this will read in data from `results/approved_ena_submission_list.json` (see the test folder for an example). You can also upload data to the [test folder](https://github.com/pathoplexus/ena-submission/blob/main/test/approved_ena_submission_list.json) - note that if you add fake data with a non-existent group-id the project creation will fail, additionally the `upload_to_loculus` rule will fail if these sequences do not actually exist in your loculus instance.
+
+All other rules query the `submission_table` for projects/samples and assemblies to submit. Once successful they add accessions to the `results` column in dictionary format. Finally, once the entire process has succeeded the new external metadata will be uploaded to Loculus.
+
+Note that ENA's dev server does not always finish processing and you might not receive a gcaAccession for your dev submissions. If you would like to test the full submission cycle on the ENA dev instance it makes sense to manually alter the gcaAccession in the database using `ERZ24784470`. You can connect to a preview instance via port forwarding to these changes on local database tool such as pgAdmin:
+
+1. Apply the preview `~/.kube/config`
+2. Find the database POD using `kubectl get pods -A | grep database`
+3. Connect via port-forwarding `kubectl port-forward $POD -n $NAMESPACE 5432:5432`
+4. If necessary find password using `kubectl get secret`