Skip to content

Commit

Permalink
feat(ena-submission): Add code to submit sequences to ENA and return …
Browse files Browse the repository at this point in the history
…results to Loculus (#2417)

* Add code to create project object for submission, submit to ena and keep state in database.

* Create sample.xml objects to send to ENA using PHAG4E's metadata field mapping. Keep state in sample_table. Send slack notifications if submission fails.

* Create assembly, add webin-cli to docker container, add call to API to check status of assembly submission.

* Map submission results to external metadata fields and upload results to Loculus.

* Fix external metadata upload issue in backend, add additional flyway version.

* add gcaAccession field to values.yaml

* use connection pooling for db requests and explicitly rollback if execute fails

* Prevent SQL injection with table_name and column name validation 

* Update docs, add details of ena-submission to README (taking from all previous PRs).

* Do not store all data from get-released-data call

* Send slack notification when step fails. 

---------

Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
  • Loading branch information
anna-parker and corneliusroemer committed Sep 18, 2024
1 parent 048f7d9 commit d400725
Show file tree
Hide file tree
Showing 32 changed files with 3,665 additions and 232 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/e2e-k3d.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:
env:
ALL_BROWSERS: ${{ github.ref == 'refs/heads/main' || github.event.inputs.all_browsers && 'true' || 'false' }}
sha: ${{ github.event.pull_request.head.sha || github.sha }}
wait_timeout: ${{ github.ref == 'refs/heads/main' && 900 || 180 }}
wait_timeout: ${{ github.ref == 'refs/heads/main' && 900 || 240 }}
steps:
- name: Shorten sha
run: echo "sha=${sha::7}" >> $GITHUB_ENV
Expand Down
37 changes: 37 additions & 0 deletions .github/workflows/ena-submission-tests.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
name: ena-submission-tests
on:
# test
pull_request:
paths:
- "ena-submission/**"
- ".github/workflows/ena-submission-tests.yml"
push:
branches:
- main
workflow_dispatch:
concurrency:
group: ci-${{ github.ref == 'refs/heads/main' && github.run_id || github.ref }}-ena-submission-tests
cancel-in-progress: true
jobs:
unitTests:
name: Unit Tests
runs-on: codebuild-loculus-ci-${{ github.run_id }}-${{ github.run_attempt }}
timeout-minutes: 15
steps:
- uses: actions/checkout@v4
- name: Set up micromamba
uses: mamba-org/setup-micromamba@v1
with:
environment-file: ena-submission/environment.yml
micromamba-version: 'latest'
init-shell: >-
bash
powershell
cache-environment: true
post-cleanup: 'all'
- name: Run tests
run: |
micromamba activate loculus-ena-submission
python3 scripts/test_ena_submission.py
shell: micromamba-shell {0}
working-directory: ena-submission
2 changes: 2 additions & 0 deletions backend/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ The service listens, by default, to **port 8079**: <http://localhost:8079/swagge
Note: When using a postgresSQL development platform (e.g. pgAdmin) the hostname is 127.0.0.1 and not localhost - this is defined in the `deploy.py` file.
Note that we also use flyway in the ena-submission pod to create an additional schema in the database, ena-submission. This schema is not added here.
### Operating the backend behind a proxy
When running the backend behind a proxy, the proxy needs to set X-Forwarded headers:
Expand Down
46 changes: 46 additions & 0 deletions backend/src/main/resources/db/migration/V1.3__update_view.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
drop view if exists external_metadata_view cascade;

create view external_metadata_view as
select
sequence_entries_preprocessed_data.accession,
sequence_entries_preprocessed_data.version,
all_external_metadata.updated_metadata_at,
-- || concatenates two JSON objects by generating an object containing the union of their keys
-- taking the second object's value when there are duplicate keys.
case
when all_external_metadata.external_metadata is null then jsonb_build_object('metadata', (sequence_entries_preprocessed_data.processed_data->'metadata'))
else jsonb_build_object('metadata', (sequence_entries_preprocessed_data.processed_data->'metadata') || all_external_metadata.external_metadata)
end as joint_metadata
from
sequence_entries_preprocessed_data
left join all_external_metadata on
all_external_metadata.accession = sequence_entries_preprocessed_data.accession
and all_external_metadata.version = sequence_entries_preprocessed_data.version
and sequence_entries_preprocessed_data.pipeline_version = (select version from current_processing_pipeline);

create view sequence_entries_view as
select
se.*,
sepd.started_processing_at,
sepd.finished_processing_at,
sepd.processed_data as processed_data,
sepd.processed_data || em.joint_metadata as joint_metadata,
sepd.errors,
sepd.warnings,
case
when se.released_at is not null then 'APPROVED_FOR_RELEASE'
when se.is_revocation then 'AWAITING_APPROVAL'
when sepd.processing_status = 'IN_PROCESSING' then 'IN_PROCESSING'
when sepd.processing_status = 'HAS_ERRORS' then 'HAS_ERRORS'
when sepd.processing_status = 'FINISHED' then 'AWAITING_APPROVAL'
else 'RECEIVED'
end as status
from
sequence_entries se
left join sequence_entries_preprocessed_data sepd on
se.accession = sepd.accession
and se.version = sepd.version
and sepd.pipeline_version = (select version from current_processing_pipeline)
left join external_metadata_view em on
se.accession = em.accession
and se.version = em.version;
16 changes: 15 additions & 1 deletion ena-submission/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
FROM mambaorg/micromamba:1.5.8

# Install dependencies needed for webin-cli
USER root
RUN apt-get update && apt-get install -y \
default-jre \
wget \
&& rm -rf /var/lib/apt/lists/*
RUN mkdir -p /package && chown -R $MAMBA_USER:$MAMBA_USER /package
USER $MAMBA_USER

COPY --chown=$MAMBA_USER:$MAMBA_USER environment.yml /tmp/env.yaml
COPY --chown=$MAMBA_USER:$MAMBA_USER .mambarc /tmp/.mambarc

Expand All @@ -10,6 +19,11 @@ RUN micromamba config set extract_threads 1 \
# Set the environment variable to activate the conda environment
ARG MAMBA_DOCKERFILE_ACTIVATE=1

COPY --chown=$MAMBA_USER:$MAMBA_USER . /package

ENV WEBIN_CLI_VERSION 7.3.1
USER root
RUN wget -q "https://github.com/enasequence/webin-cli/releases/download/${WEBIN_CLI_VERSION}/webin-cli-${WEBIN_CLI_VERSION}.jar" -O /package/webin-cli.jar
USER $MAMBA_USER

COPY --chown=$MAMBA_USER:$MAMBA_USER . /package
WORKDIR /package
31 changes: 29 additions & 2 deletions ena-submission/ENA_submission.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ We require the following components:

- Analysis: An analysis contains secondary analysis results derived from sequence reads (e.g. a genome assembly).

At the time of writing (October 2023), in contrast to ENA, Pathoplexus has no hierarchy of study/sample/sequence: every sequence is its own study and sample. Therefore we need to figure out how to map sequences to projects, each submitter could have exactly _one_ study pre organism (this is the approach we are currently taking), or each sequence could be associated with its own study.
In contrast to ENA, Pathoplexus has no hierarchy of study/sample/sequence. Therefore we have decided to create _one_ study per Loculus submission group and organism. For each Loculus sample we will create one sample (with metadata) and one sequence (with the sequence).

### Mapping sequences and studies

Expand Down Expand Up @@ -277,7 +277,34 @@ The following could be implement as post-MVP features:
-password YYYYYY
```

5. Save accession numbers (these will be returned by the webin-cli)
5. Save ERZ accession numbers (these will be returned by the webin-cli)
6. Wait to receive GCA accession numbers (returned later after assignment by NCBI). This can be retrieved via https://wwwdev.ebi.ac.uk/ena/submit/report/swagger-ui/index.html

```
curl -X 'GET' \
'https://www.ebi.ac.uk/ena/submit/report/analysis-process/{erz_accession}?format=json&max-results=100' \
-H 'accept: */*' \
-H 'Authorization: Basic KEY'
```
When processing is finished the response should look like:
```
[
{
"report": {
"id": "{erz_accession}",
"analysisType": "SEQUENCE_ASSEMBLY",
"acc": "chromosomes:OZ076380-OZ076381,genome:GCA_964187725.1",
"processingStatus": "COMPLETED",
"processingStart": "14-06-2024 05:07:40",
"processingEnd": "14-06-2024 05:08:19",
"processingError": null
},
"links": []
}
]
```
## Promises made to ENA
Expand Down
155 changes: 143 additions & 12 deletions ena-submission/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,94 @@
## ENA Submission
# ENA Submission

### Developing Locally
## Snakemake Rules

The ENA submission pod creates a new schema in the loculus DB, this is managed by flyway. This means to develop locally you will have to start the postgres DB locally e.g. by using the ../deploy.py script or using
### get_ena_submission_list

This rule runs daily in a cron job, it calls the loculus backend (`get-released-data`), obtains a new list of sequences that are ready for submission to ENA and sends this list as a compressed json file to our slack channel. Sequences are ready for submission IF:

- data in state APPROVED_FOR_RELEASE:
- data must be state "OPEN" for use
- data must not already exist in ENA or be in the submission process, this means:
- data was not submitted by the `config.ingest_pipeline_submitter`
- data is not in the `ena-submission.submission_table`
- as an extra check we discard all sequences with `ena-specific-metadata` fields

### all

This rule runs in the ena-submission pod, it runs the following rules in parallel:

#### trigger_submission_to_ena

Download file in `github_url` every 30s. If data is not in submission table already (and not a revision) upload data to `ena-submission.submission_table`.

#### create_project

In a loop:

- Get sequences in `submission_table` in state READY_TO_SUBMIT
- if (there exists an entry in the project_table for the corresponding (group_id, organism)):
- if (entry is in status SUBMITTED): update `submission_table` to SUBMITTED_PROJECT.
- else: update submission_table to SUBMITTING_PROJECT.
- else: create project entry in `project_table` for (group_id, organism).
- Get sequences in `submission_table` in state SUBMITTING_PROJECT
- if (corresponding `project_table` entry is in state SUBMITTED): update entries to state SUBMITTED_PROJECT.
- Get sequences in `project_table` in state READY, prepare submission object, set status to SUBMITTING
- if (submission succeeds): set status to SUBMITTED and fill in results: the result of a successful submission is `bioproject_accession` and an ena-internal `ena_submission_accession`.
- else: set status to HAS_ERRORS and fill in errors
- Get sequences in `project_table` in state HAS_ERRORS for over 15min and sequences in status SUBMITTING for over 15min: send slack notification

#### create_sample

Maps loculus metadata to ena metadata using template: https://www.ebi.ac.uk/ena/browser/view/ERC000033

In a loop

- Get sequences in `submission_table` in state SUBMITTED_PROJECT
- if (there exists an entry in the `sample_table` for the corresponding (accession, version)):
- if (entry is in status SUBMITTED): update `submission_table` to SUBMITTED_SAMPLE.
- else: update submission_table to SUBMITTING_SAMPLE.
- else: create sample entry in `sample_table` for (accession, version).
- Get sequences in `submission_table` in state SUBMITTING_SAMPLE
- if (corresponding `sample_table` entry is in state SUBMITTED): update entries to state SUBMITTED_SAMPLE.
- Get sequences in `sample_table` in state READY, prepare submission object, set status to SUBMITTING
- if (submission succeeds): set status to SUBMITTED and fill in results, the results of a successful submission are an `sra_run_accession` (starting with ERS) , a `biosample_accession` (starting with SAM) and an ena-internal `ena_submission_accession`.
- else: set status to HAS_ERRORS and fill in errors
- Get sequences in `sample_table` in state HAS_ERRORS for over 15min and sequences in status SUBMITTING for over 15min: send a slack notification

#### create_assembly

In a loop:

- Get sequences in `submission_table` in state SUBMITTED_SAMPLE
- if (there exists an entry in the `assembly_table` for the corresponding (accession, version)):
- if (entry is in status SUBMITTED): update `assembly_table` to SUBMITTED_ASSEMBLY.
- else: update `assembly_table` to SUBMITTING_ASSEMBLY.
- else: create assembly entry in `assembly_table` for (accession, version).
- Get sequences in `submission_table` in state SUBMITTING_SAMPLE
- if (corresponding `assembly_table` entry is in state SUBMITTED): update entries to state SUBMITTED_ASSEMBLY.
- Get sequences in `assembly_table` in state READY, prepare files: we need chromosome_list, fasta files and a manifest file, set status to WAITING
- if (submission succeeds): set status to WAITING and fill in results: ena-internal `erz_accession`
- else: set status to HAS_ERRORS and fill in errors
- Get sequences in `assembly_table` in state WAITING, every 5minutes (to not overload ENA) check if ENA has processed the assemblies and assigned them `gca_accession`. If so update the table to status SUBMITTED and fill in results
- Get sequences in `assembly_table` in state HAS_ERRORS for over 15min and sequences in status SUBMITTING for over 15min, or in state WAITING for over 48hours: send slack notification

#### upload_to_loculus

- Get sequences in `submission_table` state SUBMITTED_ALL.
- Get the results of all the submissions (from all other tables)
- Create a POST request to the submit-external-metadata with the results in the expected format.
- if (successful): set sequences to state SENT_TO_LOCULUS
- else: set sequences to state HAS_ERRORS_EXT_METADATA_UPLOAD
- Get sequences in `submission_table` in state HAS_ERRORS_EXT_METADATA_UPLOAD for over 15min and sequences in status SUBMITTED_ALL for over 15min: send slack notification

## Developing Locally

### Database

The ENA submission service creates a new schema in the Loculus Postgres DB, managed by flyway. To develop locally you will have to start the postgres DB locally e.g. by using the `../deploy.py` script or using

```sh
docker run -d \
docker run -d \
--name loculus_postgres \
-e POSTGRES_DB=loculus \
-e POSTGRES_USER=postgres \
Expand All @@ -14,34 +97,82 @@ The ENA submission pod creates a new schema in the loculus DB, this is managed b
postgres:latest
```

In our kubernetes pod we run flyway in a docker container, however when running locally it is best to [download the flyway CLI](https://documentation.red-gate.com/fd/command-line-184127404.html).
### Install and run flyway

You can then run flyway using the
In our kubernetes pod we run flyway in a docker container, however when running locally it is best to [download the flyway CLI](https://documentation.red-gate.com/fd/command-line-184127404.html) (or `brew install flyway` on macOS).

```
flyway -user=postgres -password=unsecure -url=jdbc:postgresql://127.0.0.1:5432/loculus -schemas=ena-submission -locations=filesystem:./sql migrate
You can then create the schema using the following command:

```sh
flyway -user=postgres -password=unsecure -url=jdbc:postgresql://127.0.0.1:5432/loculus -schemas=ena-submission -locations=filesystem:./flyway/sql migrate
```

If you want to test the docker image locally. It can be built and run using the commands:

```
```sh
docker build -t ena-submission-flyway .
docker run -it -e FLYWAY_URL=jdbc:postgresql://127.0.0.1:5432/loculus -e FLYWAY_USER=postgres -e FLYWAY_PASSWORD=unsecure ena-submission-flyway flyway migrate
```

### Setting up micromamba environment

<details>

<summary> Setting up micromamba </summary>

The rest of the ena-submission pod uses micromamba:

```bash
```sh
brew install micromamba
micromamba shell init --shell zsh --root-prefix=~/micromamba
source ~/.zshrc
```

</details>

Then activate the loculus-ena-submission environment

```bash
micromamba create -f environment.yml --platform osx-64 --rc-file .mambarc
```sh
micromamba create -f environment.yml --rc-file .mambarc
micromamba activate loculus-ena-submission
```

### Using ENA's webin-cli

In order to submit assemblies you will also need to install ENA's `webin-cli.jar`. Their [webpage](https://ena-docs.readthedocs.io/en/latest/submit/general-guide/webin-cli.html) offers more instructions. This pipeline has been tested with `WEBIN_CLI_VERSION=7.3.1`.

```sh
wget -q "https://github.com/enasequence/webin-cli/releases/download/${WEBIN_CLI_VERSION}/webin-cli-${WEBIN_CLI_VERSION}.jar" -O /package/webin-cli.jar
```

### Running snakemake

Then run snakemake using `snakemake` or `snakemake {rule}`.

## Testing

### Run tests

```sh
micromamba activate loculus-ena-submission
python3 scripts/test_ena_submission.py
```

### Testing submission locally

ENA-submission currently is only triggered after manual approval.

The `get_ena_submission_list` runs as a cron-job. It queries Loculus for new sequences to submit to ENA (these are sequences that are in state OPEN, were not submitted by the INSDC_INGEST_USER, do not include ena external_metadata fields and are not yet in the submission_table of the ena-submission schema). If it finds new sequences it sends a notification to slack with all sequences.

It is then the reviewer's turn to review these sequences. [TODO: define review criteria] If these sequences meet our criteria they should be uploaded to [pathoplexus/ena-submission](https://github.com/pathoplexus/ena-submission/blob/main/approved/approved_ena_submission_list.json) (currently we read data from the [test folder](https://github.com/pathoplexus/ena-submission/blob/main/test/approved_ena_submission_list.json) - but this will be changed to the `approved` folder in production). The `trigger_submission_to_ena` rule is constantly checking this folder for new sequences and adding them to the submission_table if they are not already there. Note we cannot yet handle revisions so these should not be added to the approved list [TODO: do not allow submission of revised sequences in `trigger_submission_to_ena`]- revisions will still have to be performed manually.

If you would like to test `trigger_submission_to_ena` while running locally you can also use the `trigger_submission_to_ena_from_file` rule, this will read in data from `results/approved_ena_submission_list.json` (see the test folder for an example). You can also upload data to the [test folder](https://github.com/pathoplexus/ena-submission/blob/main/test/approved_ena_submission_list.json) - note that if you add fake data with a non-existent group-id the project creation will fail, additionally the `upload_to_loculus` rule will fail if these sequences do not actually exist in your loculus instance.

All other rules query the `submission_table` for projects/samples and assemblies to submit. Once successful they add accessions to the `results` column in dictionary format. Finally, once the entire process has succeeded the new external metadata will be uploaded to Loculus.

Note that ENA's dev server does not always finish processing and you might not receive a gcaAccession for your dev submissions. If you would like to test the full submission cycle on the ENA dev instance it makes sense to manually alter the gcaAccession in the database using `ERZ24784470`. You can connect to a preview instance via port forwarding to these changes on local database tool such as pgAdmin:

1. Apply the preview `~/.kube/config`
2. Find the database POD using `kubectl get pods -A | grep database`
3. Connect via port-forwarding `kubectl port-forward $POD -n $NAMESPACE 5432:5432`
4. If necessary find password using `kubectl get secret`
Loading

0 comments on commit d400725

Please sign in to comment.