Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ena-submission): Upload ena submission results to Loculus #2417

Merged
merged 45 commits into from
Sep 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
621a522
Add code to create project object for submission, submit to ena and k…
anna-parker Aug 13, 2024
9d09a0f
Make functions cleaner
anna-parker Aug 26, 2024
c4d7a21
Use encryptedData not data for sealedSecrets
anna-parker Aug 26, 2024
c8f5cfa
Do not disable ena submission
anna-parker Aug 28, 2024
325829f
feat(ena-submission): Edits while PR reviewing create ena projects (#…
corneliusroemer Aug 29, 2024
5b0b378
Create sample.xml objects to send to ENA using PHAG4E's metadata fiel…
anna-parker Aug 29, 2024
c71e7d1
Clean up construct_sample_set_object function and add new columns in …
anna-parker Aug 29, 2024
6f8b318
Clean up tests to use test files
anna-parker Aug 29, 2024
d87016a
Test needs to be set to true while testing
anna-parker Aug 29, 2024
dfce120
Create assembly, add webin-cli to docker container, add call to API t…
anna-parker Aug 29, 2024
9685180
Test get-ena-submission-list cronjob is still working
anna-parker Aug 29, 2024
3112ed1
Modify assembly_name for testing
anna-parker Aug 29, 2024
d44428b
Correct logs, revert cronjob to normal frequency as working
anna-parker Aug 29, 2024
27e7278
Map submission results to external metadata fields and upload results…
anna-parker Aug 29, 2024
23824d0
Refactor slack notifications
anna-parker Aug 29, 2024
3977432
Add local debug instructions
anna-parker Aug 29, 2024
2b41057
Fix slack notifications in get_ena_submission_list
anna-parker Aug 29, 2024
dcc457a
Update to webin-cli to v8.0.0, clean up assembly creation functions
anna-parker Aug 30, 2024
c8a31ea
Clean up tests some more
anna-parker Aug 30, 2024
0ee84de
Downgrade webin-cli to 7.3.1 until I figure out what is happening loc…
anna-parker Aug 30, 2024
e24ddd1
Chromosomes must be named and unaligned sequences cannot include '-' …
anna-parker Aug 30, 2024
97f29ba
Add a state if external metadata upload fails.
anna-parker Aug 30, 2024
bde80a4
Make sql schema changes in another flyway version.
anna-parker Aug 30, 2024
86e8de4
Prevent SQL injection with table_name validation.
anna-parker Sep 6, 2024
6dde7c7
Update docs with suggestions
anna-parker Sep 9, 2024
0a744a3
More updates with review suggestions
anna-parker Sep 9, 2024
e06a87d
Remove '-' conversion test
anna-parker Sep 9, 2024
13d4af1
consistent response handling
anna-parker Sep 9, 2024
ac46751
Validate column names, update in_submission_table function, prevent p…
anna-parker Sep 9, 2024
2451261
Add types to function input
anna-parker Sep 9, 2024
645029e
Fix weird slack response import
anna-parker Sep 9, 2024
c16d991
Do not store all data from get-released-data call
anna-parker Sep 10, 2024
a19b56e
Now that we are using SimpleConnectionPool we must explicitly rollbac…
anna-parker Sep 13, 2024
8727532
Add details of ena-submission to README (taking from all previous PRs).
anna-parker Sep 13, 2024
a4266f9
Add long and lat mapping
anna-parker Sep 16, 2024
e0e8644
Send slack notification is ENA metadata update fails
anna-parker Sep 16, 2024
bb43aa0
Add the option to use the ENA checklist - disable for now
anna-parker Sep 16, 2024
c75fa88
Make config values clearer
anna-parker Sep 16, 2024
069961c
update view to V1.3
anna-parker Sep 18, 2024
616f18b
increase e2e timeout to 4min.
anna-parker Sep 18, 2024
ad6b79a
Add comment to readme and add insdc_accession
anna-parker Sep 18, 2024
22439f7
Sleep in between iterations to keep cpu down
anna-parker Sep 18, 2024
af9849c
Only send ena submission lists at midnight, do not raise error if git…
anna-parker Sep 18, 2024
59aad25
make ena-submission github repo public, remove need for PAT and USERN…
anna-parker Sep 18, 2024
c3a6b97
No need to decode json now.
anna-parker Sep 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/e2e-k3d.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:
env:
ALL_BROWSERS: ${{ github.ref == 'refs/heads/main' || github.event.inputs.all_browsers && 'true' || 'false' }}
sha: ${{ github.event.pull_request.head.sha || github.sha }}
wait_timeout: ${{ github.ref == 'refs/heads/main' && 900 || 180 }}
wait_timeout: ${{ github.ref == 'refs/heads/main' && 900 || 240 }}
steps:
- name: Shorten sha
run: echo "sha=${sha::7}" >> $GITHUB_ENV
Expand Down
37 changes: 37 additions & 0 deletions .github/workflows/ena-submission-tests.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
name: ena-submission-tests
on:
# test
pull_request:
paths:
- "ena-submission/**"
- ".github/workflows/ena-submission-tests.yml"
push:
branches:
- main
workflow_dispatch:
concurrency:
group: ci-${{ github.ref == 'refs/heads/main' && github.run_id || github.ref }}-ena-submission-tests
cancel-in-progress: true
jobs:
unitTests:
name: Unit Tests
runs-on: codebuild-loculus-ci-${{ github.run_id }}-${{ github.run_attempt }}
timeout-minutes: 15
steps:
- uses: actions/checkout@v4
- name: Set up micromamba
uses: mamba-org/setup-micromamba@v1
with:
environment-file: ena-submission/environment.yml
micromamba-version: 'latest'
init-shell: >-
bash
powershell
cache-environment: true
post-cleanup: 'all'
- name: Run tests
run: |
micromamba activate loculus-ena-submission
python3 scripts/test_ena_submission.py
shell: micromamba-shell {0}
working-directory: ena-submission
2 changes: 2 additions & 0 deletions backend/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ The service listens, by default, to **port 8079**: <http://localhost:8079/swagge

Note: When using a postgresSQL development platform (e.g. pgAdmin) the hostname is 127.0.0.1 and not localhost - this is defined in the `deploy.py` file.

Note that we also use flyway in the ena-submission pod to create an additional schema in the database, ena-submission. This schema is not added here.

### Operating the backend behind a proxy

When running the backend behind a proxy, the proxy needs to set X-Forwarded headers:
Expand Down
46 changes: 46 additions & 0 deletions backend/src/main/resources/db/migration/V1.3__update_view.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
drop view if exists external_metadata_view cascade;

create view external_metadata_view as
select
sequence_entries_preprocessed_data.accession,
sequence_entries_preprocessed_data.version,
all_external_metadata.updated_metadata_at,
-- || concatenates two JSON objects by generating an object containing the union of their keys
-- taking the second object's value when there are duplicate keys.
case
when all_external_metadata.external_metadata is null then jsonb_build_object('metadata', (sequence_entries_preprocessed_data.processed_data->'metadata'))
else jsonb_build_object('metadata', (sequence_entries_preprocessed_data.processed_data->'metadata') || all_external_metadata.external_metadata)
end as joint_metadata
from
sequence_entries_preprocessed_data
left join all_external_metadata on
all_external_metadata.accession = sequence_entries_preprocessed_data.accession
and all_external_metadata.version = sequence_entries_preprocessed_data.version
and sequence_entries_preprocessed_data.pipeline_version = (select version from current_processing_pipeline);

create view sequence_entries_view as
select
se.*,
sepd.started_processing_at,
sepd.finished_processing_at,
sepd.processed_data as processed_data,
sepd.processed_data || em.joint_metadata as joint_metadata,
sepd.errors,
sepd.warnings,
case
when se.released_at is not null then 'APPROVED_FOR_RELEASE'
when se.is_revocation then 'AWAITING_APPROVAL'
when sepd.processing_status = 'IN_PROCESSING' then 'IN_PROCESSING'
when sepd.processing_status = 'HAS_ERRORS' then 'HAS_ERRORS'
when sepd.processing_status = 'FINISHED' then 'AWAITING_APPROVAL'
else 'RECEIVED'
end as status
from
sequence_entries se
left join sequence_entries_preprocessed_data sepd on
se.accession = sepd.accession
and se.version = sepd.version
and sepd.pipeline_version = (select version from current_processing_pipeline)
left join external_metadata_view em on
se.accession = em.accession
and se.version = em.version;
16 changes: 15 additions & 1 deletion ena-submission/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
FROM mambaorg/micromamba:1.5.8

# Install dependencies needed for webin-cli
USER root
RUN apt-get update && apt-get install -y \
default-jre \
wget \
&& rm -rf /var/lib/apt/lists/*
RUN mkdir -p /package && chown -R $MAMBA_USER:$MAMBA_USER /package
USER $MAMBA_USER

COPY --chown=$MAMBA_USER:$MAMBA_USER environment.yml /tmp/env.yaml
COPY --chown=$MAMBA_USER:$MAMBA_USER .mambarc /tmp/.mambarc

Expand All @@ -10,6 +19,11 @@ RUN micromamba config set extract_threads 1 \
# Set the environment variable to activate the conda environment
ARG MAMBA_DOCKERFILE_ACTIVATE=1

COPY --chown=$MAMBA_USER:$MAMBA_USER . /package

ENV WEBIN_CLI_VERSION 7.3.1
USER root
RUN wget -q "https://github.com/enasequence/webin-cli/releases/download/${WEBIN_CLI_VERSION}/webin-cli-${WEBIN_CLI_VERSION}.jar" -O /package/webin-cli.jar
USER $MAMBA_USER

COPY --chown=$MAMBA_USER:$MAMBA_USER . /package
WORKDIR /package
31 changes: 29 additions & 2 deletions ena-submission/ENA_submission.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ We require the following components:

- Analysis: An analysis contains secondary analysis results derived from sequence reads (e.g. a genome assembly).

At the time of writing (October 2023), in contrast to ENA, Pathoplexus has no hierarchy of study/sample/sequence: every sequence is its own study and sample. Therefore we need to figure out how to map sequences to projects, each submitter could have exactly _one_ study pre organism (this is the approach we are currently taking), or each sequence could be associated with its own study.
In contrast to ENA, Pathoplexus has no hierarchy of study/sample/sequence. Therefore we have decided to create _one_ study per Loculus submission group and organism. For each Loculus sample we will create one sample (with metadata) and one sequence (with the sequence).

### Mapping sequences and studies

Expand Down Expand Up @@ -277,7 +277,34 @@ The following could be implement as post-MVP features:
-password YYYYYY
```

5. Save accession numbers (these will be returned by the webin-cli)
5. Save ERZ accession numbers (these will be returned by the webin-cli)
6. Wait to receive GCA accession numbers (returned later after assignment by NCBI). This can be retrieved via https://wwwdev.ebi.ac.uk/ena/submit/report/swagger-ui/index.html

```
curl -X 'GET' \
'https://www.ebi.ac.uk/ena/submit/report/analysis-process/{erz_accession}?format=json&max-results=100' \
-H 'accept: */*' \
-H 'Authorization: Basic KEY'
```

When processing is finished the response should look like:

```
[
{
"report": {
"id": "{erz_accession}",
"analysisType": "SEQUENCE_ASSEMBLY",
"acc": "chromosomes:OZ076380-OZ076381,genome:GCA_964187725.1",
"processingStatus": "COMPLETED",
"processingStart": "14-06-2024 05:07:40",
"processingEnd": "14-06-2024 05:08:19",
"processingError": null
},
"links": []
}
]
```

## Promises made to ENA

Expand Down
155 changes: 143 additions & 12 deletions ena-submission/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,94 @@
## ENA Submission
# ENA Submission

### Developing Locally
## Snakemake Rules

The ENA submission pod creates a new schema in the loculus DB, this is managed by flyway. This means to develop locally you will have to start the postgres DB locally e.g. by using the ../deploy.py script or using
### get_ena_submission_list

This rule runs daily in a cron job, it calls the loculus backend (`get-released-data`), obtains a new list of sequences that are ready for submission to ENA and sends this list as a compressed json file to our slack channel. Sequences are ready for submission IF:

- data in state APPROVED_FOR_RELEASE:
- data must be state "OPEN" for use
- data must not already exist in ENA or be in the submission process, this means:
- data was not submitted by the `config.ingest_pipeline_submitter`
- data is not in the `ena-submission.submission_table`
- as an extra check we discard all sequences with `ena-specific-metadata` fields

### all

This rule runs in the ena-submission pod, it runs the following rules in parallel:

#### trigger_submission_to_ena

Download file in `github_url` every 30s. If data is not in submission table already (and not a revision) upload data to `ena-submission.submission_table`.

#### create_project

In a loop:

- Get sequences in `submission_table` in state READY_TO_SUBMIT
- if (there exists an entry in the project_table for the corresponding (group_id, organism)):
- if (entry is in status SUBMITTED): update `submission_table` to SUBMITTED_PROJECT.
- else: update submission_table to SUBMITTING_PROJECT.
- else: create project entry in `project_table` for (group_id, organism).
- Get sequences in `submission_table` in state SUBMITTING_PROJECT
- if (corresponding `project_table` entry is in state SUBMITTED): update entries to state SUBMITTED_PROJECT.
- Get sequences in `project_table` in state READY, prepare submission object, set status to SUBMITTING
- if (submission succeeds): set status to SUBMITTED and fill in results: the result of a successful submission is `bioproject_accession` and an ena-internal `ena_submission_accession`.
- else: set status to HAS_ERRORS and fill in errors
- Get sequences in `project_table` in state HAS_ERRORS for over 15min and sequences in status SUBMITTING for over 15min: send slack notification

#### create_sample

Maps loculus metadata to ena metadata using template: https://www.ebi.ac.uk/ena/browser/view/ERC000033

In a loop

- Get sequences in `submission_table` in state SUBMITTED_PROJECT
- if (there exists an entry in the `sample_table` for the corresponding (accession, version)):
- if (entry is in status SUBMITTED): update `submission_table` to SUBMITTED_SAMPLE.
- else: update submission_table to SUBMITTING_SAMPLE.
- else: create sample entry in `sample_table` for (accession, version).
- Get sequences in `submission_table` in state SUBMITTING_SAMPLE
- if (corresponding `sample_table` entry is in state SUBMITTED): update entries to state SUBMITTED_SAMPLE.
- Get sequences in `sample_table` in state READY, prepare submission object, set status to SUBMITTING
- if (submission succeeds): set status to SUBMITTED and fill in results, the results of a successful submission are an `sra_run_accession` (starting with ERS) , a `biosample_accession` (starting with SAM) and an ena-internal `ena_submission_accession`.
- else: set status to HAS_ERRORS and fill in errors
- Get sequences in `sample_table` in state HAS_ERRORS for over 15min and sequences in status SUBMITTING for over 15min: send a slack notification

#### create_assembly

In a loop:

- Get sequences in `submission_table` in state SUBMITTED_SAMPLE
- if (there exists an entry in the `assembly_table` for the corresponding (accession, version)):
- if (entry is in status SUBMITTED): update `assembly_table` to SUBMITTED_ASSEMBLY.
- else: update `assembly_table` to SUBMITTING_ASSEMBLY.
- else: create assembly entry in `assembly_table` for (accession, version).
- Get sequences in `submission_table` in state SUBMITTING_SAMPLE
- if (corresponding `assembly_table` entry is in state SUBMITTED): update entries to state SUBMITTED_ASSEMBLY.
- Get sequences in `assembly_table` in state READY, prepare files: we need chromosome_list, fasta files and a manifest file, set status to WAITING
- if (submission succeeds): set status to WAITING and fill in results: ena-internal `erz_accession`
- else: set status to HAS_ERRORS and fill in errors
- Get sequences in `assembly_table` in state WAITING, every 5minutes (to not overload ENA) check if ENA has processed the assemblies and assigned them `gca_accession`. If so update the table to status SUBMITTED and fill in results
- Get sequences in `assembly_table` in state HAS_ERRORS for over 15min and sequences in status SUBMITTING for over 15min, or in state WAITING for over 48hours: send slack notification

#### upload_to_loculus

- Get sequences in `submission_table` state SUBMITTED_ALL.
- Get the results of all the submissions (from all other tables)
- Create a POST request to the submit-external-metadata with the results in the expected format.
- if (successful): set sequences to state SENT_TO_LOCULUS
- else: set sequences to state HAS_ERRORS_EXT_METADATA_UPLOAD
- Get sequences in `submission_table` in state HAS_ERRORS_EXT_METADATA_UPLOAD for over 15min and sequences in status SUBMITTED_ALL for over 15min: send slack notification

## Developing Locally

### Database

The ENA submission service creates a new schema in the Loculus Postgres DB, managed by flyway. To develop locally you will have to start the postgres DB locally e.g. by using the `../deploy.py` script or using

```sh
docker run -d \
docker run -d \
--name loculus_postgres \
-e POSTGRES_DB=loculus \
-e POSTGRES_USER=postgres \
Expand All @@ -14,34 +97,82 @@ The ENA submission pod creates a new schema in the loculus DB, this is managed b
postgres:latest
```

In our kubernetes pod we run flyway in a docker container, however when running locally it is best to [download the flyway CLI](https://documentation.red-gate.com/fd/command-line-184127404.html).
### Install and run flyway

You can then run flyway using the
In our kubernetes pod we run flyway in a docker container, however when running locally it is best to [download the flyway CLI](https://documentation.red-gate.com/fd/command-line-184127404.html) (or `brew install flyway` on macOS).

```
flyway -user=postgres -password=unsecure -url=jdbc:postgresql://127.0.0.1:5432/loculus -schemas=ena-submission -locations=filesystem:./sql migrate
You can then create the schema using the following command:

```sh
flyway -user=postgres -password=unsecure -url=jdbc:postgresql://127.0.0.1:5432/loculus -schemas=ena-submission -locations=filesystem:./flyway/sql migrate
```

If you want to test the docker image locally. It can be built and run using the commands:

```
```sh
docker build -t ena-submission-flyway .
docker run -it -e FLYWAY_URL=jdbc:postgresql://127.0.0.1:5432/loculus -e FLYWAY_USER=postgres -e FLYWAY_PASSWORD=unsecure ena-submission-flyway flyway migrate
```

### Setting up micromamba environment

<details>

<summary> Setting up micromamba </summary>

The rest of the ena-submission pod uses micromamba:

```bash
```sh
brew install micromamba
micromamba shell init --shell zsh --root-prefix=~/micromamba
source ~/.zshrc
```

</details>

Then activate the loculus-ena-submission environment

```bash
micromamba create -f environment.yml --platform osx-64 --rc-file .mambarc
```sh
micromamba create -f environment.yml --rc-file .mambarc
micromamba activate loculus-ena-submission
```

### Using ENA's webin-cli

In order to submit assemblies you will also need to install ENA's `webin-cli.jar`. Their [webpage](https://ena-docs.readthedocs.io/en/latest/submit/general-guide/webin-cli.html) offers more instructions. This pipeline has been tested with `WEBIN_CLI_VERSION=7.3.1`.

```sh
wget -q "https://github.com/enasequence/webin-cli/releases/download/${WEBIN_CLI_VERSION}/webin-cli-${WEBIN_CLI_VERSION}.jar" -O /package/webin-cli.jar
```

### Running snakemake

Then run snakemake using `snakemake` or `snakemake {rule}`.

## Testing

### Run tests

```sh
micromamba activate loculus-ena-submission
python3 scripts/test_ena_submission.py
```

### Testing submission locally

ENA-submission currently is only triggered after manual approval.

The `get_ena_submission_list` runs as a cron-job. It queries Loculus for new sequences to submit to ENA (these are sequences that are in state OPEN, were not submitted by the INSDC_INGEST_USER, do not include ena external_metadata fields and are not yet in the submission_table of the ena-submission schema). If it finds new sequences it sends a notification to slack with all sequences.

It is then the reviewer's turn to review these sequences. [TODO: define review criteria] If these sequences meet our criteria they should be uploaded to [pathoplexus/ena-submission](https://github.com/pathoplexus/ena-submission/blob/main/approved/approved_ena_submission_list.json) (currently we read data from the [test folder](https://github.com/pathoplexus/ena-submission/blob/main/test/approved_ena_submission_list.json) - but this will be changed to the `approved` folder in production). The `trigger_submission_to_ena` rule is constantly checking this folder for new sequences and adding them to the submission_table if they are not already there. Note we cannot yet handle revisions so these should not be added to the approved list [TODO: do not allow submission of revised sequences in `trigger_submission_to_ena`]- revisions will still have to be performed manually.

If you would like to test `trigger_submission_to_ena` while running locally you can also use the `trigger_submission_to_ena_from_file` rule, this will read in data from `results/approved_ena_submission_list.json` (see the test folder for an example). You can also upload data to the [test folder](https://github.com/pathoplexus/ena-submission/blob/main/test/approved_ena_submission_list.json) - note that if you add fake data with a non-existent group-id the project creation will fail, additionally the `upload_to_loculus` rule will fail if these sequences do not actually exist in your loculus instance.

All other rules query the `submission_table` for projects/samples and assemblies to submit. Once successful they add accessions to the `results` column in dictionary format. Finally, once the entire process has succeeded the new external metadata will be uploaded to Loculus.

Note that ENA's dev server does not always finish processing and you might not receive a gcaAccession for your dev submissions. If you would like to test the full submission cycle on the ENA dev instance it makes sense to manually alter the gcaAccession in the database using `ERZ24784470`. You can connect to a preview instance via port forwarding to these changes on local database tool such as pgAdmin:

1. Apply the preview `~/.kube/config`
2. Find the database POD using `kubectl get pods -A | grep database`
3. Connect via port-forwarding `kubectl port-forward $POD -n $NAMESPACE 5432:5432`
4. If necessary find password using `kubectl get secret`
Loading
Loading