Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ena-submission): Upload ena submission results to Loculus #2417

Merged
merged 45 commits into from
Sep 18, 2024

Conversation

anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Aug 12, 2024

resolves #2424

preview URL: https://upload-ena-submission-res.loculus.org/

Summary

  1. This PR encompasses the entire ena-submission process, for details see: ena-submission/README.md.

  2. It adds database connection pooling, otherwise I get this error:
    image
    caused by Exhaustion of Local Ports: "When your Python application makes connections to the database, it uses a local (ephemeral) port on your machine. If your application opens many connections in a short period and doesn’t properly close them, you might run out of available local ports." - it might be possible to fix this just with better closing of port connections... I checked and the postgresDB can support SHOW max_connections; = 100. After the ena-submission-pod jobs were killed there were still SELECT count(*) FROM pg_stat_activity; = 20 connections to the db. Probably from the backend. I wasn't closing db updates (this would explain ~20-30 additional connections).

  3. It also fixes an issue in the backend when merging external metadata with other metadata fields, an alternative solution could be:

create or replace function jsonb_concat(a jsonb, b jsonb) 
returns jsonb
language plpgsql
immutable
parallel safe
as $$
declare
    result jsonb;
begin
    -- Merge the two JSON objects
    result := a || jsonb_object_agg(key, 
                case 
                    when a->key is null or a->key = 'null' then b->key 
                    else a->key 
                end)
                from jsonb_each_text(b) as elem(key, value);
    return result;
end;
$$;

and use this like:

case 
       when all_external_metadata.external_metadata is null then jsonb_build_object('metadata', (sequence_entries_preprocessed_data.processed_data->'metadata'))
        else jsonb_build_object('metadata', (sequence_entries_preprocessed_data.processed_data->'metadata') || all_external_metadata.external_metadata)
    end as joint_metadata

Screenshot

PR Checklist

  • Confirm this works locally from uploading new sequences, getting the sequences from the slack notification and uploading to the github repo to appearing on the page of the uploaded sequence:
    image
    Note Biosamples was down so I had to artificially add the values in.
  • Confirm this works on a preview instance: test full cycle from uploading sequences to the preview instance. Similar to in feat(ena-submission): Create ena assembly #2332 I had to log into the database via port forwarding to set the sample results to existing entries:
{
        "ena_sample_accession":  "ERS20170050",
        "biosample_accession": "SAMEA115670243",
        "ena_submission_accession": "ena_submission_accession"
    }

(as biosamples is currently down), and then did the same for the erz_accession (this was successful but to not have to wait I set this to an existing erz accession). This finishes the submission cycle and then the results are sent to loculus:
image
and are seen on the website (with links that work except for sra... it seems like this isn't actually an sra accession??):

image
now without sra_run_accession: image

  • Check this works for a single segmented organism

@anna-parker anna-parker force-pushed the upload_ena_submission_results branch 2 times, most recently from bc7da3e to 6353a76 Compare August 14, 2024 11:46
@anna-parker anna-parker added preview Triggers a deployment to argocd and removed preview Triggers a deployment to argocd labels Aug 14, 2024
@anna-parker anna-parker removed the preview Triggers a deployment to argocd label Aug 21, 2024
@corneliusroemer corneliusroemer added the insdc tasks related to ENA/INSDC submission label Aug 29, 2024
@anna-parker anna-parker added the preview Triggers a deployment to argocd label Aug 29, 2024
@anna-parker anna-parker marked this pull request as ready for review August 29, 2024 18:14
@anna-parker anna-parker force-pushed the upload_ena_submission_results branch 2 times, most recently from 4d0d5d9 to ebd05d6 Compare August 30, 2024 16:14
Copy link
Member

@chaoran-chen chaoran-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there is at least one possible way for an SQL injection.

There are multiple places where a function constructs an SQL query via string concatenation using values passed into the function without escaping and validation. This is very dangerous and even for places where an attack is currently not possible because the input values are all under our control (e.g., the organism field), this could potentially lead to future vulnerabilities.

ena-submission/scripts/create_project.py Show resolved Hide resolved
ena-submission/ENA_submission.md Outdated Show resolved Hide resolved
ena-submission/scripts/call_loculus.py Outdated Show resolved Hide resolved
ena-submission/scripts/call_loculus.py Outdated Show resolved Hide resolved
ena-submission/scripts/call_loculus.py Outdated Show resolved Hide resolved
ena-submission/scripts/create_project.py Show resolved Hide resolved
ena-submission/scripts/create_project.py Show resolved Hide resolved
ena-submission/scripts/call_loculus.py Outdated Show resolved Hide resolved
ena-submission/scripts/call_loculus.py Outdated Show resolved Hide resolved
ena-submission/scripts/create_assembly.py Show resolved Hide resolved
ena-submission/scripts/ena_submission_helper.py Outdated Show resolved Hide resolved
ena-submission/scripts/create_assembly.py Outdated Show resolved Hide resolved
@chaoran-chen
Copy link
Member

Thanks @anna-parker for this huge piece of work!! I read (rather briefly) through the code and commented whenever I noticed something. I found a few vulnerabilities that should be fixed. Otherwise, it looks great for now!

@anna-parker anna-parker removed preview Triggers a deployment to argocd labels Sep 11, 2024
@corneliusroemer corneliusroemer added preview Triggers a deployment to argocd and removed preview Triggers a deployment to argocd labels Sep 16, 2024
Copy link
Contributor

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review as we go - let's try this out

@anna-parker anna-parker added the preview Triggers a deployment to argocd label Sep 18, 2024
@anna-parker anna-parker added preview Triggers a deployment to argocd and removed preview Triggers a deployment to argocd labels Sep 18, 2024
@anna-parker anna-parker dismissed chaoran-chen’s stale review September 18, 2024 16:29

SQL injection has been prevented with subsequent code changes - I am not sure why I need to dismiss this review it seems I should be able to do it automatically.

@anna-parker
Copy link
Contributor Author

@chaoran-chen sorry I had to dismiss your review to merge this (but I did address and fix the issues you found), from discussion with @rneher and @corneliusroemer we will merge this now and work on #2832 in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
insdc tasks related to ENA/INSDC submission preview Triggers a deployment to argocd
Projects
None yet
4 participants