Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[don't merge without prior discussion] fix: don't sort author lists, except for segment grouping purposes #2653

Merged
merged 4 commits into from
Sep 19, 2024

Conversation

corneliusroemer
Copy link
Contributor

resolves #2650

preview URL: https://author-order.loculus.org

Summary

We started sorting author lists when it turned out that different segments of the same isolate could have different author orders. We should have done the sorting only for grouping purposes and not when submitting to Loculus.

This PR fixes that by only sorting for grouping purposes.

Note: this will cause ingest to submit revisions for existing deployments as the ingest hash changes.

Screenshot

@corneliusroemer corneliusroemer added preview Triggers a deployment to argocd ingest Ingest pipeline bug Something isn't working high_priority Work on this as soon as possible (potentially post-MVP) labels Aug 28, 2024
Copy link
Member

@theosanderson theosanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good work - looks good to me. I don't think we should merge this until tomorrow at the point that we're ready to start looking into bringing this to prod, because once we do we block on bringing in other new things (if something even more urgent were to come up in the intervening time).

@corneliusroemer
Copy link
Contributor Author

Thanks for the review @theosanderson, I've addressed your points - I won't merge then, let's coordinate tomorrow on how to phase this into production!

@corneliusroemer corneliusroemer changed the title fix: don't sort author lists, except for segment grouping purposes [don't merge without prior discussion] fix: don't sort author lists, except for segment grouping purposes Aug 28, 2024
Copy link
Contributor

@anna-parker anna-parker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for doing this @corneliusroemer!

@anna-parker
Copy link
Contributor

One suggestion

BEGIN TRANSACTION;

SELECT accession
INTO TEMPORARY TABLE temp_accessions
FROM sequence_entries
WHERE submitter = 'insdc_ingest_user' AND version = 2;

DELETE FROM sequence_entries
WHERE accession IN (SELECT accession FROM temp_accessions) AND version = 1;

UPDATE sequence_entries
SET version = 1
WHERE submitter = 'insdc_ingest_user' AND version = 2;

DROP TABLE temp_accessions;


SELECT accession
INTO TEMPORARY TABLE temp_accessions_preprocessed
FROM sequence_entries_preprocessed_data
WHERE submitter = 'insdc_ingest_user' AND version = 2;

DELETE FROM sequence_entries_preprocessed_data
WHERE accession IN (SELECT accession FROM temp_accessions_preprocessed) AND version = 1;

UPDATE sequence_entries_preprocessed_data
SET version = 1
WHERE submitter = 'insdc_ingest_user' AND version = 2;

DROP TABLE temp_accessions_preprocessed;

COMMIT;

@anna-parker
Copy link
Contributor

anna-parker commented Aug 30, 2024

ROLLBACK;
DO $$ 
DECLARE
    acc_exists BOOLEAN;
BEGIN
    -- Check if any accession exists with the given conditions
    SELECT EXISTS (
        SELECT 1 
        FROM sequence_entries 
        WHERE submitter = 'insdc_ingest_user' AND version > 2
    ) INTO acc_exists;

    -- If such an accession exists, rollback the transaction
    IF acc_exists THEN
        RAISE NOTICE 'Condition met: Rolling back transaction.';
        -- Rollback the transaction; this will end the transaction block
        RAISE EXCEPTION 'Transaction rolled back due to condition.';
    ELSE
        -- Create a temporary table for use in the following operations
        CREATE TEMPORARY TABLE accessions_to_downgrade AS
        SELECT accession
        FROM sequence_entries
        WHERE submitter = 'insdc_ingest_user' AND version = 2;
        
        -- Delete entries with version 1 that have accessions to downgrade
        DELETE FROM sequence_entries
        WHERE accession IN (SELECT accession FROM accessions_to_downgrade) AND version = 1;
        
        -- Update version 2 to version 1 for the relevant accessions
        UPDATE sequence_entries
        SET version = 1
        WHERE submitter = 'insdc_ingest_user' AND version = 2;
        
        -- Delete preprocessed data entries with version 1 for the relevant accessions
        DELETE FROM sequence_entries_preprocessed_data
        WHERE accession IN (SELECT accession FROM accessions_to_downgrade) AND version = 1;
        
        -- Update version 2 to version 1 in preprocessed data
        UPDATE sequence_entries_preprocessed_data
        SET version = 1
        WHERE accession IN (SELECT accession FROM accessions_to_downgrade) AND version = 2;
        
        -- Drop the temporary table
        DROP TABLE accessions_to_downgrade;
        
        RAISE NOTICE 'There are only v1 and v2 - can continue.';
        -- No explicit COMMIT needed as the DO block will commit if no exceptions are raised
    END IF;
END $$;

@anna-parker
Copy link
Contributor

Weird I didn't think we could have duplicate ingest anymore:
image

@theosanderson
Copy link
Member

Yes, that is mysterious

@corneliusroemer
Copy link
Contributor Author

Yeah it's odd, but I can't see why author order would have this effect?

@anna-parker
Copy link
Contributor

anna-parker commented Sep 18, 2024

uggh... #2837 (which has the same code changes as this PR) has the exact same duplication in west nile (and only in west nile) ... so I do think it is this PR...

@theosanderson
Copy link
Member

theosanderson commented Sep 18, 2024

It doesn't seem to have the duplication right now? https://ing-upd.loculus.org/ did we change sthng?

@anna-parker
Copy link
Contributor

anna-parker commented Sep 18, 2024

I rebased to see if that had an impact - now https://ing-upd.loculus.org/ has a duplication in cchf - but yes not in west nile anymore

@anna-parker
Copy link
Contributor

But maybe this is more linked to our ingest deployment? Is there any chance ingest pods can run concurrently?

I'm uncertain why this would only show up in this PR... maybe now ingest takes longer?

@anna-parker
Copy link
Contributor

Lets maybe not sort on keys? e.g. 1d833f9 - adding the sort makes ingest slower and we want to minimize time between the get-original-metadata call and the submit calls.

@corneliusroemer
Copy link
Contributor Author

@anna-parker have you measured how much this changes? I'd be very surprised if the sort takes more than a second. Sorting is fast, this is a small dataset.

@anna-parker
Copy link
Contributor

Yeah its not a big difference, e.g. for west nile

10:37:12    INFO ( prepare_metadata.py: 151) - Unsorted write took: 0.32383300000000004 
10:37:13    INFO ( prepare_metadata.py: 154) - Sorted write took: 0.356398 

But found it suspicious we saw this for the organisms with the most data

@anna-parker anna-parker merged commit 767221a into main Sep 19, 2024
29 checks passed
@anna-parker anna-parker deleted the author-order branch September 19, 2024 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high_priority Work on this as soon as possible (potentially post-MVP) ingest Ingest pipeline preview Triggers a deployment to argocd
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Author list order is not preserved when ingested
3 participants