Feature/ebi search dump migration #332

SandyRogers · 2023-10-20T17:17:29Z

This PR:

adds management commands to dump XML dumps of Studies ("projects") and Analyses ("runs") for EBI Search indexing.
updates the schema of the dumps. In particular, we no longer separately index samples; instead the relevant sample metadata is dumped onto the analyses listing. A sample may appear twice, if there are multiple analyses of it, but this is acceptable.
updates a few mongo queries, by introducing lazy references. By default these do not fetch the other side of references fields (e.g. the description of an interpro identifier found in an analysis). Instead, they just return the PK. Since the PKs for these fields are usually all we need to xref annotations to the foreign db (e.g., interpro), and are adequate for a user to search for (e.g., by searching an IPRxxxx accession), this is a fine optimisation to make the search dump run in reasonable time.
makes (at least some) last_updated fields be auto-nows. This is so that if the object changes, it'll be included in the next incremental EBI Search indexing.

This is a WIP, but the analysis should be nearly done. I've moved the logic for "Runs" of EBI Search Dump: - Code that reads the analysis data from the DB and flat files -> https://github.com/EBI-Metagenomics/MetagenomicsSearchDump/blob/master/src/input/metagenomicsDB.py#L191 - I've replaced the file reading bits with calls to Mongo / MySQL. Which makes the code independent from the filesystem. - Code that generates the xml output file for **one** analysis: https://github.com/EBI-Metagenomics/MetagenomicsSearchDump/blob/master/src/output/run.py We need to port the Samples and Studies; and the analysis aggregation. The aggregation is required to merge the analysis into XML bundles that EBI Search ingest. AnalysesAggregation script -> https://github.com/EBI-Metagenomics/MetagenomicsSearchDump/blob/master/src/RunEntryAggregator.py Another task: Is to add a indexed_at field on Analyses, Samples and Studies. This is going to be useful to have incremental dump generation, instead of having to re-generate all the files. Moving to this will also required the modification of the AnalysesAggregation script OR for us to move to EBI Search incremental dumps (https://www.ebi.ac.uk/seqdb/confluence/display/EXT/Preparation+for+incremental+indexing) Last one. EBI Search also supports JSONs (https://www.ebi.ac.uk/seqdb/confluence/display/EXT/JSON+data+format)

# Conflicts: # emgcli/__init__.py # pyproject.toml

# Conflicts: # emgapi/models.py # emgcli/__init__.py # pyproject.toml

# Conflicts: # emgcli/__init__.py # pyproject.toml

…lated attributes

mberacochea

Great stuff @SandyRogers

mberacochea · 2024-01-25T10:57:32Z

emgapi/management/commands/ebi_search_analysis_dump.py

+
+        sample_metadata = {}
+        for sample_metadata_entry in analysis.sample.metadata.all():
+            if (vn := sample_metadata_entry.var.var_name) in sample_annotations_to_index:


Quality code review

mberacochea · 2024-01-25T11:00:44Z

emgapi/management/commands/ebi_search_analysis_dump.py

+                if page.number > mp:
+                    logger.warning("Skipping remaining pages")
+                    break
+            logger.info(f"Dumping {page.number = }/{paginated_analyses.num_pages}")


Typo?

Suggested change

logger.info(f"Dumping {page.number = }/{paginated_analyses.num_pages}")

logger.info(f"Dumping {page.number}/{paginated_analyses.num_pages}")

This one wasn't actually a typo, this is just the short-hand f string syntax typically useful in logs to print var names along with values, i.e. outputs

Dumping page.number=1/999

mberacochea and others added 29 commits July 12, 2023 12:25

Merge branch 'develop' into feature/ebi-search-dump-migration

6f6c070

fixes podman compatibility of docker setup

2b02f66

adds incremental ebi search dumping for analysisjobs

974371f

integrates indexable sample data into analyses ebi search dump

14f0842

adds ebi search dump for studies/projects (plus some tweaks to analyses)

1a3acfd

Merge branch 'develop' into feature/ebi-search-dump-migration

afa4cb5

resolve migration merge conflicts

d914ec1

v2.4.36

ca360e0

allows analysis features to be null in ebi search dump

2827c2b

adds support for chunking of analysis-runs ebi search dump

d18573c

escapes html chars in ebi search dump

28908ca

restore ability to recreate test DB fixtures; recreate them

5852201

ensure biome hierarchical_field is not empty for root-tagged studies

6b40098

Merge branch 'develop' into feature/ebi-search-dump-migration

480d951

# Conflicts: # emgcli/__init__.py # pyproject.toml

allow ERA database to be non existent

d185d4e

bugfixes on analysis-run dumper

f36021e

Merge branch 'develop' into feature/ebi-search-dump-migration

5eb0fc7

# Conflicts: # emgapi/models.py # emgcli/__init__.py # pyproject.toml

Merge branch 'develop' into feature/ebi-search-dump-migration

57a391b

# Conflicts: # emgcli/__init__.py # pyproject.toml

optimising speed of dump (particularly mongo)

67edbcd

avoids serializing last_indexed timestamps

0e21458

loosens test since last-updates are auto fields

0a8f872

materialise (fetch) lazyily referenced mongo embeds when accessing re…

e10fd33

…lated attributes

test updates since last_update field became auto

0c8c61c

more materialised lazy reference fields

a036402

insert test studies in reverse to preserve previous ordering

4fd2f3d

typo

de3b415

restore quote string in csv values

d13789e

2.4.44

d23c962

SandyRogers marked this pull request as ready for review January 22, 2024 16:09

SandyRogers requested a review from mberacochea January 22, 2024 16:14

mberacochea approved these changes Jan 25, 2024

View reviewed changes

Merge branch 'develop' into feature/ebi-search-dump-migration

8140ef5

SandyRogers merged commit 0109374 into develop Jan 29, 2024
4 checks passed

mberacochea deleted the feature/ebi-search-dump-migration branch February 29, 2024 12:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/ebi search dump migration #332

Feature/ebi search dump migration #332

SandyRogers commented Oct 20, 2023 •

edited

Loading

mberacochea left a comment

mberacochea Jan 25, 2024

mberacochea Jan 25, 2024

mberacochea Jan 25, 2024

SandyRogers Jan 25, 2024

	logger.info(f"Dumping {page.number = }/{paginated_analyses.num_pages}")
	logger.info(f"Dumping {page.number}/{paginated_analyses.num_pages}")

Feature/ebi search dump migration #332

Feature/ebi search dump migration #332

Conversation

SandyRogers commented Oct 20, 2023 • edited Loading

mberacochea left a comment

Choose a reason for hiding this comment

mberacochea Jan 25, 2024

Choose a reason for hiding this comment

mberacochea Jan 25, 2024

Choose a reason for hiding this comment

mberacochea Jan 25, 2024

Choose a reason for hiding this comment

SandyRogers Jan 25, 2024

Choose a reason for hiding this comment

SandyRogers commented Oct 20, 2023 •

edited

Loading