Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/ebi search dump migration #332

Merged
merged 30 commits into from
Jan 29, 2024

Conversation

SandyRogers
Copy link
Member

@SandyRogers SandyRogers commented Oct 20, 2023

This PR:

  • adds management commands to dump XML dumps of Studies ("projects") and Analyses ("runs") for EBI Search indexing.
  • updates the schema of the dumps. In particular, we no longer separately index samples; instead the relevant sample metadata is dumped onto the analyses listing. A sample may appear twice, if there are multiple analyses of it, but this is acceptable.
  • updates a few mongo queries, by introducing lazy references. By default these do not fetch the other side of references fields (e.g. the description of an interpro identifier found in an analysis). Instead, they just return the PK. Since the PKs for these fields are usually all we need to xref annotations to the foreign db (e.g., interpro), and are adequate for a user to search for (e.g., by searching an IPRxxxx accession), this is a fine optimisation to make the search dump run in reasonable time.
  • makes (at least some) last_updated fields be auto-nows. This is so that if the object changes, it'll be included in the next incremental EBI Search indexing.

mberacochea and others added 29 commits July 12, 2023 12:25
This is a WIP, but the analysis should be nearly done.

I've moved the logic for "Runs" of EBI Search Dump:
- Code that reads the analysis data from the DB and flat files -> https://github.com/EBI-Metagenomics/MetagenomicsSearchDump/blob/master/src/input/metagenomicsDB.py#L191
  - I've replaced the file reading bits with calls to Mongo / MySQL.  Which
    makes the code independent from the filesystem.
- Code that generates the xml output file for **one** analysis: https://github.com/EBI-Metagenomics/MetagenomicsSearchDump/blob/master/src/output/run.py

We need to port the Samples and Studies; and the analysis aggregation.

The aggregation is required to merge the analysis into XML bundles that EBI Search
ingest.

AnalysesAggregation script -> https://github.com/EBI-Metagenomics/MetagenomicsSearchDump/blob/master/src/RunEntryAggregator.py

Another task: Is to add a indexed_at field on Analyses, Samples and Studies.
This is going to be useful to have incremental dump generation, instead of
having to re-generate all the files. Moving to this will also required
the modification of the AnalysesAggregation script OR for us to move
to EBI Search incremental dumps (https://www.ebi.ac.uk/seqdb/confluence/display/EXT/Preparation+for+incremental+indexing)

Last one. EBI Search also supports JSONs (https://www.ebi.ac.uk/seqdb/confluence/display/EXT/JSON+data+format)
# Conflicts:
#	emgcli/__init__.py
#	pyproject.toml
# Conflicts:
#	emgapi/models.py
#	emgcli/__init__.py
#	pyproject.toml
# Conflicts:
#	emgcli/__init__.py
#	pyproject.toml
@SandyRogers SandyRogers marked this pull request as ready for review January 22, 2024 16:09
Copy link
Member

@mberacochea mberacochea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff @SandyRogers


sample_metadata = {}
for sample_metadata_entry in analysis.sample.metadata.all():
if (vn := sample_metadata_entry.var.var_name) in sample_annotations_to_index:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quality code review

if page.number > mp:
logger.warning("Skipping remaining pages")
break
logger.info(f"Dumping {page.number = }/{paginated_analyses.num_pages}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo?

Suggested change
logger.info(f"Dumping {page.number = }/{paginated_analyses.num_pages}")
logger.info(f"Dumping {page.number}/{paginated_analyses.num_pages}")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one wasn't actually a typo, this is just the short-hand f string syntax typically useful in logs to print var names along with values, i.e. outputs

Dumping page.number=1/999

@SandyRogers SandyRogers merged commit 0109374 into develop Jan 29, 2024
4 checks passed
@mberacochea mberacochea deleted the feature/ebi-search-dump-migration branch February 29, 2024 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants