Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/metagenomics exchange #325

Merged
merged 36 commits into from
Feb 8, 2024
Merged

Conversation

KateSakharova
Copy link
Contributor

@KateSakharova KateSakharova commented Sep 12, 2023

Scripts to populate and track Metagenomics Exchange https://www.ebi.ac.uk/ena/registry/metagenome/api/
ME is based on EBI Search model.
System for ME works a bit differently. For update process we do not need to remove and then add a record to "XML" again. That is why I do not use to_delete() function.
I've created get_suppressed() to filter records that need to be removed from ME API.

@KateSakharova KateSakharova requested review from SandyRogers and removed request for SandyRogers October 23, 2023 10:36
@KateSakharova KateSakharova marked this pull request as draft October 23, 2023 12:54
@KateSakharova KateSakharova marked this pull request as ready for review January 31, 2024 16:29
Copy link
Member

@mberacochea mberacochea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff Kate, I left some comments :)

# metagenomics exchange
me_api: 'https://wwwdev.ebi.ac.uk/ena/registry/metagenome/api'
me_api_token: 'mgx 871cd915-2826-46bb-94ed-8e6a3d9b6014'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to remove this token and ask Suran to grant us a new one, this has write access and this repo is public (even if it's dev env). This one should be in the secrets config file

logging.info(f"Analysis {ajob} updated successfully")
# Just to be safe, update the MGX accession
ajob.mgx_accession = registry_id
ajob.last_mgx_indexed = timezone.now()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should also use the "nowish" timestamp: "timezone.now() + timedelta(minutes=1)"

return response

def add_analysis(self, mgya: str, run_accession: str, public: bool):
data = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "data" object is repeated in the command:

    def generate_metadata(self, mgya, run_accession, status):
        return {
            "confidence": "full",
            "endPoint": f"https://www.ebi.ac.uk/metagenomics/analyses/{mgya}",
            "method": ["other_metadata"],
            "sourceID": mgya,
            "sequenceID": run_accession,
            "status": status,
            "brokerID": settings.METAGENOMICS_EXCHANGE_MGNIFY_BROKER,
        }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep one copy of this, probably here

emgapi/models.py Outdated
@@ -321,20 +331,65 @@ def to_add(self):

return self.filter(never_indexed | updated_after_indexing, not_suppressed, not_private)

def get_suppressed(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding this method, it's better to override "to_delete" with your implementation. This way we stick to the "interface" that IndexableModelQueryset provides

class MetagenomicsExchangeQueryset(IndexableModelQueryset):
    
    index_field = "last_mgx_indexed"

    def to_delete(self):
        try:
            self.model._meta.get_field("suppressed_at")
        except FieldDoesNotExist:
            return Q()
        else:
            return self.filter(
                Q(suppressed_at__gte=F(self.index_field))
            )

emgapi/models.py Outdated
@@ -1634,6 +1687,70 @@ def __str__(self):
return 'Assembly:{} - Sample:{}'.format(self.assembly, self.sample)


class MetagenomicsExchangeQueryset(models.QuerySet):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class should be removed

Copy link
Member

@SandyRogers SandyRogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @KateSakharova , great stuff! Couple of small suggestions and queries inline.

emgapi/models.py Outdated
Comment on lines 1690 to 1708
class MetagenomicsExchangeModel(models.Model):
"""Model to track Metagenomics Exchange population
https://www.ebi.ac.uk/ena/registry/metagenome/api/
"""
last_updated_me = models.DateTimeField(
db_column='LAST_UPDATED_ME',
auto_now=True
)
last_populated_me = models.DateTimeField(
db_column='LAST_POPULATED_ME',
null=True,
blank=True,
help_text="Date at which this model was last appeared in Metagenomics Exchange"
)

objects_for_population = MetagenomicsExchangeQueryset.as_manager()

class Meta:
abstract = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this abstract model for? Isn't last_populated_me now done by last_mgx_indexed in the MetagenomicsExchangeIndexedModel? Maybe I misunderstood.

Copy link
Contributor Author

@KateSakharova KateSakharova Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch 🤦, it was left after ebi_search merge by mistake, removed

Comment on lines 690 to 697
try:
METAGENOMICS_EXCHANGE_API = EMG_CONF['emg']['me_api']
if EMG_CONF['emg'].get('me_api_token'):
METAGENOMICS_EXCHANGE_API_TOKEN = EMG_CONF['emg']['me_api_token']
else:
METAGENOMICS_EXCHANGE_API_TOKEN = os.getenv('METAGENOMICS_EXCHANGE_API_TOKEN')
except KeyError:
warnings.warn("The metagenomics exchange API and Token are not configured properly")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try:
METAGENOMICS_EXCHANGE_API = EMG_CONF['emg']['me_api']
if EMG_CONF['emg'].get('me_api_token'):
METAGENOMICS_EXCHANGE_API_TOKEN = EMG_CONF['emg']['me_api_token']
else:
METAGENOMICS_EXCHANGE_API_TOKEN = os.getenv('METAGENOMICS_EXCHANGE_API_TOKEN')
except KeyError:
warnings.warn("The metagenomics exchange API and Token are not configured properly")
try:
METAGENOMICS_EXCHANGE_API = EMG_CONF['emg']['me_api']
METAGENOMICS_EXCHANGE_API_TOKEN = os.getenv('METAGENOMICS_EXCHANGE_API_TOKEN')
if not METAGENOMICS_EXCHANGE_API_TOKEN:
METAGENOMICS_EXCHANGE_API_TOKEN = EMG_CONF['emg']['me_api_token']
except KeyError:
warnings.warn("The metagenomics exchange API and Token are not configured properly")

Just to make sure that a missing token (not in env or in config) triggers the warning.

emgapi/migrations/0017_auto_20240129_1401.py Outdated Show resolved Hide resolved
"""Metagenomics Exchange API Client"""

def __init__(self, base_url=None):
self.base_url = base_url if base_url else settings.METAGENOMICS_EXCHANGE_API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.base_url = base_url if base_url else settings.METAGENOMICS_EXCHANGE_API
self.base_url = base_url or settings.METAGENOMICS_EXCHANGE_API

#!/usr/bin/env python
# -*- coding: utf-8 -*-

# Copyright 2017-2022 EMBL - European Bioinformatics Institute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2017-2022 EMBL - European Bioinformatics Institute
# Copyright 2017-2024 EMBL - European Bioinformatics Institute

#!/usr/bin/env python
# -*- coding: utf-8 -*-

# Copyright 2018-2023 EMBL - European Bioinformatics Institute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2018-2023 EMBL - European Bioinformatics Institute
# Copyright 2018-2024 EMBL - European Bioinformatics Institute

def process_to_index_and_update_records(self, analyses_to_index_and_update):
logging.info(f"Indexing {len(analyses_to_index_and_update)} new analyses")

for page in Paginator(analyses_to_index_and_update, 100):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 100 is probably worth making a parameter – we had some trouble with EBI dump where reasonable sounding page sizes still caused query problem on the DB. If parametrized we can at least easily try different values.

logging.info(f"Analysis {ajob} updated successfully")
# Just to be safe, update the MGX accession
ajob.mgx_accession = registry_id
ajob.last_mgx_indexed = timezone.now() + timedelta(minutes=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to consider here, is that if we offset individual jobs indexation by only 1 minute individually, by the time this loop exists and the bulk_update on L154 happens – that +minute may be in the past. This means that the auto last_update will now be after the last indexation (if Django decided to update the last_update for bulk updates... which I think it currently doesn't, but might in future). Perhaps worth a quick test / check the Django docs to be sure this won't be a problem.

KateSakharova and others added 3 commits February 6, 2024 16:08
…ndexed

This is needed because RenameField doesn't allow you to change
the db_column name.

With this change it seems to work, and the data should be kept:

```sql
--
-- Alter field last_indexed on analysisjob
--
ALTER TABLE "ANALYSIS_JOB" RENAME COLUMN "LAST_INDEXED" TO "LAST_EBI_SEARCH_INDEXED";
--
-- Alter field last_indexed on study
--
ALTER TABLE "STUDY" RENAME COLUMN "LAST_INDEXED" TO "LAST_EBI_SEARCH_INDEXED";
--
-- Rename field last_indexed on analysisjob to last_ebi_search_indexed
--
--
-- Rename field last_indexed on study to last_ebi_search_indexed
--
```
Copy link
Member

@SandyRogers SandyRogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @KateSakharova !

Copy link
Member

@mberacochea mberacochea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Kate!

@mberacochea mberacochea merged commit 633b3f7 into develop Feb 8, 2024
4 checks passed
@mberacochea mberacochea deleted the feature/metagenomics_exchange branch February 29, 2024 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants