Skip to content

Commit

Permalink
clean up loader and downloader (#149)
Browse files Browse the repository at this point in the history
* change BGC attributes type from list to tuple

The following BGC attributes are updated:
- product_prediction
- mibig_bgc_class
- smiles

* use positional-only parameter in BGC and GCF

Parameters before "/" are positional-only parameters, see https://docs.python.org/3/glossary.html#term-parameter.

* update BGC's `__eq__` and `__hash__`

* update GCF's `__eq__` and `__hash__`

* Update gcf.py

* update Spectrum's `__eq__` and `__hash__`

* update MolecularFamily `__eq__` and `__hash__`

* Update molecular_family.py

* update Strain `__eq__` and `__hash__`

* update StrainCollection `__eq__`

* add TODO comments to ObjectLink

* add parameter type check for `add_alias`

* add `__contains__` to Strain class

* update `lookup` method of StrainCollection

* update `__contains__` in StrainCollection

* remove from __eq__

* update `__eq__` logic for Strain

* rename `_strain_dict_id` to `_strain_dict_name` in StrainCollection

* add comments to `get_common_strains`

* add comments and rename variables for DataLinks

* add comment about `met_only` parameter

* add todo comments to LinkFinder

* add comments to GNPSSpectrumLoader

to figure out how `spectrum_id` is set

* change Spectrum.spectrum_id from type int to str

* update spec_dict

* Update tests

* update `__eq__` in MolecularFamily

* change `MolecularFamily.family_id` from type int to str

* add method `has_strain` to MolecularFamily

* Update metcalf_scoring.py

* change array to dataframe in DataLinks

1. Change array to dataframe:
- self.M_gcf_strain -> self.gcf_strain_occurrence
- self. M_spec_strain -> self.spec_strain_occurrence
- self. M_fam_strain -> mf_strain_occurrence
2. update relevant methods to get the new dataframes
3. update logics of method `common_strains` using the new dataframes

* update references of the new dataframes from DataLinks

* update logics of `get_links` in NPLinker class

* Update test_nplinker.py

- add code to remove cached results

* move SCORING_METHODS to LinkFinder

* update method name to `get_common_strains`

* refactor mapping dataframes in DataLinks

* add TODOs and deprecation to LinkFinder

* refactor cooccurrence in DataLinks

* merge `load_data` and `find_correlations` to init in DataLinks

* refactor DataLinks attributes

- Move assignment of attributes to `__init__`
- Rename attributes
- Replace `fam` or `molfam` with `mf` to refer to molecular family
- Add docstrings

* Delete test_data_links.py

* update get_common_strains methods

- update parameters to be more clear and specific
- change strain id in returned dict to strain objects
-  update docstrings

* remove lookup_index method from StrainCollection (#90)

- remove method `lookup_index`
- remove attribute `_strain_dict_index`

* Remove integer id from GCF

* update lookup methods and attributes in NPLikner class

* change cooccurrence from array to DataFrame in DataLinks

* format link_finder.py

* temp replace array with dataframe in LinkFinder for metcalf scoring

* refactor `LinkFinder.get_scores` method

* refactor `LinkFinder.metcalf_scoring` method

- rename parameter name
- wrap parameters for weights to one parameter
- extract private method `_cal_mean_std`

* refactor get_links

* remove unused methods and scorings from LinkFinder

-  remove unused `likescore` and `hg` scoring types
- remove all unused methods

* refactor returned type of `LinkFinder.get_links` method

* add `lookup_mf` method in NPLinker class

* refactor MetcalfScoring class

* add deprecation to LinkLikelihood class

* add `__init__.py` to linking module

* rename `data_linking.py` to `data_links.py`

* rename `data_linking_functions.py` to `utils.py`

* rename `test_data_linking_functions.py` to `test_linking_utils.py`.py

* Delete test_scoring.py

* add dtype to DataLinks dataframes

* remove mapping dataframes and relevant method from DataLinks

Removed:
- self.mapping_spec
- self.mapping_gcf
- self.mapping_fam
-self.mapping_strain
- _get_mappings_from_occurrence() method

* Create test_data_links.py

* add `conftest.py` for scoring tests

* update LinkFinder's attribute and private method

- refactor method `_cal_mean_std`
- rename attribute `raw_score_fam_gcf` to `raw_score_mf_gcf`

* Create test_link_finder.py

* Update vscode plugin autodocstring template

- fix indentation bug in autodocsting
- remove `Examples:` section

* add scope for fixtures

* Create test_metcalf_scoring.py

* add docstrings and type hints to `MetcalfScoring` class

* add util func `isinstance_all`

* replace `_isinstance` with util func `isinstance_all`

* update validation of args for `DataLinks`

* Update test_data_links.py

- add docstrings
- add more tests

* add type hints for returned values to unit tests

* update exception types for invalid input

* add docstrings and type hints to `LinkFinder` class

* add more unit tests for `LinkFinder`

* fix input type bug for `DataLinks.get_common_strains`

* Create test_nplinker_scoring.py

* add todo comments to `NPLinker` class

* remove local integration tests for scoring part of `NPLinker`

- rename `test_nplinker.py` to `test_nplinker_local.py`

* remove unused imports

* Fix mypy warnings as much as possible

* check strain existence using strain dict

* change calculate abbrevation from "cal" to "calc"

* remove resolved TODO comment

* move shared fixtures to conftest.py

* remove unnecessary type hints

* update docstrings for cooccurrences

* use uuid for singleton molecular families #144

* add TODO comment for GNPSLoader

* fix typos

* remove useless parameter `met_only`

The `met_only` is useless. NPlinker will stop working if met_only=True.

* update exception type

* refactor the usage of PODPDownloader

1. create instance in the private method, only when it's needed
2. change the scope of the instance from global to local

* rename private config attributes in class DatasetLoader

-  add prefix `_config` for all config attributes
- add comments to restructure `__init__` code

* change the variable of app data dir to be global

this variable is independent of DatasetLoader and other classes, so it should be a global variable

* change two public methods to variables

* change one public method to attribute for DatasetLoader

* add value validation to Config

- move the validation of antismash format config in DatasetLoader to Config class
- refactor the config data validations into a private method

* add TODO comments about init and validate paths

* remove unused attribute `growth_media`

* remove commented code

* add TODO comments

* remove unused imports

* format the code

* reorder methods in loader.py
  • Loading branch information
CunliangGeng authored Jul 4, 2023
1 parent bd35f65 commit 067aeb9
Show file tree
Hide file tree
Showing 8 changed files with 526 additions and 571 deletions.
17 changes: 10 additions & 7 deletions src/nplinker/annotations.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ def _headers_match_gnps(headers: list[str]) -> bool:
return False
return True


@deprecated(version="1.3.3", reason="Use GNPSAnnotationLoader class instead.")
def create_gnps_annotation(spec: Spectrum, gnps_anno: dict):
"""Function to insert png, json, svg and spectrum information in GNPS annotation data for given spectrum.
Expand All @@ -46,8 +47,8 @@ def create_gnps_annotation(spec: Spectrum, gnps_anno: dict):
"""
# also insert useful URLs
for t in ['png', 'json', 'svg', 'spectrum']:
gnps_anno[f'{t}_url'] = GNPS_URL_FORMAT.format(
t, gnps_anno['SpectrumID'])
gnps_anno[f'{t}_url'] = GNPS_URL_FORMAT.format(t,
gnps_anno['SpectrumID'])

if GNPS_KEY in spec.annotations:
# TODO is this actually an error or can it happen normally?
Expand All @@ -60,7 +61,9 @@ def create_gnps_annotation(spec: Spectrum, gnps_anno: dict):


@deprecated(version="1.3.3", reason="Use GNPSAnnotationLoader class instead.")
def load_annotations(root: str | os.PathLike, config: str | os.PathLike, spectra: list[Spectrum], spec_dict: dict[str, Spectrum]) -> list[Spectrum]:
def load_annotations(root: str | os.PathLike, config: str | os.PathLike,
spectra: list[Spectrum],
spec_dict: dict[str, Spectrum]) -> list[Spectrum]:
"""Load the annotations from the GNPS annotation file present in root to the spectra.
Args:
Expand Down Expand Up @@ -120,8 +123,7 @@ def load_annotations(root: str | os.PathLike, config: str | os.PathLike, spectra
data = {}
for dc in data_cols:
if dc not in headers:
logger.warning(
f'Column lookup failed for "{dc}"')
logger.warning(f'Column lookup failed for "{dc}"')
continue
data[dc] = line[headers.index(dc)]

Expand Down Expand Up @@ -160,8 +162,7 @@ def load_annotations(root: str | os.PathLike, config: str | os.PathLike, spectra
data = {}
for dc in data_cols:
if dc not in headers:
logger.warning(
f'Column lookup failed for "{dc}"')
logger.warning(f'Column lookup failed for "{dc}"')
continue
data[dc] = line[headers.index(dc)]
spec_annotations[spec].append(data)
Expand All @@ -173,6 +174,7 @@ def load_annotations(root: str | os.PathLike, config: str | os.PathLike, spectra

return spectra


def _find_annotation_files(root: str, config: str) -> list[str]:
"""Detect all annotation files in the root folder or specified in the config file.
Expand All @@ -190,6 +192,7 @@ def _find_annotation_files(root: str, config: str) -> list[str]:
annotation_files.append(os.path.join(root, f))
return annotation_files


def _read_config(config):
ac = {}
if os.path.exists(config):
Expand Down
59 changes: 30 additions & 29 deletions src/nplinker/config.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,6 @@
# Copyright 2021 The NPLinker Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import os
from collections.abc import Mapping
import os
from shutil import copyfile
import toml
from xdg import XDG_CONFIG_HOME
Expand Down Expand Up @@ -143,24 +129,39 @@ def update(d, u):
return d

config = update(config, config_dict)
self._validate(config)
self.config = config

def _validate(self, config: dict) -> None:
"""Validates the configuration dictionary to ensure that all required
fields are present and have valid values.
Args:
config (dict): The configuration dictionary to validate.
Raises:
ValueError: If the configuration dictionary is missing required
fields or contains invalid values.
"""
if 'dataset' not in config:
raise Exception('No dataset defined in configuration!')
raise ValueError('Not found config for "dataset".')

root = config['dataset']['root']
config['dataset']['platform_id'] = '' # placeholder
root = config['dataset'].get('root')
if root is None:
raise ValueError('Not found config for "root".')

# check if the root has the special 'platform:' prefix and extract the
# ID if so. otherwise treat as a path
if root.startswith('platform:'):
config['dataset']['platform_id'] = root.replace('platform:', '')
logger.info('Selected platform project ID {}'.format(
config['dataset']['platform_id']))
logger.info('Loading from platform project ID %s',
config['dataset']['platform_id'])
else:
if root is None or not os.path.exists(root):
raise Exception(
'Dataset path "{}" not found or not accessible'.format(
root))
logger.info(f'Loading from local data in directory {root}')

self.config = config
config['dataset']['platform_id'] = ''
logger.info('Loading from local data in directory %s', root)

antismash = config['dataset'].get('antismash')
allowed_antismash_formats = ["default", "flat"]
if antismash is not None:
if 'format' in antismash and antismash[
'format'] not in allowed_antismash_formats:
raise ValueError(
f'Unknown antismash format: {antismash["format"]}')
2 changes: 1 addition & 1 deletion src/nplinker/genomics/mibig/mibig_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ def get_bgc_genome_mapping(self) -> dict[str, str]:
Note that for MIBiG BGC, same value is used for BGC id and genome id.
Users don't have to provide genome id for MIBiG BGCs in the
`strain_mapping.csv` file.
`strain_mappings.csv` file.
Returns:
dict[str, str]: key is BGC id/accession, value is
Expand Down
Loading

0 comments on commit 067aeb9

Please sign in to comment.