clean up loader and downloader (#149)

* change BGC attributes type from list to tuple The following BGC attributes are updated: - product_prediction - mibig_bgc_class - smiles * use positional-only parameter in BGC and GCF Parameters before "/" are positional-only parameters, see https://docs.python.org/3/glossary.html#term-parameter. * update BGC's `__eq__` and `__hash__` * update GCF's `__eq__` and `__hash__` * Update gcf.py * update Spectrum's `__eq__` and `__hash__` * update MolecularFamily `__eq__` and `__hash__` * Update molecular_family.py * update Strain `__eq__` and `__hash__` * update StrainCollection `__eq__` * add TODO comments to ObjectLink * add parameter type check for `add_alias` * add `__contains__` to Strain class * update `lookup` method of StrainCollection * update `__contains__` in StrainCollection * remove from __eq__ * update `__eq__` logic for Strain * rename `_strain_dict_id` to `_strain_dict_name` in StrainCollection * add comments to `get_common_strains` * add comments and rename variables for DataLinks * add comment about `met_only` parameter * add todo comments to LinkFinder * add comments to GNPSSpectrumLoader to figure out how `spectrum_id` is set * change Spectrum.spectrum_id from type int to str * update spec_dict * Update tests * update `__eq__` in MolecularFamily * change `MolecularFamily.family_id` from type int to str * add method `has_strain` to MolecularFamily * Update metcalf_scoring.py * change array to dataframe in DataLinks 1. Change array to dataframe: - self.M_gcf_strain -> self.gcf_strain_occurrence - self. M_spec_strain -> self.spec_strain_occurrence - self. M_fam_strain -> mf_strain_occurrence 2. update relevant methods to get the new dataframes 3. update logics of method `common_strains` using the new dataframes * update references of the new dataframes from DataLinks * update logics of `get_links` in NPLinker class * Update test_nplinker.py - add code to remove cached results * move SCORING_METHODS to LinkFinder * update method name to `get_common_strains` * refactor mapping dataframes in DataLinks * add TODOs and deprecation to LinkFinder * refactor cooccurrence in DataLinks * merge `load_data` and `find_correlations` to init in DataLinks * refactor DataLinks attributes - Move assignment of attributes to `__init__` - Rename attributes - Replace `fam` or `molfam` with `mf` to refer to molecular family - Add docstrings * Delete test_data_links.py * update get_common_strains methods - update parameters to be more clear and specific - change strain id in returned dict to strain objects - update docstrings * remove lookup_index method from StrainCollection (#90) - remove method `lookup_index` - remove attribute `_strain_dict_index` * Remove integer id from GCF * update lookup methods and attributes in NPLikner class * change cooccurrence from array to DataFrame in DataLinks * format link_finder.py * temp replace array with dataframe in LinkFinder for metcalf scoring * refactor `LinkFinder.get_scores` method * refactor `LinkFinder.metcalf_scoring` method - rename parameter name - wrap parameters for weights to one parameter - extract private method `_cal_mean_std` * refactor get_links * remove unused methods and scorings from LinkFinder - remove unused `likescore` and `hg` scoring types - remove all unused methods * refactor returned type of `LinkFinder.get_links` method * add `lookup_mf` method in NPLinker class * refactor MetcalfScoring class * add deprecation to LinkLikelihood class * add `__init__.py` to linking module * rename `data_linking.py` to `data_links.py` * rename `data_linking_functions.py` to `utils.py` * rename `test_data_linking_functions.py` to `test_linking_utils.py`.py * Delete test_scoring.py * add dtype to DataLinks dataframes * remove mapping dataframes and relevant method from DataLinks Removed: - self.mapping_spec - self.mapping_gcf - self.mapping_fam -self.mapping_strain - _get_mappings_from_occurrence() method * Create test_data_links.py * add `conftest.py` for scoring tests * update LinkFinder's attribute and private method - refactor method `_cal_mean_std` - rename attribute `raw_score_fam_gcf` to `raw_score_mf_gcf` * Create test_link_finder.py * Update vscode plugin autodocstring template - fix indentation bug in autodocsting - remove `Examples:` section * add scope for fixtures * Create test_metcalf_scoring.py * add docstrings and type hints to `MetcalfScoring` class * add util func `isinstance_all` * replace `_isinstance` with util func `isinstance_all` * update validation of args for `DataLinks` * Update test_data_links.py - add docstrings - add more tests * add type hints for returned values to unit tests * update exception types for invalid input * add docstrings and type hints to `LinkFinder` class * add more unit tests for `LinkFinder` * fix input type bug for `DataLinks.get_common_strains` * Create test_nplinker_scoring.py * add todo comments to `NPLinker` class * remove local integration tests for scoring part of `NPLinker` - rename `test_nplinker.py` to `test_nplinker_local.py` * remove unused imports * Fix mypy warnings as much as possible * check strain existence using strain dict * change calculate abbrevation from "cal" to "calc" * remove resolved TODO comment * move shared fixtures to conftest.py * remove unnecessary type hints * update docstrings for cooccurrences * use uuid for singleton molecular families #144 * add TODO comment for GNPSLoader * fix typos * remove useless parameter `met_only` The `met_only` is useless. NPlinker will stop working if met_only=True. * update exception type * refactor the usage of PODPDownloader 1. create instance in the private method, only when it's needed 2. change the scope of the instance from global to local * rename private config attributes in class DatasetLoader - add prefix `_config` for all config attributes - add comments to restructure `__init__` code * change the variable of app data dir to be global this variable is independent of DatasetLoader and other classes, so it should be a global variable * change two public methods to variables * change one public method to attribute for DatasetLoader * add value validation to Config - move the validation of antismash format config in DatasetLoader to Config class - refactor the config data validations into a private method * add TODO comments about init and validate paths * remove unused attribute `growth_media` * remove commented code * add TODO comments * remove unused imports * format the code * reorder methods in loader.py
NPLinker · Jul 4, 2023 · 067aeb9 · 067aeb9
1 parent bd35f65
commit 067aeb9
Show file tree

Hide file tree

Showing 8 changed files with 526 additions and 571 deletions.
diff --git a/src/nplinker/annotations.py b/src/nplinker/annotations.py
@@ -33,6 +33,7 @@ def _headers_match_gnps(headers: list[str]) -> bool:
             return False
     return True
 
+
 @deprecated(version="1.3.3", reason="Use GNPSAnnotationLoader class instead.")
 def create_gnps_annotation(spec: Spectrum, gnps_anno: dict):
     """Function to insert png, json, svg and spectrum information in GNPS annotation data for given spectrum.
@@ -46,8 +47,8 @@ def create_gnps_annotation(spec: Spectrum, gnps_anno: dict):
     """
     # also insert useful URLs
     for t in ['png', 'json', 'svg', 'spectrum']:
-        gnps_anno[f'{t}_url'] = GNPS_URL_FORMAT.format(
-            t, gnps_anno['SpectrumID'])
+        gnps_anno[f'{t}_url'] = GNPS_URL_FORMAT.format(t,
+                                                       gnps_anno['SpectrumID'])
 
     if GNPS_KEY in spec.annotations:
         # TODO is this actually an error or can it happen normally?
@@ -60,7 +61,9 @@ def create_gnps_annotation(spec: Spectrum, gnps_anno: dict):
 
 
 @deprecated(version="1.3.3", reason="Use GNPSAnnotationLoader class instead.")
-def load_annotations(root: str | os.PathLike, config: str | os.PathLike, spectra: list[Spectrum], spec_dict: dict[str, Spectrum]) -> list[Spectrum]:
+def load_annotations(root: str | os.PathLike, config: str | os.PathLike,
+                     spectra: list[Spectrum],
+                     spec_dict: dict[str, Spectrum]) -> list[Spectrum]:
     """Load the annotations from the GNPS annotation file present in root to the spectra.
 
     Args:
@@ -120,8 +123,7 @@ def load_annotations(root: str | os.PathLike, config: str | os.PathLike, spectra
                     data = {}
                     for dc in data_cols:
                         if dc not in headers:
-                            logger.warning(
-                                f'Column lookup failed for "{dc}"')
+                            logger.warning(f'Column lookup failed for "{dc}"')
                             continue
                         data[dc] = line[headers.index(dc)]
 
@@ -160,8 +162,7 @@ def load_annotations(root: str | os.PathLike, config: str | os.PathLike, spectra
                     data = {}
                     for dc in data_cols:
                         if dc not in headers:
-                            logger.warning(
-                                f'Column lookup failed for "{dc}"')
+                            logger.warning(f'Column lookup failed for "{dc}"')
                             continue
                         data[dc] = line[headers.index(dc)]
                     spec_annotations[spec].append(data)
@@ -173,6 +174,7 @@ def load_annotations(root: str | os.PathLike, config: str | os.PathLike, spectra
 
     return spectra
 
+
 def _find_annotation_files(root: str, config: str) -> list[str]:
     """Detect all annotation files in the root folder or specified in the config file.
 
@@ -190,6 +192,7 @@ def _find_annotation_files(root: str, config: str) -> list[str]:
         annotation_files.append(os.path.join(root, f))
     return annotation_files
 
+
 def _read_config(config):
     ac = {}
     if os.path.exists(config):

diff --git a/src/nplinker/config.py b/src/nplinker/config.py
@@ -1,20 +1,6 @@
-# Copyright 2021 The NPLinker Authors
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
 import argparse
-import os
 from collections.abc import Mapping
+import os
 from shutil import copyfile
 import toml
 from xdg import XDG_CONFIG_HOME
@@ -143,24 +129,39 @@ def update(d, u):
             return d
 
         config = update(config, config_dict)
+        self._validate(config)
+        self.config = config
 
+    def _validate(self, config: dict) -> None:
+        """Validates the configuration dictionary to ensure that all required
+            fields are present and have valid values.
+
+        Args:
+            config (dict): The configuration dictionary to validate.
+
+        Raises:
+            ValueError: If the configuration dictionary is missing required
+                fields or contains invalid values.
+        """
         if 'dataset' not in config:
-            raise Exception('No dataset defined in configuration!')
+            raise ValueError('Not found config for "dataset".')
 
-        root = config['dataset']['root']
-        config['dataset']['platform_id'] = ''  # placeholder
+        root = config['dataset'].get('root')
+        if root is None:
+            raise ValueError('Not found config for "root".')
 
-        # check if the root has the special 'platform:' prefix and extract the
-        # ID if so. otherwise treat as a path
         if root.startswith('platform:'):
             config['dataset']['platform_id'] = root.replace('platform:', '')
-            logger.info('Selected platform project ID {}'.format(
-                config['dataset']['platform_id']))
+            logger.info('Loading from platform project ID %s',
+                        config['dataset']['platform_id'])
         else:
-            if root is None or not os.path.exists(root):
-                raise Exception(
-                    'Dataset path "{}" not found or not accessible'.format(
-                        root))
-            logger.info(f'Loading from local data in directory {root}')
-
-        self.config = config
+            config['dataset']['platform_id'] = ''
+            logger.info('Loading from local data in directory %s', root)
+
+        antismash = config['dataset'].get('antismash')
+        allowed_antismash_formats = ["default", "flat"]
+        if antismash is not None:
+            if 'format' in antismash and antismash[
+                    'format'] not in allowed_antismash_formats:
+                raise ValueError(
+                    f'Unknown antismash format: {antismash["format"]}')
diff --git a/src/nplinker/genomics/mibig/mibig_loader.py b/src/nplinker/genomics/mibig/mibig_loader.py
@@ -31,7 +31,7 @@ def get_bgc_genome_mapping(self) -> dict[str, str]:
 
         Note that for MIBiG BGC, same value is used for BGC id and genome id.
         Users don't have to provide genome id for MIBiG BGCs in the
-        `strain_mapping.csv` file.
+        `strain_mappings.csv` file.
 
         Returns:
             dict[str, str]: key is BGC id/accession, value is