Use UUID for singleton molecular family? #144

CunliangGeng · 2023-06-20T11:38:06Z

The issue is created from #142 (comment)

The -1 ID somewhat represents an unknown family or singleton family - all of these will have the same id, which is in conflict with the idea of having unique string id's for every object.
Given the id in the GNPS output is -1, do we want to configure a UUID as ID to be able to distinguish the singletons?

check the influence of using UUID, e.g. how to verify the singletons
if feasible, using UUID for singletons
does it make sense to use UUID for all types of MFs? not for now.

hechth · 2023-07-01T19:44:48Z

I'd be in favour of UUIDs for singleton families - or maybe even all families?

* change BGC attributes type from list to tuple The following BGC attributes are updated: - product_prediction - mibig_bgc_class - smiles * use positional-only parameter in BGC and GCF Parameters before "/" are positional-only parameters, see https://docs.python.org/3/glossary.html#term-parameter. * update BGC's `__eq__` and `__hash__` * update GCF's `__eq__` and `__hash__` * Update gcf.py * update Spectrum's `__eq__` and `__hash__` * update MolecularFamily `__eq__` and `__hash__` * Update molecular_family.py * update Strain `__eq__` and `__hash__` * update StrainCollection `__eq__` * add TODO comments to ObjectLink * add parameter type check for `add_alias` * add `__contains__` to Strain class * update `lookup` method of StrainCollection * update `__contains__` in StrainCollection * remove from __eq__ * update `__eq__` logic for Strain * rename `_strain_dict_id` to `_strain_dict_name` in StrainCollection * add comments to `get_common_strains` * add comments and rename variables for DataLinks * add comment about `met_only` parameter * add todo comments to LinkFinder * add comments to GNPSSpectrumLoader to figure out how `spectrum_id` is set * change Spectrum.spectrum_id from type int to str * update spec_dict * Update tests * update `__eq__` in MolecularFamily * change `MolecularFamily.family_id` from type int to str * add method `has_strain` to MolecularFamily * Update metcalf_scoring.py * change array to dataframe in DataLinks 1. Change array to dataframe: - self.M_gcf_strain -> self.gcf_strain_occurrence - self. M_spec_strain -> self.spec_strain_occurrence - self. M_fam_strain -> mf_strain_occurrence 2. update relevant methods to get the new dataframes 3. update logics of method `common_strains` using the new dataframes * update references of the new dataframes from DataLinks * update logics of `get_links` in NPLinker class * Update test_nplinker.py - add code to remove cached results * move SCORING_METHODS to LinkFinder * update method name to `get_common_strains` * refactor mapping dataframes in DataLinks * add TODOs and deprecation to LinkFinder * refactor cooccurrence in DataLinks * merge `load_data` and `find_correlations` to init in DataLinks * refactor DataLinks attributes - Move assignment of attributes to `__init__` - Rename attributes - Replace `fam` or `molfam` with `mf` to refer to molecular family - Add docstrings * Delete test_data_links.py * update get_common_strains methods - update parameters to be more clear and specific - change strain id in returned dict to strain objects - update docstrings * remove lookup_index method from StrainCollection (#90) - remove method `lookup_index` - remove attribute `_strain_dict_index` * Remove integer id from GCF * update lookup methods and attributes in NPLikner class * change cooccurrence from array to DataFrame in DataLinks * format link_finder.py * temp replace array with dataframe in LinkFinder for metcalf scoring * refactor `LinkFinder.get_scores` method * refactor `LinkFinder.metcalf_scoring` method - rename parameter name - wrap parameters for weights to one parameter - extract private method `_cal_mean_std` * refactor get_links * remove unused methods and scorings from LinkFinder - remove unused `likescore` and `hg` scoring types - remove all unused methods * refactor returned type of `LinkFinder.get_links` method * add `lookup_mf` method in NPLinker class * refactor MetcalfScoring class * add deprecation to LinkLikelihood class * add `__init__.py` to linking module * rename `data_linking.py` to `data_links.py` * rename `data_linking_functions.py` to `utils.py` * rename `test_data_linking_functions.py` to `test_linking_utils.py`.py * Delete test_scoring.py * add dtype to DataLinks dataframes * remove mapping dataframes and relevant method from DataLinks Removed: - self.mapping_spec - self.mapping_gcf - self.mapping_fam -self.mapping_strain - _get_mappings_from_occurrence() method * Create test_data_links.py * add `conftest.py` for scoring tests * update LinkFinder's attribute and private method - refactor method `_cal_mean_std` - rename attribute `raw_score_fam_gcf` to `raw_score_mf_gcf` * Create test_link_finder.py * Update vscode plugin autodocstring template - fix indentation bug in autodocsting - remove `Examples:` section * add scope for fixtures * Create test_metcalf_scoring.py * add docstrings and type hints to `MetcalfScoring` class * add util func `isinstance_all` * replace `_isinstance` with util func `isinstance_all` * update validation of args for `DataLinks` * Update test_data_links.py - add docstrings - add more tests * add type hints for returned values to unit tests * update exception types for invalid input * add docstrings and type hints to `LinkFinder` class * add more unit tests for `LinkFinder` * fix input type bug for `DataLinks.get_common_strains` * Create test_nplinker_scoring.py * add todo comments to `NPLinker` class * remove local integration tests for scoring part of `NPLinker` - rename `test_nplinker.py` to `test_nplinker_local.py` * remove unused imports * Fix mypy warnings as much as possible * check strain existence using strain dict * change calculate abbrevation from "cal" to "calc" * remove resolved TODO comment * move shared fixtures to conftest.py * remove unnecessary type hints * update docstrings for cooccurrences * use uuid for singleton molecular families #144 * add TODO comment for GNPSLoader * update type hints for `*args` parameter

CunliangGeng · 2023-07-04T08:00:57Z

The current family id looks good enough to support the features of nplinker right now. I'd like to keep it as it is until we see requirements for uuid.

* change BGC attributes type from list to tuple The following BGC attributes are updated: - product_prediction - mibig_bgc_class - smiles * use positional-only parameter in BGC and GCF Parameters before "/" are positional-only parameters, see https://docs.python.org/3/glossary.html#term-parameter. * update BGC's `__eq__` and `__hash__` * update GCF's `__eq__` and `__hash__` * Update gcf.py * update Spectrum's `__eq__` and `__hash__` * update MolecularFamily `__eq__` and `__hash__` * Update molecular_family.py * update Strain `__eq__` and `__hash__` * update StrainCollection `__eq__` * add TODO comments to ObjectLink * add parameter type check for `add_alias` * add `__contains__` to Strain class * update `lookup` method of StrainCollection * update `__contains__` in StrainCollection * remove from __eq__ * update `__eq__` logic for Strain * rename `_strain_dict_id` to `_strain_dict_name` in StrainCollection * add comments to `get_common_strains` * add comments and rename variables for DataLinks * add comment about `met_only` parameter * add todo comments to LinkFinder * add comments to GNPSSpectrumLoader to figure out how `spectrum_id` is set * change Spectrum.spectrum_id from type int to str * update spec_dict * Update tests * update `__eq__` in MolecularFamily * change `MolecularFamily.family_id` from type int to str * add method `has_strain` to MolecularFamily * Update metcalf_scoring.py * change array to dataframe in DataLinks 1. Change array to dataframe: - self.M_gcf_strain -> self.gcf_strain_occurrence - self. M_spec_strain -> self.spec_strain_occurrence - self. M_fam_strain -> mf_strain_occurrence 2. update relevant methods to get the new dataframes 3. update logics of method `common_strains` using the new dataframes * update references of the new dataframes from DataLinks * update logics of `get_links` in NPLinker class * Update test_nplinker.py - add code to remove cached results * move SCORING_METHODS to LinkFinder * update method name to `get_common_strains` * refactor mapping dataframes in DataLinks * add TODOs and deprecation to LinkFinder * refactor cooccurrence in DataLinks * merge `load_data` and `find_correlations` to init in DataLinks * refactor DataLinks attributes - Move assignment of attributes to `__init__` - Rename attributes - Replace `fam` or `molfam` with `mf` to refer to molecular family - Add docstrings * Delete test_data_links.py * update get_common_strains methods - update parameters to be more clear and specific - change strain id in returned dict to strain objects - update docstrings * remove lookup_index method from StrainCollection (#90) - remove method `lookup_index` - remove attribute `_strain_dict_index` * Remove integer id from GCF * update lookup methods and attributes in NPLikner class * change cooccurrence from array to DataFrame in DataLinks * format link_finder.py * temp replace array with dataframe in LinkFinder for metcalf scoring * refactor `LinkFinder.get_scores` method * refactor `LinkFinder.metcalf_scoring` method - rename parameter name - wrap parameters for weights to one parameter - extract private method `_cal_mean_std` * refactor get_links * remove unused methods and scorings from LinkFinder - remove unused `likescore` and `hg` scoring types - remove all unused methods * refactor returned type of `LinkFinder.get_links` method * add `lookup_mf` method in NPLinker class * refactor MetcalfScoring class * add deprecation to LinkLikelihood class * add `__init__.py` to linking module * rename `data_linking.py` to `data_links.py` * rename `data_linking_functions.py` to `utils.py` * rename `test_data_linking_functions.py` to `test_linking_utils.py`.py * Delete test_scoring.py * add dtype to DataLinks dataframes * remove mapping dataframes and relevant method from DataLinks Removed: - self.mapping_spec - self.mapping_gcf - self.mapping_fam -self.mapping_strain - _get_mappings_from_occurrence() method * Create test_data_links.py * add `conftest.py` for scoring tests * update LinkFinder's attribute and private method - refactor method `_cal_mean_std` - rename attribute `raw_score_fam_gcf` to `raw_score_mf_gcf` * Create test_link_finder.py * Update vscode plugin autodocstring template - fix indentation bug in autodocsting - remove `Examples:` section * add scope for fixtures * Create test_metcalf_scoring.py * add docstrings and type hints to `MetcalfScoring` class * add util func `isinstance_all` * replace `_isinstance` with util func `isinstance_all` * update validation of args for `DataLinks` * Update test_data_links.py - add docstrings - add more tests * add type hints for returned values to unit tests * update exception types for invalid input * add docstrings and type hints to `LinkFinder` class * add more unit tests for `LinkFinder` * fix input type bug for `DataLinks.get_common_strains` * Create test_nplinker_scoring.py * add todo comments to `NPLinker` class * remove local integration tests for scoring part of `NPLinker` - rename `test_nplinker.py` to `test_nplinker_local.py` * remove unused imports * Fix mypy warnings as much as possible * check strain existence using strain dict * change calculate abbrevation from "cal" to "calc" * remove resolved TODO comment * move shared fixtures to conftest.py * remove unnecessary type hints * update docstrings for cooccurrences * use uuid for singleton molecular families #144 * add TODO comment for GNPSLoader * fix typos * remove useless parameter `met_only` The `met_only` is useless. NPlinker will stop working if met_only=True. * update exception type * refactor the usage of PODPDownloader 1. create instance in the private method, only when it's needed 2. change the scope of the instance from global to local * rename private config attributes in class DatasetLoader - add prefix `_config` for all config attributes - add comments to restructure `__init__` code * change the variable of app data dir to be global this variable is independent of DatasetLoader and other classes, so it should be a global variable * change two public methods to variables * change one public method to attribute for DatasetLoader * add value validation to Config - move the validation of antismash format config in DatasetLoader to Config class - refactor the config data validations into a private method * add TODO comments about init and validate paths * remove unused attribute `growth_media` * remove commented code * add TODO comments * remove unused imports * format the code * reorder methods in loader.py

* change BGC attributes type from list to tuple The following BGC attributes are updated: - product_prediction - mibig_bgc_class - smiles * use positional-only parameter in BGC and GCF Parameters before "/" are positional-only parameters, see https://docs.python.org/3/glossary.html#term-parameter. * update BGC's `__eq__` and `__hash__` * update GCF's `__eq__` and `__hash__` * Update gcf.py * update Spectrum's `__eq__` and `__hash__` * update MolecularFamily `__eq__` and `__hash__` * Update molecular_family.py * update Strain `__eq__` and `__hash__` * update StrainCollection `__eq__` * add TODO comments to ObjectLink * add parameter type check for `add_alias` * add `__contains__` to Strain class * update `lookup` method of StrainCollection * update `__contains__` in StrainCollection * remove from __eq__ * update `__eq__` logic for Strain * rename `_strain_dict_id` to `_strain_dict_name` in StrainCollection * add comments to `get_common_strains` * add comments and rename variables for DataLinks * add comment about `met_only` parameter * add todo comments to LinkFinder * add comments to GNPSSpectrumLoader to figure out how `spectrum_id` is set * change Spectrum.spectrum_id from type int to str * update spec_dict * Update tests * update `__eq__` in MolecularFamily * change `MolecularFamily.family_id` from type int to str * add method `has_strain` to MolecularFamily * Update metcalf_scoring.py * change array to dataframe in DataLinks 1. Change array to dataframe: - self.M_gcf_strain -> self.gcf_strain_occurrence - self. M_spec_strain -> self.spec_strain_occurrence - self. M_fam_strain -> mf_strain_occurrence 2. update relevant methods to get the new dataframes 3. update logics of method `common_strains` using the new dataframes * update references of the new dataframes from DataLinks * update logics of `get_links` in NPLinker class * Update test_nplinker.py - add code to remove cached results * move SCORING_METHODS to LinkFinder * update method name to `get_common_strains` * refactor mapping dataframes in DataLinks * add TODOs and deprecation to LinkFinder * refactor cooccurrence in DataLinks * merge `load_data` and `find_correlations` to init in DataLinks * refactor DataLinks attributes - Move assignment of attributes to `__init__` - Rename attributes - Replace `fam` or `molfam` with `mf` to refer to molecular family - Add docstrings * Delete test_data_links.py * update get_common_strains methods - update parameters to be more clear and specific - change strain id in returned dict to strain objects - update docstrings * remove lookup_index method from StrainCollection (#90) - remove method `lookup_index` - remove attribute `_strain_dict_index` * Remove integer id from GCF * update lookup methods and attributes in NPLikner class * change cooccurrence from array to DataFrame in DataLinks * format link_finder.py * temp replace array with dataframe in LinkFinder for metcalf scoring * refactor `LinkFinder.get_scores` method * refactor `LinkFinder.metcalf_scoring` method - rename parameter name - wrap parameters for weights to one parameter - extract private method `_cal_mean_std` * refactor get_links * remove unused methods and scorings from LinkFinder - remove unused `likescore` and `hg` scoring types - remove all unused methods * refactor returned type of `LinkFinder.get_links` method * add `lookup_mf` method in NPLinker class * refactor MetcalfScoring class * add deprecation to LinkLikelihood class * add `__init__.py` to linking module * rename `data_linking.py` to `data_links.py` * rename `data_linking_functions.py` to `utils.py` * rename `test_data_linking_functions.py` to `test_linking_utils.py`.py * Delete test_scoring.py * add dtype to DataLinks dataframes * remove mapping dataframes and relevant method from DataLinks Removed: - self.mapping_spec - self.mapping_gcf - self.mapping_fam -self.mapping_strain - _get_mappings_from_occurrence() method * Create test_data_links.py * add `conftest.py` for scoring tests * update LinkFinder's attribute and private method - refactor method `_cal_mean_std` - rename attribute `raw_score_fam_gcf` to `raw_score_mf_gcf` * Create test_link_finder.py * Update vscode plugin autodocstring template - fix indentation bug in autodocsting - remove `Examples:` section * add scope for fixtures * Create test_metcalf_scoring.py * add docstrings and type hints to `MetcalfScoring` class * add util func `isinstance_all` * replace `_isinstance` with util func `isinstance_all` * update validation of args for `DataLinks` * Update test_data_links.py - add docstrings - add more tests * add type hints for returned values to unit tests * update exception types for invalid input * add docstrings and type hints to `LinkFinder` class * add more unit tests for `LinkFinder` * fix input type bug for `DataLinks.get_common_strains` * Create test_nplinker_scoring.py * add todo comments to `NPLinker` class * remove local integration tests for scoring part of `NPLinker` - rename `test_nplinker.py` to `test_nplinker_local.py` * remove unused imports * Fix mypy warnings as much as possible * check strain existence using strain dict * change calculate abbrevation from "cal" to "calc" * remove resolved TODO comment * move shared fixtures to conftest.py * remove unnecessary type hints * update docstrings for cooccurrences * use uuid for singleton molecular families #144 * add TODO comment for GNPSLoader * fix typos * remove useless parameter `met_only` The `met_only` is useless. NPlinker will stop working if met_only=True. * update exception type * refactor the usage of PODPDownloader 1. create instance in the private method, only when it's needed 2. change the scope of the instance from global to local * rename private config attributes in class DatasetLoader - add prefix `_config` for all config attributes - add comments to restructure `__init__` code * change the variable of app data dir to be global this variable is independent of DatasetLoader and other classes, so it should be a global variable * change two public methods to variables * change one public method to attribute for DatasetLoader * add value validation to Config - move the validation of antismash format config in DatasetLoader to Config class - refactor the config data validations into a private method * add TODO comments about init and validate paths * remove unused attribute `growth_media` * remove commented code * add TODO comments * remove unused imports * format the code * reorder methods in loader.py * add function `generate_genome_bgc_mappings_file` * Update __init__.py * add tests for `generate_genome_bgc_mappings_file` * update strain test when 113 issue is closed

CunliangGeng mentioned this issue Jun 20, 2023

Use unique string IDs #142

Merged

CunliangGeng self-assigned this Jun 20, 2023

CunliangGeng added this to the refactor codebase milestone Jun 20, 2023

CunliangGeng added a commit that referenced this issue Jun 21, 2023

use uuid for singleton molecular families #144

09c049f

CunliangGeng linked a pull request Jun 21, 2023 that will close this issue

Use unique string IDs #142

Merged

CunliangGeng closed this as completed Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use UUID for singleton molecular family? #144

Use UUID for singleton molecular family? #144

CunliangGeng commented Jun 20, 2023 •

edited

Loading

hechth commented Jul 1, 2023

CunliangGeng commented Jul 4, 2023

Use UUID for singleton molecular family? #144

Use UUID for singleton molecular family? #144

Comments

CunliangGeng commented Jun 20, 2023 • edited Loading

hechth commented Jul 1, 2023

CunliangGeng commented Jul 4, 2023

CunliangGeng commented Jun 20, 2023 •

edited

Loading