Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use UUID for singleton molecular family? #144

Closed
3 tasks done
CunliangGeng opened this issue Jun 20, 2023 · 2 comments · Fixed by #142
Closed
3 tasks done

Use UUID for singleton molecular family? #144

CunliangGeng opened this issue Jun 20, 2023 · 2 comments · Fixed by #142
Assignees

Comments

@CunliangGeng
Copy link
Member

CunliangGeng commented Jun 20, 2023

The issue is created from #142 (comment)

The -1 ID somewhat represents an unknown family or singleton family - all of these will have the same id, which is in conflict with the idea of having unique string id's for every object.
Given the id in the GNPS output is -1, do we want to configure a UUID as ID to be able to distinguish the singletons?

  • check the influence of using UUID, e.g. how to verify the singletons

  • if feasible, using UUID for singletons

  • does it make sense to use UUID for all types of MFs? not for now.

@CunliangGeng CunliangGeng self-assigned this Jun 20, 2023
@CunliangGeng CunliangGeng added this to the refactor codebase milestone Jun 20, 2023
@CunliangGeng CunliangGeng linked a pull request Jun 21, 2023 that will close this issue
@hechth
Copy link
Collaborator

hechth commented Jul 1, 2023

I'd be in favour of UUIDs for singleton families - or maybe even all families?

CunliangGeng added a commit that referenced this issue Jul 3, 2023
* change BGC attributes type from list to tuple

The following BGC attributes are updated:
- product_prediction
- mibig_bgc_class
- smiles

* use positional-only parameter in BGC and GCF

Parameters before "/" are positional-only parameters, see https://docs.python.org/3/glossary.html#term-parameter.

* update BGC's `__eq__` and `__hash__`

* update GCF's `__eq__` and `__hash__`

* Update gcf.py

* update Spectrum's `__eq__` and `__hash__`

* update MolecularFamily `__eq__` and `__hash__`

* Update molecular_family.py

* update Strain `__eq__` and `__hash__`

* update StrainCollection `__eq__`

* add TODO comments to ObjectLink

* add parameter type check for `add_alias`

* add `__contains__` to Strain class

* update `lookup` method of StrainCollection

* update `__contains__` in StrainCollection

* remove from __eq__

* update `__eq__` logic for Strain

* rename `_strain_dict_id` to `_strain_dict_name` in StrainCollection

* add comments to `get_common_strains`

* add comments and rename variables for DataLinks

* add comment about `met_only` parameter

* add todo comments to LinkFinder

* add comments to GNPSSpectrumLoader

to figure out how `spectrum_id` is set

* change Spectrum.spectrum_id from type int to str

* update spec_dict

* Update tests

* update `__eq__` in MolecularFamily

* change `MolecularFamily.family_id` from type int to str

* add method `has_strain` to MolecularFamily

* Update metcalf_scoring.py

* change array to dataframe in DataLinks

1. Change array to dataframe:
- self.M_gcf_strain -> self.gcf_strain_occurrence
- self. M_spec_strain -> self.spec_strain_occurrence
- self. M_fam_strain -> mf_strain_occurrence
2. update relevant methods to get the new dataframes
3. update logics of method `common_strains` using the new dataframes

* update references of the new dataframes from DataLinks

* update logics of `get_links` in NPLinker class

* Update test_nplinker.py

- add code to remove cached results

* move SCORING_METHODS to LinkFinder

* update method name to `get_common_strains`

* refactor mapping dataframes in DataLinks

* add TODOs and deprecation to LinkFinder

* refactor cooccurrence in DataLinks

* merge `load_data` and `find_correlations` to init in DataLinks

* refactor DataLinks attributes

- Move assignment of attributes to `__init__`
- Rename attributes
- Replace `fam` or `molfam` with `mf` to refer to molecular family
- Add docstrings

* Delete test_data_links.py

* update get_common_strains methods

- update parameters to be more clear and specific
- change strain id in returned dict to strain objects
-  update docstrings

* remove lookup_index method from StrainCollection (#90)

- remove method `lookup_index`
- remove attribute `_strain_dict_index`

* Remove integer id from GCF

* update lookup methods and attributes in NPLikner class

* change cooccurrence from array to DataFrame in DataLinks

* format link_finder.py

* temp replace array with dataframe in LinkFinder for metcalf scoring

* refactor `LinkFinder.get_scores` method

* refactor `LinkFinder.metcalf_scoring` method

- rename parameter name
- wrap parameters for weights to one parameter
- extract private method `_cal_mean_std`

* refactor get_links

* remove unused methods and scorings from LinkFinder

-  remove unused `likescore` and `hg` scoring types
- remove all unused methods

* refactor returned type of `LinkFinder.get_links` method

* add `lookup_mf` method in NPLinker class

* refactor MetcalfScoring class

* add deprecation to LinkLikelihood class

* add `__init__.py` to linking module

* rename `data_linking.py` to `data_links.py`

* rename `data_linking_functions.py` to `utils.py`

* rename `test_data_linking_functions.py` to `test_linking_utils.py`.py

* Delete test_scoring.py

* add dtype to DataLinks dataframes

* remove mapping dataframes and relevant method from DataLinks

Removed:
- self.mapping_spec
- self.mapping_gcf
- self.mapping_fam
-self.mapping_strain
- _get_mappings_from_occurrence() method

* Create test_data_links.py

* add `conftest.py` for scoring tests

* update LinkFinder's attribute and private method

- refactor method `_cal_mean_std`
- rename attribute `raw_score_fam_gcf` to `raw_score_mf_gcf`

* Create test_link_finder.py

* Update vscode plugin autodocstring template

- fix indentation bug in autodocsting
- remove `Examples:` section

* add scope for fixtures

* Create test_metcalf_scoring.py

* add docstrings and type hints to `MetcalfScoring` class

* add util func `isinstance_all`

* replace `_isinstance` with util func `isinstance_all`

* update validation of args for `DataLinks`

* Update test_data_links.py

- add docstrings
- add more tests

* add type hints for returned values to unit tests

* update exception types for invalid input

* add docstrings and type hints to `LinkFinder` class

* add more unit tests for `LinkFinder`

* fix input type bug for `DataLinks.get_common_strains`

* Create test_nplinker_scoring.py

* add todo comments to `NPLinker` class

* remove local integration tests for scoring part of `NPLinker`

- rename `test_nplinker.py` to `test_nplinker_local.py`

* remove unused imports

* Fix mypy warnings as much as possible

* check strain existence using strain dict

* change calculate abbrevation from "cal" to "calc"

* remove resolved TODO comment

* move shared fixtures to conftest.py

* remove unnecessary type hints

* update docstrings for cooccurrences

* use uuid for singleton molecular families #144

* add TODO comment for GNPSLoader

* update type hints for `*args` parameter
@CunliangGeng
Copy link
Member Author

The current family id looks good enough to support the features of nplinker right now. I'd like to keep it as it is until we see requirements for uuid.

CunliangGeng added a commit that referenced this issue Jul 4, 2023
* change BGC attributes type from list to tuple

The following BGC attributes are updated:
- product_prediction
- mibig_bgc_class
- smiles

* use positional-only parameter in BGC and GCF

Parameters before "/" are positional-only parameters, see https://docs.python.org/3/glossary.html#term-parameter.

* update BGC's `__eq__` and `__hash__`

* update GCF's `__eq__` and `__hash__`

* Update gcf.py

* update Spectrum's `__eq__` and `__hash__`

* update MolecularFamily `__eq__` and `__hash__`

* Update molecular_family.py

* update Strain `__eq__` and `__hash__`

* update StrainCollection `__eq__`

* add TODO comments to ObjectLink

* add parameter type check for `add_alias`

* add `__contains__` to Strain class

* update `lookup` method of StrainCollection

* update `__contains__` in StrainCollection

* remove from __eq__

* update `__eq__` logic for Strain

* rename `_strain_dict_id` to `_strain_dict_name` in StrainCollection

* add comments to `get_common_strains`

* add comments and rename variables for DataLinks

* add comment about `met_only` parameter

* add todo comments to LinkFinder

* add comments to GNPSSpectrumLoader

to figure out how `spectrum_id` is set

* change Spectrum.spectrum_id from type int to str

* update spec_dict

* Update tests

* update `__eq__` in MolecularFamily

* change `MolecularFamily.family_id` from type int to str

* add method `has_strain` to MolecularFamily

* Update metcalf_scoring.py

* change array to dataframe in DataLinks

1. Change array to dataframe:
- self.M_gcf_strain -> self.gcf_strain_occurrence
- self. M_spec_strain -> self.spec_strain_occurrence
- self. M_fam_strain -> mf_strain_occurrence
2. update relevant methods to get the new dataframes
3. update logics of method `common_strains` using the new dataframes

* update references of the new dataframes from DataLinks

* update logics of `get_links` in NPLinker class

* Update test_nplinker.py

- add code to remove cached results

* move SCORING_METHODS to LinkFinder

* update method name to `get_common_strains`

* refactor mapping dataframes in DataLinks

* add TODOs and deprecation to LinkFinder

* refactor cooccurrence in DataLinks

* merge `load_data` and `find_correlations` to init in DataLinks

* refactor DataLinks attributes

- Move assignment of attributes to `__init__`
- Rename attributes
- Replace `fam` or `molfam` with `mf` to refer to molecular family
- Add docstrings

* Delete test_data_links.py

* update get_common_strains methods

- update parameters to be more clear and specific
- change strain id in returned dict to strain objects
-  update docstrings

* remove lookup_index method from StrainCollection (#90)

- remove method `lookup_index`
- remove attribute `_strain_dict_index`

* Remove integer id from GCF

* update lookup methods and attributes in NPLikner class

* change cooccurrence from array to DataFrame in DataLinks

* format link_finder.py

* temp replace array with dataframe in LinkFinder for metcalf scoring

* refactor `LinkFinder.get_scores` method

* refactor `LinkFinder.metcalf_scoring` method

- rename parameter name
- wrap parameters for weights to one parameter
- extract private method `_cal_mean_std`

* refactor get_links

* remove unused methods and scorings from LinkFinder

-  remove unused `likescore` and `hg` scoring types
- remove all unused methods

* refactor returned type of `LinkFinder.get_links` method

* add `lookup_mf` method in NPLinker class

* refactor MetcalfScoring class

* add deprecation to LinkLikelihood class

* add `__init__.py` to linking module

* rename `data_linking.py` to `data_links.py`

* rename `data_linking_functions.py` to `utils.py`

* rename `test_data_linking_functions.py` to `test_linking_utils.py`.py

* Delete test_scoring.py

* add dtype to DataLinks dataframes

* remove mapping dataframes and relevant method from DataLinks

Removed:
- self.mapping_spec
- self.mapping_gcf
- self.mapping_fam
-self.mapping_strain
- _get_mappings_from_occurrence() method

* Create test_data_links.py

* add `conftest.py` for scoring tests

* update LinkFinder's attribute and private method

- refactor method `_cal_mean_std`
- rename attribute `raw_score_fam_gcf` to `raw_score_mf_gcf`

* Create test_link_finder.py

* Update vscode plugin autodocstring template

- fix indentation bug in autodocsting
- remove `Examples:` section

* add scope for fixtures

* Create test_metcalf_scoring.py

* add docstrings and type hints to `MetcalfScoring` class

* add util func `isinstance_all`

* replace `_isinstance` with util func `isinstance_all`

* update validation of args for `DataLinks`

* Update test_data_links.py

- add docstrings
- add more tests

* add type hints for returned values to unit tests

* update exception types for invalid input

* add docstrings and type hints to `LinkFinder` class

* add more unit tests for `LinkFinder`

* fix input type bug for `DataLinks.get_common_strains`

* Create test_nplinker_scoring.py

* add todo comments to `NPLinker` class

* remove local integration tests for scoring part of `NPLinker`

- rename `test_nplinker.py` to `test_nplinker_local.py`

* remove unused imports

* Fix mypy warnings as much as possible

* check strain existence using strain dict

* change calculate abbrevation from "cal" to "calc"

* remove resolved TODO comment

* move shared fixtures to conftest.py

* remove unnecessary type hints

* update docstrings for cooccurrences

* use uuid for singleton molecular families #144

* add TODO comment for GNPSLoader

* fix typos

* remove useless parameter `met_only`

The `met_only` is useless. NPlinker will stop working if met_only=True.

* update exception type

* refactor the usage of PODPDownloader

1. create instance in the private method, only when it's needed
2. change the scope of the instance from global to local

* rename private config attributes in class DatasetLoader

-  add prefix `_config` for all config attributes
- add comments to restructure `__init__` code

* change the variable of app data dir to be global

this variable is independent of DatasetLoader and other classes, so it should be a global variable

* change two public methods to variables

* change one public method to attribute for DatasetLoader

* add value validation to Config

- move the validation of antismash format config in DatasetLoader to Config class
- refactor the config data validations into a private method

* add TODO comments about init and validate paths

* remove unused attribute `growth_media`

* remove commented code

* add TODO comments

* remove unused imports

* format the code

* reorder methods in loader.py
CunliangGeng added a commit that referenced this issue Jul 4, 2023
* change BGC attributes type from list to tuple

The following BGC attributes are updated:
- product_prediction
- mibig_bgc_class
- smiles

* use positional-only parameter in BGC and GCF

Parameters before "/" are positional-only parameters, see https://docs.python.org/3/glossary.html#term-parameter.

* update BGC's `__eq__` and `__hash__`

* update GCF's `__eq__` and `__hash__`

* Update gcf.py

* update Spectrum's `__eq__` and `__hash__`

* update MolecularFamily `__eq__` and `__hash__`

* Update molecular_family.py

* update Strain `__eq__` and `__hash__`

* update StrainCollection `__eq__`

* add TODO comments to ObjectLink

* add parameter type check for `add_alias`

* add `__contains__` to Strain class

* update `lookup` method of StrainCollection

* update `__contains__` in StrainCollection

* remove from __eq__

* update `__eq__` logic for Strain

* rename `_strain_dict_id` to `_strain_dict_name` in StrainCollection

* add comments to `get_common_strains`

* add comments and rename variables for DataLinks

* add comment about `met_only` parameter

* add todo comments to LinkFinder

* add comments to GNPSSpectrumLoader

to figure out how `spectrum_id` is set

* change Spectrum.spectrum_id from type int to str

* update spec_dict

* Update tests

* update `__eq__` in MolecularFamily

* change `MolecularFamily.family_id` from type int to str

* add method `has_strain` to MolecularFamily

* Update metcalf_scoring.py

* change array to dataframe in DataLinks

1. Change array to dataframe:
- self.M_gcf_strain -> self.gcf_strain_occurrence
- self. M_spec_strain -> self.spec_strain_occurrence
- self. M_fam_strain -> mf_strain_occurrence
2. update relevant methods to get the new dataframes
3. update logics of method `common_strains` using the new dataframes

* update references of the new dataframes from DataLinks

* update logics of `get_links` in NPLinker class

* Update test_nplinker.py

- add code to remove cached results

* move SCORING_METHODS to LinkFinder

* update method name to `get_common_strains`

* refactor mapping dataframes in DataLinks

* add TODOs and deprecation to LinkFinder

* refactor cooccurrence in DataLinks

* merge `load_data` and `find_correlations` to init in DataLinks

* refactor DataLinks attributes

- Move assignment of attributes to `__init__`
- Rename attributes
- Replace `fam` or `molfam` with `mf` to refer to molecular family
- Add docstrings

* Delete test_data_links.py

* update get_common_strains methods

- update parameters to be more clear and specific
- change strain id in returned dict to strain objects
-  update docstrings

* remove lookup_index method from StrainCollection (#90)

- remove method `lookup_index`
- remove attribute `_strain_dict_index`

* Remove integer id from GCF

* update lookup methods and attributes in NPLikner class

* change cooccurrence from array to DataFrame in DataLinks

* format link_finder.py

* temp replace array with dataframe in LinkFinder for metcalf scoring

* refactor `LinkFinder.get_scores` method

* refactor `LinkFinder.metcalf_scoring` method

- rename parameter name
- wrap parameters for weights to one parameter
- extract private method `_cal_mean_std`

* refactor get_links

* remove unused methods and scorings from LinkFinder

-  remove unused `likescore` and `hg` scoring types
- remove all unused methods

* refactor returned type of `LinkFinder.get_links` method

* add `lookup_mf` method in NPLinker class

* refactor MetcalfScoring class

* add deprecation to LinkLikelihood class

* add `__init__.py` to linking module

* rename `data_linking.py` to `data_links.py`

* rename `data_linking_functions.py` to `utils.py`

* rename `test_data_linking_functions.py` to `test_linking_utils.py`.py

* Delete test_scoring.py

* add dtype to DataLinks dataframes

* remove mapping dataframes and relevant method from DataLinks

Removed:
- self.mapping_spec
- self.mapping_gcf
- self.mapping_fam
-self.mapping_strain
- _get_mappings_from_occurrence() method

* Create test_data_links.py

* add `conftest.py` for scoring tests

* update LinkFinder's attribute and private method

- refactor method `_cal_mean_std`
- rename attribute `raw_score_fam_gcf` to `raw_score_mf_gcf`

* Create test_link_finder.py

* Update vscode plugin autodocstring template

- fix indentation bug in autodocsting
- remove `Examples:` section

* add scope for fixtures

* Create test_metcalf_scoring.py

* add docstrings and type hints to `MetcalfScoring` class

* add util func `isinstance_all`

* replace `_isinstance` with util func `isinstance_all`

* update validation of args for `DataLinks`

* Update test_data_links.py

- add docstrings
- add more tests

* add type hints for returned values to unit tests

* update exception types for invalid input

* add docstrings and type hints to `LinkFinder` class

* add more unit tests for `LinkFinder`

* fix input type bug for `DataLinks.get_common_strains`

* Create test_nplinker_scoring.py

* add todo comments to `NPLinker` class

* remove local integration tests for scoring part of `NPLinker`

- rename `test_nplinker.py` to `test_nplinker_local.py`

* remove unused imports

* Fix mypy warnings as much as possible

* check strain existence using strain dict

* change calculate abbrevation from "cal" to "calc"

* remove resolved TODO comment

* move shared fixtures to conftest.py

* remove unnecessary type hints

* update docstrings for cooccurrences

* use uuid for singleton molecular families #144

* add TODO comment for GNPSLoader

* fix typos

* remove useless parameter `met_only`

The `met_only` is useless. NPlinker will stop working if met_only=True.

* update exception type

* refactor the usage of PODPDownloader

1. create instance in the private method, only when it's needed
2. change the scope of the instance from global to local

* rename private config attributes in class DatasetLoader

-  add prefix `_config` for all config attributes
- add comments to restructure `__init__` code

* change the variable of app data dir to be global

this variable is independent of DatasetLoader and other classes, so it should be a global variable

* change two public methods to variables

* change one public method to attribute for DatasetLoader

* add value validation to Config

- move the validation of antismash format config in DatasetLoader to Config class
- refactor the config data validations into a private method

* add TODO comments about init and validate paths

* remove unused attribute `growth_media`

* remove commented code

* add TODO comments

* remove unused imports

* format the code

* reorder methods in loader.py

* add function `generate_genome_bgc_mappings_file`

* Update __init__.py

* add tests for `generate_genome_bgc_mappings_file`

* update strain test when 113 issue is closed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants