Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use unique string IDs #142

Merged
merged 96 commits into from
Jul 3, 2023
Merged
Show file tree
Hide file tree
Changes from 95 commits
Commits
Show all changes
96 commits
Select commit Hold shift + click to select a range
742d4b5
change BGC attributes type from list to tuple
CunliangGeng Apr 5, 2023
f1cac59
use positional-only parameter in BGC and GCF
CunliangGeng Apr 5, 2023
18e77a6
update BGC's `__eq__` and `__hash__`
CunliangGeng Apr 5, 2023
da2cf41
update GCF's `__eq__` and `__hash__`
CunliangGeng Apr 5, 2023
2111e5b
Update gcf.py
CunliangGeng Apr 11, 2023
0736476
update Spectrum's `__eq__` and `__hash__`
CunliangGeng Apr 11, 2023
fda1208
update MolecularFamily `__eq__` and `__hash__`
CunliangGeng Apr 11, 2023
0a7ccee
Update molecular_family.py
CunliangGeng Apr 11, 2023
6a373e1
update Strain `__eq__` and `__hash__`
CunliangGeng Apr 11, 2023
33ed3c6
update StrainCollection `__eq__`
CunliangGeng Apr 11, 2023
f6bae56
add TODO comments to ObjectLink
CunliangGeng Apr 11, 2023
f81accf
add parameter type check for `add_alias`
CunliangGeng Apr 11, 2023
53dad82
add `__contains__` to Strain class
CunliangGeng Apr 11, 2023
45d8765
update `lookup` method of StrainCollection
CunliangGeng Apr 11, 2023
7cb75e0
update `__contains__` in StrainCollection
CunliangGeng Apr 11, 2023
f797ee5
remove from __eq__
CunliangGeng Apr 11, 2023
8a3fa1b
update `__eq__` logic for Strain
CunliangGeng Apr 13, 2023
8875400
rename `_strain_dict_id` to `_strain_dict_name` in StrainCollection
CunliangGeng Apr 13, 2023
2f77539
add comments to `get_common_strains`
CunliangGeng Apr 13, 2023
dae421d
add comments and rename variables for DataLinks
CunliangGeng Apr 13, 2023
504862e
add comment about `met_only` parameter
CunliangGeng Apr 14, 2023
8b160e6
add todo comments to LinkFinder
CunliangGeng Apr 18, 2023
8e9557c
add comments to GNPSSpectrumLoader
CunliangGeng Apr 18, 2023
9257e0b
change Spectrum.spectrum_id from type int to str
CunliangGeng Apr 18, 2023
43f6d80
update spec_dict
CunliangGeng Apr 19, 2023
7536e42
Update tests
CunliangGeng Apr 19, 2023
a0cf97a
update `__eq__` in MolecularFamily
CunliangGeng Apr 28, 2023
b53af96
change `MolecularFamily.family_id` from type int to str
CunliangGeng Apr 28, 2023
ca848fd
add method `has_strain` to MolecularFamily
CunliangGeng May 1, 2023
583fa33
Update metcalf_scoring.py
CunliangGeng May 1, 2023
6ebaf0d
change array to dataframe in DataLinks
CunliangGeng May 1, 2023
2cd8e01
update references of the new dataframes from DataLinks
CunliangGeng May 1, 2023
2834345
update logics of `get_links` in NPLinker class
CunliangGeng May 1, 2023
ee5ff03
Update test_nplinker.py
CunliangGeng May 1, 2023
ef94418
move SCORING_METHODS to LinkFinder
CunliangGeng May 1, 2023
1edabc5
update method name to `get_common_strains`
CunliangGeng May 1, 2023
d7ad2c0
refactor mapping dataframes in DataLinks
CunliangGeng May 2, 2023
4f8d811
add TODOs and deprecation to LinkFinder
CunliangGeng May 2, 2023
ab7268a
refactor cooccurrence in DataLinks
CunliangGeng May 2, 2023
995d6d2
merge `load_data` and `find_correlations` to init in DataLinks
CunliangGeng May 2, 2023
742e6e5
refactor DataLinks attributes
CunliangGeng May 3, 2023
66cda2c
Delete test_data_links.py
CunliangGeng May 3, 2023
7094d0d
update get_common_strains methods
CunliangGeng May 3, 2023
0b13746
remove lookup_index method from StrainCollection (#90)
CunliangGeng May 3, 2023
96d2210
Remove integer id from GCF
CunliangGeng May 3, 2023
44d2c67
update lookup methods and attributes in NPLikner class
CunliangGeng May 3, 2023
92aee38
change cooccurrence from array to DataFrame in DataLinks
CunliangGeng May 3, 2023
d239940
format link_finder.py
CunliangGeng May 3, 2023
8ed8ce9
temp replace array with dataframe in LinkFinder for metcalf scoring
CunliangGeng May 3, 2023
e4bdd95
refactor `LinkFinder.get_scores` method
CunliangGeng May 8, 2023
14a1aac
refactor `LinkFinder.metcalf_scoring` method
CunliangGeng May 8, 2023
d2d6a10
refactor get_links
CunliangGeng May 8, 2023
2078488
remove unused methods and scorings from LinkFinder
CunliangGeng May 9, 2023
5d9a916
refactor returned type of `LinkFinder.get_links` method
CunliangGeng May 9, 2023
47770c5
add `lookup_mf` method in NPLinker class
CunliangGeng May 9, 2023
25606e1
refactor MetcalfScoring class
CunliangGeng May 9, 2023
e22a8ce
add deprecation to LinkLikelihood class
CunliangGeng May 10, 2023
79ecb4d
add `__init__.py` to linking module
CunliangGeng May 10, 2023
ac94cd8
rename `data_linking.py` to `data_links.py`
CunliangGeng May 10, 2023
8c98e96
rename `data_linking_functions.py` to `utils.py`
CunliangGeng May 10, 2023
96dcb66
rename `test_data_linking_functions.py` to `test_linking_utils.py`.py
CunliangGeng May 10, 2023
18eb841
Delete test_scoring.py
CunliangGeng May 10, 2023
4f12f35
add dtype to DataLinks dataframes
CunliangGeng May 10, 2023
e3fdd86
remove mapping dataframes and relevant method from DataLinks
CunliangGeng May 10, 2023
7ef4da3
Create test_data_links.py
CunliangGeng May 10, 2023
fd01596
add `conftest.py` for scoring tests
CunliangGeng May 10, 2023
9f0b289
update LinkFinder's attribute and private method
CunliangGeng May 10, 2023
3e3bdfa
Create test_link_finder.py
CunliangGeng May 10, 2023
46880a8
Update vscode plugin autodocstring template
CunliangGeng May 11, 2023
31586dd
add scope for fixtures
CunliangGeng May 11, 2023
22ffd90
Create test_metcalf_scoring.py
CunliangGeng May 10, 2023
79082a6
add docstrings and type hints to `MetcalfScoring` class
CunliangGeng May 11, 2023
15a0bd1
add util func `isinstance_all`
CunliangGeng May 11, 2023
f0f570a
replace `_isinstance` with util func `isinstance_all`
CunliangGeng May 11, 2023
9f4bd2b
update validation of args for `DataLinks`
CunliangGeng May 11, 2023
1bd05a8
Update test_data_links.py
CunliangGeng May 11, 2023
a19f43f
add type hints for returned values to unit tests
CunliangGeng May 11, 2023
6fc0ef5
update exception types for invalid input
CunliangGeng May 12, 2023
91bcff7
add docstrings and type hints to `LinkFinder` class
CunliangGeng May 12, 2023
989471a
add more unit tests for `LinkFinder`
CunliangGeng May 12, 2023
04a5b27
fix input type bug for `DataLinks.get_common_strains`
CunliangGeng May 12, 2023
9c7dd49
Create test_nplinker_scoring.py
CunliangGeng May 12, 2023
60ccbdb
add todo comments to `NPLinker` class
CunliangGeng May 12, 2023
9a1af05
remove local integration tests for scoring part of `NPLinker`
CunliangGeng May 12, 2023
b0e44e1
remove unused imports
CunliangGeng May 12, 2023
a2efd69
Merge branch 'dev' into use_unique_string_id
CunliangGeng May 12, 2023
6ee04e9
Fix mypy warnings as much as possible
CunliangGeng May 12, 2023
88f48b3
check strain existence using strain dict
CunliangGeng May 15, 2023
27bcc2d
change calculate abbrevation from "cal" to "calc"
CunliangGeng Jun 20, 2023
ab5d025
remove resolved TODO comment
CunliangGeng Jun 20, 2023
e4a0b3e
move shared fixtures to conftest.py
CunliangGeng Jun 20, 2023
ade54a3
remove unnecessary type hints
CunliangGeng Jun 21, 2023
78aa794
update docstrings for cooccurrences
CunliangGeng Jun 21, 2023
09c049f
use uuid for singleton molecular families #144
CunliangGeng Jun 20, 2023
1ba7106
add TODO comment for GNPSLoader
CunliangGeng Jun 21, 2023
0f1352a
update type hints for `*args` parameter
CunliangGeng Jul 3, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 8 additions & 9 deletions src/nplinker/annotations.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,9 @@

import csv
import os

from deprecated import deprecated

from nplinker.metabolomics.spectrum import Spectrum, GNPS_KEY
from nplinker.metabolomics.spectrum import GNPS_KEY
from nplinker.metabolomics.spectrum import Spectrum
from .logconfig import LogConfig


Expand Down Expand Up @@ -61,22 +60,22 @@ def create_gnps_annotation(spec: Spectrum, gnps_anno: dict):


@deprecated(version="1.3.3", reason="Use GNPSAnnotationLoader class instead.")
def load_annotations(root: str | os.PathLike, config: str | os.PathLike, spectra: list[Spectrum], spec_dict: dict[int, Spectrum]) -> list[Spectrum]:
def load_annotations(root: str | os.PathLike, config: str | os.PathLike, spectra: list[Spectrum], spec_dict: dict[str, Spectrum]) -> list[Spectrum]:
"""Load the annotations from the GNPS annotation file present in root to the spectra.

Args:
root(str | os.PathLike): Path to the downloaded and extracted GNPS results.
config(str | os.PathLike): Path to config file for custom file locations.
spectra(list[Spectrum]): List of spectra to annotate.
spec_dict(dict[int, Spectrum]): Dictionary mapping to spectra passed in `spectra` variable.
spec_dict(dict[str, Spectrum]): Dictionary mapping to spectra passed in `spectra` variable.

Raises:
Exception: Raises exception if custom annotation config file has invalid content.

Returns:
list[Spectrum]: List of annotated spectra.
"""

if not os.path.exists(root):
logger.debug(f'Annotation directory not found ({root})')
return spectra
Expand All @@ -89,7 +88,7 @@ def load_annotations(root: str | os.PathLike, config: str | os.PathLike, spectra

logger.debug('Found {} annotations .tsv files in {}'.format(
len(annotation_files), root))

for af in annotation_files:
with open(af) as f:
rdr = csv.reader(f, delimiter='\t')
Expand All @@ -105,7 +104,7 @@ def load_annotations(root: str | os.PathLike, config: str | os.PathLike, spectra
# each line should be a different spec ID here
for line in rdr:
# read the scan ID column and get the corresponding Spectrum object
scan_id = int(line[scans_index])
scan_id = line[scans_index]
if scan_id not in spec_dict:
logger.warning(
'Unknown spectrum ID found in GNPS annotation file (ID={})'
Expand Down Expand Up @@ -147,7 +146,7 @@ def load_annotations(root: str | os.PathLike, config: str | os.PathLike, spectra
# note that might have multiple lines for the same spec ID!
spec_annotations = {}
for line in rdr:
scan_id = int(line[spec_id_index])
scan_id = line[spec_id_index]
if scan_id not in spec_dict:
logger.warning(
'Unknown spectrum ID found in annotation file "{}", ID is "{}"'
Expand Down
11 changes: 6 additions & 5 deletions src/nplinker/class_info/chem_classes.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from collections import Counter
import glob
import os
from collections import Counter
from canopus import Canopus
from canopus.classifications_to_gnps import analyse_canopus
from ..logconfig import LogConfig
Expand Down Expand Up @@ -399,17 +399,17 @@ class prediction for a level. When no class is present, instead of Tuple it will
molfam_classes = {}

for molfam in molfams:
fid = str(molfam.family_id) # the key
fid = molfam.family_id # the key
spectra = molfam.spectra
# if singleton family, format like '-1_spectrum-id'
if fid == '-1':
# if singleton family, format like 'fid_spectrum-id'
if fid.startswith('singleton-'):
spec_id = spectra[0].spectrum_id
fid += f'_{spec_id}'
len_molfam = len(spectra)

classes_per_spectra = []
for spec in spectra:
spec_classes = self.spectra_classes.get(str(spec.spectrum_id))
spec_classes = self.spectra_classes.get(spec.spectrum_id)
if spec_classes: # account for spectra without prediction
classes_per_spectra.append(spec_classes)

Expand Down Expand Up @@ -555,6 +555,7 @@ def _read_cf_classes(self, mne_dir):
nr_nodes = line.pop(0)
# todo: make it easier to query classes of singleton families
# if singleton family, format like '-1_spectrum-id' like canopus results
# CG: Note that the singleton families id is "singleton-" + uuid.
if nr_nodes == '1':
component = f'-1_{cluster}'
class_info = []
Expand Down
3 changes: 2 additions & 1 deletion src/nplinker/genomics/bgc.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from nplinker.logconfig import LogConfig
from .aa_pred import predict_aa


if TYPE_CHECKING:
from ..strains import Strain
from .gcf import GCF
Expand Down Expand Up @@ -121,7 +122,7 @@ def strain(self, strain: Strain) -> None:
self._strain = strain

@property
def bigscape_classes(self) -> set[str]:
def bigscape_classes(self) -> set[str | None]:
"""Get BiG-SCAPE's BGC classes.

BiG-SCAPE's BGC classes are similar to those defined in MiBIG but have
Expand Down
1 change: 0 additions & 1 deletion src/nplinker/genomics/gcf.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,6 @@ def __init__(self, gcf_id: str, /) -> None:
self.bigscape_class: str | None = None
# CG TODO: remove attribute id, see issue 103
# https://github.com/NPLinker/nplinker/issues/103
self.id: int | None = None
self.bgc_ids: set[str] = set()
self.strains: StrainCollection = StrainCollection()

Expand Down
3 changes: 2 additions & 1 deletion src/nplinker/genomics/genomics.py
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,8 @@ def _filter_gcfs(

for bgc in bgcs_to_remove:
bgcs.remove(bgc)
strains.remove(bgc.strain)
if bgc.strain is not None:
strains.remove(bgc.strain)
Comment on lines -248 to +249
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the scenario where the strain object of the bgc is None?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strain is None by default, so I guess this happens in all cases in which the strain is not set through the strain setter method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giulia is right. The strain might be None when BGC instances are not created using BGC Loaders (in the loaders, the strain will be set).


logger.info(
'Remove GCFs that has only MIBiG BGCs: removing {} GCFs and {} BGCs'.
Expand Down
7 changes: 1 addition & 6 deletions src/nplinker/loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -659,12 +659,6 @@ def _load_genomics(self):
antismash_bgc_loader.get_files(),
self._bigscape_cutoff)

# CG TODO: remove the gcf.id, see issue 103
# https://github.com/NPLinker/nplinker/issues/103
# This is only place to set gcf.id value.
for i, gcf in enumerate(self.gcfs):
gcf.id = i

#----------------------------------------------------------------------
# CG: write unknown strains in genomics to file
#----------------------------------------------------------------------
Expand All @@ -680,6 +674,7 @@ def _load_genomics(self):

return True

# TODO CG: replace deprecated load_dataset with GPNSLoader
def _load_metabolomics(self):
spec_dict, self.spectra, self.molfams, unknown_strains = load_dataset(
self.strains,
Expand Down
16 changes: 8 additions & 8 deletions src/nplinker/metabolomics/gnps/gnps_annotation_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
from os import PathLike
from pathlib import Path
from typing import Any

from nplinker.metabolomics.abc import AnnotationLoaderBase


GNPS_URL_FORMAT = 'https://metabolomics-usi.ucsd.edu/{}/?usi=mzspec:GNPSLIBRARY:{}'

class GNPSAnnotationLoader(AnnotationLoaderBase):
Expand All @@ -15,28 +15,28 @@ def __init__(self, file: str | PathLike):
file(str | PathLike): The GNPS annotation file.
"""
self._file = Path(file)
self._annotations : dict[int, dict] = dict()
self._annotations : dict[str, dict] = {}

with open(self._file, mode='rt', encoding='UTF-8') as f:
header = f.readline().split('\t')
dict_reader = csv.DictReader(f, header, delimiter='\t')
for row in dict_reader:
scan_id = int(row.pop('#Scan#'))
scan_id = row.pop('#Scan#')
self._annotations[scan_id] = row

# also insert useful URLs
for t in ['png', 'json', 'svg', 'spectrum']:
self._annotations[scan_id][f'{t}_url'] = GNPS_URL_FORMAT.format(t, row['SpectrumID'])



def get_annotations(self) -> dict[int, dict]:

def get_annotations(self) -> dict[str, dict]:
"""Get annotations.

Returns:
dict[int, dict]: Spectra indices are keys and values are the annotations for this spectrum.
dict[str, dict]: Spectra indices are keys and values are the annotations for this spectrum.

Examples:
>>> print(loader.annotations()[100])
"""
return self._annotations
return self._annotations
22 changes: 11 additions & 11 deletions src/nplinker/metabolomics/gnps/gnps_molecular_family_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,44 +17,44 @@ def __init__(self, file: str | PathLike):
file(str | PathLike): str or PathLike object pointing towards the GNPS molecular families file to load.
"""
self._families: list[MolecularFamily | SingletonFamily] = []

for family_id, spectra_ids in _load_molecular_families(file).items():
if family_id == -1:
CunliangGeng marked this conversation as resolved.
Show resolved Hide resolved
if family_id == '-1': # the "-1" is from GNPS result
for spectrum_id in spectra_ids:
family = SingletonFamily()
family = SingletonFamily() ## uuid as family id
family.spectra_ids = set([spectrum_id])
self._families.append(family)
else:
family = MolecularFamily(family_id)
family.spectra_ids = spectra_ids
self._families.append(family)

def families(self) -> list[MolecularFamily]:
return self._families


def _load_molecular_families(file: str | PathLike) -> dict[int, set[int]]:
def _load_molecular_families(file: str | PathLike) -> dict[str, set[str]]:
"""Load ids of molecular families and corresponding spectra from GNPS output file.

Args:
file(str | PathLike): path to the GNPS file to load molecular families.

Returns:
dict[int, set[int]]: Mapping from molecular family/cluster id to the spectra ids.
dict[str, set[str]]: Mapping from molecular family/cluster id to the spectra ids.
"""
logger.debug('loading edges file: %s', file)

families: dict = {}

with open(file, mode='rt', encoding='utf-8') as f:
reader = csv.reader(f, delimiter='\t')
headers = next(reader)
cid1_index, cid2_index, fam_index = _sniff_column_indices(file, headers)

for line in reader:
spec1_id = int(line[cid1_index])
spec2_id = int(line[cid2_index])
family_id = int(line[fam_index])
spec1_id = line[cid1_index]
spec2_id = line[cid2_index]
family_id = line[fam_index]

if families.get(family_id) is None:
families[family_id] = set([spec1_id, spec2_id])
Expand Down Expand Up @@ -84,5 +84,5 @@ def _sniff_column_indices(file: str | PathLike, headers: list[str]) -> tuple[int
except ValueError as ve:
message = f'Unknown or missing column(s) in edges file: {file}'
raise Exception(message) from ve

return cid1_index,cid2_index,fam_index
12 changes: 7 additions & 5 deletions src/nplinker/metabolomics/gnps/gnps_spectrum_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,15 @@ def __init__(self, file: str | PathLike):
ms1, ms2, metadata = LoadMGF(name_field='scans').load_spectra([str(file)])
logger.info('%d molecules parsed from MGF file', len(ms1))
self._spectra = _mols_to_spectra(ms2, metadata)

def spectra(self) -> list[Spectrum]:
"""Get the spectra loaded from the file.

Returns:
list[Spectrum]: the loaded spectra as a list of `Spectrum` objects.
"""
return self._spectra


def _mols_to_spectra(ms2: list, metadata: dict[str, dict[str, str]]) -> list[Spectrum]:
"""Function to convert ms2 object and metadata to `Spectrum` objects.
Expand All @@ -39,14 +39,16 @@ def _mols_to_spectra(ms2: list, metadata: dict[str, dict[str, str]]) -> list[Spe
list[Spectrum]: List of mass spectra obtained from ms2 and metadata.
"""
ms2_dict = {}
# an example of m:
# (118.487999, 0.0, 18.753, <nplinker.parsers.mg...105f2c970>, 'spectra.mgf', 0.0)
for m in ms2:
if not m[3] in ms2_dict:
if not m[3] in ms2_dict: # m[3] is `nplinker.parsers.mgf.MS1` object
ms2_dict[m[3]] = []
ms2_dict[m[3]].append((m[0], m[2]))

spectra = []
for i, m in enumerate(ms2_dict.keys()):
new_spectrum = Spectrum(i, ms2_dict[m], int(m.name),
for i, m in enumerate(ms2_dict.keys()): # m is `nplinker.parsers.mgf.MS1` object
new_spectrum = Spectrum(i, ms2_dict[m], m.name,
metadata[m.name]['precursormass'],
metadata[m.name]['parentmass'])
new_spectrum.metadata = metadata[m.name]
Expand Down
Loading