From ca0a4b71569e95436ccd360df2178dae34c745c0 Mon Sep 17 00:00:00 2001
From: Cunliang Geng NPLinker is a python framework for data mining microbial natural products by integrating genomics and metabolomics data. For a deep understanding of NPLinker, please refer to the original paper. Under Development NPLinker v2 is under active development (see its pre-releases). The documentation is not complete yet. If you have any questions, please contact us via GitHub Issues. NPLinker is a python package that has both pypi packages and non-pypi packages as dependencies. It requires ~4.5GB of disk space to install all the dependencies. Install You can also install NPLinker from source code: NPLinker uses the standard library logging module for managing log messages and the python library rich to colorize the log messages. Depending on how you use NPLinker, you can set up logging in different ways. If you're using NPLinker as an application, you're running the whole workflow of NPLinker as described in the Quickstart. In this case, you can set up logging in the nplinker configuration file If you're using NPLinker as a library, you're using only some functions and classes of NPLinker in your script. By default, NPLinker will not log any messages. However, you can set up logging in your script to log messages. The log messages will be written to the log file NPLinker allows you to run in two modes: The The required input data includes: The So, which mode will you use? The answer is important for the next steps. The working directory is used to store all input and output data for NPLinker. You can name this directory as you like, for example Important Before going to the next step, make sure you get familiar with how NPLinker organizes data in the working directory, see Working Directory Structure page. Skip this step if you choose to use the If you choose to use the NPLinker accepts data from the output of the following GNPS workflows: NPLinker provides the tools Given an example of GNPS task at https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=c22f44b14a3d450eb836d607cb9521bb, the task id is the last part of this url, i.e. The required data for NPLinker will be extracted to the Info Not all GNPS data are required by NPLinker, and only the necessary data will be extracted. During the extraction, these data will be renamed to the standard names used by NPLinker. See the page GNPS Data for more information. If you have GNPS data but it is not the archive format as downloaded from GNPS, it's recommended to re-download the data from GNPS. If (re-)downloading is not possible, you could manually prepare data for the NPLinker requires AntiSMASH BGC data as input, which are organized in the For each output of AntiSMASH run, the BGC data must be stored in a subdirectory named after the NCBI accession number (e.g. When manually preparing AntiSMASH data for NPLinker, you must make sure that the data is organized as expected by NPLinker. See the page Working Directory Structure for more information. It is optional to provide the output of BigScape to NPLinker. If the output of BigScape is not provided, NPLinker will run BigScape automatically to generate the data using the AntiSMASH BGC data. If you have the output of BigScape, you can put its The strain mappings file The BGC id is same as the name of the BGC file in the The spectrum id is same as the scan number in the If you labelled the mzXML files (input for GNPS) with the strain id, you may need the function extract_mappings_ms_filename_spectrum_id to extract the mappings from mzXML files to the spectrum ids. For the The configuration file The details of all settings can be found at this page Config File. To keep it simple, default settings will be used automatically by NPLinker if you don't set them in your What you need to do is to set the Before running NPLinker, make sure your working directory has the correct directory structure and names as described in the Working Directory Structure page. For more info about the classes and methods, see the API Documentation. Bases: Data loader for AntiSMASH BGC genbank (.gbk) files. Parameters: Path to AntiSMASH directory that contains a collection of AntiSMASH outputs. The input Installation
Requirements
-
nplinker
package as following:# Check python version (\u22653.9)\npython --version\n\n# Create a new virtual environment\npython -m venv env # (1)!\nsource env/bin/activate # (2)! \n\n# install nplinker package (requiring ~300MB of disk space)\npip install --pre nplinker # (3)!\n\n# install nplinker non-pypi dependencies and databases (~4GB)\ninstall-nplinker-deps\n
"},{"location":"install/#install-from-source-code","title":"Install from source code","text":"conda
to create a new environment. But NPLinker is not available on conda yet.pip
command and make sure it is provided by the activated virtual environment. --pre
option. pip install git+https://github.com/nplinker/nplinker@dev # (1)!\ninstall-nplinker-deps\n
"},{"location":"logging/","title":"How to setup logging","text":"@dev
is the branch name. You can replace it with the branch name, commit or tag.nplinker.toml
. # Set up logging configuration first\nfrom nplinker import setup_logging\n\nsetup_logging(level=\"DEBUG\", file=\"nplinker.log\", use_console=True) # (1)!\n\n# Your business code here\n# e.g. download and extract nplinker example data\nfrom nplinker.utils import download_and_extract_archive\n\ndownload_and_extract_archive(\n url=\"https://zenodo.org/records/10822604/files/nplinker_local_mode_example.zip\",\n download_root=\".\",\n)\n
setup_logging
function sets up the logging configuration. The level
argument sets the logging level. The file
argument sets the log file. The use_console
argument sets whether to log messages to the console.nplinker.log
and displayed in the console with a format like this: [Date Time] Level Log-message Module:Line
.# Run your script\n$ python your_script.py\nDownloading nplinker_local_mode_example.zip \u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501 100.0% \u2022 195.3/195.3 MB \u2022 2.6 MB/s \u2022 0:00:00 \u2022 0:01:02 # (1)!\n[2024-05-10 15:14:48] INFO Extracting nplinker_local_mode_example.zip to . utils.py:401\n\n# Check the log file\n$ cat nplinker.log\n[2024-05-10 15:14:48] INFO Extracting nplinker_local_mode_example.zip to . utils.py:401\n
"},{"location":"quickstart/","title":"Quickstart","text":"local
modepodp
mode local
mode assumes that the data required by NPLinker is available on your local machine.
METABOLOMICS-SNETS
,METABOLOMICS-SNETS-V2
FEATURE-BASED-MOLECULAR-NETWORKING
podp
mode assumes that you use an identifier of Paired Omics Data Platform (PODP) as the input for NPLinker. Then NPLinker will download and prepare all data necessary based on the PODP id which refers to the metadata of the dataset.nplinker_quickstart
:mkdir nplinker_quickstart\n
local
mode only)","text":"Details podp
mode.local
mode, meaning you have input data of NPLinker stored on your local machine, you need to move the input data to the working directory created in the previous step.
METABOLOMICS-SNETS
METABOLOMICS-SNETS-V2
FEATURE-BASED-MOLECULAR-NETWORKING
.GNPSDownloader
and GNPSExtractor
to download and extract the GNPS data with ease. What you need to give is a valid GNPS task ID, referring to a task of the GNPS workflows supported by NPLinker.c22f44b14a3d450eb836d607cb9521bb
. Open this link, you can find the worklow info at the row \"Workflow\" of the table \"Job Status\", for this case, it is METABOLOMICS-SNETS
.from nplinker.metabolomics.gnps import GNPSDownloader, GNPSExtractor\n\n# Go to the working directory\ncd nplinker_quickstart\n\n# Download GNPS data & get the path to the downloaded archive\ndownloader = GNPSDownloader(\"gnps_task_id\", \"downloads\") # (1)!\ndownloaded_archive = downloader.download().get_download_file()\n\n# Extract GNPS data to `gnps` directory\nextractor = GNPSExtractor(downloaded_archive, \"gnps\") # (2)!\n
downloaded_archive
with the actuall path to your GNPS data archive if you skipped the download steps.gnps
subdirectory of the working directory.gnps
directory. In this case, you must make sure that the data is organized as expected by NPLinker. See the page GNPS Data for examples of how to prepare the data.antismash
subdirectory of the working directory.GCF_000514975.1
). And only the *.region*.gbk
files are required by NPLinker.mix_clustering_c{cutoff}.tsv
file in the bigscape
subdirectory of the NPLinker working directory, where {cutoff}
is the cutoff value used in the BigScape run.strain_mapping.json
is required by NPLinker to map the strain to genomics and metabolomics data. {\n \"strain_mappings\": [\n {\n \"strain_id\": \"strain_id_1\", # (1)!\n \"strain_alias\": [\"bgc_id_1\", \"spectrum_id_1\", ...] # (2)!\n },\n {\n \"strain_id\": \"strain_id_2\",\n \"strain_alias\": [\"bgc_id_2\", \"spectrum_id_2\", ...]\n },\n ...\n ],\n \"version\": \"1.0\" # (3)!\n}\n
strain_id
is the unique identifier of the strain.strain_alias
is a list of aliases of the strain, which are the identifiers of the BGCs and spectra of the strain.version
is the schema version of this file. It is recommended to use the latest version of the schema. The current latest version is 1.0
. antismash
directory, for example, given a BGC file xxxx.region001.gbk
, the BGC id is xxxx.region001
.spectra.mgf
file in the gnps
directory, for example, given a spectrum in the mgf file with a scan SCANS=1
, the spectrum id is 1
. local
mode, you need to create this file manually and put it in the working directory. It takes some effort to prepare this file manually, especially when you have a large number of strains.nplinker.toml
is required by NPLinker to specify the working directory, mode, and other settings for the run of NPLinker. You can put the nplinker.toml
file in any place, but it is recommended to put it in the working directory created in step 2.nplinker.toml
config file.root_dir
and mode
in the nplinker.toml
file.local
modepodp
mode nplinker.tomlroot_dir = \"absolute/path/to/working/directory\" # (1)!\nmode = \"local\"\n# and other settings you want to override the default settings \n
nplinker.tomlabsolute/path/to/working/directory
with the absolute path to the working directory created in step 2.root_dir = \"absolute/path/to/working/directory\" # (1)!\nmode = \"podp\"\npodp_id = \"podp_id\" # (2)!\n# and other settings you want to override the default settings \n
"},{"location":"quickstart/#4-run-nplinker","title":"4. Run NPLinker","text":"absolute/path/to/working/directory
with the absolute path to the working directory created in step 2.podp_id
with the identifier of the dataset in the Paired Omics Data Platform (PODP).from nplinker import NPLinker\n\n# create an instance of NPLinker\nnpl = NPLinker(\"nplinker.toml\") # (1)!\n\n# load data\nnpl.load_data()\n\n# check loaded data\nprint(npl.bgcs)\nprint(npl.gcfs)\nprint(npl.spectra)\nprint(npl.mfs)\nprint(npl.strains)\n\n# compute the links for the first 3 GCFs using metcalf scoring method\nlink_graph = npl.get_links(npl.gcfs[:3], \"metcalf\") # (2)!\n\n# get links as a list of tuples\nlink_graph.links \n\n# get the link data between two objects or entities\nlink_graph.get_link_data(npl.gcfs[0], npl.spectra[0]) \n\n# Save data to a pickle file\nnpl.save_data(\"npl.pkl\", link_graph)\n
nplinker.toml
with the actual path to your configuration file.get_links
returns a LinkGraph object that represents the calculated links between the GCFs and other entities as a graph.AntismashBGCLoader(data_dir: str | PathLike)\n
BGCLoaderBase
Notes data_dir
(str | PathLike
) \u2013 data_dir
must follow the structure defined in the Working Directory Structure for AntiSMASH data, e.g.: antismash\n \u251c\u2500\u2500 genome_id_1 # one AntiSMASH output, e.g. GCF_000514775.1\n \u2502\u00a0 \u251c\u2500\u2500 NZ_AZWO01000004.region001.gbk\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u251c\u2500\u2500 genome_id_2\n \u2502\u00a0 \u251c\u2500\u2500 ...\n \u2514\u2500\u2500 ...\n
src/nplinker/genomics/antismash/antismash_loader.py
def __init__(self, data_dir: str | PathLike) -> None:\n \"\"\"Initialize the AntiSMASH BGC loader.\n\n Args:\n data_dir: Path to AntiSMASH directory that contains a collection of AntiSMASH outputs.\n\n Notes:\n The input `data_dir` must follow the structure defined in the\n [Working Directory Structure][working-directory-structure] for AntiSMASH data, e.g.:\n ```shell\n antismash\n \u251c\u2500\u2500 genome_id_1 # one AntiSMASH output, e.g. GCF_000514775.1\n \u2502\u00a0 \u251c\u2500\u2500 NZ_AZWO01000004.region001.gbk\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u251c\u2500\u2500 genome_id_2\n \u2502\u00a0 \u251c\u2500\u2500 ...\n \u2514\u2500\u2500 ...\n ```\n \"\"\"\n self.data_dir = str(data_dir)\n self._file_dict = self._parse_data_dir(self.data_dir)\n self._bgcs = self._parse_bgcs(self._file_dict)\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.AntismashBGCLoader.data_dir","title":"data_dir instance-attribute
","text":"data_dir = str(data_dir)\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.AntismashBGCLoader.get_bgc_genome_mapping","title":"get_bgc_genome_mapping","text":"get_bgc_genome_mapping() -> dict[str, str]\n
Get the mapping from BGC to genome.
Info
The directory name of the gbk files is treated as genome id.
Returns:
dict[str, str]
\u2013 The key is BGC name (gbk file name) and value is genome id (the directory name of the
dict[str, str]
\u2013 gbk file).
src/nplinker/genomics/antismash/antismash_loader.py
def get_bgc_genome_mapping(self) -> dict[str, str]:\n \"\"\"Get the mapping from BGC to genome.\n\n !!! info\n The directory name of the gbk files is treated as genome id.\n\n Returns:\n The key is BGC name (gbk file name) and value is genome id (the directory name of the\n gbk file).\n \"\"\"\n return {\n bid: os.path.basename(os.path.dirname(bpath)) for bid, bpath in self._file_dict.items()\n }\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.AntismashBGCLoader.get_files","title":"get_files","text":"get_files() -> dict[str, str]\n
Get BGC gbk files.
Returns:
dict[str, str]
\u2013 The key is BGC name (gbk file name) and value is path to the gbk file.
src/nplinker/genomics/antismash/antismash_loader.py
def get_files(self) -> dict[str, str]:\n \"\"\"Get BGC gbk files.\n\n Returns:\n The key is BGC name (gbk file name) and value is path to the gbk file.\n \"\"\"\n return self._file_dict\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.AntismashBGCLoader.get_bgcs","title":"get_bgcs","text":"get_bgcs() -> list[BGC]\n
Get all BGC objects.
Returns:
list[BGC]
\u2013 A list of BGC objects
src/nplinker/genomics/antismash/antismash_loader.py
def get_bgcs(self) -> list[BGC]:\n \"\"\"Get all BGC objects.\n\n Returns:\n A list of BGC objects\n \"\"\"\n return self._bgcs\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.GenomeStatus","title":"GenomeStatus","text":"GenomeStatus(\n original_id: str,\n resolved_refseq_id: str = \"\",\n resolve_attempted: bool = False,\n bgc_path: str = \"\",\n)\n
Class to represent the status of a single genome.
The status of genomes is tracked in the file GENOME_STATUS_FILENAME.
Parameters:
original_id
(str
) \u2013 The original ID of the genome.
resolved_refseq_id
(str
, default: ''
) \u2013 The resolved RefSeq ID of the genome. Defaults to \"\".
resolve_attempted
(bool
, default: False
) \u2013 A flag indicating whether an attempt to resolve the RefSeq ID has been made. Defaults to False.
bgc_path
(str
, default: ''
) \u2013 The path to the downloaded BGC file for the genome. Defaults to \"\".
src/nplinker/genomics/antismash/podp_antismash_downloader.py
def __init__(\n self,\n original_id: str,\n resolved_refseq_id: str = \"\",\n resolve_attempted: bool = False,\n bgc_path: str = \"\",\n):\n \"\"\"Initialize a GenomeStatus object for the given genome.\n\n Args:\n original_id: The original ID of the genome.\n resolved_refseq_id: The resolved RefSeq ID of the\n genome. Defaults to \"\".\n resolve_attempted: A flag indicating whether an\n attempt to resolve the RefSeq ID has been made. Defaults to False.\n bgc_path: The path to the downloaded BGC file for\n the genome. Defaults to \"\".\n \"\"\"\n self.original_id = original_id\n self.resolved_refseq_id = \"\" if resolved_refseq_id == \"None\" else resolved_refseq_id\n self.resolve_attempted = resolve_attempted\n self.bgc_path = bgc_path\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.GenomeStatus.original_id","title":"original_id instance-attribute
","text":"original_id = original_id\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.GenomeStatus.resolved_refseq_id","title":"resolved_refseq_id instance-attribute
","text":"resolved_refseq_id = (\n \"\"\n if resolved_refseq_id == \"None\"\n else resolved_refseq_id\n)\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.GenomeStatus.resolve_attempted","title":"resolve_attempted instance-attribute
","text":"resolve_attempted = resolve_attempted\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.GenomeStatus.bgc_path","title":"bgc_path instance-attribute
","text":"bgc_path = bgc_path\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.GenomeStatus.read_json","title":"read_json staticmethod
","text":"read_json(\n file: str | PathLike,\n) -> dict[str, \"GenomeStatus\"]\n
Get a dict of GenomeStatus objects by loading given genome status file.
Note that an empty dict is returned if the given file doesn't exist.
Parameters:
file
(str | PathLike
) \u2013 Path to genome status file.
Returns:
dict[str, 'GenomeStatus']
\u2013 Dict keys are genome original id and values are GenomeStatus objects. An empty dict is returned if the given file doesn't exist.
src/nplinker/genomics/antismash/podp_antismash_downloader.py
@staticmethod\ndef read_json(file: str | PathLike) -> dict[str, \"GenomeStatus\"]:\n \"\"\"Get a dict of GenomeStatus objects by loading given genome status file.\n\n Note that an empty dict is returned if the given file doesn't exist.\n\n Args:\n file: Path to genome status file.\n\n Returns:\n Dict keys are genome original id and values are GenomeStatus\n objects. An empty dict is returned if the given file doesn't exist.\n \"\"\"\n genome_status_dict = {}\n if Path(file).exists():\n with open(file, \"r\") as f:\n data = json.load(f)\n\n # validate json data before using it\n validate(data, schema=GENOME_STATUS_SCHEMA)\n\n genome_status_dict = {\n gs[\"original_id\"]: GenomeStatus(**gs) for gs in data[\"genome_status\"]\n }\n return genome_status_dict\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.GenomeStatus.to_json","title":"to_json staticmethod
","text":"to_json(\n genome_status_dict: Mapping[str, \"GenomeStatus\"],\n file: str | PathLike | None = None,\n) -> str | None\n
Convert the genome status dictionary to a JSON string.
If a file path is provided, the JSON string is written to the file. If the file already exists, it is overwritten.
Parameters:
genome_status_dict
(Mapping[str, 'GenomeStatus']
) \u2013 A dictionary of genome status objects. The keys are the original genome IDs and the values are GenomeStatus objects.
file
(str | PathLike | None
, default: None
) \u2013 The path to the output JSON file. If None, the JSON string is returned but not written to a file.
Returns:
str | None
\u2013 The JSON string if file
is None, otherwise None.
src/nplinker/genomics/antismash/podp_antismash_downloader.py
@staticmethod\ndef to_json(\n genome_status_dict: Mapping[str, \"GenomeStatus\"], file: str | PathLike | None = None\n) -> str | None:\n \"\"\"Convert the genome status dictionary to a JSON string.\n\n If a file path is provided, the JSON string is written to the file. If\n the file already exists, it is overwritten.\n\n Args:\n genome_status_dict: A dictionary of genome\n status objects. The keys are the original genome IDs and the values\n are GenomeStatus objects.\n file: The path to the output JSON file.\n If None, the JSON string is returned but not written to a file.\n\n Returns:\n The JSON string if `file` is None, otherwise None.\n \"\"\"\n gs_list = [gs._to_dict() for gs in genome_status_dict.values()]\n json_data = {\"genome_status\": gs_list, \"version\": \"1.0\"}\n\n # validate json object before dumping\n validate(json_data, schema=GENOME_STATUS_SCHEMA)\n\n if file is not None:\n with open(file, \"w\") as f:\n json.dump(json_data, f)\n return None\n return json.dumps(json_data)\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.download_and_extract_antismash_data","title":"download_and_extract_antismash_data","text":"download_and_extract_antismash_data(\n antismash_id: str,\n download_root: str | PathLike,\n extract_root: str | PathLike,\n) -> None\n
Download and extract antiSMASH BGC archive for a specified genome.
The antiSMASH database (https://antismash-db.secondarymetabolites.org/) is used to download the BGC archive. And antiSMASH use RefSeq assembly id of a genome as the id of the archive.
Parameters:
antismash_id
(str
) \u2013 The id used to download BGC archive from antiSMASH database. If the id is versioned (e.g., \"GCF_004339725.1\") please be sure to specify the version as well.
download_root
(str | PathLike
) \u2013 Path to the directory to place downloaded archive in.
extract_root
(str | PathLike
) \u2013 Path to the directory data files will be extracted to. Note that an antismash
directory will be created in the specified extract_root
if it doesn't exist. The files will be extracted to <extract_root>/antismash/<antismash_id>
directory.
Raises:
ValueError
\u2013 if <extract_root>/antismash/<refseq_assembly_id>
dir is not empty.
Examples:
>>> download_and_extract_antismash_metadata(\"GCF_004339725.1\", \"/data/download\", \"/data/extracted\")\n
Source code in src/nplinker/genomics/antismash/antismash_downloader.py
def download_and_extract_antismash_data(\n antismash_id: str, download_root: str | PathLike, extract_root: str | PathLike\n) -> None:\n \"\"\"Download and extract antiSMASH BGC archive for a specified genome.\n\n The antiSMASH database (https://antismash-db.secondarymetabolites.org/)\n is used to download the BGC archive. And antiSMASH use RefSeq assembly id\n of a genome as the id of the archive.\n\n Args:\n antismash_id: The id used to download BGC archive from antiSMASH database.\n If the id is versioned (e.g., \"GCF_004339725.1\") please be sure to\n specify the version as well.\n download_root: Path to the directory to place downloaded archive in.\n extract_root: Path to the directory data files will be extracted to.\n Note that an `antismash` directory will be created in the specified `extract_root` if\n it doesn't exist. The files will be extracted to `<extract_root>/antismash/<antismash_id>` directory.\n\n Raises:\n ValueError: if `<extract_root>/antismash/<refseq_assembly_id>` dir is not empty.\n\n Examples:\n >>> download_and_extract_antismash_metadata(\"GCF_004339725.1\", \"/data/download\", \"/data/extracted\")\n \"\"\"\n download_root = Path(download_root)\n extract_root = Path(extract_root)\n extract_path = extract_root / \"antismash\" / antismash_id\n\n try:\n if extract_path.exists():\n _check_extract_path(extract_path)\n else:\n extract_path.mkdir(parents=True, exist_ok=True)\n\n for base_url in [ANTISMASH_DB_DOWNLOAD_URL, ANTISMASH_DBV2_DOWNLOAD_URL]:\n url = base_url.format(antismash_id, antismash_id + \".zip\")\n download_and_extract_archive(url, download_root, extract_path, antismash_id + \".zip\")\n break\n\n # delete subdirs\n for subdir_path in list_dirs(extract_path):\n shutil.rmtree(subdir_path)\n\n # delete unnecessary files\n files_to_keep = list_files(extract_path, suffix=(\".json\", \".gbk\"))\n for file in list_files(extract_path):\n if file not in files_to_keep:\n os.remove(file)\n\n logger.info(\"antiSMASH BGC data of %s is downloaded and extracted.\", antismash_id)\n\n except Exception as e:\n shutil.rmtree(extract_path)\n logger.warning(e)\n raise e\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.parse_bgc_genbank","title":"parse_bgc_genbank","text":"parse_bgc_genbank(file: str | PathLike) -> BGC\n
Parse a single BGC gbk file to BGC object.
Parameters:
file
(str | PathLike
) \u2013 Path to BGC gbk file
Returns:
BGC
\u2013 BGC object
Examples:
>>> bgc = AntismashBGCLoader.parse_bgc(\n... \"/data/antismash/GCF_000016425.1/NC_009380.1.region001.gbk\")\n
Source code in src/nplinker/genomics/antismash/antismash_loader.py
def parse_bgc_genbank(file: str | PathLike) -> BGC:\n \"\"\"Parse a single BGC gbk file to BGC object.\n\n Args:\n file: Path to BGC gbk file\n\n Returns:\n BGC object\n\n Examples:\n >>> bgc = AntismashBGCLoader.parse_bgc(\n ... \"/data/antismash/GCF_000016425.1/NC_009380.1.region001.gbk\")\n \"\"\"\n file = Path(file)\n fname = file.stem\n\n record = SeqIO.read(file, format=\"genbank\")\n description = record.description # \"DEFINITION\" in gbk file\n antismash_id = record.id # \"VERSION\" in gbk file\n features = _parse_antismash_genbank(record)\n product_prediction = features.get(\"product\")\n if product_prediction is None:\n raise ValueError(f\"Not found product prediction in antiSMASH Genbank file {file}\")\n\n # init BGC\n bgc = BGC(fname, *product_prediction)\n bgc.description = description\n bgc.antismash_id = antismash_id\n bgc.antismash_file = str(file)\n bgc.antismash_region = features.get(\"region_number\")\n bgc.smiles = features.get(\"smiles\")\n bgc.strain = Strain(fname)\n return bgc\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.get_best_available_genome_id","title":"get_best_available_genome_id","text":"get_best_available_genome_id(\n genome_id_data: Mapping[str, str]\n) -> str | None\n
Get the best available ID from genome_id_data dict.
Parameters:
genome_id_data
(Mapping[str, str]
) \u2013 dictionary containing information for each genome record present.
Returns:
str | None
\u2013 ID for the genome, if present, otherwise None.
src/nplinker/genomics/antismash/podp_antismash_downloader.py
def get_best_available_genome_id(genome_id_data: Mapping[str, str]) -> str | None:\n \"\"\"Get the best available ID from genome_id_data dict.\n\n Args:\n genome_id_data: dictionary containing information for each genome record present.\n\n Returns:\n ID for the genome, if present, otherwise None.\n \"\"\"\n if \"RefSeq_accession\" in genome_id_data:\n best_id = genome_id_data[\"RefSeq_accession\"]\n elif \"GenBank_accession\" in genome_id_data:\n best_id = genome_id_data[\"GenBank_accession\"]\n elif \"JGI_Genome_ID\" in genome_id_data:\n best_id = genome_id_data[\"JGI_Genome_ID\"]\n else:\n best_id = None\n\n if best_id is None or len(best_id) == 0:\n logger.warning(f\"Failed to get valid genome ID in genome data: {genome_id_data}\")\n return None\n return best_id\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.podp_download_and_extract_antismash_data","title":"podp_download_and_extract_antismash_data","text":"podp_download_and_extract_antismash_data(\n genome_records: Sequence[\n Mapping[str, Mapping[str, str]]\n ],\n project_download_root: str | PathLike,\n project_extract_root: str | PathLike,\n)\n
Download and extract antiSMASH BGC archive for the given genome records.
Parameters:
genome_records
(Sequence[Mapping[str, Mapping[str, str]]]
) \u2013 list of dicts representing genome records.
The dict of each genome record contains a key of genome ID with a value of another dict containing information about genome type, label and accession ids (RefSeq, GenBank, and/or JGI).
project_download_root
(str | PathLike
) \u2013 Path to the directory to place downloaded archive in.
project_extract_root
(str | PathLike
) \u2013 Path to the directory downloaded archive will be extracted to.
Note that an antismash
directory will be created in the specified extract_root
if it doesn't exist. The files will be extracted to <extract_root>/antismash/<antismash_id>
directory.
Warns:
UserWarning
\u2013 when no antiSMASH data is found for some genomes.
src/nplinker/genomics/antismash/podp_antismash_downloader.py
def podp_download_and_extract_antismash_data(\n genome_records: Sequence[Mapping[str, Mapping[str, str]]],\n project_download_root: str | PathLike,\n project_extract_root: str | PathLike,\n):\n \"\"\"Download and extract antiSMASH BGC archive for the given genome records.\n\n Args:\n genome_records: list of dicts representing genome records.\n\n The dict of each genome record contains a key of genome ID with a value\n of another dict containing information about genome type, label and\n accession ids (RefSeq, GenBank, and/or JGI).\n project_download_root: Path to the directory to place\n downloaded archive in.\n project_extract_root: Path to the directory downloaded archive will be extracted to.\n\n Note that an `antismash` directory will be created in the specified\n `extract_root` if it doesn't exist. The files will be extracted to\n `<extract_root>/antismash/<antismash_id>` directory.\n\n Warnings:\n UserWarning: when no antiSMASH data is found for some genomes.\n \"\"\"\n if not Path(project_download_root).exists():\n # otherwise in case of failed first download, the folder doesn't exist and\n # genome_status_file can't be written\n Path(project_download_root).mkdir(parents=True, exist_ok=True)\n\n gs_file = Path(project_download_root, GENOME_STATUS_FILENAME)\n gs_dict = GenomeStatus.read_json(gs_file)\n\n for i, genome_record in enumerate(genome_records):\n # get the best available ID from the dict\n genome_id_data = genome_record[\"genome_ID\"]\n raw_genome_id = get_best_available_genome_id(genome_id_data)\n if raw_genome_id is None or len(raw_genome_id) == 0:\n logger.warning(f'Invalid input genome record \"{genome_record}\"')\n continue\n\n # check if genome ID exist in the genome status file\n if raw_genome_id not in gs_dict:\n gs_dict[raw_genome_id] = GenomeStatus(raw_genome_id)\n\n gs_obj = gs_dict[raw_genome_id]\n\n logger.info(\n f\"Checking for antismash data {i + 1}/{len(genome_records)}, \"\n f\"current genome ID={raw_genome_id}\"\n )\n # first, check if BGC data is downloaded\n if gs_obj.bgc_path and Path(gs_obj.bgc_path).exists():\n logger.info(f\"Genome ID {raw_genome_id} already downloaded to {gs_obj.bgc_path}\")\n continue\n # second, check if lookup attempted previously\n if gs_obj.resolve_attempted:\n logger.info(f\"Genome ID {raw_genome_id} skipped due to previous failed attempt\")\n continue\n\n # if not downloaded or lookup attempted, then try to resolve the ID\n # and download\n logger.info(f\"Start lookup process for genome ID {raw_genome_id}\")\n gs_obj.resolved_refseq_id = _resolve_refseq_id(genome_id_data)\n gs_obj.resolve_attempted = True\n\n if gs_obj.resolved_refseq_id == \"\":\n # give up on this one\n logger.warning(f\"Failed lookup for genome ID {raw_genome_id}\")\n continue\n\n # if resolved id is valid, try to download and extract antismash data\n try:\n download_and_extract_antismash_data(\n gs_obj.resolved_refseq_id, project_download_root, project_extract_root\n )\n\n gs_obj.bgc_path = str(\n Path(project_download_root, gs_obj.resolved_refseq_id + \".zip\").absolute()\n )\n\n output_path = Path(project_extract_root, \"antismash\", gs_obj.resolved_refseq_id)\n if output_path.exists():\n Path.touch(output_path / \"completed\", exist_ok=True)\n\n except Exception:\n gs_obj.bgc_path = \"\"\n\n # raise and log warning for failed downloads\n failed_ids = [gs.original_id for gs in gs_dict.values() if not gs.bgc_path]\n if failed_ids:\n warning_message = (\n f\"Failed to download antiSMASH data for the following genome IDs: {failed_ids}\"\n )\n logger.warning(warning_message)\n warnings.warn(warning_message, UserWarning)\n\n # save updated genome status to json file\n GenomeStatus.to_json(gs_dict, gs_file)\n\n if len(failed_ids) == len(genome_records):\n raise ValueError(\"No antiSMASH data found for any genome\")\n
"},{"location":"api/arranger/","title":"Dataset Arranger","text":""},{"location":"api/arranger/#nplinker.arranger","title":"nplinker.arranger","text":""},{"location":"api/arranger/#nplinker.arranger.PODP_PROJECT_URL","title":"PODP_PROJECT_URL module-attribute
","text":"PODP_PROJECT_URL = \"https://pairedomicsdata.bioinformatics.nl/api/projects/{}\"\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger","title":"DatasetArranger","text":"DatasetArranger(config: Dynaconf)\n
Arrange datasets based on the fixed working directory structure with the given configuration.
Concept and DiagramWorking Directory Structure
Dataset Arranging Pipeline
\"Arrange datasets\" means:
local
mode (config.mode
is local
), the datasets provided by users are validated.podp
mode (config.mode
is podp
), the datasets are automatically downloaded or generated, then validated.The datasets include MIBiG, GNPS, antiSMASH, and BiG-SCAPE data.
Attributes:
config
\u2013 A Dynaconf object that contains the configuration settings.
root_dir
\u2013 The root directory of the datasets.
downloads_dir
\u2013 The directory to store downloaded files.
mibig_dir
\u2013 The directory to store MIBiG metadata.
gnps_dir
\u2013 The directory to store GNPS data.
antismash_dir
\u2013 The directory to store antiSMASH data.
bigscape_dir
\u2013 The directory to store BiG-SCAPE data.
bigscape_running_output_dir
\u2013 The directory to store the running output of BiG-SCAPE.
Parameters:
config
(Dynaconf
) \u2013 A Dynaconf object that contains the configuration settings.
Examples:
>>> from nplinker.config import load_config\n>>> from nplinker.arranger import DatasetArranger\n>>> config = load_config(\"nplinker.toml\")\n>>> arranger = DatasetArranger(config)\n>>> arranger.arrange()\n
See Also DatasetLoader: Load all data from files to memory.
Source code insrc/nplinker/arranger.py
def __init__(self, config: Dynaconf) -> None:\n \"\"\"Initialize the DatasetArranger.\n\n Args:\n config: A Dynaconf object that contains the configuration settings.\n\n\n Examples:\n >>> from nplinker.config import load_config\n >>> from nplinker.arranger import DatasetArranger\n >>> config = load_config(\"nplinker.toml\")\n >>> arranger = DatasetArranger(config)\n >>> arranger.arrange()\n\n See Also:\n [DatasetLoader][nplinker.loader.DatasetLoader]: Load all data from files to memory.\n \"\"\"\n self.config = config\n self.root_dir = config.root_dir\n self.downloads_dir = self.root_dir / defaults.DOWNLOADS_DIRNAME\n self.downloads_dir.mkdir(exist_ok=True)\n\n self.mibig_dir = self.root_dir / defaults.MIBIG_DIRNAME\n self.gnps_dir = self.root_dir / defaults.GNPS_DIRNAME\n self.antismash_dir = self.root_dir / defaults.ANTISMASH_DIRNAME\n self.bigscape_dir = self.root_dir / defaults.BIGSCAPE_DIRNAME\n self.bigscape_running_output_dir = (\n self.bigscape_dir / defaults.BIGSCAPE_RUNNING_OUTPUT_DIRNAME\n )\n\n self.arrange_podp_project_json()\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.config","title":"config instance-attribute
","text":"config = config\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.root_dir","title":"root_dir instance-attribute
","text":"root_dir = root_dir\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.downloads_dir","title":"downloads_dir instance-attribute
","text":"downloads_dir = root_dir / DOWNLOADS_DIRNAME\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.mibig_dir","title":"mibig_dir instance-attribute
","text":"mibig_dir = root_dir / MIBIG_DIRNAME\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.gnps_dir","title":"gnps_dir instance-attribute
","text":"gnps_dir = root_dir / GNPS_DIRNAME\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.antismash_dir","title":"antismash_dir instance-attribute
","text":"antismash_dir = root_dir / ANTISMASH_DIRNAME\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.bigscape_dir","title":"bigscape_dir instance-attribute
","text":"bigscape_dir = root_dir / BIGSCAPE_DIRNAME\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.bigscape_running_output_dir","title":"bigscape_running_output_dir instance-attribute
","text":"bigscape_running_output_dir = (\n bigscape_dir / BIGSCAPE_RUNNING_OUTPUT_DIRNAME\n)\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.arrange","title":"arrange","text":"arrange() -> None\n
Arrange all datasets according to the configuration.
The datasets include MIBiG, GNPS, antiSMASH, and BiG-SCAPE.
Source code insrc/nplinker/arranger.py
def arrange(self) -> None:\n \"\"\"Arrange all datasets according to the configuration.\n\n The datasets include MIBiG, GNPS, antiSMASH, and BiG-SCAPE.\n \"\"\"\n # The order of arranging the datasets matters, as some datasets depend on others\n self.arrange_mibig()\n self.arrange_gnps()\n self.arrange_antismash()\n self.arrange_bigscape()\n self.arrange_strain_mappings()\n self.arrange_strains_selected()\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.arrange_podp_project_json","title":"arrange_podp_project_json","text":"arrange_podp_project_json() -> None\n
Arrange the PODP project JSON file.
This method only works for the podp
mode. If the JSON file does not exist, download it first; then the downloaded or existing JSON file will be validated according to the PODP_ADAPTED_SCHEMA.
src/nplinker/arranger.py
def arrange_podp_project_json(self) -> None:\n \"\"\"Arrange the PODP project JSON file.\n\n This method only works for the `podp` mode. If the JSON file does not exist, download it\n first; then the downloaded or existing JSON file will be validated according to the\n [PODP_ADAPTED_SCHEMA][nplinker.schemas.PODP_ADAPTED_SCHEMA].\n \"\"\"\n if self.config.mode == \"podp\":\n file_name = f\"paired_datarecord_{self.config.podp_id}.json\"\n podp_file = self.downloads_dir / file_name\n if not podp_file.exists():\n download_url(\n PODP_PROJECT_URL.format(self.config.podp_id),\n self.downloads_dir,\n file_name,\n )\n\n with open(podp_file, \"r\") as f:\n json_data = json.load(f)\n validate_podp_json(json_data)\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.arrange_mibig","title":"arrange_mibig","text":"arrange_mibig() -> None\n
Arrange the MIBiG metadata.
If config.mibig.to_use
is True
, download and extract the MIBiG metadata and override the existing MIBiG metadata if it exists. This ensures that the MIBiG metadata is always up-to-date to the specified version in the configuration.
src/nplinker/arranger.py
def arrange_mibig(self) -> None:\n \"\"\"Arrange the MIBiG metadata.\n\n If `config.mibig.to_use` is `True`, download and extract the MIBiG metadata and override\n the existing MIBiG metadata if it exists. This ensures that the MIBiG metadata is always\n up-to-date to the specified version in the configuration.\n \"\"\"\n if self.config.mibig.to_use:\n if self.mibig_dir.exists():\n # remove existing mibig data\n shutil.rmtree(self.mibig_dir)\n download_and_extract_mibig_metadata(\n self.downloads_dir,\n self.mibig_dir,\n version=self.config.mibig.version,\n )\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.arrange_gnps","title":"arrange_gnps","text":"arrange_gnps() -> None\n
Arrange the GNPS data.
For local
mode, validate the GNPS data directory.
For podp
mode, if the GNPS data does not exist, download it; if it exists but not valid, remove the data and re-downloads it.
The validation process includes:
file_mappings.tsv
or file_mappings.csv
spectra.mgf
molecular_families.tsv
annotations.tsv
src/nplinker/arranger.py
def arrange_gnps(self) -> None:\n \"\"\"Arrange the GNPS data.\n\n For `local` mode, validate the GNPS data directory.\n\n For `podp` mode, if the GNPS data does not exist, download it; if it exists but not valid,\n remove the data and re-downloads it.\n\n The validation process includes:\n\n - Check if the GNPS data directory exists.\n - Check if the required files exist in the GNPS data directory, including:\n - `file_mappings.tsv` or `file_mappings.csv`\n - `spectra.mgf`\n - `molecular_families.tsv`\n - `annotations.tsv`\n \"\"\"\n pass_validation = False\n if self.config.mode == \"podp\":\n # retry downloading at most 3 times if downloaded data has problems\n for _ in range(3):\n try:\n validate_gnps(self.gnps_dir)\n pass_validation = True\n break\n except (FileNotFoundError, ValueError):\n # Don't need to remove downloaded archive, as it'll be overwritten\n shutil.rmtree(self.gnps_dir, ignore_errors=True)\n self._download_and_extract_gnps()\n\n if not pass_validation:\n validate_gnps(self.gnps_dir)\n\n # get the path to file_mappings file (csv or tsv)\n self.gnps_file_mappings_file = self._get_gnps_file_mappings_file()\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.arrange_antismash","title":"arrange_antismash","text":"arrange_antismash() -> None\n
Arrange the antiSMASH data.
For local
mode, validate the antiSMASH data.
For podp
mode, if the antiSMASH data does not exist, download it; if it exists but not valid, remove the data and re-download it.
The validation process includes:
.region???.gbk
where ???
is a number).AntiSMASH BGC directory must follow the structure below:
antismash\n \u251c\u2500\u2500 genome_id_1 (one AntiSMASH output, e.g. GCF_000514775.1)\n \u2502\u00a0 \u251c\u2500\u2500 GCF_000514775.1.gbk\n \u2502\u00a0 \u251c\u2500\u2500 NZ_AZWO01000004.region001.gbk\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u251c\u2500\u2500 genome_id_2\n \u2502\u00a0 \u251c\u2500\u2500 ...\n \u2514\u2500\u2500 ...\n
Source code in src/nplinker/arranger.py
def arrange_antismash(self) -> None:\n \"\"\"Arrange the antiSMASH data.\n\n For `local` mode, validate the antiSMASH data.\n\n For `podp` mode, if the antiSMASH data does not exist, download it; if it exists but not\n valid, remove the data and re-download it.\n\n The validation process includes:\n\n - Check if the antiSMASH data directory exists.\n - Check if the antiSMASH data directory contains at least one sub-directory, and each\n sub-directory contains at least one BGC file (with the suffix `.region???.gbk` where\n `???` is a number).\n\n AntiSMASH BGC directory must follow the structure below:\n ```\n antismash\n \u251c\u2500\u2500 genome_id_1 (one AntiSMASH output, e.g. GCF_000514775.1)\n \u2502\u00a0 \u251c\u2500\u2500 GCF_000514775.1.gbk\n \u2502\u00a0 \u251c\u2500\u2500 NZ_AZWO01000004.region001.gbk\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u251c\u2500\u2500 genome_id_2\n \u2502\u00a0 \u251c\u2500\u2500 ...\n \u2514\u2500\u2500 ...\n ```\n \"\"\"\n pass_validation = False\n if self.config.mode == \"podp\":\n for _ in range(3):\n try:\n validate_antismash(self.antismash_dir)\n pass_validation = True\n break\n except FileNotFoundError:\n shutil.rmtree(self.antismash_dir, ignore_errors=True)\n self._download_and_extract_antismash()\n\n if not pass_validation:\n validate_antismash(self.antismash_dir)\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.arrange_bigscape","title":"arrange_bigscape","text":"arrange_bigscape() -> None\n
Arrange the BiG-SCAPE data.
For local
mode, validate the BiG-SCAPE data.
For podp
mode, if the BiG-SCAPE data does not exist, run BiG-SCAPE to generate the clustering file; if it exists but not valid, remove the data and re-run BiG-SCAPE to generate the data.
The running output of BiG-SCAPE will be saved to the directory bigscape_running_output
in the default BiG-SCAPE directory, and the clustering file mix_clustering_c{self.config.bigscape.cutoff}.tsv
will be copied to the default BiG-SCAPE directory.
The validation process includes:
mix_clustering_c{self.config.bigscape.cutoff}.tsv
exists in the BiG-SCAPE data directory.data_sqlite.db
file exists in the BiG-SCAPE data directory.src/nplinker/arranger.py
def arrange_bigscape(self) -> None:\n \"\"\"Arrange the BiG-SCAPE data.\n\n For `local` mode, validate the BiG-SCAPE data.\n\n For `podp` mode, if the BiG-SCAPE data does not exist, run BiG-SCAPE to generate the\n clustering file; if it exists but not valid, remove the data and re-run BiG-SCAPE to generate\n the data.\n\n The running output of BiG-SCAPE will be saved to the directory `bigscape_running_output`\n in the default BiG-SCAPE directory, and the clustering file\n `mix_clustering_c{self.config.bigscape.cutoff}.tsv` will be copied to the default BiG-SCAPE\n directory.\n\n The validation process includes:\n\n - Check if the default BiG-SCAPE data directory exists.\n - Check if the clustering file `mix_clustering_c{self.config.bigscape.cutoff}.tsv` exists in the\n BiG-SCAPE data directory.\n - Check if the `data_sqlite.db` file exists in the BiG-SCAPE data directory.\n \"\"\"\n pass_validation = False\n if self.config.mode == \"podp\":\n for _ in range(3):\n try:\n validate_bigscape(self.bigscape_dir, self.config.bigscape.cutoff)\n pass_validation = True\n break\n except FileNotFoundError:\n shutil.rmtree(self.bigscape_dir, ignore_errors=True)\n self._run_bigscape()\n\n if not pass_validation:\n validate_bigscape(self.bigscape_dir, self.config.bigscape.cutoff)\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.arrange_strain_mappings","title":"arrange_strain_mappings","text":"arrange_strain_mappings() -> None\n
Arrange the strain mappings file.
For local
mode, validate the strain mappings file.
For podp
mode, always generate the new strain mappings file and validate it.
The validation checks if the strain mappings file exists and if it is a valid JSON file according to STRAIN_MAPPINGS_SCHEMA.
Source code insrc/nplinker/arranger.py
def arrange_strain_mappings(self) -> None:\n \"\"\"Arrange the strain mappings file.\n\n For `local` mode, validate the strain mappings file.\n\n For `podp` mode, always generate the new strain mappings file and validate it.\n\n The validation checks if the strain mappings file exists and if it is a valid JSON file\n according to [STRAIN_MAPPINGS_SCHEMA][nplinker.schemas.STRAIN_MAPPINGS_SCHEMA].\n \"\"\"\n if self.config.mode == \"podp\":\n self._generate_strain_mappings()\n\n self._validate_strain_mappings()\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.arrange_strains_selected","title":"arrange_strains_selected","text":"arrange_strains_selected() -> None\n
Arrange the strains selected file.
If the file exists, validate it according to the schema defined in user_strains.json
.
src/nplinker/arranger.py
def arrange_strains_selected(self) -> None:\n \"\"\"Arrange the strains selected file.\n\n If the file exists, validate it according to the schema defined in `user_strains.json`.\n \"\"\"\n strains_selected_file = self.root_dir / defaults.STRAINS_SELECTED_FILENAME\n if strains_selected_file.exists():\n with open(strains_selected_file, \"r\") as f:\n json_data = json.load(f)\n validate(instance=json_data, schema=USER_STRAINS_SCHEMA)\n
"},{"location":"api/arranger/#nplinker.arranger.validate_gnps","title":"validate_gnps","text":"validate_gnps(gnps_dir: str | PathLike) -> None\n
Validate the GNPS data directory and its contents.
The GNPS data directory must contain the following files:
file_mappings.tsv
or file_mappings.csv
spectra.mgf
molecular_families.tsv
annotations.tsv
Parameters:
gnps_dir
(str | PathLike
) \u2013 Path to the GNPS data directory.
Raises:
FileNotFoundError
\u2013 If the GNPS data directory is not found or any of the required files is not found.
ValueError
\u2013 If both file_mappings.tsv and file_mapping.csv are found.
src/nplinker/arranger.py
def validate_gnps(gnps_dir: str | PathLike) -> None:\n \"\"\"Validate the GNPS data directory and its contents.\n\n The GNPS data directory must contain the following files:\n\n - `file_mappings.tsv` or `file_mappings.csv`\n - `spectra.mgf`\n - `molecular_families.tsv`\n - `annotations.tsv`\n\n Args:\n gnps_dir: Path to the GNPS data directory.\n\n Raises:\n FileNotFoundError: If the GNPS data directory is not found or any of the required files\n is not found.\n ValueError: If both file_mappings.tsv and file_mapping.csv are found.\n \"\"\"\n gnps_dir = Path(gnps_dir)\n if not gnps_dir.exists():\n raise FileNotFoundError(f\"GNPS data directory not found at {gnps_dir}\")\n\n file_mappings_tsv = gnps_dir / defaults.GNPS_FILE_MAPPINGS_TSV\n file_mappings_csv = gnps_dir / defaults.GNPS_FILE_MAPPINGS_CSV\n if file_mappings_tsv.exists() and file_mappings_csv.exists():\n raise ValueError(\n f\"Both {file_mappings_tsv.name} and {file_mappings_csv.name} found in GNPS directory \"\n f\"{gnps_dir}, only one is allowed.\"\n )\n elif not file_mappings_tsv.exists() and not file_mappings_csv.exists():\n raise FileNotFoundError(\n f\"Neither {file_mappings_tsv.name} nor {file_mappings_csv.name} found in GNPS directory\"\n f\" {gnps_dir}\"\n )\n\n required_files = [\n gnps_dir / defaults.GNPS_SPECTRA_FILENAME,\n gnps_dir / defaults.GNPS_MOLECULAR_FAMILY_FILENAME,\n gnps_dir / defaults.GNPS_ANNOTATIONS_FILENAME,\n ]\n list_not_found = [f.name for f in required_files if not f.exists()]\n if list_not_found:\n raise FileNotFoundError(\n f\"Files not found in GNPS directory {gnps_dir}: ', '.join({list_not_found})\"\n )\n
"},{"location":"api/arranger/#nplinker.arranger.validate_antismash","title":"validate_antismash","text":"validate_antismash(antismash_dir: str | PathLike) -> None\n
Validate the antiSMASH data directory and its contents.
The validation only checks the structure of the antiSMASH data directory and file names. It does not check
podp
modeThe antiSMASH data directory must exist and contain at least one sub-directory. The name of the sub-directories must not contain any space. Each sub-directory must contain at least one BGC file (with the suffix .region???.gbk
where ???
is the region number).
Parameters:
antismash_dir
(str | PathLike
) \u2013 Path to the antiSMASH data directory.
Raises:
FileNotFoundError
\u2013 If the antiSMASH data directory is not found, or no sub-directories are found in the antiSMASH data directory, or no BGC files are found in any sub-directory.
ValueError
\u2013 If any sub-directory name contains a space.
src/nplinker/arranger.py
def validate_antismash(antismash_dir: str | PathLike) -> None:\n \"\"\"Validate the antiSMASH data directory and its contents.\n\n The validation only checks the structure of the antiSMASH data directory and file names.\n It does not check\n\n - the content of the BGC files\n - the consistency between the antiSMASH data and the PODP project JSON file for the `podp` mode\n\n The antiSMASH data directory must exist and contain at least one sub-directory. The name of the\n sub-directories must not contain any space. Each sub-directory must contain at least one BGC\n file (with the suffix `.region???.gbk` where `???` is the region number).\n\n Args:\n antismash_dir: Path to the antiSMASH data directory.\n\n Raises:\n FileNotFoundError: If the antiSMASH data directory is not found, or no sub-directories\n are found in the antiSMASH data directory, or no BGC files are found in any\n sub-directory.\n ValueError: If any sub-directory name contains a space.\n \"\"\"\n antismash_dir = Path(antismash_dir)\n if not antismash_dir.exists():\n raise FileNotFoundError(f\"antiSMASH data directory not found at {antismash_dir}\")\n\n sub_dirs = list_dirs(antismash_dir)\n if not sub_dirs:\n raise FileNotFoundError(\n \"No BGC directories found in antiSMASH data directory {antismash_dir}\"\n )\n\n for sub_dir in sub_dirs:\n dir_name = Path(sub_dir).name\n if \" \" in dir_name:\n raise ValueError(\n f\"antiSMASH sub-directory name {dir_name} contains space, which is not allowed\"\n )\n\n gbk_files = list_files(sub_dir, suffix=\".gbk\", keep_parent=False)\n bgc_files = fnmatch.filter(gbk_files, \"*.region???.gbk\")\n if not bgc_files:\n raise FileNotFoundError(f\"No BGC files found in antiSMASH sub-directory {sub_dir}\")\n
"},{"location":"api/arranger/#nplinker.arranger.validate_bigscape","title":"validate_bigscape","text":"validate_bigscape(\n bigscape_dir: str | PathLike, cutoff: str\n) -> None\n
Validate the BiG-SCAPE data directory and its contents.
The BiG-SCAPE data directory must exist and contain the clustering file mix_clustering_c{self.config.bigscape.cutoff}.tsv
where {self.config.bigscape.cutoff}
is the bigscape cutoff value set in the config file.
Alternatively, the directory can contain the BiG-SCAPE database file generated by BiG-SCAPE v2. At the moment, all the family assignments in the database will be used, so this database should contain results from a single run with the desired cutoff.
Parameters:
bigscape_dir
(str | PathLike
) \u2013 Path to the BiG-SCAPE data directory.
cutoff
(str
) \u2013 The BiG-SCAPE cutoff value.
Raises:
FileNotFoundError
\u2013 If the BiG-SCAPE data directory or the clustering file is not found.
src/nplinker/arranger.py
def validate_bigscape(bigscape_dir: str | PathLike, cutoff: str) -> None:\n \"\"\"Validate the BiG-SCAPE data directory and its contents.\n\n The BiG-SCAPE data directory must exist and contain the clustering file\n `mix_clustering_c{self.config.bigscape.cutoff}.tsv` where `{self.config.bigscape.cutoff}` is the\n bigscape cutoff value set in the config file.\n\n Alternatively, the directory can contain the BiG-SCAPE database file generated by BiG-SCAPE v2.\n At the moment, all the family assignments in the database will be used, so this database should\n contain results from a single run with the desired cutoff.\n\n Args:\n bigscape_dir: Path to the BiG-SCAPE data directory.\n cutoff: The BiG-SCAPE cutoff value.\n\n Raises:\n FileNotFoundError: If the BiG-SCAPE data directory or the clustering file is not found.\n \"\"\"\n bigscape_dir = Path(bigscape_dir)\n if not bigscape_dir.exists():\n raise FileNotFoundError(f\"BiG-SCAPE data directory not found at {bigscape_dir}\")\n\n clustering_file = bigscape_dir / f\"mix_clustering_c{cutoff}.tsv\"\n database_file = bigscape_dir / \"data_sqlite.db\"\n if not clustering_file.exists() and not database_file.exists():\n raise FileNotFoundError(f\"BiG-SCAPE data not found in {clustering_file} or {database_file}\")\n
"},{"location":"api/bigscape/","title":"BigScape","text":""},{"location":"api/bigscape/#nplinker.genomics.bigscape","title":"nplinker.genomics.bigscape","text":""},{"location":"api/bigscape/#nplinker.genomics.bigscape.BigscapeGCFLoader","title":"BigscapeGCFLoader","text":"BigscapeGCFLoader(cluster_file: str | PathLike)\n
Bases: GCFLoaderBase
Data loader for BiG-SCAPE GCF cluster file.
Attributes:
cluster_file
(str
) \u2013 path to the BiG-SCAPE cluster file.
Parameters:
cluster_file
(str | PathLike
) \u2013 Path to the BiG-SCAPE cluster file, the filename has a pattern of <class>_clustering_c0.xx.tsv
.
src/nplinker/genomics/bigscape/bigscape_loader.py
def __init__(self, cluster_file: str | PathLike, /) -> None:\n \"\"\"Initialize the BiG-SCAPE GCF loader.\n\n Args:\n cluster_file: Path to the BiG-SCAPE cluster file,\n the filename has a pattern of `<class>_clustering_c0.xx.tsv`.\n \"\"\"\n self.cluster_file: str = str(cluster_file)\n self._gcf_list = self._parse_gcf(self.cluster_file)\n
"},{"location":"api/bigscape/#nplinker.genomics.bigscape.BigscapeGCFLoader.cluster_file","title":"cluster_file instance-attribute
","text":"cluster_file: str = str(cluster_file)\n
"},{"location":"api/bigscape/#nplinker.genomics.bigscape.BigscapeGCFLoader.get_gcfs","title":"get_gcfs","text":"get_gcfs(\n keep_mibig_only: bool = False,\n keep_singleton: bool = False,\n) -> list[GCF]\n
Get all GCF objects.
Parameters:
keep_mibig_only
(bool
, default: False
) \u2013 True to keep GCFs that contain only MIBiG BGCs.
keep_singleton
(bool
, default: False
) \u2013 True to keep singleton GCFs. A singleton GCF is a GCF that contains only one BGC.
Returns:
list[GCF]
\u2013 A list of GCF objects.
src/nplinker/genomics/bigscape/bigscape_loader.py
def get_gcfs(self, keep_mibig_only: bool = False, keep_singleton: bool = False) -> list[GCF]:\n \"\"\"Get all GCF objects.\n\n Args:\n keep_mibig_only: True to keep GCFs that contain only MIBiG\n BGCs.\n keep_singleton: True to keep singleton GCFs. A singleton GCF\n is a GCF that contains only one BGC.\n\n Returns:\n A list of GCF objects.\n \"\"\"\n gcf_list = self._gcf_list\n if not keep_mibig_only:\n gcf_list = [gcf for gcf in gcf_list if not gcf.has_mibig_only()]\n if not keep_singleton:\n gcf_list = [gcf for gcf in gcf_list if not gcf.is_singleton()]\n return gcf_list\n
"},{"location":"api/bigscape/#nplinker.genomics.bigscape.BigscapeV2GCFLoader","title":"BigscapeV2GCFLoader","text":"BigscapeV2GCFLoader(db_file: str | PathLike)\n
Bases: GCFLoaderBase
Data loader for BiG-SCAPE v2 database file.
Attributes:
db_file
\u2013 Path to the BiG-SCAPE database file.
Parameters:
db_file
(str | PathLike
) \u2013 Path to the BiG-SCAPE v2 database file
src/nplinker/genomics/bigscape/bigscape_loader.py
def __init__(self, db_file: str | PathLike, /) -> None:\n \"\"\"Initialize the BiG-SCAPE v2 GCF loader.\n\n Args:\n db_file: Path to the BiG-SCAPE v2 database file\n \"\"\"\n self.db_file = str(db_file)\n self._gcf_list = self._parse_gcf(self.db_file)\n
"},{"location":"api/bigscape/#nplinker.genomics.bigscape.BigscapeV2GCFLoader.db_file","title":"db_file instance-attribute
","text":"db_file = str(db_file)\n
"},{"location":"api/bigscape/#nplinker.genomics.bigscape.BigscapeV2GCFLoader.get_gcfs","title":"get_gcfs","text":"get_gcfs(\n keep_mibig_only: bool = False,\n keep_singleton: bool = False,\n) -> list[GCF]\n
Get all GCF objects.
Parameters:
keep_mibig_only
(bool
, default: False
) \u2013 True to keep GCFs that contain only MIBiG BGCs.
keep_singleton
(bool
, default: False
) \u2013 True to keep singleton GCFs. A singleton GCF is a GCF that contains only one BGC.
Returns:
list[GCF]
\u2013 a list of GCF objects.
src/nplinker/genomics/bigscape/bigscape_loader.py
def get_gcfs(self, keep_mibig_only: bool = False, keep_singleton: bool = False) -> list[GCF]:\n \"\"\"Get all GCF objects.\n\n Args:\n keep_mibig_only: True to keep GCFs that contain only MIBiG BGCs.\n keep_singleton: True to keep singleton GCFs.\n A singleton GCF is a GCF that contains only one BGC.\n\n Returns:\n a list of GCF objects.\n \"\"\"\n gcf_list = self._gcf_list\n if not keep_mibig_only:\n gcf_list = [gcf for gcf in gcf_list if not gcf.has_mibig_only()]\n if not keep_singleton:\n gcf_list = [gcf for gcf in gcf_list if not gcf.is_singleton()]\n return gcf_list\n
"},{"location":"api/bigscape/#nplinker.genomics.bigscape.run_bigscape","title":"run_bigscape","text":"run_bigscape(\n antismash_path: str | PathLike,\n output_path: str | PathLike,\n extra_params: str,\n version: Literal[1, 2] = 1,\n) -> bool\n
Runs BiG-SCAPE to cluster BGCs.
The behavior of this function is slightly different depending on the version of BiG-SCAPE that is set to run using the configuration file. Mostly this means a different set of parameters is used between the two versions.
The AntiSMASH output directory should be a directory that contains GBK files. The directory can contain subdirectories, in which case BiG-SCAPE will search recursively for GBK files. E.g.:
example_folder\n \u251c\u2500\u2500 organism_1\n \u2502\u00a0 \u251c\u2500\u2500 organism_1.region001.gbk\n \u2502\u00a0 \u251c\u2500\u2500 organism_1.region002.gbk\n \u2502\u00a0 \u251c\u2500\u2500 organism_1.region003.gbk\n \u2502\u00a0 \u251c\u2500\u2500 organism_1.final.gbk <- skipped!\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u251c\u2500\u2500 organism_2\n \u2502\u00a0 \u251c\u2500\u2500 ...\n \u2514\u2500\u2500 ...\n
By default, only GBK Files with \"cluster\" or \"region\" in the filename are accepted. GBK Files with \"final\" in the filename are excluded.
Parameters:
antismash_path
(str | PathLike
) \u2013 Path to the antismash output directory.
output_path
(str | PathLike
) \u2013 Path to the output directory where BiG-SCAPE will write its results.
extra_params
(str
) \u2013 Additional parameters to pass to BiG-SCAPE.
version
(Literal[1, 2]
, default: 1
) \u2013 The version of BiG-SCAPE to run. Must be 1 or 2.
Returns:
bool
\u2013 True if BiG-SCAPE ran successfully, False otherwise.
Raises:
ValueError
\u2013 If an unexpected BiG-SCAPE version number is specified.
FileNotFoundError
\u2013 If the antismash_path does not exist or if the BiG-SCAPE python script could not be found.
RuntimeError
\u2013 If BiG-SCAPE fails to run.
Examples:
>>> from nplinker.genomics.bigscape import run_bigscape\n>>> run_bigscape(antismash_path=\"./antismash\", output_path=\"./output\",\n... extra_params=\"--help\", version=1)\n
Source code in src/nplinker/genomics/bigscape/runbigscape.py
def run_bigscape(\n antismash_path: str | PathLike,\n output_path: str | PathLike,\n extra_params: str,\n version: Literal[1, 2] = 1,\n) -> bool:\n \"\"\"Runs BiG-SCAPE to cluster BGCs.\n\n The behavior of this function is slightly different depending on the version of\n BiG-SCAPE that is set to run using the configuration file.\n Mostly this means a different set of parameters is used between the two versions.\n\n The AntiSMASH output directory should be a directory that contains GBK files.\n The directory can contain subdirectories, in which case BiG-SCAPE will search\n recursively for GBK files. E.g.:\n\n ```\n example_folder\n \u251c\u2500\u2500 organism_1\n \u2502\u00a0 \u251c\u2500\u2500 organism_1.region001.gbk\n \u2502\u00a0 \u251c\u2500\u2500 organism_1.region002.gbk\n \u2502\u00a0 \u251c\u2500\u2500 organism_1.region003.gbk\n \u2502\u00a0 \u251c\u2500\u2500 organism_1.final.gbk <- skipped!\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u251c\u2500\u2500 organism_2\n \u2502\u00a0 \u251c\u2500\u2500 ...\n \u2514\u2500\u2500 ...\n ```\n\n By default, only GBK Files with \"cluster\" or \"region\" in the filename are\n accepted. GBK Files with \"final\" in the filename are excluded.\n\n Args:\n antismash_path: Path to the antismash output directory.\n output_path: Path to the output directory where BiG-SCAPE will write its results.\n extra_params: Additional parameters to pass to BiG-SCAPE.\n version: The version of BiG-SCAPE to run. Must be 1 or 2.\n\n Returns:\n True if BiG-SCAPE ran successfully, False otherwise.\n\n Raises:\n ValueError: If an unexpected BiG-SCAPE version number is specified.\n FileNotFoundError: If the antismash_path does not exist or if the BiG-SCAPE python\n script could not be found.\n RuntimeError: If BiG-SCAPE fails to run.\n\n Examples:\n >>> from nplinker.genomics.bigscape import run_bigscape\n >>> run_bigscape(antismash_path=\"./antismash\", output_path=\"./output\",\n ... extra_params=\"--help\", version=1)\n \"\"\"\n # switch to correct version of BiG-SCAPE\n if version == 1:\n bigscape_py_path = \"bigscape.py\"\n elif version == 2:\n bigscape_py_path = \"bigscape-v2.py\"\n else:\n raise ValueError(\"Invalid BiG-SCAPE version number. Expected: 1 or 2.\")\n\n try:\n subprocess.run([bigscape_py_path, \"-h\"], capture_output=True, check=True)\n except Exception as e:\n raise FileNotFoundError(\n f\"Failed to find/run BiG-SCAPE executable program (path={bigscape_py_path}, err={e})\"\n ) from e\n\n if not os.path.exists(antismash_path):\n raise FileNotFoundError(f'antismash_path \"{antismash_path}\" does not exist!')\n\n logger.info(f\"Running BiG-SCAPE version {version}\")\n logger.info(\n f'run_bigscape: input=\"{antismash_path}\", output=\"{output_path}\", extra_params={extra_params}\"'\n )\n\n # assemble arguments. first argument is the python file\n args = [bigscape_py_path]\n\n # version 2 points to specific Pfam file, version 1 points to directory\n # version 2 also requires the cluster subcommand\n if version == 1:\n args.extend([\"--pfam_dir\", PFAM_PATH])\n elif version == 2:\n args.extend([\"cluster\", \"--pfam_path\", os.path.join(PFAM_PATH, \"Pfam-A.hmm\")])\n\n # add input and output paths. these are unchanged\n args.extend([\"-i\", str(antismash_path), \"-o\", str(output_path)])\n\n # append the user supplied params, if any\n if len(extra_params) > 0:\n args.extend(extra_params.split(\" \"))\n\n logger.info(f\"BiG-SCAPE command: {args}\")\n result = subprocess.run(args, stdout=sys.stdout, stderr=sys.stderr)\n\n # return true on any non-error return code\n if result.returncode == 0:\n logger.info(f\"BiG-SCAPE completed with return code {result.returncode}\")\n return True\n\n # otherwise log details and raise a runtime error\n logger.error(f\"BiG-SCAPE failed with return code {result.returncode}\")\n logger.error(f\"output: {str(result.stdout)}\")\n logger.error(f\"stderr: {str(result.stderr)}\")\n\n raise RuntimeError(f\"Failed to run BiG-SCAPE with error code {result.returncode}\")\n
"},{"location":"api/genomics/","title":"Data Models","text":""},{"location":"api/genomics/#nplinker.genomics","title":"nplinker.genomics","text":""},{"location":"api/genomics/#nplinker.genomics.BGC","title":"BGC","text":"BGC(id: str, /, *product_prediction: str)\n
Class to model BGC (biosynthetic gene cluster) data.
BGC data include both annotations and sequence data. This class is mainly designed to model the annotations or metadata.
The raw BGC data is stored in GenBank format (.gbk). Additional GenBank features could be added to the GenBank file to annotate BGCs, e.g. antiSMASH has some self-defined features (like region
) in its output GenBank files.
The annotations of BGC can be stored in JSON format, which is defined and used by MIBiG.
Attributes:
id
\u2013 BGC identifier, e.g. MIBiG accession, GenBank accession.
product_prediction
\u2013 A tuple of (predicted) natural products or product classes of the BGC. For antiSMASH's GenBank data, the feature region /product
gives product information. For MIBiG metadata, its biosynthetic class provides such info.
mibig_bgc_class
(tuple[str] | None
) \u2013 A tuple of MIBiG biosynthetic classes to which the BGC belongs. Defaults to None, which means the class is unknown.
MIBiG defines 6 major biosynthetic classes for natural products, including NRP
, Polyketide
, RiPP
, Terpene
, Saccharide
and Alkaloid
. Note that natural products created by the other biosynthetic mechanisms fall under the category Other
. For more details see the paper.
description
(str | None
) \u2013 Brief description of the BGC. Defaults to None.
smiles
(tuple[str] | None
) \u2013 A tuple of SMILES formulas of the BGC's products. Defaults to None.
antismash_file
(str | None
) \u2013 The path to the antiSMASH GenBank file. Defaults to None.
antismash_id
(str | None
) \u2013 Identifier of the antiSMASH BGC, referring to the feature VERSION
of GenBank file. Defaults to None.
antismash_region
(int | None
) \u2013 AntiSMASH BGC region number, referring to the feature region
of GenBank file. Defaults to None.
parents
(set[GCF]
) \u2013 The set of GCFs that contain the BGC.
strain
(Strain | None
) \u2013 The strain of the BGC.
Parameters:
id
(str
) \u2013 BGC identifier, e.g. MIBiG accession, GenBank accession.
product_prediction
(str
, default: ()
) \u2013 BGC's (predicted) natural products or product classes.
Examples:
>>> bgc = BGC(\"Unique_BGC_ID\", \"Polyketide\", \"NRP\")\n>>> bgc.id\n'Unique_BGC_ID'\n>>> bgc.product_prediction\n('Polyketide', 'NRP')\n>>> bgc.is_mibig()\nFalse\n
Source code in src/nplinker/genomics/bgc.py
def __init__(self, id: str, /, *product_prediction: str):\n \"\"\"Initialize the BGC object.\n\n Args:\n id: BGC identifier, e.g. MIBiG accession, GenBank accession.\n product_prediction: BGC's (predicted) natural products or product classes.\n\n Examples:\n >>> bgc = BGC(\"Unique_BGC_ID\", \"Polyketide\", \"NRP\")\n >>> bgc.id\n 'Unique_BGC_ID'\n >>> bgc.product_prediction\n ('Polyketide', 'NRP')\n >>> bgc.is_mibig()\n False\n \"\"\"\n # BGC metadata\n self.id = id\n self.product_prediction = product_prediction\n\n self.mibig_bgc_class: tuple[str] | None = None\n self.description: str | None = None\n self.smiles: tuple[str] | None = None\n\n # antismash related attributes\n self.antismash_file: str | None = None\n self.antismash_id: str | None = None # version in .gbk, id in SeqRecord\n self.antismash_region: int | None = None # antismash region number\n\n # other attributes\n self.parents: set[GCF] = set()\n self._strain: Strain | None = None\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.id","title":"id instance-attribute
","text":"id = id\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.product_prediction","title":"product_prediction instance-attribute
","text":"product_prediction = product_prediction\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.mibig_bgc_class","title":"mibig_bgc_class instance-attribute
","text":"mibig_bgc_class: tuple[str] | None = None\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.description","title":"description instance-attribute
","text":"description: str | None = None\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.smiles","title":"smiles instance-attribute
","text":"smiles: tuple[str] | None = None\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.antismash_file","title":"antismash_file instance-attribute
","text":"antismash_file: str | None = None\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.antismash_id","title":"antismash_id instance-attribute
","text":"antismash_id: str | None = None\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.antismash_region","title":"antismash_region instance-attribute
","text":"antismash_region: int | None = None\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.parents","title":"parents instance-attribute
","text":"parents: set[GCF] = set()\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.strain","title":"strain property
writable
","text":"strain: Strain | None\n
Get the strain of the BGC.
"},{"location":"api/genomics/#nplinker.genomics.BGC.bigscape_classes","title":"bigscape_classesproperty
","text":"bigscape_classes: set[str | None]\n
Get BiG-SCAPE's BGC classes.
BiG-SCAPE's BGC classes are similar to those defined in MiBIG but have more categories (7 classes), including:
For BGC falls outside of these categories, the value is \"Others\".
Default is None, which means the class is unknown.
More details see: https://doi.org/10.1038%2Fs41589-019-0400-9.
"},{"location":"api/genomics/#nplinker.genomics.BGC.aa_predictions","title":"aa_predictionsproperty
","text":"aa_predictions: list\n
Amino acids as predicted monomers of product.
Returns:
list
\u2013 list of dicts with key as amino acid and value as prediction
list
\u2013 probability.
__repr__()\n
Source code in src/nplinker/genomics/bgc.py
def __repr__(self):\n return str(self)\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.__str__","title":"__str__","text":"__str__()\n
Source code in src/nplinker/genomics/bgc.py
def __str__(self):\n return \"{}(id={}, strain={}, asid={}, region={})\".format(\n self.__class__.__name__,\n self.id,\n self.strain,\n self.antismash_id,\n self.antismash_region,\n )\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.__eq__","title":"__eq__","text":"__eq__(other) -> bool\n
Source code in src/nplinker/genomics/bgc.py
def __eq__(self, other) -> bool:\n if isinstance(other, BGC):\n return self.id == other.id and self.product_prediction == other.product_prediction\n return NotImplemented\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.__hash__","title":"__hash__","text":"__hash__() -> int\n
Source code in src/nplinker/genomics/bgc.py
def __hash__(self) -> int:\n return hash((self.id, self.product_prediction))\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.__reduce__","title":"__reduce__","text":"__reduce__() -> tuple\n
Reduce function for pickling.
Source code insrc/nplinker/genomics/bgc.py
def __reduce__(self) -> tuple:\n \"\"\"Reduce function for pickling.\"\"\"\n return (self.__class__, (self.id, *self.product_prediction), self.__dict__)\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.add_parent","title":"add_parent","text":"add_parent(gcf: GCF) -> None\n
Add a parent GCF to the BGC.
Parameters:
gcf
(GCF
) \u2013 gene cluster family
src/nplinker/genomics/bgc.py
def add_parent(self, gcf: GCF) -> None:\n \"\"\"Add a parent GCF to the BGC.\n\n Args:\n gcf: gene cluster family\n \"\"\"\n gcf.add_bgc(self)\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.detach_parent","title":"detach_parent","text":"detach_parent(gcf: GCF) -> None\n
Remove a parent GCF.
Source code insrc/nplinker/genomics/bgc.py
def detach_parent(self, gcf: GCF) -> None:\n \"\"\"Remove a parent GCF.\"\"\"\n gcf.detach_bgc(self)\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.is_mibig","title":"is_mibig","text":"is_mibig() -> bool\n
Check if the BGC is a MIBiG reference BGC or not.
WarningThis method evaluates MIBiG BGC based on the pattern that MIBiG BGC names start with \"BGC\". It might give false positive result.
Returns:
bool
\u2013 True if it's MIBiG reference BGC
src/nplinker/genomics/bgc.py
def is_mibig(self) -> bool:\n \"\"\"Check if the BGC is a MIBiG reference BGC or not.\n\n Warning:\n This method evaluates MIBiG BGC based on the pattern that MIBiG\n BGC names start with \"BGC\". It might give false positive result.\n\n Returns:\n True if it's MIBiG reference BGC\n \"\"\"\n return self.id.startswith(\"BGC\")\n
"},{"location":"api/genomics/#nplinker.genomics.GCF","title":"GCF","text":"GCF(id: str)\n
Class to model gene cluster family (GCF).
GCF is a group of similar BGCs and generated by clustering BGCs with tools such as BiG-SCAPE and BiG-SLICE.
Attributes:
id
\u2013 id of the GCF object.
bgc_ids
(set[str]
) \u2013 a set of BGC ids that belongs to the GCF.
bigscape_class
(str | None
) \u2013 BiG-SCAPE's BGC class. BiG-SCAPE's BGC classes are similar to those defined in MiBIG but have more categories (7 classes), including:
For BGC falls outside of these categories, the value is \"Others\".
Default is None, which means the class is unknown.
More details see: https://doi.org/10.1038%2Fs41589-019-0400-9.
Parameters:
id
(str
) \u2013 id of the GCF object.
Examples:
>>> gcf = GCF(\"Unique_GCF_ID\")\n>>> gcf.id\n'Unique_GCF_ID'\n
Source code in src/nplinker/genomics/gcf.py
def __init__(self, id: str, /) -> None:\n \"\"\"Initialize the GCF object.\n\n Args:\n id: id of the GCF object.\n\n Examples:\n >>> gcf = GCF(\"Unique_GCF_ID\")\n >>> gcf.id\n 'Unique_GCF_ID'\n \"\"\"\n self.id = id\n self.bgc_ids: set[str] = set()\n self.bigscape_class: str | None = None\n self._bgcs: set[BGC] = set()\n self._strains: StrainCollection = StrainCollection()\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.id","title":"id instance-attribute
","text":"id = id\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.bgc_ids","title":"bgc_ids instance-attribute
","text":"bgc_ids: set[str] = set()\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.bigscape_class","title":"bigscape_class instance-attribute
","text":"bigscape_class: str | None = None\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.bgcs","title":"bgcs property
","text":"bgcs: set[BGC]\n
Get the BGC objects.
"},{"location":"api/genomics/#nplinker.genomics.GCF.strains","title":"strainsproperty
","text":"strains: StrainCollection\n
Get the strains in the GCF.
"},{"location":"api/genomics/#nplinker.genomics.GCF.__str__","title":"__str__","text":"__str__() -> str\n
Source code in src/nplinker/genomics/gcf.py
def __str__(self) -> str:\n return (\n f\"GCF(id={self.id}, #BGC_objects={len(self.bgcs)}, #bgc_ids={len(self.bgc_ids)},\"\n f\"#strains={len(self._strains)}).\"\n )\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.__repr__","title":"__repr__","text":"__repr__() -> str\n
Source code in src/nplinker/genomics/gcf.py
def __repr__(self) -> str:\n return str(self)\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.__eq__","title":"__eq__","text":"__eq__(other) -> bool\n
Source code in src/nplinker/genomics/gcf.py
def __eq__(self, other) -> bool:\n if isinstance(other, GCF):\n return self.id == other.id and self.bgcs == other.bgcs\n return NotImplemented\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.__hash__","title":"__hash__","text":"__hash__() -> int\n
Hash function for GCF.
Note that GCF class is a mutable container. We only hash the GCF id to avoid the hash value changes when self._bgcs
is updated.
src/nplinker/genomics/gcf.py
def __hash__(self) -> int:\n \"\"\"Hash function for GCF.\n\n Note that GCF class is a mutable container. We only hash the GCF id to\n avoid the hash value changes when `self._bgcs` is updated.\n \"\"\"\n return hash(self.id)\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.__reduce__","title":"__reduce__","text":"__reduce__() -> tuple\n
Reduce function for pickling.
Source code insrc/nplinker/genomics/gcf.py
def __reduce__(self) -> tuple:\n \"\"\"Reduce function for pickling.\"\"\"\n return (self.__class__, (self.id,), self.__dict__)\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.add_bgc","title":"add_bgc","text":"add_bgc(bgc: BGC) -> None\n
Add a BGC object to the GCF.
Source code insrc/nplinker/genomics/gcf.py
def add_bgc(self, bgc: BGC) -> None:\n \"\"\"Add a BGC object to the GCF.\"\"\"\n bgc.parents.add(self)\n self._bgcs.add(bgc)\n self.bgc_ids.add(bgc.id)\n if bgc.strain is not None:\n self._strains.add(bgc.strain)\n else:\n logger.warning(\"No strain specified for the BGC %s\", bgc.id)\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.detach_bgc","title":"detach_bgc","text":"detach_bgc(bgc: BGC) -> None\n
Remove a child BGC object.
Source code insrc/nplinker/genomics/gcf.py
def detach_bgc(self, bgc: BGC) -> None:\n \"\"\"Remove a child BGC object.\"\"\"\n bgc.parents.remove(self)\n self._bgcs.remove(bgc)\n self.bgc_ids.remove(bgc.id)\n if bgc.strain is not None:\n for other_bgc in self._bgcs:\n if other_bgc.strain == bgc.strain:\n return\n self._strains.remove(bgc.strain)\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.has_strain","title":"has_strain","text":"has_strain(strain: Strain) -> bool\n
Check if the given strain exists.
Parameters:
strain
(Strain
) \u2013 Strain
object.
Returns:
bool
\u2013 True when the given strain exist.
src/nplinker/genomics/gcf.py
def has_strain(self, strain: Strain) -> bool:\n \"\"\"Check if the given strain exists.\n\n Args:\n strain: `Strain` object.\n\n Returns:\n True when the given strain exist.\n \"\"\"\n return strain in self._strains\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.has_mibig_only","title":"has_mibig_only","text":"has_mibig_only() -> bool\n
Check if the GCF's children are only MIBiG BGCs.
Returns:
bool
\u2013 True if GCF.bgc_ids
are only MIBiG BGC ids.
src/nplinker/genomics/gcf.py
def has_mibig_only(self) -> bool:\n \"\"\"Check if the GCF's children are only MIBiG BGCs.\n\n Returns:\n True if `GCF.bgc_ids` are only MIBiG BGC ids.\n \"\"\"\n return all(map(lambda id: id.startswith(\"BGC\"), self.bgc_ids))\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.is_singleton","title":"is_singleton","text":"is_singleton() -> bool\n
Check if the GCF contains only one BGC.
Returns:
bool
\u2013 True if GCF.bgc_ids
contains only one BGC id.
src/nplinker/genomics/gcf.py
def is_singleton(self) -> bool:\n \"\"\"Check if the GCF contains only one BGC.\n\n Returns:\n True if `GCF.bgc_ids` contains only one BGC id.\n \"\"\"\n return len(self.bgc_ids) == 1\n
"},{"location":"api/genomics_abc/","title":"Abstract Base Classes","text":""},{"location":"api/genomics_abc/#nplinker.genomics.abc","title":"nplinker.genomics.abc","text":""},{"location":"api/genomics_abc/#nplinker.genomics.abc.BGCLoaderBase","title":"BGCLoaderBase","text":"BGCLoaderBase(data_dir: str | PathLike)\n
Bases: ABC
Abstract base class for BGC loader.
Parameters:
data_dir
(str | PathLike
) \u2013 Path to directory that contains BGC metadata files (.json) or full data genbank files (.gbk).
src/nplinker/genomics/abc.py
def __init__(self, data_dir: str | PathLike) -> None:\n \"\"\"Initialize the BGC loader.\n\n Args:\n data_dir: Path to directory that contains BGC metadata files\n (.json) or full data genbank files (.gbk).\n \"\"\"\n self.data_dir = str(data_dir)\n
"},{"location":"api/genomics_abc/#nplinker.genomics.abc.BGCLoaderBase.data_dir","title":"data_dir instance-attribute
","text":"data_dir = str(data_dir)\n
"},{"location":"api/genomics_abc/#nplinker.genomics.abc.BGCLoaderBase.get_files","title":"get_files abstractmethod
","text":"get_files() -> dict[str, str]\n
Get path to BGC files.
Returns:
dict[str, str]
\u2013 The key is BGC name and value is path to BGC file
src/nplinker/genomics/abc.py
@abstractmethod\ndef get_files(self) -> dict[str, str]:\n \"\"\"Get path to BGC files.\n\n Returns:\n The key is BGC name and value is path to BGC file\n \"\"\"\n
"},{"location":"api/genomics_abc/#nplinker.genomics.abc.BGCLoaderBase.get_bgcs","title":"get_bgcs abstractmethod
","text":"get_bgcs() -> list[BGC]\n
Get BGC objects.
Returns:
list[BGC]
\u2013 A list of BGC objects
src/nplinker/genomics/abc.py
@abstractmethod\ndef get_bgcs(self) -> list[BGC]:\n \"\"\"Get BGC objects.\n\n Returns:\n A list of BGC objects\n \"\"\"\n
"},{"location":"api/genomics_abc/#nplinker.genomics.abc.GCFLoaderBase","title":"GCFLoaderBase","text":" Bases: ABC
Abstract base class for GCF loader.
"},{"location":"api/genomics_abc/#nplinker.genomics.abc.GCFLoaderBase.get_gcfs","title":"get_gcfsabstractmethod
","text":"get_gcfs(\n keep_mibig_only: bool, keep_singleton: bool\n) -> list[GCF]\n
Get GCF objects.
Parameters:
keep_mibig_only
(bool
) \u2013 True to keep GCFs that contain only MIBiG BGCs.
keep_singleton
(bool
) \u2013 True to keep singleton GCFs. A singleton GCF is a GCF that contains only one BGC.
Returns:
list[GCF]
\u2013 A list of GCF objects
src/nplinker/genomics/abc.py
@abstractmethod\ndef get_gcfs(self, keep_mibig_only: bool, keep_singleton: bool) -> list[GCF]:\n \"\"\"Get GCF objects.\n\n Args:\n keep_mibig_only: True to keep GCFs that contain only MIBiG\n BGCs.\n keep_singleton: True to keep singleton GCFs. A singleton GCF\n is a GCF that contains only one BGC.\n\n Returns:\n A list of GCF objects\n \"\"\"\n
"},{"location":"api/genomics_utils/","title":"Utilities","text":""},{"location":"api/genomics_utils/#nplinker.genomics.utils","title":"nplinker.genomics.utils","text":""},{"location":"api/genomics_utils/#nplinker.genomics.utils.generate_mappings_genome_id_bgc_id","title":"generate_mappings_genome_id_bgc_id","text":"generate_mappings_genome_id_bgc_id(\n bgc_dir: str | PathLike,\n output_file: str | PathLike | None = None,\n) -> None\n
Generate a file that maps genome id to BGC id.
The input bgc_dir
must follow the structure of the antismash
directory defined in Working Directory Structure, e.g.:
bgc_dir\n \u251c\u2500\u2500 genome_id_1\n \u2502\u00a0 \u251c\u2500\u2500 bgc_id_1.gbk\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u251c\u2500\u2500 genome_id_2\n \u2502\u00a0 \u251c\u2500\u2500 bgc_id_2.gbk\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u2514\u2500\u2500 ...\n
Parameters:
bgc_dir
(str | PathLike
) \u2013 The directory has one-layer of subfolders and each subfolder contains BGC files in .gbk
format.
It assumes that
output_file
(str | PathLike | None
, default: None
) \u2013 The path to the output file. The file will be overwritten if it already exists.
Defaults to None, in which case the output file will be placed in the directory bgc_dir
with the file name GENOME_BGC_MAPPINGS_FILENAME.
src/nplinker/genomics/utils.py
def generate_mappings_genome_id_bgc_id(\n bgc_dir: str | PathLike, output_file: str | PathLike | None = None\n) -> None:\n \"\"\"Generate a file that maps genome id to BGC id.\n\n The input `bgc_dir` must follow the structure of the `antismash` directory defined in\n [Working Directory Structure][working-directory-structure], e.g.:\n ```shell\n bgc_dir\n \u251c\u2500\u2500 genome_id_1\n \u2502\u00a0 \u251c\u2500\u2500 bgc_id_1.gbk\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u251c\u2500\u2500 genome_id_2\n \u2502\u00a0 \u251c\u2500\u2500 bgc_id_2.gbk\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u2514\u2500\u2500 ...\n ```\n\n Args:\n bgc_dir: The directory has one-layer of subfolders and each subfolder contains BGC files\n in `.gbk` format.\n\n It assumes that\n\n - the subfolder name is the genome id (e.g. refseq),\n - the BGC file name is the BGC id.\n output_file: The path to the output file.\n The file will be overwritten if it already exists.\n\n Defaults to None, in which case the output file will be placed in\n the directory `bgc_dir` with the file name\n [GENOME_BGC_MAPPINGS_FILENAME][nplinker.defaults.GENOME_BGC_MAPPINGS_FILENAME].\n \"\"\"\n bgc_dir = Path(bgc_dir)\n genome_bgc_mappings = {}\n\n for subdir in list_dirs(bgc_dir):\n genome_id = Path(subdir).name\n bgc_files = list_files(subdir, suffix=(\".gbk\"), keep_parent=False)\n bgc_ids = [bgc_id for f in bgc_files if (bgc_id := Path(f).stem) != genome_id]\n if bgc_ids:\n genome_bgc_mappings[genome_id] = bgc_ids\n else:\n logger.warning(\"No BGC files found in %s\", subdir)\n\n # sort mappings by genome_id and construct json data\n genome_bgc_mappings = dict(sorted(genome_bgc_mappings.items()))\n json_data_mappings = [{\"genome_ID\": k, \"BGC_ID\": v} for k, v in genome_bgc_mappings.items()]\n json_data = {\"mappings\": json_data_mappings, \"version\": \"1.0\"}\n\n # validate json data\n validate(instance=json_data, schema=GENOME_BGC_MAPPINGS_SCHEMA)\n\n if output_file is None:\n output_file = bgc_dir / GENOME_BGC_MAPPINGS_FILENAME\n with open(output_file, \"w\") as f:\n json.dump(json_data, f)\n logger.info(\"Generated genome-BGC mappings file: %s\", output_file)\n
"},{"location":"api/genomics_utils/#nplinker.genomics.utils.add_strain_to_bgc","title":"add_strain_to_bgc","text":"add_strain_to_bgc(\n strains: StrainCollection, bgcs: Sequence[BGC]\n) -> tuple[list[BGC], list[BGC]]\n
Assign a Strain object to BGC.strain
for input BGCs.
BGC id is used to find the corresponding Strain object. It's possible that no Strain object is found for a BGC id.
Note
The input bgcs
will be changed in place.
Parameters:
strains
(StrainCollection
) \u2013 A collection of all strain objects.
bgcs
(Sequence[BGC]
) \u2013 A list of BGC objects.
Returns:
tuple[list[BGC], list[BGC]]
\u2013 A tuple of two lists of BGC objects,
Raises:
ValueError
\u2013 Multiple strain objects found for a BGC id.
src/nplinker/genomics/utils.py
def add_strain_to_bgc(\n strains: StrainCollection, bgcs: Sequence[BGC]\n) -> tuple[list[BGC], list[BGC]]:\n \"\"\"Assign a Strain object to `BGC.strain` for input BGCs.\n\n BGC id is used to find the corresponding Strain object. It's possible that\n no Strain object is found for a BGC id.\n\n !!! Note\n The input `bgcs` will be changed in place.\n\n Args:\n strains: A collection of all strain objects.\n bgcs: A list of BGC objects.\n\n Returns:\n A tuple of two lists of BGC objects,\n\n - the first list contains BGC objects that are updated with Strain object;\n - the second list contains BGC objects that are not updated with\n Strain object because no Strain object is found.\n\n Raises:\n ValueError: Multiple strain objects found for a BGC id.\n \"\"\"\n bgc_with_strain = []\n bgc_without_strain = []\n for bgc in bgcs:\n try:\n strain_list = strains.lookup(bgc.id)\n except ValueError:\n bgc_without_strain.append(bgc)\n continue\n if len(strain_list) > 1:\n raise ValueError(\n f\"Multiple strain objects found for BGC id '{bgc.id}'.\"\n f\"BGC object accept only one strain.\"\n )\n bgc.strain = strain_list[0]\n bgc_with_strain.append(bgc)\n\n logger.info(\n f\"{len(bgc_with_strain)} BGC objects updated with Strain object.\\n\"\n f\"{len(bgc_without_strain)} BGC objects not updated with Strain object.\"\n )\n return bgc_with_strain, bgc_without_strain\n
"},{"location":"api/genomics_utils/#nplinker.genomics.utils.add_bgc_to_gcf","title":"add_bgc_to_gcf","text":"add_bgc_to_gcf(\n bgcs: Sequence[BGC], gcfs: Sequence[GCF]\n) -> tuple[list[GCF], list[GCF], dict[GCF, set[str]]]\n
Add BGC objects to GCF object based on GCF's BGC ids.
The attribute of GCF.bgc_ids
contains the ids of BGC objects. These ids are used to find BGC objects from the input bgcs
list. The found BGC objects are added to the bgcs
attribute of GCF object. It is possible that some BGC ids are not found in the input bgcs
list, and so their BGC objects are missing in the GCF object.
Note
This method changes the lists bgcs
and gcfs
in place.
Parameters:
bgcs
(Sequence[BGC]
) \u2013 A list of BGC objects.
gcfs
(Sequence[GCF]
) \u2013 A list of GCF objects.
Returns:
tuple[list[GCF], list[GCF], dict[GCF, set[str]]]
\u2013 A tuple of two lists and a dictionary,
src/nplinker/genomics/utils.py
def add_bgc_to_gcf(\n bgcs: Sequence[BGC], gcfs: Sequence[GCF]\n) -> tuple[list[GCF], list[GCF], dict[GCF, set[str]]]:\n \"\"\"Add BGC objects to GCF object based on GCF's BGC ids.\n\n The attribute of `GCF.bgc_ids` contains the ids of BGC objects. These ids\n are used to find BGC objects from the input `bgcs` list. The found BGC\n objects are added to the `bgcs` attribute of GCF object. It is possible that\n some BGC ids are not found in the input `bgcs` list, and so their BGC\n objects are missing in the GCF object.\n\n !!! note\n This method changes the lists `bgcs` and `gcfs` in place.\n\n Args:\n bgcs: A list of BGC objects.\n gcfs: A list of GCF objects.\n\n Returns:\n A tuple of two lists and a dictionary,\n\n - The first list contains GCF objects that are updated with BGC objects;\n - The second list contains GCF objects that are not updated with BGC objects\n because no BGC objects are found;\n - The dictionary contains GCF objects as keys and a set of ids of missing\n BGC objects as values.\n \"\"\"\n bgc_dict = {bgc.id: bgc for bgc in bgcs}\n gcf_with_bgc = []\n gcf_without_bgc = []\n gcf_missing_bgc: dict[GCF, set[str]] = {}\n for gcf in gcfs:\n for bgc_id in gcf.bgc_ids:\n try:\n bgc = bgc_dict[bgc_id]\n except KeyError:\n if gcf not in gcf_missing_bgc:\n gcf_missing_bgc[gcf] = {bgc_id}\n else:\n gcf_missing_bgc[gcf].add(bgc_id)\n continue\n gcf.add_bgc(bgc)\n\n if gcf.bgcs:\n gcf_with_bgc.append(gcf)\n else:\n gcf_without_bgc.append(gcf)\n\n logger.info(\n f\"{len(gcf_with_bgc)} GCF objects updated with BGC objects.\\n\"\n f\"{len(gcf_without_bgc)} GCF objects not updated with BGC objects.\\n\"\n f\"{len(gcf_missing_bgc)} GCF objects have missing BGC objects.\"\n )\n return gcf_with_bgc, gcf_without_bgc, gcf_missing_bgc\n
"},{"location":"api/genomics_utils/#nplinker.genomics.utils.get_mibig_from_gcf","title":"get_mibig_from_gcf","text":"get_mibig_from_gcf(\n gcfs: Sequence[GCF],\n) -> tuple[list[BGC], StrainCollection]\n
Get MIBiG BGCs and strains from GCF objects.
Parameters:
gcfs
(Sequence[GCF]
) \u2013 A list of GCF objects.
Returns:
tuple[list[BGC], StrainCollection]
\u2013 A tuple of two objects,
src/nplinker/genomics/utils.py
def get_mibig_from_gcf(gcfs: Sequence[GCF]) -> tuple[list[BGC], StrainCollection]:\n \"\"\"Get MIBiG BGCs and strains from GCF objects.\n\n Args:\n gcfs: A list of GCF objects.\n\n Returns:\n A tuple of two objects,\n\n - the first is a list of MIBiG BGC objects used in the GCFs;\n - the second is a StrainCollection object that contains all Strain objects used in the\n GCFs.\n \"\"\"\n mibig_bgcs_in_use = []\n mibig_strains_in_use = StrainCollection()\n for gcf in gcfs:\n for bgc in gcf.bgcs:\n if bgc.is_mibig():\n mibig_bgcs_in_use.append(bgc)\n if bgc.strain is not None:\n mibig_strains_in_use.add(bgc.strain)\n return mibig_bgcs_in_use, mibig_strains_in_use\n
"},{"location":"api/genomics_utils/#nplinker.genomics.utils.extract_mappings_strain_id_original_genome_id","title":"extract_mappings_strain_id_original_genome_id","text":"extract_mappings_strain_id_original_genome_id(\n podp_project_json_file: str | PathLike,\n) -> dict[str, set[str]]\n
Extract mappings \"strain_id <-> original_genome_id\".
Tip
The podp_project_json_file
is the JSON file downloaded from PODP platform.
For example, for PODP project MSV000079284, its JSON file is https://pairedomicsdata.bioinformatics.nl/api/projects/4b29ddc3-26d0-40d7-80c5-44fb6631dbf9.4.
Parameters:
podp_project_json_file
(str | PathLike
) \u2013 The path to the PODP project JSON file.
Returns:
dict[str, set[str]]
\u2013 Key is strain id and value is a set of original genome ids.
src/nplinker/genomics/utils.py
def extract_mappings_strain_id_original_genome_id(\n podp_project_json_file: str | PathLike,\n) -> dict[str, set[str]]:\n \"\"\"Extract mappings \"strain_id <-> original_genome_id\".\n\n !!! tip\n The `podp_project_json_file` is the JSON file downloaded from PODP platform.\n\n For example, for PODP project MSV000079284, its JSON file is\n https://pairedomicsdata.bioinformatics.nl/api/projects/4b29ddc3-26d0-40d7-80c5-44fb6631dbf9.4.\n\n Args:\n podp_project_json_file: The path to the PODP project\n JSON file.\n\n Returns:\n Key is strain id and value is a set of original genome ids.\n\n See Also:\n - [podp_generate_strain_mappings][nplinker.strain.utils.podp_generate_strain_mappings]:\n Generate strain mappings JSON file for PODP pipeline.\n \"\"\"\n mappings_dict: dict[str, set[str]] = {}\n with open(podp_project_json_file, \"r\") as f:\n json_data = json.load(f)\n\n validate_podp_json(json_data)\n\n for record in json_data[\"genomes\"]:\n strain_id = record[\"genome_label\"]\n genome_id = get_best_available_genome_id(record[\"genome_ID\"])\n if genome_id is None:\n logger.warning(\"Failed to extract genome ID from genome with label %s\", strain_id)\n continue\n if strain_id in mappings_dict:\n mappings_dict[strain_id].add(genome_id)\n else:\n mappings_dict[strain_id] = {genome_id}\n return mappings_dict\n
"},{"location":"api/genomics_utils/#nplinker.genomics.utils.extract_mappings_original_genome_id_resolved_genome_id","title":"extract_mappings_original_genome_id_resolved_genome_id","text":"extract_mappings_original_genome_id_resolved_genome_id(\n genome_status_json_file: str | PathLike,\n) -> dict[str, str]\n
Extract mappings \"original_genome_id <-> resolved_genome_id\".
Tip
The genome_status_json_file
is generated by the podp_download_and_extract_antismash_data function with a default file name GENOME_STATUS_FILENAME.
Parameters:
genome_status_json_file
(str | PathLike
) \u2013 The path to the genome status JSON file.
Returns:
dict[str, str]
\u2013 Key is original genome id and value is resolved genome id.
src/nplinker/genomics/utils.py
def extract_mappings_original_genome_id_resolved_genome_id(\n genome_status_json_file: str | PathLike,\n) -> dict[str, str]:\n \"\"\"Extract mappings \"original_genome_id <-> resolved_genome_id\".\n\n !!! tip\n The `genome_status_json_file` is generated by the [podp_download_and_extract_antismash_data]\n [nplinker.genomics.antismash.podp_antismash_downloader.podp_download_and_extract_antismash_data]\n function with a default file name [GENOME_STATUS_FILENAME][nplinker.defaults.GENOME_STATUS_FILENAME].\n\n Args:\n genome_status_json_file: The path to the genome status JSON file.\n\n\n Returns:\n Key is original genome id and value is resolved genome id.\n\n See Also:\n - [podp_generate_strain_mappings][nplinker.strain.utils.podp_generate_strain_mappings]:\n Generate strain mappings JSON file for PODP pipeline.\n \"\"\"\n gs_mappings_dict = GenomeStatus.read_json(genome_status_json_file)\n return {gs.original_id: gs.resolved_refseq_id for gs in gs_mappings_dict.values()}\n
"},{"location":"api/genomics_utils/#nplinker.genomics.utils.extract_mappings_resolved_genome_id_bgc_id","title":"extract_mappings_resolved_genome_id_bgc_id","text":"extract_mappings_resolved_genome_id_bgc_id(\n genome_bgc_mappings_file: str | PathLike,\n) -> dict[str, set[str]]\n
Extract mappings \"resolved_genome_id <-> bgc_id\".
Tip
The genome_bgc_mappings_file
is usually generated by the generate_mappings_genome_id_bgc_id function with a default file name GENOME_BGC_MAPPINGS_FILENAME.
Parameters:
genome_bgc_mappings_file
(str | PathLike
) \u2013 The path to the genome BGC mappings JSON file.
Returns:
dict[str, set[str]]
\u2013 Key is resolved genome id and value is a set of BGC ids.
src/nplinker/genomics/utils.py
def extract_mappings_resolved_genome_id_bgc_id(\n genome_bgc_mappings_file: str | PathLike,\n) -> dict[str, set[str]]:\n \"\"\"Extract mappings \"resolved_genome_id <-> bgc_id\".\n\n !!! tip\n The `genome_bgc_mappings_file` is usually generated by the\n [generate_mappings_genome_id_bgc_id][nplinker.genomics.utils.generate_mappings_genome_id_bgc_id]\n function with a default file name [GENOME_BGC_MAPPINGS_FILENAME][nplinker.defaults.GENOME_BGC_MAPPINGS_FILENAME].\n\n Args:\n genome_bgc_mappings_file: The path to the genome BGC\n mappings JSON file.\n\n Returns:\n Key is resolved genome id and value is a set of BGC ids.\n\n See Also:\n - [podp_generate_strain_mappings][nplinker.strain.utils.podp_generate_strain_mappings]:\n Generate strain mappings JSON file for PODP pipeline.\n \"\"\"\n with open(genome_bgc_mappings_file, \"r\") as f:\n json_data = json.load(f)\n\n # validate the JSON data\n validate(json_data, GENOME_BGC_MAPPINGS_SCHEMA)\n\n return {mapping[\"genome_ID\"]: set(mapping[\"BGC_ID\"]) for mapping in json_data[\"mappings\"]}\n
"},{"location":"api/genomics_utils/#nplinker.genomics.utils.get_mappings_strain_id_bgc_id","title":"get_mappings_strain_id_bgc_id","text":"get_mappings_strain_id_bgc_id(\n mappings_strain_id_original_genome_id: Mapping[\n str, set[str]\n ],\n mappings_original_genome_id_resolved_genome_id: Mapping[\n str, str\n ],\n mappings_resolved_genome_id_bgc_id: Mapping[\n str, set[str]\n ],\n) -> dict[str, set[str]]\n
Get mappings \"strain_id <-> bgc_id\".
Parameters:
mappings_strain_id_original_genome_id
(Mapping[str, set[str]]
) \u2013 Mappings \"strain_id <-> original_genome_id\".
mappings_original_genome_id_resolved_genome_id
(Mapping[str, str]
) \u2013 Mappings \"original_genome_id <-> resolved_genome_id\".
mappings_resolved_genome_id_bgc_id
(Mapping[str, set[str]]
) \u2013 Mappings \"resolved_genome_id <-> bgc_id\".
Returns:
dict[str, set[str]]
\u2013 Key is strain id and value is a set of BGC ids.
extract_mappings_strain_id_original_genome_id
: Extract mappings \"strain_id <-> original_genome_id\".extract_mappings_original_genome_id_resolved_genome_id
: Extract mappings \"original_genome_id <-> resolved_genome_id\".extract_mappings_resolved_genome_id_bgc_id
: Extract mappings \"resolved_genome_id <-> bgc_id\".src/nplinker/genomics/utils.py
def get_mappings_strain_id_bgc_id(\n mappings_strain_id_original_genome_id: Mapping[str, set[str]],\n mappings_original_genome_id_resolved_genome_id: Mapping[str, str],\n mappings_resolved_genome_id_bgc_id: Mapping[str, set[str]],\n) -> dict[str, set[str]]:\n \"\"\"Get mappings \"strain_id <-> bgc_id\".\n\n Args:\n mappings_strain_id_original_genome_id: Mappings \"strain_id <-> original_genome_id\".\n mappings_original_genome_id_resolved_genome_id: Mappings \"original_genome_id <-> resolved_genome_id\".\n mappings_resolved_genome_id_bgc_id: Mappings \"resolved_genome_id <-> bgc_id\".\n\n Returns:\n Key is strain id and value is a set of BGC ids.\n\n See Also:\n - `extract_mappings_strain_id_original_genome_id`: Extract mappings\n \"strain_id <-> original_genome_id\".\n - `extract_mappings_original_genome_id_resolved_genome_id`: Extract mappings\n \"original_genome_id <-> resolved_genome_id\".\n - `extract_mappings_resolved_genome_id_bgc_id`: Extract mappings\n \"resolved_genome_id <-> bgc_id\".\n - [podp_generate_strain_mappings][nplinker.strain.utils.podp_generate_strain_mappings]:\n Generate strain mappings JSON file for PODP pipeline.\n \"\"\"\n mappings_dict = {}\n for strain_id, original_genome_ids in mappings_strain_id_original_genome_id.items():\n bgc_ids = set()\n for original_genome_id in original_genome_ids:\n resolved_genome_id = mappings_original_genome_id_resolved_genome_id[original_genome_id]\n if (bgc_id := mappings_resolved_genome_id_bgc_id.get(resolved_genome_id)) is not None:\n bgc_ids.update(bgc_id)\n if bgc_ids:\n mappings_dict[strain_id] = bgc_ids\n return mappings_dict\n
"},{"location":"api/gnps/","title":"GNPS","text":""},{"location":"api/gnps/#nplinker.metabolomics.gnps","title":"nplinker.metabolomics.gnps","text":""},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSFormat","title":"GNPSFormat","text":" Bases: Enum
Enum class for GNPS formats or workflows.
ConceptGNPS data
The name of the enum is a short name for the workflow, and the value of the enum is the workflow name used on the GNPS website.
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSFormat.SNETS","title":"SNETSclass-attribute
instance-attribute
","text":"SNETS = 'METABOLOMICS-SNETS'\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSFormat.SNETSV2","title":"SNETSV2 class-attribute
instance-attribute
","text":"SNETSV2 = 'METABOLOMICS-SNETS-V2'\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSFormat.FBMN","title":"FBMN class-attribute
instance-attribute
","text":"FBMN = 'FEATURE-BASED-MOLECULAR-NETWORKING'\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSFormat.Unknown","title":"Unknown class-attribute
instance-attribute
","text":"Unknown = 'Unknown-GNPS-Workflow'\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSDownloader","title":"GNPSDownloader","text":"GNPSDownloader(task_id: str, download_root: str | PathLike)\n
Download GNPS zip archive for the given task id.
ConceptGNPS data
Note that only GNPS workflows listed in the GNPSFormat enum are supported.
Attributes:
GNPS_DATA_DOWNLOAD_URL
(str
) \u2013 URL template for downloading GNPS data.
GNPS_DATA_DOWNLOAD_URL_FBMN
(str
) \u2013 URL template for downloading GNPS data for FBMN.
gnps_format
(GNPSFormat
) \u2013 GNPS workflow type.
Parameters:
task_id
(str
) \u2013 GNPS task id, identifying the data to be downloaded.
download_root
(str | PathLike
) \u2013 Path where to store the downloaded archive.
Raises:
ValueError
\u2013 If the given task id does not correspond to a supported GNPS workflow.
Examples:
>>> GNPSDownloader(\"c22f44b14a3d450eb836d607cb9521bb\", \"~/downloads\")\n
Source code in src/nplinker/metabolomics/gnps/gnps_downloader.py
def __init__(self, task_id: str, download_root: str | PathLike):\n \"\"\"Initialize the GNPSDownloader.\n\n Args:\n task_id: GNPS task id, identifying the data to be downloaded.\n download_root: Path where to store the downloaded archive.\n\n Raises:\n ValueError: If the given task id does not correspond to a supported\n GNPS workflow.\n\n Examples:\n >>> GNPSDownloader(\"c22f44b14a3d450eb836d607cb9521bb\", \"~/downloads\")\n \"\"\"\n gnps_format = gnps_format_from_task_id(task_id)\n if gnps_format == GNPSFormat.Unknown:\n raise ValueError(\n f\"Unknown workflow type for GNPS task '{task_id}'.\"\n f\"Supported GNPS workflows are described in the GNPSFormat enum, \"\n f\"including such as 'METABOLOMICS-SNETS', 'METABOLOMICS-SNETS-V2' \"\n f\"and 'FEATURE-BASED-MOLECULAR-NETWORKING'.\"\n )\n\n self._task_id = task_id\n self._download_root: Path = Path(download_root)\n self._gnps_format = gnps_format\n self._file_name = gnps_format.value + \"-\" + self._task_id + \".zip\"\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSDownloader.GNPS_DATA_DOWNLOAD_URL","title":"GNPS_DATA_DOWNLOAD_URL class-attribute
instance-attribute
","text":"GNPS_DATA_DOWNLOAD_URL: str = (\n \"https://gnps.ucsd.edu/ProteoSAFe/DownloadResult?task={}&view=download_clustered_spectra\"\n)\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSDownloader.GNPS_DATA_DOWNLOAD_URL_FBMN","title":"GNPS_DATA_DOWNLOAD_URL_FBMN class-attribute
instance-attribute
","text":"GNPS_DATA_DOWNLOAD_URL_FBMN: str = (\n \"https://gnps.ucsd.edu/ProteoSAFe/DownloadResult?task={}&view=download_cytoscape_data\"\n)\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSDownloader.gnps_format","title":"gnps_format property
","text":"gnps_format: GNPSFormat\n
Get the GNPS workflow type.
Returns:
GNPSFormat
\u2013 GNPS workflow type.
download() -> Self\n
Download GNPS data.
Note: GNPS data is downloaded using the POST method (empty payload is OK).
Source code insrc/nplinker/metabolomics/gnps/gnps_downloader.py
def download(self) -> Self:\n \"\"\"Download GNPS data.\n\n Note: GNPS data is downloaded using the POST method (empty payload is OK).\n \"\"\"\n download_url(\n self.get_url(), self._download_root, filename=self._file_name, http_method=\"POST\"\n )\n return self\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSDownloader.get_download_file","title":"get_download_file","text":"get_download_file() -> str\n
Get the path to the downloaded file.
Returns:
str
\u2013 Download path as string
src/nplinker/metabolomics/gnps/gnps_downloader.py
def get_download_file(self) -> str:\n \"\"\"Get the path to the downloaded file.\n\n Returns:\n Download path as string\n \"\"\"\n return str(Path(self._download_root) / self._file_name)\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSDownloader.get_task_id","title":"get_task_id","text":"get_task_id() -> str\n
Get the GNPS task id.
Returns:
str
\u2013 Task id as string.
src/nplinker/metabolomics/gnps/gnps_downloader.py
def get_task_id(self) -> str:\n \"\"\"Get the GNPS task id.\n\n Returns:\n Task id as string.\n \"\"\"\n return self._task_id\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSDownloader.get_url","title":"get_url","text":"get_url() -> str\n
Get the download URL.
Returns:
str
\u2013 URL pointing to the GNPS data to be downloaded.
src/nplinker/metabolomics/gnps/gnps_downloader.py
def get_url(self) -> str:\n \"\"\"Get the download URL.\n\n Returns:\n URL pointing to the GNPS data to be downloaded.\n \"\"\"\n if self.gnps_format == GNPSFormat.FBMN:\n return GNPSDownloader.GNPS_DATA_DOWNLOAD_URL_FBMN.format(self._task_id)\n return GNPSDownloader.GNPS_DATA_DOWNLOAD_URL.format(self._task_id)\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSExtractor","title":"GNPSExtractor","text":"GNPSExtractor(\n file: str | PathLike, extract_dir: str | PathLike\n)\n
Extract files from a GNPS molecular networking archive (.zip).
ConceptGNPS data
Four files are extracted and renamed to the following names:
The files to be extracted are selected based on the GNPS workflow type, as described below (in the order of the files above):
Attributes:
gnps_format
(GNPSFormat
) \u2013 The GNPS workflow type.
extract_dir
(str
) \u2013 The path where to extract the files to.
Parameters:
file
(str | PathLike
) \u2013 The path to the GNPS zip file.
extract_dir
(str | PathLike
) \u2013 path to the directory where to extract the files to.
Raises:
ValueError
\u2013 If the given file is an invalid GNPS archive.
Examples:
>>> gnps_extractor = GNPSExtractor(\"path/to/gnps_archive.zip\", \"path/to/extract_dir\")\n>>> gnps_extractor.gnps_format\n<GNPSFormat.SNETS: 'METABOLOMICS-SNETS'>\n>>> gnps_extractor.extract_dir\n'path/to/extract_dir'\n
Source code in src/nplinker/metabolomics/gnps/gnps_extractor.py
def __init__(self, file: str | PathLike, extract_dir: str | PathLike):\n \"\"\"Initialize the GNPSExtractor.\n\n Args:\n file: The path to the GNPS zip file.\n extract_dir: path to the directory where to extract the files to.\n\n Raises:\n ValueError: If the given file is an invalid GNPS archive.\n\n Examples:\n >>> gnps_extractor = GNPSExtractor(\"path/to/gnps_archive.zip\", \"path/to/extract_dir\")\n >>> gnps_extractor.gnps_format\n <GNPSFormat.SNETS: 'METABOLOMICS-SNETS'>\n >>> gnps_extractor.extract_dir\n 'path/to/extract_dir'\n \"\"\"\n gnps_format = gnps_format_from_archive(file)\n if gnps_format == GNPSFormat.Unknown:\n raise ValueError(\n f\"Unknown workflow type for GNPS archive '{file}'.\"\n f\"Supported GNPS workflows are described in the GNPSFormat enum, \"\n f\"including such as 'METABOLOMICS-SNETS', 'METABOLOMICS-SNETS-V2' \"\n f\"and 'FEATURE-BASED-MOLECULAR-NETWORKING'.\"\n )\n\n self._file = Path(file)\n self._extract_path = Path(extract_dir)\n self._gnps_format = gnps_format\n # the order of filenames matters\n self._target_files = [\n \"file_mappings\",\n \"spectra.mgf\",\n \"molecular_families.tsv\",\n \"annotations.tsv\",\n ]\n\n self._extract()\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSExtractor.gnps_format","title":"gnps_format property
","text":"gnps_format: GNPSFormat\n
Get the GNPS workflow type.
Returns:
GNPSFormat
\u2013 GNPS workflow type.
property
","text":"extract_dir: str\n
Get the path where to extract the files to.
Returns:
str
\u2013 Path where to extract files as string.
GNPSSpectrumLoader(file: str | PathLike)\n
Bases: SpectrumLoaderBase
Load mass spectra from the given GNPS MGF file.
ConceptGNPS data
The file mappings file is from GNPS output archive, as described below for each GNPS workflow type:
Parameters:
file
(str | PathLike
) \u2013 path to the MGF file.
Raises:
ValueError
\u2013 Raises ValueError if the file is not valid.
Examples:
>>> loader = GNPSSpectrumLoader(\"gnps_spectra.mgf\")\n>>> print(loader.spectra[0])\n
Source code in src/nplinker/metabolomics/gnps/gnps_spectrum_loader.py
def __init__(self, file: str | PathLike) -> None:\n \"\"\"Initialize the GNPSSpectrumLoader.\n\n Args:\n file: path to the MGF file.\n\n Raises:\n ValueError: Raises ValueError if the file is not valid.\n\n Examples:\n >>> loader = GNPSSpectrumLoader(\"gnps_spectra.mgf\")\n >>> print(loader.spectra[0])\n \"\"\"\n self._file = str(file)\n self._spectra: list[Spectrum] = []\n\n self._validate()\n self._load()\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSSpectrumLoader.spectra","title":"spectra property
","text":"spectra: list[Spectrum]\n
Get the list of Spectrum objects.
Returns:
list[Spectrum]
\u2013 list[Spectrum]: the loaded spectra as a list of Spectrum
objects.
GNPSMolecularFamilyLoader(file: str | PathLike)\n
Bases: MolecularFamilyLoaderBase
Load molecular families from GNPS data.
ConceptGNPS data
The molecular family file is from GNPS output archive, as described below for each GNPS workflow type:
The ComponentIndex
column in the GNPS molecular family file is treated as family id.
But for molecular families that have only one member (i.e. spectrum), named singleton molecular families, their files have the same value of -1
in the ComponentIndex
column. To make the family id unique,the spectrum id plus a prefix singleton-
is used as the family id of singleton molecular families.
Parameters:
file
(str | PathLike
) \u2013 Path to the GNPS molecular family file.
Raises:
ValueError
\u2013 Raises ValueError if the file is not valid.
Examples:
>>> loader = GNPSMolecularFamilyLoader(\"gnps_molecular_families.tsv\")\n>>> print(loader.families)\n[<MolecularFamily 1>, <MolecularFamily 2>, ...]\n>>> print(loader.families[0].spectra_ids)\n{'1', '3', '7', ...}\n
Source code in src/nplinker/metabolomics/gnps/gnps_molecular_family_loader.py
def __init__(self, file: str | PathLike) -> None:\n \"\"\"Initialize the GNPSMolecularFamilyLoader.\n\n Args:\n file: Path to the GNPS molecular family file.\n\n Raises:\n ValueError: Raises ValueError if the file is not valid.\n\n Examples:\n >>> loader = GNPSMolecularFamilyLoader(\"gnps_molecular_families.tsv\")\n >>> print(loader.families)\n [<MolecularFamily 1>, <MolecularFamily 2>, ...]\n >>> print(loader.families[0].spectra_ids)\n {'1', '3', '7', ...}\n \"\"\"\n self._mfs: list[MolecularFamily] = []\n self._file = file\n\n self._validate()\n self._load()\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSMolecularFamilyLoader.get_mfs","title":"get_mfs","text":"get_mfs(\n keep_singleton: bool = False,\n) -> list[MolecularFamily]\n
Get MolecularFamily objects.
Parameters:
keep_singleton
(bool
, default: False
) \u2013 True to keep singleton molecular families. A singleton molecular family is a molecular family that contains only one spectrum.
Returns:
list[MolecularFamily]
\u2013 A list of MolecularFamily objects with their spectra ids.
src/nplinker/metabolomics/gnps/gnps_molecular_family_loader.py
def get_mfs(self, keep_singleton: bool = False) -> list[MolecularFamily]:\n \"\"\"Get MolecularFamily objects.\n\n Args:\n keep_singleton: True to keep singleton molecular families. A\n singleton molecular family is a molecular family that contains\n only one spectrum.\n\n Returns:\n A list of MolecularFamily objects with their spectra ids.\n \"\"\"\n mfs = self._mfs\n if not keep_singleton:\n mfs = [mf for mf in mfs if not mf.is_singleton()]\n return mfs\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSAnnotationLoader","title":"GNPSAnnotationLoader","text":"GNPSAnnotationLoader(file: str | PathLike)\n
Bases: AnnotationLoaderBase
Load annotations from GNPS output file.
ConceptGNPS data
The annotation file is a .tsv
file from GNPS output archive, as described below for each GNPS workflow type:
Parameters:
file
(str | PathLike
) \u2013 The GNPS annotation file.
Examples:
>>> loader = GNPSAnnotationLoader(\"gnps_annotations.tsv\")\n>>> print(loader.annotations[\"100\"])\n{'#Scan#': '100',\n'Adduct': 'M+H',\n'CAS_Number': 'N/A',\n'Charge': '1',\n'Compound_Name': 'MLS002153841-01!Iobenguane sulfate',\n'Compound_Source': 'NIH Pharmacologically Active Library',\n'Data_Collector': 'VP/LMS',\n'ExactMass': '274.992',\n'INCHI': 'N/A',\n'INCHI_AUX': 'N/A',\n'Instrument': 'qTof',\n'IonMode': 'Positive',\n'Ion_Source': 'LC-ESI',\n'LibMZ': '276.003',\n'LibraryName': 'lib-00014.mgf',\n'LibraryQualityString': 'Gold',\n'Library_Class': '1',\n'MQScore': '0.704152',\n'MZErrorPPM': '405416',\n'MassDiff': '111.896',\n'Organism': 'GNPS-NIH-SMALLMOLECULEPHARMACOLOGICALLYACTIVE',\n'PI': 'Dorrestein',\n'Precursor_MZ': '276.003',\n'Pubmed_ID': 'N/A',\n'RT_Query': '795.979',\n'SharedPeaks': '7',\n'Smiles': 'NC(=N)NCc1cccc(I)c1.OS(=O)(=O)O',\n'SpecCharge': '1',\n'SpecMZ': '164.107',\n'SpectrumFile': 'spectra/specs_ms.pklbin',\n'SpectrumID': 'CCMSLIB00000086167',\n'TIC_Query': '986.997',\n'UpdateWorkflowName': 'UPDATE-SINGLE-ANNOTATED-GOLD',\n'tags': ' ',\n'png_url': 'https://metabolomics-usi.gnps2.org/png/?usi1=mzspec:GNPS:GNPS-LIBRARY:accession:CCMSLIB00000086167',\n'json_url': 'https://metabolomics-usi.gnps2.org/json/?usi1=mzspec:GNPS:GNPS-LIBRARY:accession:CCMSLIB00000086167',\n'svg_url': 'https://metabolomics-usi.gnps2.org/svg/?usi1=mzspec:GNPS:GNPS-LIBRARY:accession:CCMSLIB00000086167',\n'spectrum_url': 'https://metabolomics-usi.gnps2.org/spectrum/?usi1=mzspec:GNPS:GNPS-LIBRARY:accession:CCMSLIB00000086167'}\n
Source code in src/nplinker/metabolomics/gnps/gnps_annotation_loader.py
def __init__(self, file: str | PathLike) -> None:\n \"\"\"Initialize the GNPSAnnotationLoader.\n\n Args:\n file: The GNPS annotation file.\n\n Examples:\n >>> loader = GNPSAnnotationLoader(\"gnps_annotations.tsv\")\n >>> print(loader.annotations[\"100\"])\n {'#Scan#': '100',\n 'Adduct': 'M+H',\n 'CAS_Number': 'N/A',\n 'Charge': '1',\n 'Compound_Name': 'MLS002153841-01!Iobenguane sulfate',\n 'Compound_Source': 'NIH Pharmacologically Active Library',\n 'Data_Collector': 'VP/LMS',\n 'ExactMass': '274.992',\n 'INCHI': 'N/A',\n 'INCHI_AUX': 'N/A',\n 'Instrument': 'qTof',\n 'IonMode': 'Positive',\n 'Ion_Source': 'LC-ESI',\n 'LibMZ': '276.003',\n 'LibraryName': 'lib-00014.mgf',\n 'LibraryQualityString': 'Gold',\n 'Library_Class': '1',\n 'MQScore': '0.704152',\n 'MZErrorPPM': '405416',\n 'MassDiff': '111.896',\n 'Organism': 'GNPS-NIH-SMALLMOLECULEPHARMACOLOGICALLYACTIVE',\n 'PI': 'Dorrestein',\n 'Precursor_MZ': '276.003',\n 'Pubmed_ID': 'N/A',\n 'RT_Query': '795.979',\n 'SharedPeaks': '7',\n 'Smiles': 'NC(=N)NCc1cccc(I)c1.OS(=O)(=O)O',\n 'SpecCharge': '1',\n 'SpecMZ': '164.107',\n 'SpectrumFile': 'spectra/specs_ms.pklbin',\n 'SpectrumID': 'CCMSLIB00000086167',\n 'TIC_Query': '986.997',\n 'UpdateWorkflowName': 'UPDATE-SINGLE-ANNOTATED-GOLD',\n 'tags': ' ',\n 'png_url': 'https://metabolomics-usi.gnps2.org/png/?usi1=mzspec:GNPS:GNPS-LIBRARY:accession:CCMSLIB00000086167',\n 'json_url': 'https://metabolomics-usi.gnps2.org/json/?usi1=mzspec:GNPS:GNPS-LIBRARY:accession:CCMSLIB00000086167',\n 'svg_url': 'https://metabolomics-usi.gnps2.org/svg/?usi1=mzspec:GNPS:GNPS-LIBRARY:accession:CCMSLIB00000086167',\n 'spectrum_url': 'https://metabolomics-usi.gnps2.org/spectrum/?usi1=mzspec:GNPS:GNPS-LIBRARY:accession:CCMSLIB00000086167'}\n \"\"\"\n self._file = Path(file)\n self._annotations: dict[str, dict] = {}\n\n self._validate()\n self._load()\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSAnnotationLoader.annotations","title":"annotations property
","text":"annotations: dict[str, dict]\n
Get annotations.
Returns:
dict[str, dict]
\u2013 Keys are spectrum ids (\"#Scan#\" in annotation file) and values are the annotations dict
dict[str, dict]
\u2013 for each spectrum.
GNPSFileMappingLoader(file: str | PathLike)\n
Bases: FileMappingLoaderBase
Class to load file mappings from GNPS output file.
ConceptGNPS data
File mappings refers to the mapping from spectrum id to files in which this spectrum occurs.
The file mappings file is from GNPS output archive, as described below for each GNPS workflow type:
Parameters:
file
(str | PathLike
) \u2013 Path to the GNPS file mappings file.
Raises:
ValueError
\u2013 Raises ValueError if the file is not valid.
Examples:
>>> loader = GNPSFileMappingLoader(\"gnps_file_mappings.tsv\")\n>>> print(loader.mappings[\"1\"])\n['26c.mzXML']\n>>> print(loader.mapping_reversed[\"26c.mzXML\"])\n{'1', '3', '7', ...}\n
Source code in src/nplinker/metabolomics/gnps/gnps_file_mapping_loader.py
def __init__(self, file: str | PathLike) -> None:\n \"\"\"Initialize the GNPSFileMappingLoader.\n\n Args:\n file: Path to the GNPS file mappings file.\n\n Raises:\n ValueError: Raises ValueError if the file is not valid.\n\n Examples:\n >>> loader = GNPSFileMappingLoader(\"gnps_file_mappings.tsv\")\n >>> print(loader.mappings[\"1\"])\n ['26c.mzXML']\n >>> print(loader.mapping_reversed[\"26c.mzXML\"])\n {'1', '3', '7', ...}\n \"\"\"\n self._gnps_format = gnps_format_from_file_mapping(file)\n if self._gnps_format is GNPSFormat.Unknown:\n raise ValueError(\"Unknown workflow type for GNPS file mappings file \")\n\n self._file = Path(file)\n self._mapping: dict[str, list[str]] = {}\n\n self._validate()\n self._load()\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSFileMappingLoader.mappings","title":"mappings property
","text":"mappings: dict[str, list[str]]\n
Return mapping from spectrum id to files in which this spectrum occurs.
Returns:
dict[str, list[str]]
\u2013 Mapping from spectrum id to names of all files in which this spectrum occurs.
property
","text":"mapping_reversed: dict[str, set[str]]\n
Return mapping from file name to all spectra that occur in this file.
Returns:
dict[str, set[str]]
\u2013 Mapping from file name to all spectra ids that occur in this file.
gnps_format_from_archive(\n zip_file: str | PathLike,\n) -> GNPSFormat\n
Detect GNPS format from GNPS zip archive.
The detection is based on the filename of the zip file and the names of the files contained in the zip file.
Parameters:
zip_file
(str | PathLike
) \u2013 Path to the GNPS zip file.
Returns:
GNPSFormat
\u2013 The format identified in the GNPS zip file.
Examples:
>>> gnps_format_from_archive(\"ProteoSAFe-METABOLOMICS-SNETS-c22f44b1-download_clustered_spectra.zip\")\n<GNPSFormat.SNETS: 'METABOLOMICS-SNETS'>\n>>> gnps_format_from_archive(\"ProteoSAFe-METABOLOMICS-SNETS-V2-189e8bf1-download_clustered_spectra.zip\")\n<GNPSFormat.SNETSV2: 'METABOLOMICS-SNETS-V2'>\n>>> gnps_format_from_archive(\"ProteoSAFe-FEATURE-BASED-MOLECULAR-NETWORKING-672d0a53-download_cytoscape_data.zip\")\n<GNPSFormat.FBMN: 'FEATURE-BASED-MOLECULAR-NETWORKING'>\n
Source code in src/nplinker/metabolomics/gnps/gnps_format.py
def gnps_format_from_archive(zip_file: str | PathLike) -> GNPSFormat:\n \"\"\"Detect GNPS format from GNPS zip archive.\n\n The detection is based on the filename of the zip file and the names of the\n files contained in the zip file.\n\n Args:\n zip_file: Path to the GNPS zip file.\n\n Returns:\n The format identified in the GNPS zip file.\n\n Examples:\n >>> gnps_format_from_archive(\"ProteoSAFe-METABOLOMICS-SNETS-c22f44b1-download_clustered_spectra.zip\")\n <GNPSFormat.SNETS: 'METABOLOMICS-SNETS'>\n >>> gnps_format_from_archive(\"ProteoSAFe-METABOLOMICS-SNETS-V2-189e8bf1-download_clustered_spectra.zip\")\n <GNPSFormat.SNETSV2: 'METABOLOMICS-SNETS-V2'>\n >>> gnps_format_from_archive(\"ProteoSAFe-FEATURE-BASED-MOLECULAR-NETWORKING-672d0a53-download_cytoscape_data.zip\")\n <GNPSFormat.FBMN: 'FEATURE-BASED-MOLECULAR-NETWORKING'>\n \"\"\"\n file = Path(zip_file)\n # Guess the format from the filename of the zip file\n if GNPSFormat.FBMN.value in file.name:\n return GNPSFormat.FBMN\n # the order of the if statements matters for the following two\n if GNPSFormat.SNETSV2.value in file.name:\n return GNPSFormat.SNETSV2\n if GNPSFormat.SNETS.value in file.name:\n return GNPSFormat.SNETS\n\n # Guess the format from the names of the files in the zip file\n with zipfile.ZipFile(file) as archive:\n filenames = archive.namelist()\n if any(GNPSFormat.FBMN.value in x for x in filenames):\n return GNPSFormat.FBMN\n # the order of the if statements matters for the following two\n if any(GNPSFormat.SNETSV2.value in x for x in filenames):\n return GNPSFormat.SNETSV2\n if any(GNPSFormat.SNETS.value in x for x in filenames):\n return GNPSFormat.SNETS\n\n return GNPSFormat.Unknown\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.gnps_format_from_file_mapping","title":"gnps_format_from_file_mapping","text":"gnps_format_from_file_mapping(\n file: str | PathLike,\n) -> GNPSFormat\n
Detect GNPS format from the given file mapping file.
The GNPS file mapping file is located in different folders depending on the GNPS workflow. Here are the locations in corresponding GNPS zip archives:
METABOLOMICS-SNETS
workflow: the .tsv
file in the folder clusterinfosummarygroup_attributes_withIDs_withcomponentID
METABOLOMICS-SNETS-V2
workflow: the .clustersummary
file (tsv) in the folder clusterinfosummarygroup_attributes_withIDs_withcomponentID
FEATURE-BASED-MOLECULAR-NETWORKING
workflow: the .csv
file in the folder quantification_table
Parameters:
file
(str | PathLike
) \u2013 Path to the file to peek the format for.
Returns:
GNPSFormat
\u2013 GNPS format identified in the file.
src/nplinker/metabolomics/gnps/gnps_format.py
def gnps_format_from_file_mapping(file: str | PathLike) -> GNPSFormat:\n \"\"\"Detect GNPS format from the given file mapping file.\n\n The GNPS file mapping file is located in different folders depending on the\n GNPS workflow. Here are the locations in corresponding GNPS zip archives:\n\n - `METABOLOMICS-SNETS` workflow: the `.tsv` file in the folder\n `clusterinfosummarygroup_attributes_withIDs_withcomponentID`\n - `METABOLOMICS-SNETS-V2` workflow: the `.clustersummary` file (tsv) in the folder\n `clusterinfosummarygroup_attributes_withIDs_withcomponentID`\n - `FEATURE-BASED-MOLECULAR-NETWORKING` workflow: the `.csv` file in the folder\n `quantification_table`\n\n Args:\n file: Path to the file to peek the format for.\n\n Returns:\n GNPS format identified in the file.\n \"\"\"\n with open(file, \"r\") as f:\n header = f.readline().strip()\n\n if re.search(r\"\\bAllFiles\\b\", header):\n return GNPSFormat.SNETS\n if re.search(r\"\\bUniqueFileSources\\b\", header):\n return GNPSFormat.SNETSV2\n if re.search(r\"\\b{}\\b\".format(re.escape(\"row ID\")), header):\n return GNPSFormat.FBMN\n return GNPSFormat.Unknown\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.gnps_format_from_task_id","title":"gnps_format_from_task_id","text":"gnps_format_from_task_id(task_id: str) -> GNPSFormat\n
Detect GNPS format for the given task id.
Parameters:
task_id
(str
) \u2013 GNPS task id.
Returns:
GNPSFormat
\u2013 The format identified in the GNPS task.
Examples:
>>> gnps_format_from_task_id(\"c22f44b14a3d450eb836d607cb9521bb\")\n<GNPSFormat.SNETS: 'METABOLOMICS-SNETS'>\n>>> gnps_format_from_task_id(\"189e8bf16af145758b0a900f1c44ff4a\")\n<GNPSFormat.SNETSV2: 'METABOLOMICS-SNETS-V2'>\n>>> gnps_format_from_task_id(\"92036537c21b44c29e509291e53f6382\")\n<GNPSFormat.FBMN: 'FEATURE-BASED-MOLECULAR-NETWORKING'>\n>>> gnps_format_from_task_id(\"0ad6535e34d449788f297e712f43068a\")\n<GNPSFormat.Unknown: 'Unknown-GNPS-Workflow'>\n
Source code in src/nplinker/metabolomics/gnps/gnps_format.py
def gnps_format_from_task_id(task_id: str) -> GNPSFormat:\n \"\"\"Detect GNPS format for the given task id.\n\n Args:\n task_id: GNPS task id.\n\n Returns:\n The format identified in the GNPS task.\n\n Examples:\n >>> gnps_format_from_task_id(\"c22f44b14a3d450eb836d607cb9521bb\")\n <GNPSFormat.SNETS: 'METABOLOMICS-SNETS'>\n >>> gnps_format_from_task_id(\"189e8bf16af145758b0a900f1c44ff4a\")\n <GNPSFormat.SNETSV2: 'METABOLOMICS-SNETS-V2'>\n >>> gnps_format_from_task_id(\"92036537c21b44c29e509291e53f6382\")\n <GNPSFormat.FBMN: 'FEATURE-BASED-MOLECULAR-NETWORKING'>\n >>> gnps_format_from_task_id(\"0ad6535e34d449788f297e712f43068a\")\n <GNPSFormat.Unknown: 'Unknown-GNPS-Workflow'>\n \"\"\"\n task_html = httpx.get(GNPS_TASK_URL.format(task_id))\n soup = BeautifulSoup(task_html.text, features=\"html.parser\")\n try:\n # find the td tag that follows the th tag containing 'Workflow'\n workflow_tag = soup.find(\"th\", string=\"Workflow\").find_next_sibling(\"td\") # type: ignore\n workflow_format = workflow_tag.contents[0].strip() # type: ignore\n except AttributeError:\n return GNPSFormat.Unknown\n\n if workflow_format == GNPSFormat.FBMN.value:\n return GNPSFormat.FBMN\n if workflow_format == GNPSFormat.SNETSV2.value:\n return GNPSFormat.SNETSV2\n if workflow_format == GNPSFormat.SNETS.value:\n return GNPSFormat.SNETS\n return GNPSFormat.Unknown\n
"},{"location":"api/loader/","title":"Dataset Loader","text":""},{"location":"api/loader/#nplinker.loader","title":"nplinker.loader","text":""},{"location":"api/loader/#nplinker.loader.DatasetLoader","title":"DatasetLoader","text":"DatasetLoader(config: Dynaconf)\n
Load datasets from the working directory with the given configuration.
Concept and DiagramWorking Directory Structure
Dataset Loading Pipeline
Loaded data are stored in the data containers (attributes), e.g. self.bgcs
, self.gcfs
, etc.
Attributes:
config
\u2013 A Dynaconf object that contains the configuration settings.
bgcs
(list[BGC]
) \u2013 A list of BGC objects.
gcfs
(list[GCF]
) \u2013 A list of GCF objects.
spectra
(list[Spectrum]
) \u2013 A list of Spectrum objects.
mfs
(list[MolecularFamily]
) \u2013 A list of MolecularFamily objects.
mibig_bgcs
(list[BGC]
) \u2013 A list of MIBiG BGC objects.
mibig_strains_in_use
(StrainCollection
) \u2013 A StrainCollection object that contains the strains in use from MIBiG.
product_types
(list
) \u2013 A list of product types.
strains
(StrainCollection
) \u2013 A StrainCollection object that contains all strains.
class_matches
\u2013 A ClassMatches object that contains class match info.
chem_classes
\u2013 A ChemClassPredictions object that contains chemical class predictions.
Parameters:
config
(Dynaconf
) \u2013 A Dynaconf object that contains the configuration settings.
Examples:
>>> from nplinker.config import load_config\n>>> from nplinker.loader import DatasetLoader\n>>> config = load_config(\"nplinker.toml\")\n>>> loader = DatasetLoader(config)\n>>> loader.load()\n
See Also DatasetArranger: Download, generate and/or validate datasets to ensure they are ready for loading.
Source code insrc/nplinker/loader.py
def __init__(self, config: Dynaconf) -> None:\n \"\"\"Initialize the DatasetLoader.\n\n Args:\n config: A Dynaconf object that contains the configuration settings.\n\n Examples:\n >>> from nplinker.config import load_config\n >>> from nplinker.loader import DatasetLoader\n >>> config = load_config(\"nplinker.toml\")\n >>> loader = DatasetLoader(config)\n >>> loader.load()\n\n See Also:\n [DatasetArranger][nplinker.arranger.DatasetArranger]: Download, generate and/or validate\n datasets to ensure they are ready for loading.\n \"\"\"\n self.config = config\n\n self.bgcs: list[BGC] = []\n self.gcfs: list[GCF] = []\n self.spectra: list[Spectrum] = []\n self.mfs: list[MolecularFamily] = []\n self.mibig_bgcs: list[BGC] = []\n self.mibig_strains_in_use: StrainCollection = StrainCollection()\n self.product_types: list = []\n self.strains: StrainCollection = StrainCollection()\n\n self.class_matches = None\n self.chem_classes = None\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.RUN_CANOPUS_DEFAULT","title":"RUN_CANOPUS_DEFAULT class-attribute
instance-attribute
","text":"RUN_CANOPUS_DEFAULT = False\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.EXTRA_CANOPUS_PARAMS_DEFAULT","title":"EXTRA_CANOPUS_PARAMS_DEFAULT class-attribute
instance-attribute
","text":"EXTRA_CANOPUS_PARAMS_DEFAULT = (\n \"--maxmz 600 formula zodiac structure canopus\"\n)\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.OR_CANOPUS","title":"OR_CANOPUS class-attribute
instance-attribute
","text":"OR_CANOPUS = 'canopus_dir'\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.OR_MOLNETENHANCER","title":"OR_MOLNETENHANCER class-attribute
instance-attribute
","text":"OR_MOLNETENHANCER = 'molnetenhancer_dir'\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.config","title":"config instance-attribute
","text":"config = config\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.bgcs","title":"bgcs instance-attribute
","text":"bgcs: list[BGC] = []\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.gcfs","title":"gcfs instance-attribute
","text":"gcfs: list[GCF] = []\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.spectra","title":"spectra instance-attribute
","text":"spectra: list[Spectrum] = []\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.mfs","title":"mfs instance-attribute
","text":"mfs: list[MolecularFamily] = []\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.mibig_bgcs","title":"mibig_bgcs instance-attribute
","text":"mibig_bgcs: list[BGC] = []\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.mibig_strains_in_use","title":"mibig_strains_in_use instance-attribute
","text":"mibig_strains_in_use: StrainCollection = StrainCollection()\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.product_types","title":"product_types instance-attribute
","text":"product_types: list = []\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.strains","title":"strains instance-attribute
","text":"strains: StrainCollection = StrainCollection()\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.class_matches","title":"class_matches instance-attribute
","text":"class_matches = None\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.chem_classes","title":"chem_classes instance-attribute
","text":"chem_classes = None\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.load","title":"load","text":"load() -> bool\n
Load all data from data files in the working directory.
See Dataset Loading Pipeline for the detailed steps.
Returns:
bool
\u2013 True if all data are loaded successfully.
src/nplinker/loader.py
def load(self) -> bool:\n \"\"\"Load all data from data files in the working directory.\n\n See [Dataset Loading Pipeline][dataset-loading-pipeline] for the detailed steps.\n\n Returns:\n True if all data are loaded successfully.\n \"\"\"\n if not self._load_strain_mappings():\n return False\n\n if not self._load_metabolomics():\n return False\n\n if not self._load_genomics():\n return False\n\n # set self.strains with all strains from input plus mibig strains in use\n self.strains = self.strains + self.mibig_strains_in_use\n\n if len(self.strains) == 0:\n raise Exception(\"Failed to find *ANY* strains.\")\n\n return True\n
"},{"location":"api/metabolomics/","title":"Data Models","text":""},{"location":"api/metabolomics/#nplinker.metabolomics","title":"nplinker.metabolomics","text":""},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily","title":"MolecularFamily","text":"MolecularFamily(id: str)\n
Class to model molecular family.
Attributes:
id
(str
) \u2013 Unique id for the molecular family.
spectra_ids
(set[str]
) \u2013 Set of spectrum ids in the molecular family.
spectra
(set[Spectrum]
) \u2013 Set of Spectrum objects in the molecular family.
strains
(StrainCollection
) \u2013 StrainCollection object that contains strains in the molecular family.
Parameters:
id
(str
) \u2013 Unique id for the molecular family.
src/nplinker/metabolomics/molecular_family.py
def __init__(self, id: str):\n \"\"\"Initialize the MolecularFamily.\n\n Args:\n id: Unique id for the molecular family.\n \"\"\"\n self.id: str = id\n self.spectra_ids: set[str] = set()\n self._spectra: set[Spectrum] = set()\n self._strains: StrainCollection = StrainCollection()\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.id","title":"id instance-attribute
","text":"id: str = id\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.spectra_ids","title":"spectra_ids instance-attribute
","text":"spectra_ids: set[str] = set()\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.spectra","title":"spectra property
","text":"spectra: set[Spectrum]\n
Get Spectrum objects in the molecular family.
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.strains","title":"strainsproperty
","text":"strains: StrainCollection\n
Get strains in the molecular family.
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.__str__","title":"__str__","text":"__str__() -> str\n
Source code in src/nplinker/metabolomics/molecular_family.py
def __str__(self) -> str:\n return (\n f\"MolecularFamily(id={self.id}, #Spectrum_objects={len(self._spectra)}, \"\n f\"#spectrum_ids={len(self.spectra_ids)}, #strains={len(self._strains)})\"\n )\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.__repr__","title":"__repr__","text":"__repr__() -> str\n
Source code in src/nplinker/metabolomics/molecular_family.py
def __repr__(self) -> str:\n return str(self)\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.__eq__","title":"__eq__","text":"__eq__(other) -> bool\n
Source code in src/nplinker/metabolomics/molecular_family.py
def __eq__(self, other) -> bool:\n if isinstance(other, MolecularFamily):\n return self.id == other.id\n return NotImplemented\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.__hash__","title":"__hash__","text":"__hash__() -> int\n
Source code in src/nplinker/metabolomics/molecular_family.py
def __hash__(self) -> int:\n return hash(self.id)\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.__reduce__","title":"__reduce__","text":"__reduce__() -> tuple\n
Reduce function for pickling.
Source code insrc/nplinker/metabolomics/molecular_family.py
def __reduce__(self) -> tuple:\n \"\"\"Reduce function for pickling.\"\"\"\n return (self.__class__, (self.id,), self.__dict__)\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.add_spectrum","title":"add_spectrum","text":"add_spectrum(spectrum: Spectrum) -> None\n
Add a Spectrum object to the molecular family.
Parameters:
spectrum
(Spectrum
) \u2013 Spectrum
object to add to the molecular family.
src/nplinker/metabolomics/molecular_family.py
def add_spectrum(self, spectrum: Spectrum) -> None:\n \"\"\"Add a Spectrum object to the molecular family.\n\n Args:\n spectrum: `Spectrum` object to add to the molecular family.\n \"\"\"\n self._spectra.add(spectrum)\n self.spectra_ids.add(spectrum.id)\n self._strains = self._strains + spectrum.strains\n # add the molecular family to the spectrum\n spectrum.family = self\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.detach_spectrum","title":"detach_spectrum","text":"detach_spectrum(spectrum: Spectrum) -> None\n
Remove a Spectrum object from the molecular family.
Parameters:
spectrum
(Spectrum
) \u2013 Spectrum
object to remove from the molecular family.
src/nplinker/metabolomics/molecular_family.py
def detach_spectrum(self, spectrum: Spectrum) -> None:\n \"\"\"Remove a Spectrum object from the molecular family.\n\n Args:\n spectrum: `Spectrum` object to remove from the molecular family.\n \"\"\"\n self._spectra.remove(spectrum)\n self.spectra_ids.remove(spectrum.id)\n self._strains = self._update_strains()\n # remove the molecular family from the spectrum\n spectrum.family = None\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.has_strain","title":"has_strain","text":"has_strain(strain: Strain) -> bool\n
Check if the given strain exists.
Parameters:
strain
(Strain
) \u2013 Strain
object.
Returns:
bool
\u2013 True when the given strain exists.
src/nplinker/metabolomics/molecular_family.py
def has_strain(self, strain: Strain) -> bool:\n \"\"\"Check if the given strain exists.\n\n Args:\n strain: `Strain` object.\n\n Returns:\n True when the given strain exists.\n \"\"\"\n return strain in self._strains\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.is_singleton","title":"is_singleton","text":"is_singleton() -> bool\n
Check if the molecular family contains only one spectrum.
Returns:
bool
\u2013 True when the molecular family has only one spectrum.
src/nplinker/metabolomics/molecular_family.py
def is_singleton(self) -> bool:\n \"\"\"Check if the molecular family contains only one spectrum.\n\n Returns:\n True when the molecular family has only one spectrum.\n \"\"\"\n return len(self.spectra_ids) == 1\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum","title":"Spectrum","text":"Spectrum(\n id: str,\n mz: list[float],\n intensity: list[float],\n precursor_mz: float,\n rt: float = 0,\n metadata: dict | None = None,\n)\n
Class to model MS/MS Spectrum.
Attributes:
id
\u2013 the spectrum ID.
mz
\u2013 the list of m/z values.
intensity
\u2013 the list of intensity values.
precursor_mz
\u2013 the m/z value of the precursor.
rt
\u2013 the retention time in seconds.
metadata
\u2013 the metadata of the spectrum, i.e. the header information in the MGF file.
gnps_annotations
(dict
) \u2013 the GNPS annotations of the spectrum.
gnps_id
(str | None
) \u2013 the GNPS ID of the spectrum.
strains
(StrainCollection
) \u2013 the strains that this spectrum belongs to.
family
(MolecularFamily | None
) \u2013 the molecular family that this spectrum belongs to.
peaks
(ndarray
) \u2013 2D array of peaks, each row is a peak of (m/z, intensity) values.
Parameters:
id
(str
) \u2013 the spectrum ID.
mz
(list[float]
) \u2013 the list of m/z values.
intensity
(list[float]
) \u2013 the list of intensity values.
precursor_mz
(float
) \u2013 the precursor m/z.
rt
(float
, default: 0
) \u2013 the retention time in seconds. Defaults to 0.
metadata
(dict | None
, default: None
) \u2013 the metadata of the spectrum, i.e. the header information in the MGF file.
src/nplinker/metabolomics/spectrum.py
def __init__(\n self,\n id: str,\n mz: list[float],\n intensity: list[float],\n precursor_mz: float,\n rt: float = 0,\n metadata: dict | None = None,\n) -> None:\n \"\"\"Initialize the Spectrum.\n\n Args:\n id: the spectrum ID.\n mz: the list of m/z values.\n intensity: the list of intensity values.\n precursor_mz: the precursor m/z.\n rt: the retention time in seconds. Defaults to 0.\n metadata: the metadata of the spectrum, i.e. the header information\n in the MGF file.\n \"\"\"\n self.id = id\n self.mz = mz\n self.intensity = intensity\n self.precursor_mz = precursor_mz\n self.rt = rt\n self.metadata = metadata or {}\n\n self.gnps_annotations: dict = {}\n self.gnps_id: str | None = None\n self.strains: StrainCollection = StrainCollection()\n self.family: MolecularFamily | None = None\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.id","title":"id instance-attribute
","text":"id = id\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.mz","title":"mz instance-attribute
","text":"mz = mz\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.intensity","title":"intensity instance-attribute
","text":"intensity = intensity\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.precursor_mz","title":"precursor_mz instance-attribute
","text":"precursor_mz = precursor_mz\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.rt","title":"rt instance-attribute
","text":"rt = rt\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.metadata","title":"metadata instance-attribute
","text":"metadata = metadata or {}\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.gnps_annotations","title":"gnps_annotations instance-attribute
","text":"gnps_annotations: dict = {}\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.gnps_id","title":"gnps_id instance-attribute
","text":"gnps_id: str | None = None\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.strains","title":"strains instance-attribute
","text":"strains: StrainCollection = StrainCollection()\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.family","title":"family instance-attribute
","text":"family: MolecularFamily | None = None\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.peaks","title":"peaks cached
property
","text":"peaks: ndarray\n
Get the peaks, a 2D array with each row containing the values of (m/z, intensity).
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.__str__","title":"__str__","text":"__str__() -> str\n
Source code in src/nplinker/metabolomics/spectrum.py
def __str__(self) -> str:\n return f\"Spectrum(id={self.id}, #strains={len(self.strains)})\"\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.__repr__","title":"__repr__","text":"__repr__() -> str\n
Source code in src/nplinker/metabolomics/spectrum.py
def __repr__(self) -> str:\n return str(self)\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.__eq__","title":"__eq__","text":"__eq__(other) -> bool\n
Source code in src/nplinker/metabolomics/spectrum.py
def __eq__(self, other) -> bool:\n if isinstance(other, Spectrum):\n return self.id == other.id and self.precursor_mz == other.precursor_mz\n return NotImplemented\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.__hash__","title":"__hash__","text":"__hash__() -> int\n
Source code in src/nplinker/metabolomics/spectrum.py
def __hash__(self) -> int:\n return hash((self.id, self.precursor_mz))\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.__reduce__","title":"__reduce__","text":"__reduce__() -> tuple\n
Reduce function for pickling.
Source code insrc/nplinker/metabolomics/spectrum.py
def __reduce__(self) -> tuple:\n \"\"\"Reduce function for pickling.\"\"\"\n return (\n self.__class__,\n (self.id, self.mz, self.intensity, self.precursor_mz, self.rt, self.metadata),\n self.__dict__,\n )\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.has_strain","title":"has_strain","text":"has_strain(strain: Strain) -> bool\n
Check if the given strain exists in the spectrum.
Parameters:
strain
(Strain
) \u2013 Strain
object.
Returns:
bool
\u2013 True when the given strain exist in the spectrum.
src/nplinker/metabolomics/spectrum.py
def has_strain(self, strain: Strain) -> bool:\n \"\"\"Check if the given strain exists in the spectrum.\n\n Args:\n strain: `Strain` object.\n\n Returns:\n True when the given strain exist in the spectrum.\n \"\"\"\n return strain in self.strains\n
"},{"location":"api/metabolomics_abc/","title":"Abstract Base Classes","text":""},{"location":"api/metabolomics_abc/#nplinker.metabolomics.abc","title":"nplinker.metabolomics.abc","text":""},{"location":"api/metabolomics_abc/#nplinker.metabolomics.abc.SpectrumLoaderBase","title":"SpectrumLoaderBase","text":" Bases: ABC
Abstract base class for SpectrumLoader.
"},{"location":"api/metabolomics_abc/#nplinker.metabolomics.abc.SpectrumLoaderBase.spectra","title":"spectraabstractmethod
property
","text":"spectra: list[Spectrum]\n
Get Spectrum objects.
Returns:
list[Spectrum]
\u2013 A sequence of Spectrum objects.
Bases: ABC
Abstract base class for MolecularFamilyLoader.
"},{"location":"api/metabolomics_abc/#nplinker.metabolomics.abc.MolecularFamilyLoaderBase.get_mfs","title":"get_mfsabstractmethod
","text":"get_mfs(keep_singleton: bool) -> list[MolecularFamily]\n
Get MolecularFamily objects.
Parameters:
keep_singleton
(bool
) \u2013 True to keep singleton molecular families. A singleton molecular family is a molecular family that contains only one spectrum.
Returns:
list[MolecularFamily]
\u2013 A sequence of MolecularFamily objects.
src/nplinker/metabolomics/abc.py
@abstractmethod\ndef get_mfs(self, keep_singleton: bool) -> list[MolecularFamily]:\n \"\"\"Get MolecularFamily objects.\n\n Args:\n keep_singleton: True to keep singleton molecular families. A\n singleton molecular family is a molecular family that contains\n only one spectrum.\n\n Returns:\n A sequence of MolecularFamily objects.\n \"\"\"\n
"},{"location":"api/metabolomics_abc/#nplinker.metabolomics.abc.FileMappingLoaderBase","title":"FileMappingLoaderBase","text":" Bases: ABC
Abstract base class for FileMappingLoader.
"},{"location":"api/metabolomics_abc/#nplinker.metabolomics.abc.FileMappingLoaderBase.mappings","title":"mappingsabstractmethod
property
","text":"mappings: dict[str, list[str]]\n
Get file mappings.
Returns:
dict[str, list[str]]
\u2013 A mapping from spectrum ID to the names of files where the spectrum occurs.
Bases: ABC
Abstract base class for AnnotationLoader.
"},{"location":"api/metabolomics_abc/#nplinker.metabolomics.abc.AnnotationLoaderBase.annotations","title":"annotationsabstractmethod
property
","text":"annotations: dict[str, dict]\n
Get annotations.
Returns:
dict[str, dict]
\u2013 A mapping from spectrum ID to its annotations.
add_annotation_to_spectrum(\n annotations: Mapping[str, dict],\n spectra: Sequence[Spectrum],\n) -> None\n
Add annotations to the Spectrum.gnps_annotations
attribute for input spectra.
It is possible that some spectra don't have annotations.
Note
The input spectra
list is changed in place.
Parameters:
annotations
(Mapping[str, dict]
) \u2013 A dictionary of GNPS annotations, where the keys are spectrum ids and the values are GNPS annotations.
spectra
(Sequence[Spectrum]
) \u2013 A list of Spectrum objects.
src/nplinker/metabolomics/utils.py
def add_annotation_to_spectrum(\n annotations: Mapping[str, dict], spectra: Sequence[Spectrum]\n) -> None:\n \"\"\"Add annotations to the `Spectrum.gnps_annotations` attribute for input spectra.\n\n It is possible that some spectra don't have annotations.\n\n !!! note\n The input `spectra` list is changed in place.\n\n Args:\n annotations: A dictionary of GNPS annotations, where the keys are\n spectrum ids and the values are GNPS annotations.\n spectra: A list of Spectrum objects.\n \"\"\"\n for spec in spectra:\n if spec.id in annotations:\n spec.gnps_annotations = annotations[spec.id]\n
"},{"location":"api/metabolomics_utils/#nplinker.metabolomics.utils.add_strains_to_spectrum","title":"add_strains_to_spectrum","text":"add_strains_to_spectrum(\n strains: StrainCollection, spectra: Sequence[Spectrum]\n) -> tuple[list[Spectrum], list[Spectrum]]\n
Add Strain
objects to the Spectrum.strains
attribute for input spectra.
Note
The input spectra
list is changed in place.
Parameters:
strains
(StrainCollection
) \u2013 A collection of strain objects.
spectra
(Sequence[Spectrum]
) \u2013 A list of Spectrum objects.
Returns:
tuple[list[Spectrum], list[Spectrum]]
\u2013 A tuple of two lists of Spectrum objects,
src/nplinker/metabolomics/utils.py
def add_strains_to_spectrum(\n strains: StrainCollection, spectra: Sequence[Spectrum]\n) -> tuple[list[Spectrum], list[Spectrum]]:\n \"\"\"Add `Strain` objects to the `Spectrum.strains` attribute for input spectra.\n\n !!! note\n The input `spectra` list is changed in place.\n\n Args:\n strains: A collection of strain objects.\n spectra: A list of Spectrum objects.\n\n Returns:\n A tuple of two lists of Spectrum objects,\n\n - the first list contains Spectrum objects that are updated with Strain objects;\n - the second list contains Spectrum objects that are not updated with Strain objects\n because no Strain objects are found.\n \"\"\"\n spectra_with_strains = []\n spectra_without_strains = []\n for spec in spectra:\n try:\n strain_list = strains.lookup(spec.id)\n except ValueError:\n spectra_without_strains.append(spec)\n continue\n\n for strain in strain_list:\n spec.strains.add(strain)\n spectra_with_strains.append(spec)\n\n logger.info(\n f\"{len(spectra_with_strains)} Spectrum objects updated with Strain objects.\\n\"\n f\"{len(spectra_without_strains)} Spectrum objects not updated with Strain objects.\"\n )\n\n return spectra_with_strains, spectra_without_strains\n
"},{"location":"api/metabolomics_utils/#nplinker.metabolomics.utils.add_spectrum_to_mf","title":"add_spectrum_to_mf","text":"add_spectrum_to_mf(\n spectra: Sequence[Spectrum],\n mfs: Sequence[MolecularFamily],\n) -> tuple[\n list[MolecularFamily],\n list[MolecularFamily],\n dict[MolecularFamily, set[str]],\n]\n
Add Spectrum objects to MolecularFamily objects.
The attribute MolecularFamily.spectra_ids
contains the ids of Spectrum
objects. These ids are used to find Spectrum
objects from the input spectra
list. The found Spectrum
objects are added to the MolecularFamily.spectra
attribute.
It is possible that some spectrum ids are not found in the input spectra
list, and so their Spectrum
objects are missing in the MolecularFamily
object.
Note
The input mfs
list is changed in place.
Parameters:
spectra
(Sequence[Spectrum]
) \u2013 A list of Spectrum objects.
mfs
(Sequence[MolecularFamily]
) \u2013 A list of MolecularFamily objects.
Returns:
tuple[list[MolecularFamily], list[MolecularFamily], dict[MolecularFamily, set[str]]]
\u2013 A tuple of three elements,
MolecularFamily
objects that are updated with Spectrum
objectsMolecularFamily
objects that are not updated with Spectrum
objects (all Spectrum
objects are missing).MolecularFamily
objects as keys and a set of ids of missing Spectrum
objects as values.src/nplinker/metabolomics/utils.py
def add_spectrum_to_mf(\n spectra: Sequence[Spectrum], mfs: Sequence[MolecularFamily]\n) -> tuple[list[MolecularFamily], list[MolecularFamily], dict[MolecularFamily, set[str]]]:\n \"\"\"Add Spectrum objects to MolecularFamily objects.\n\n The attribute `MolecularFamily.spectra_ids` contains the ids of `Spectrum` objects.\n These ids are used to find `Spectrum` objects from the input `spectra` list. The found `Spectrum`\n objects are added to the `MolecularFamily.spectra` attribute.\n\n It is possible that some spectrum ids are not found in the input `spectra` list, and so their\n `Spectrum` objects are missing in the `MolecularFamily` object.\n\n\n !!! note\n The input `mfs` list is changed in place.\n\n Args:\n spectra: A list of Spectrum objects.\n mfs: A list of MolecularFamily objects.\n\n Returns:\n A tuple of three elements,\n\n - the first list contains `MolecularFamily` objects that are updated with `Spectrum` objects\n - the second list contains `MolecularFamily` objects that are not updated with `Spectrum`\n objects (all `Spectrum` objects are missing).\n - the third is a dictionary containing `MolecularFamily` objects as keys and a set of ids\n of missing `Spectrum` objects as values.\n \"\"\"\n spec_dict = {spec.id: spec for spec in spectra}\n mf_with_spec = []\n mf_without_spec = []\n mf_missing_spec: dict[MolecularFamily, set[str]] = {}\n for mf in mfs:\n for spec_id in mf.spectra_ids:\n try:\n spec = spec_dict[spec_id]\n except KeyError:\n if mf not in mf_missing_spec:\n mf_missing_spec[mf] = {spec_id}\n else:\n mf_missing_spec[mf].add(spec_id)\n continue\n mf.add_spectrum(spec)\n\n if mf.spectra:\n mf_with_spec.append(mf)\n else:\n mf_without_spec.append(mf)\n\n logger.info(\n f\"{len(mf_with_spec)} MolecularFamily objects updated with Spectrum objects.\\n\"\n f\"{len(mf_without_spec)} MolecularFamily objects not updated with Spectrum objects.\\n\"\n f\"{len(mf_missing_spec)} MolecularFamily objects have missing Spectrum objects.\"\n )\n return mf_with_spec, mf_without_spec, mf_missing_spec\n
"},{"location":"api/metabolomics_utils/#nplinker.metabolomics.utils.extract_mappings_strain_id_ms_filename","title":"extract_mappings_strain_id_ms_filename","text":"extract_mappings_strain_id_ms_filename(\n podp_project_json_file: str | PathLike,\n) -> dict[str, set[str]]\n
Extract mappings \"strain_id <-> MS_filename\".
Parameters:
podp_project_json_file
(str | PathLike
) \u2013 The path to the PODP project JSON file.
Returns:
dict[str, set[str]]
\u2013 Key is strain id and value is a set of MS filenames.
The podp_project_json_file
is the project JSON file downloaded from PODP platform. For example, for project MSV000079284, its json file is https://pairedomicsdata.bioinformatics.nl/api/projects/4b29ddc3-26d0-40d7-80c5-44fb6631dbf9.4.
src/nplinker/metabolomics/utils.py
def extract_mappings_strain_id_ms_filename(\n podp_project_json_file: str | PathLike,\n) -> dict[str, set[str]]:\n \"\"\"Extract mappings \"strain_id <-> MS_filename\".\n\n Args:\n podp_project_json_file: The path to the PODP project JSON file.\n\n Returns:\n Key is strain id and value is a set of MS filenames.\n\n Notes:\n The `podp_project_json_file` is the project JSON file downloaded from\n PODP platform. For example, for project MSV000079284, its json file is\n https://pairedomicsdata.bioinformatics.nl/api/projects/4b29ddc3-26d0-40d7-80c5-44fb6631dbf9.4.\n\n See Also:\n - [podp_generate_strain_mappings][nplinker.strain.utils.podp_generate_strain_mappings]:\n Generate strain mappings JSON file for PODP pipeline.\n \"\"\"\n mappings_dict: dict[str, set[str]] = {}\n with open(podp_project_json_file, \"r\") as f:\n json_data = json.load(f)\n\n validate_podp_json(json_data)\n\n # Extract mappings strain id <-> metabolomics filename\n for record in json_data[\"genome_metabolome_links\"]:\n strain_id = record[\"genome_label\"]\n # get the actual filename of the mzXML URL\n filename = Path(record[\"metabolomics_file\"]).name\n if strain_id in mappings_dict:\n mappings_dict[strain_id].add(filename)\n else:\n mappings_dict[strain_id] = {filename}\n return mappings_dict\n
"},{"location":"api/metabolomics_utils/#nplinker.metabolomics.utils.extract_mappings_ms_filename_spectrum_id","title":"extract_mappings_ms_filename_spectrum_id","text":"extract_mappings_ms_filename_spectrum_id(\n gnps_file_mappings_file: str | PathLike,\n) -> dict[str, set[str]]\n
Extract mappings \"MS_filename <-> spectrum_id\".
Parameters:
gnps_file_mappings_file
(str | PathLike
) \u2013 The path to the GNPS file mappings file (csv or tsv).
Returns:
dict[str, set[str]]
\u2013 Key is MS filename and value is a set of spectrum ids.
The gnps_file_mappings_file
is downloaded from GNPS website and named as GNPS_FILE_MAPPINGS_TSV or GNPS_FILE_MAPPINGS_CSV. For more details, see GNPS data.
src/nplinker/metabolomics/utils.py
def extract_mappings_ms_filename_spectrum_id(\n gnps_file_mappings_file: str | PathLike,\n) -> dict[str, set[str]]:\n \"\"\"Extract mappings \"MS_filename <-> spectrum_id\".\n\n Args:\n gnps_file_mappings_file: The path to the GNPS file mappings file (csv or tsv).\n\n Returns:\n Key is MS filename and value is a set of spectrum ids.\n\n Notes:\n The `gnps_file_mappings_file` is downloaded from GNPS website and named as\n [GNPS_FILE_MAPPINGS_TSV][nplinker.defaults.GNPS_FILE_MAPPINGS_TSV] or\n [GNPS_FILE_MAPPINGS_CSV][nplinker.defaults.GNPS_FILE_MAPPINGS_CSV].\n For more details, see [GNPS data][gnps-data].\n\n See Also:\n - [GNPSFileMappingLoader][nplinker.metabolomics.gnps.gnps_file_mapping_loader.GNPSFileMappingLoader]:\n Load GNPS file mappings file.\n - [podp_generate_strain_mappings][nplinker.strain.utils.podp_generate_strain_mappings]:\n Generate strain mappings JSON file for PODP pipeline.\n \"\"\"\n loader = GNPSFileMappingLoader(gnps_file_mappings_file)\n return loader.mapping_reversed\n
"},{"location":"api/metabolomics_utils/#nplinker.metabolomics.utils.get_mappings_strain_id_spectrum_id","title":"get_mappings_strain_id_spectrum_id","text":"get_mappings_strain_id_spectrum_id(\n mappings_strain_id_ms_filename: Mapping[str, set[str]],\n mappings_ms_filename_spectrum_id: Mapping[\n str, set[str]\n ],\n) -> dict[str, set[str]]\n
Get mappings \"strain_id <-> spectrum_id\".
Parameters:
mappings_strain_id_ms_filename
(Mapping[str, set[str]]
) \u2013 Mappings \"strain_id <-> MS_filename\".
mappings_ms_filename_spectrum_id
(Mapping[str, set[str]]
) \u2013 Mappings \"MS_filename <-> spectrum_id\".
Returns:
dict[str, set[str]]
\u2013 Key is strain id and value is a set of spectrum ids.
extract_mappings_strain_id_ms_filename
: Extract mappings \"strain_id <-> MS_filename\".extract_mappings_ms_filename_spectrum_id
: Extract mappings \"MS_filename <-> spectrum_id\".src/nplinker/metabolomics/utils.py
def get_mappings_strain_id_spectrum_id(\n mappings_strain_id_ms_filename: Mapping[str, set[str]],\n mappings_ms_filename_spectrum_id: Mapping[str, set[str]],\n) -> dict[str, set[str]]:\n \"\"\"Get mappings \"strain_id <-> spectrum_id\".\n\n Args:\n mappings_strain_id_ms_filename: Mappings\n \"strain_id <-> MS_filename\".\n mappings_ms_filename_spectrum_id: Mappings\n \"MS_filename <-> spectrum_id\".\n\n Returns:\n Key is strain id and value is a set of spectrum ids.\n\n\n See Also:\n - `extract_mappings_strain_id_ms_filename`: Extract mappings \"strain_id <-> MS_filename\".\n - `extract_mappings_ms_filename_spectrum_id`: Extract mappings \"MS_filename <-> spectrum_id\".\n - [podp_generate_strain_mappings][nplinker.strain.utils.podp_generate_strain_mappings]:\n Generate strain mappings JSON file for PODP pipeline.\n \"\"\"\n mappings_dict = {}\n for strain_id, ms_filenames in mappings_strain_id_ms_filename.items():\n spectrum_ids = set()\n for ms_filename in ms_filenames:\n if (sid := mappings_ms_filename_spectrum_id.get(ms_filename)) is not None:\n spectrum_ids.update(sid)\n if spectrum_ids:\n mappings_dict[strain_id] = spectrum_ids\n return mappings_dict\n
"},{"location":"api/mibig/","title":"MiBIG","text":""},{"location":"api/mibig/#nplinker.genomics.mibig","title":"nplinker.genomics.mibig","text":""},{"location":"api/mibig/#nplinker.genomics.mibig.MibigLoader","title":"MibigLoader","text":"MibigLoader(data_dir: str | PathLike)\n
Bases: BGCLoaderBase
Parse MIBiG metadata files and return BGC objects.
MIBiG metadata file (json) contains annotations/metadata information for each BGC. See https://mibig.secondarymetabolites.org/download.
The MiBIG accession is used as BGC id and strain name. The loaded BGC objects have Strain object as their strain attribute (i.e. BGC.strain
).
Parameters:
data_dir
(str | PathLike
) \u2013 Path to the directory of MIBiG metadata json files
Examples:
>>> loader = MibigLoader(\"path/to/mibig/data/dir\")\n>>> loader.data_dir\n'path/to/mibig/data/dir'\n>>> loader.get_bgcs()\n[BGC('BGC000001', 'NRP'), BGC('BGC000002', 'Polyketide')]\n
Source code in src/nplinker/genomics/mibig/mibig_loader.py
def __init__(self, data_dir: str | PathLike):\n \"\"\"Initialize the MIBiG metadata loader.\n\n Args:\n data_dir: Path to the directory of MIBiG metadata json files\n\n Examples:\n >>> loader = MibigLoader(\"path/to/mibig/data/dir\")\n >>> loader.data_dir\n 'path/to/mibig/data/dir'\n >>> loader.get_bgcs()\n [BGC('BGC000001', 'NRP'), BGC('BGC000002', 'Polyketide')]\n \"\"\"\n self.data_dir = str(data_dir)\n self._file_dict = self.parse_data_dir(self.data_dir)\n self._metadata_dict = self._parse_metadata()\n self._bgcs = self._parse_bgcs()\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigLoader.data_dir","title":"data_dir instance-attribute
","text":"data_dir = str(data_dir)\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigLoader.get_files","title":"get_files","text":"get_files() -> dict[str, str]\n
Get the path of all MIBiG metadata json files.
Returns:
dict[str, str]
\u2013 The key is metadata file name (BGC accession), and the value is path to the metadata
dict[str, str]
\u2013 json file
src/nplinker/genomics/mibig/mibig_loader.py
def get_files(self) -> dict[str, str]:\n \"\"\"Get the path of all MIBiG metadata json files.\n\n Returns:\n The key is metadata file name (BGC accession), and the value is path to the metadata\n json file\n \"\"\"\n return self._file_dict\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigLoader.parse_data_dir","title":"parse_data_dir staticmethod
","text":"parse_data_dir(data_dir: str | PathLike) -> dict[str, str]\n
Parse metadata directory and return paths to all metadata json files.
Parameters:
data_dir
(str | PathLike
) \u2013 path to the directory of MIBiG metadata json files
Returns:
dict[str, str]
\u2013 The key is metadata file name (BGC accession), and the value is path to the metadata
dict[str, str]
\u2013 json file
src/nplinker/genomics/mibig/mibig_loader.py
@staticmethod\ndef parse_data_dir(data_dir: str | PathLike) -> dict[str, str]:\n \"\"\"Parse metadata directory and return paths to all metadata json files.\n\n Args:\n data_dir: path to the directory of MIBiG metadata json files\n\n Returns:\n The key is metadata file name (BGC accession), and the value is path to the metadata\n json file\n \"\"\"\n file_dict = {}\n json_files = list_files(data_dir, prefix=\"BGC\", suffix=\".json\")\n for file in json_files:\n fname = Path(file).stem\n file_dict[fname] = file\n return file_dict\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigLoader.get_metadata","title":"get_metadata","text":"get_metadata() -> dict[str, MibigMetadata]\n
Get MibigMetadata objects.
Returns:
dict[str, MibigMetadata]
\u2013 The key is BGC accession (file name) and the value is MibigMetadata object
src/nplinker/genomics/mibig/mibig_loader.py
def get_metadata(self) -> dict[str, MibigMetadata]:\n \"\"\"Get MibigMetadata objects.\n\n Returns:\n The key is BGC accession (file name) and the value is MibigMetadata object\n \"\"\"\n return self._metadata_dict\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigLoader.get_bgcs","title":"get_bgcs","text":"get_bgcs() -> list[BGC]\n
Get BGC objects.
The BGC objects use MiBIG accession as id and have Strain object as their strain attribute (i.e. BGC.strain
), where the name of the Strain object is also MiBIG accession.
Returns:
list[BGC]
\u2013 A list of BGC objects
src/nplinker/genomics/mibig/mibig_loader.py
def get_bgcs(self) -> list[BGC]:\n \"\"\"Get BGC objects.\n\n The BGC objects use MiBIG accession as id and have Strain object as\n their strain attribute (i.e. `BGC.strain`), where the name of the Strain\n object is also MiBIG accession.\n\n Returns:\n A list of BGC objects\n \"\"\"\n return self._bgcs\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigMetadata","title":"MibigMetadata","text":"MibigMetadata(file: str | PathLike)\n
Class to model the BGC metadata/annotations defined in MIBiG.
MIBiG is a specification of BGC metadata and use JSON schema to represent BGC metadata. More details see: https://mibig.secondarymetabolites.org/download.
Parameters:
file
(str | PathLike
) \u2013 Path to the json file of MIBiG BGC metadata
Examples:
>>> metadata = MibigMetadata(\"/data/BGC0000001.json\")\n
Source code in src/nplinker/genomics/mibig/mibig_metadata.py
def __init__(self, file: str | PathLike) -> None:\n \"\"\"Initialize the MIBiG metadata object.\n\n Args:\n file: Path to the json file of MIBiG BGC metadata\n\n Examples:\n >>> metadata = MibigMetadata(\"/data/BGC0000001.json\")\n \"\"\"\n self.file = str(file)\n with open(self.file, \"rb\") as f:\n self.metadata = json.load(f)\n\n self._mibig_accession: str\n self._biosyn_class: tuple[str]\n self._parse_metadata()\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigMetadata.file","title":"file instance-attribute
","text":"file = str(file)\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigMetadata.metadata","title":"metadata instance-attribute
","text":"metadata = load(f)\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigMetadata.mibig_accession","title":"mibig_accession property
","text":"mibig_accession: str\n
Get the value of metadata item 'mibig_accession'.
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigMetadata.biosyn_class","title":"biosyn_classproperty
","text":"biosyn_class: tuple[str]\n
Get the value of metadata item 'biosyn_class'.
The 'biosyn_class' is biosynthetic class(es), namely the type of natural product or secondary metabolite.
MIBiG defines 6 major biosynthetic classes for natural products, including NRP
, Polyketide
, RiPP
, Terpene
, Saccharide
and Alkaloid
. Note that natural products created by the other biosynthetic mechanisms fall under the category Other
. For more details see the paper.
download_and_extract_mibig_metadata(\n download_root: str | PathLike,\n extract_path: str | PathLike,\n version: str = \"3.1\",\n)\n
Download and extract MIBiG metadata json files.
Note that it does not matter whether the metadata json files are in nested folders or not in the archive, all json files will be extracted to the same location, i.e. extract_path
. The nested folders will be removed if they exist. So the extract_path
will have only json files.
Parameters:
download_root
(str | PathLike
) \u2013 Path to the directory in which to place the downloaded archive.
extract_path
(str | PathLike
) \u2013 Path to an empty directory where the json files will be extracted. The directory must be empty if it exists. If it doesn't exist, the directory will be created.
version
(str
, default: '3.1'
) \u2013 description. Defaults to \"3.1\".
Examples:
>>> download_and_extract_mibig_metadata(\"/data/download\", \"/data/mibig_metadata\")\n
Source code in src/nplinker/genomics/mibig/mibig_downloader.py
def download_and_extract_mibig_metadata(\n download_root: str | os.PathLike,\n extract_path: str | os.PathLike,\n version: str = \"3.1\",\n):\n \"\"\"Download and extract MIBiG metadata json files.\n\n Note that it does not matter whether the metadata json files are in nested folders or not in the archive,\n all json files will be extracted to the same location, i.e. `extract_path`. The nested\n folders will be removed if they exist. So the `extract_path` will have only json files.\n\n Args:\n download_root: Path to the directory in which to place the downloaded archive.\n extract_path: Path to an empty directory where the json files will be extracted.\n The directory must be empty if it exists. If it doesn't exist, the directory will be created.\n version: _description_. Defaults to \"3.1\".\n\n Examples:\n >>> download_and_extract_mibig_metadata(\"/data/download\", \"/data/mibig_metadata\")\n \"\"\"\n download_root = Path(download_root)\n extract_path = Path(extract_path)\n\n if download_root == extract_path:\n raise ValueError(\"Identical path of download directory and extract directory\")\n\n # check if extract_path is empty\n if not extract_path.exists():\n extract_path.mkdir(parents=True)\n else:\n if len(list(extract_path.iterdir())) != 0:\n raise ValueError(f'Nonempty directory: \"{extract_path}\"')\n\n # download and extract\n md5 = _MD5_MIBIG_METADATA[version]\n download_and_extract_archive(\n url=MIBIG_METADATA_URL.format(version=version),\n download_root=download_root,\n extract_root=extract_path,\n md5=md5,\n )\n\n # After extracting mibig archive, it's either one dir or many json files,\n # if it's a dir, then move all json files from it to extract_path\n subdirs = list_dirs(extract_path)\n if len(subdirs) > 1:\n raise ValueError(f\"Expected one extracted directory, got {len(subdirs)}\")\n\n if len(subdirs) == 1:\n subdir_path = subdirs[0]\n for fname in list_files(subdir_path, prefix=\"BGC\", suffix=\".json\", keep_parent=False):\n shutil.move(os.path.join(subdir_path, fname), os.path.join(extract_path, fname))\n # delete subdir\n if subdir_path != extract_path:\n shutil.rmtree(subdir_path)\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.parse_bgc_metadata_json","title":"parse_bgc_metadata_json","text":"parse_bgc_metadata_json(file: str | PathLike) -> BGC\n
Parse MIBiG metadata file and return BGC object.
Note that the MiBIG accession is used as the BGC id and strain name. The BGC object has Strain object as its strain attribute.
Parameters:
file
(str | PathLike
) \u2013 Path to the MIBiG metadata json file
Returns:
BGC
\u2013 BGC object
src/nplinker/genomics/mibig/mibig_loader.py
def parse_bgc_metadata_json(file: str | PathLike) -> BGC:\n \"\"\"Parse MIBiG metadata file and return BGC object.\n\n Note that the MiBIG accession is used as the BGC id and strain name. The BGC\n object has Strain object as its strain attribute.\n\n Args:\n file: Path to the MIBiG metadata json file\n\n Returns:\n BGC object\n \"\"\"\n metadata = MibigMetadata(str(file))\n mibig_bgc = BGC(metadata.mibig_accession, *metadata.biosyn_class)\n mibig_bgc.mibig_bgc_class = metadata.biosyn_class\n mibig_bgc.strain = Strain(metadata.mibig_accession)\n return mibig_bgc\n
"},{"location":"api/nplinker/","title":"NPLinker","text":""},{"location":"api/nplinker/#nplinker","title":"nplinker","text":""},{"location":"api/nplinker/#nplinker.NPLinker","title":"NPLinker","text":"NPLinker(config_file: str | PathLike)\n
The central class of NPLinker application.
Attributes:
config
(Dynaconf
) \u2013 The configuration object for the current NPLinker application.
root_dir
(str
) \u2013 The path to the root directory of the current NPLinker application.
output_dir
(str
) \u2013 The path to the output directory of the current NPLinker application.
bgcs
(list[BGC]
) \u2013 A list of all BGC objects.
gcfs
(list[GCF]
) \u2013 A list of all GCF objects.
spectra
(list[Spectrum]
) \u2013 A list of all Spectrum objects.
mfs
(list[MolecularFamily]
) \u2013 A list of all MolecularFamily objects.
mibig_bgcs
(list[BGC]
) \u2013 A list of all MiBIG BGC objects.
strains
(StrainCollection
) \u2013 A StrainCollection object containing all Strain objects.
product_types
(list[str]
) \u2013 A list of all BiGSCAPE product types.
scoring_methods
(list[str]
) \u2013 A list of all valid scoring methods.
Parameters:
config_file
(str | PathLike
) \u2013 Path to the configuration file to use.
Examples:
Starting the NPLinker application:
>>> from nplinker import NPLinker\n>>> npl = NPLinker(\"path/to/config.toml\")\n
Loading data from files to python objects:
>>> npl.load_data()\n
Checking the number of GCF objects:
>>> len(npl.gcfs)\n
Getting the links for all GCF objects using the Metcalf scoring method, and the result is stored in a LinkGraph object:
>>> lg = npl.get_links(npl.gcfs, \"metcalf\")\n
Getting the link data between two objects:
>>> link_data = lg.get_link_data(npl.gcfs[0], npl.spectra[0])\n{\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0, \"standardised\": False})}\n
Saving the data to a pickle file:
>>> npl.save_data(\"path/to/output.pkl\", lg)\n
Source code in src/nplinker/nplinker.py
def __init__(self, config_file: str | PathLike):\n \"\"\"Initialise an NPLinker instance.\n\n Args:\n config_file: Path to the configuration file to use.\n\n\n Examples:\n Starting the NPLinker application:\n >>> from nplinker import NPLinker\n >>> npl = NPLinker(\"path/to/config.toml\")\n\n Loading data from files to python objects:\n >>> npl.load_data()\n\n Checking the number of GCF objects:\n >>> len(npl.gcfs)\n\n Getting the links for all GCF objects using the Metcalf scoring method, and the result\n is stored in a [LinkGraph][nplinker.scoring.LinkGraph] object:\n >>> lg = npl.get_links(npl.gcfs, \"metcalf\")\n\n Getting the link data between two objects:\n >>> link_data = lg.get_link_data(npl.gcfs[0], npl.spectra[0])\n {\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0, \"standardised\": False})}\n\n Saving the data to a pickle file:\n >>> npl.save_data(\"path/to/output.pkl\", lg)\n \"\"\"\n # Load the configuration file\n self.config: Dynaconf = load_config(config_file)\n\n # Setup logging for the application\n setup_logging(\n level=self.config.log.level,\n file=self.config.log.get(\"file\", \"\"),\n use_console=self.config.log.use_console,\n )\n logger.info(\n \"Configuration:\\n %s\", pformat(self.config.as_dict(), width=20, sort_dicts=False)\n )\n\n # Setup the output directory\n self._output_dir = self.config.root_dir / OUTPUT_DIRNAME\n self._output_dir.mkdir(exist_ok=True)\n\n # Initialise data containers that will be populated by the `load_data` method\n self._bgc_dict: dict[str, BGC] = {}\n self._gcf_dict: dict[str, GCF] = {}\n self._spec_dict: dict[str, Spectrum] = {}\n self._mf_dict: dict[str, MolecularFamily] = {}\n self._mibig_bgcs: list[BGC] = []\n self._strains: StrainCollection = StrainCollection()\n self._product_types: list = []\n self._chem_classes = None # TODO: to be refactored\n self._class_matches = None # TODO: to be refactored\n\n # Flags to keep track of whether the scoring methods have been set up\n self._scoring_methods_setup_done = {name: False for name in self._valid_scoring_methods}\n
"},{"location":"api/nplinker/#nplinker.NPLinker.config","title":"config instance-attribute
","text":"config: Dynaconf = load_config(config_file)\n
"},{"location":"api/nplinker/#nplinker.NPLinker.root_dir","title":"root_dir property
","text":"root_dir: str\n
Get the path to the root directory of the current NPLinker instance.
"},{"location":"api/nplinker/#nplinker.NPLinker.output_dir","title":"output_dirproperty
","text":"output_dir: str\n
Get the path to the output directory of the current NPLinker instance.
"},{"location":"api/nplinker/#nplinker.NPLinker.bgcs","title":"bgcsproperty
","text":"bgcs: list[BGC]\n
Get all BGC objects.
"},{"location":"api/nplinker/#nplinker.NPLinker.gcfs","title":"gcfsproperty
","text":"gcfs: list[GCF]\n
Get all GCF objects.
"},{"location":"api/nplinker/#nplinker.NPLinker.spectra","title":"spectraproperty
","text":"spectra: list[Spectrum]\n
Get all Spectrum objects.
"},{"location":"api/nplinker/#nplinker.NPLinker.mfs","title":"mfsproperty
","text":"mfs: list[MolecularFamily]\n
Get all MolecularFamily objects.
"},{"location":"api/nplinker/#nplinker.NPLinker.mibig_bgcs","title":"mibig_bgcsproperty
","text":"mibig_bgcs: list[BGC]\n
Get all MiBIG BGC objects.
"},{"location":"api/nplinker/#nplinker.NPLinker.strains","title":"strainsproperty
","text":"strains: StrainCollection\n
Get all Strain objects.
"},{"location":"api/nplinker/#nplinker.NPLinker.product_types","title":"product_typesproperty
","text":"product_types: list[str]\n
Get all BiGSCAPE product types.
"},{"location":"api/nplinker/#nplinker.NPLinker.chem_classes","title":"chem_classesproperty
","text":"chem_classes\n
Returns loaded ChemClassPredictions with the class predictions.
"},{"location":"api/nplinker/#nplinker.NPLinker.class_matches","title":"class_matchesproperty
","text":"class_matches\n
ClassMatches with the matched classes and scoring tables from MIBiG.
"},{"location":"api/nplinker/#nplinker.NPLinker.scoring_methods","title":"scoring_methodsproperty
","text":"scoring_methods: list[str]\n
Get names of all valid scoring methods.
"},{"location":"api/nplinker/#nplinker.NPLinker.load_data","title":"load_data","text":"load_data()\n
Load all data from files into memory.
This method is a convenience function that calls the DatasetArranger
class to arrange data files (download, generate and/or validate data) in the correct directory structure, and then calls the DatasetLoader
class to load all data from the files into memory.
The loaded data is stored in various data containers for easy access, e.g. self.bgcs
for all BGC objects, self.strains
for all Strain objects, etc.
src/nplinker/nplinker.py
def load_data(self):\n \"\"\"Load all data from files into memory.\n\n This method is a convenience function that calls the\n [`DatasetArranger`][nplinker.arranger.DatasetArranger] class to arrange data files\n (download, generate and/or validate data) in the [correct directory structure][working-directory-structure],\n and then calls the [`DatasetLoader`][nplinker.loader.DatasetLoader] class to load all data\n from the files into memory.\n\n The loaded data is stored in various data containers for easy access, e.g.\n [`self.bgcs`][nplinker.NPLinker.bgcs] for all BGC objects,\n [`self.strains`][nplinker.NPLinker.strains] for all Strain objects, etc.\n \"\"\"\n arranger = DatasetArranger(self.config)\n arranger.arrange()\n loader = DatasetLoader(self.config)\n loader.load()\n\n self._bgc_dict = {bgc.id: bgc for bgc in loader.bgcs}\n self._gcf_dict = {gcf.id: gcf for gcf in loader.gcfs}\n self._spec_dict = {spec.id: spec for spec in loader.spectra}\n self._mf_dict = {mf.id: mf for mf in loader.mfs}\n\n self._mibig_bgcs = loader.mibig_bgcs\n self._strains = loader.strains\n self._product_types = loader.product_types\n self._chem_classes = loader.chem_classes\n self._class_matches = loader.class_matches\n
"},{"location":"api/nplinker/#nplinker.NPLinker.get_links","title":"get_links","text":"get_links(\n objects: (\n Sequence[BGC]\n | Sequence[GCF]\n | Sequence[Spectrum]\n | Sequence[MolecularFamily]\n ),\n scoring_method: str,\n **scoring_params: Any\n) -> LinkGraph\n
Get links for the given objects using the specified scoring method and parameters.
Parameters:
objects
(Sequence[BGC] | Sequence[GCF] | Sequence[Spectrum] | Sequence[MolecularFamily]
) \u2013 A sequence of objects to get links for. The objects must be of the same type, i.e. BGC
, GCF
, Spectrum
or MolecularFamily
type.
Warning
For scoring method metcalf
, the BGC
objects are not supported.
scoring_method
(str
) \u2013 The scoring method to use. Must be one of the valid scoring methods self.scoring_methods
, such as metcalf
.
scoring_params
(Any
, default: {}
) \u2013 Parameters to pass to the scoring method. If not given, the default parameters of the specified scoring method will be used.
Check the get_links
method of the scoring method class for the available parameters and their default values.
metcalf
cutoff
, standardised
Returns:
LinkGraph
\u2013 A LinkGraph object containing the links for the given objects.
Raises:
ValueError
\u2013 If input objects are empty or if the scoring method is invalid.
TypeError
\u2013 If the input objects are not of the same type or if the object type is invalid.
Examples:
Using default scoring parameters:
>>> lg = npl.get_links(npl.gcfs, \"metcalf\")\n
Scoring parameters provided:
>>> lg = npl.get_links(npl.gcfs, \"metcalf\", cutoff=0.5, standardised=True)\n
Source code in src/nplinker/nplinker.py
def get_links(\n self,\n objects: Sequence[BGC] | Sequence[GCF] | Sequence[Spectrum] | Sequence[MolecularFamily],\n scoring_method: str,\n **scoring_params: Any,\n) -> LinkGraph:\n \"\"\"Get links for the given objects using the specified scoring method and parameters.\n\n Args:\n objects: A sequence of objects to get links for. The objects must be of the same\n type, i.e. `BGC`, `GCF`, `Spectrum` or `MolecularFamily` type.\n !!! Warning\n For scoring method `metcalf`, the `BGC` objects are not supported.\n scoring_method: The scoring method to use. Must be one of the valid scoring methods\n [`self.scoring_methods`][nplinker.NPLinker.scoring_methods], such as `metcalf`.\n scoring_params: Parameters to pass to the scoring method. If not given, the default\n parameters of the specified scoring method will be used.\n\n Check the `get_links` method of the scoring method class for the available\n parameters and their default values.\n\n | Scoring Method | Scoring Parameters |\n | -------------- | ------------------ |\n | `metcalf` | [`cutoff`, `standardised`][nplinker.scoring.MetcalfScoring.get_links] |\n\n Returns:\n A LinkGraph object containing the links for the given objects.\n\n Raises:\n ValueError: If input objects are empty or if the scoring method is invalid.\n TypeError: If the input objects are not of the same type or if the object type is invalid.\n\n Examples:\n Using default scoring parameters:\n >>> lg = npl.get_links(npl.gcfs, \"metcalf\")\n\n Scoring parameters provided:\n >>> lg = npl.get_links(npl.gcfs, \"metcalf\", cutoff=0.5, standardised=True)\n \"\"\"\n # Validate objects\n if len(objects) == 0:\n raise ValueError(\"No objects provided to get links for\")\n # check if all objects are of the same type\n types = {type(i) for i in objects}\n if len(types) > 1:\n raise TypeError(\"Input objects must be of the same type.\")\n # check if the object type is valid\n obj_type = next(iter(types))\n if obj_type not in (BGC, GCF, Spectrum, MolecularFamily):\n raise TypeError(\n f\"Invalid type {obj_type}. Input objects must be BGC, GCF, Spectrum or MolecularFamily objects.\"\n )\n\n # Validate scoring method\n if scoring_method not in self._valid_scoring_methods:\n raise ValueError(f\"Invalid scoring method {scoring_method}.\")\n\n # Check if the scoring method has been set up\n if not self._scoring_methods_setup_done[scoring_method]:\n self._valid_scoring_methods[scoring_method].setup(self)\n self._scoring_methods_setup_done[scoring_method] = True\n\n # Initialise the scoring method\n scoring = self._valid_scoring_methods[scoring_method]()\n\n return scoring.get_links(*objects, **scoring_params)\n
"},{"location":"api/nplinker/#nplinker.NPLinker.lookup_bgc","title":"lookup_bgc","text":"lookup_bgc(id: str) -> BGC | None\n
Get the BGC object with the given ID.
Parameters:
id
(str
) \u2013 the ID of the BGC to look up.
Returns:
BGC | None
\u2013 The BGC object with the given ID, or None if no such object exists.
Examples:
>>> bgc = npl.lookup_bgc(\"BGC000001\")\n>>> bgc\nBGC(id=\"BGC000001\", ...)\n
Source code in src/nplinker/nplinker.py
def lookup_bgc(self, id: str) -> BGC | None:\n \"\"\"Get the BGC object with the given ID.\n\n Args:\n id: the ID of the BGC to look up.\n\n Returns:\n The BGC object with the given ID, or None if no such object exists.\n\n Examples:\n >>> bgc = npl.lookup_bgc(\"BGC000001\")\n >>> bgc\n BGC(id=\"BGC000001\", ...)\n \"\"\"\n return self._bgc_dict.get(id, None)\n
"},{"location":"api/nplinker/#nplinker.NPLinker.lookup_gcf","title":"lookup_gcf","text":"lookup_gcf(id: str) -> GCF | None\n
Get the GCF object with the given ID.
Parameters:
id
(str
) \u2013 the ID of the GCF to look up.
Returns:
GCF | None
\u2013 The GCF object with the given ID, or None if no such object exists.
src/nplinker/nplinker.py
def lookup_gcf(self, id: str) -> GCF | None:\n \"\"\"Get the GCF object with the given ID.\n\n Args:\n id: the ID of the GCF to look up.\n\n Returns:\n The GCF object with the given ID, or None if no such object exists.\n \"\"\"\n return self._gcf_dict.get(id, None)\n
"},{"location":"api/nplinker/#nplinker.NPLinker.lookup_spectrum","title":"lookup_spectrum","text":"lookup_spectrum(id: str) -> Spectrum | None\n
Get the Spectrum object with the given ID.
Parameters:
id
(str
) \u2013 the ID of the Spectrum to look up.
Returns:
Spectrum | None
\u2013 The Spectrum object with the given ID, or None if no such object exists.
src/nplinker/nplinker.py
def lookup_spectrum(self, id: str) -> Spectrum | None:\n \"\"\"Get the Spectrum object with the given ID.\n\n Args:\n id: the ID of the Spectrum to look up.\n\n Returns:\n The Spectrum object with the given ID, or None if no such object exists.\n \"\"\"\n return self._spec_dict.get(id, None)\n
"},{"location":"api/nplinker/#nplinker.NPLinker.lookup_mf","title":"lookup_mf","text":"lookup_mf(id: str) -> MolecularFamily | None\n
Get the MolecularFamily object with the given ID.
Parameters:
id
(str
) \u2013 the ID of the MolecularFamily to look up.
Returns:
MolecularFamily | None
\u2013 The MolecularFamily object with the given ID, or None if no such object exists.
src/nplinker/nplinker.py
def lookup_mf(self, id: str) -> MolecularFamily | None:\n \"\"\"Get the MolecularFamily object with the given ID.\n\n Args:\n id: the ID of the MolecularFamily to look up.\n\n Returns:\n The MolecularFamily object with the given ID, or None if no such object exists.\n \"\"\"\n return self._mf_dict.get(id, None)\n
"},{"location":"api/nplinker/#nplinker.NPLinker.save_data","title":"save_data","text":"save_data(\n file: str | PathLike, links: LinkGraph | None = None\n) -> None\n
Pickle data to a file.
The pickled data is a tuple of BGCs, GCFs, Spectra, MolecularFamilies, StrainCollection and links, i.e. (bgcs, gcfs, spectra, mfs, strains, links)
.
Parameters:
file
(str | PathLike
) \u2013 The path to the pickle file to save the data to.
links
(LinkGraph | None
, default: None
) \u2013 The LinkGraph object to save.
Examples:
Saving the data to a pickle file, links data is None
:
>>> npl.save_data(\"path/to/output.pkl\")\n
Also saving the links data:
>>> lg = npl.get_links(npl.gcfs, \"metcalf\")\n>>> npl.save_data(\"path/to/output.pkl\", lg)\n
Source code in src/nplinker/nplinker.py
def save_data(\n self,\n file: str | PathLike,\n links: LinkGraph | None = None,\n) -> None:\n \"\"\"Pickle data to a file.\n\n The pickled data is a tuple of BGCs, GCFs, Spectra, MolecularFamilies, StrainCollection and\n links, i.e. `(bgcs, gcfs, spectra, mfs, strains, links)`.\n\n Args:\n file: The path to the pickle file to save the data to.\n links: The LinkGraph object to save.\n\n Examples:\n Saving the data to a pickle file, links data is `None`:\n >>> npl.save_data(\"path/to/output.pkl\")\n\n Also saving the links data:\n >>> lg = npl.get_links(npl.gcfs, \"metcalf\")\n >>> npl.save_data(\"path/to/output.pkl\", lg)\n \"\"\"\n data = (self.bgcs, self.gcfs, self.spectra, self.mfs, self.strains, links)\n with open(file, \"wb\") as f:\n pickle.dump(data, f)\n
"},{"location":"api/nplinker/#nplinker.setup_logging","title":"setup_logging","text":"setup_logging(\n level: str = \"INFO\",\n file: str = \"\",\n use_console: bool = True,\n) -> None\n
Setup logging configuration for the ancestor logger \"nplinker\".
Usage DocumentationHow to setup logging
Parameters:
level
(str
, default: 'INFO'
) \u2013 The log level, use the logging module's log level constants. Valid levels are: NOTSET
, DEBUG
, INFO
, WARNING
, ERROR
, CRITICAL
.
file
(str
, default: ''
) \u2013 The file to write the log to. If the file is an empty string (by default), the log will not be written to a file. If the file does not exist, it will be created. The log will be written to the file in append mode.
use_console
(bool
, default: True
) \u2013 Whether to log to the console.
src/nplinker/logger.py
def setup_logging(level: str = \"INFO\", file: str = \"\", use_console: bool = True) -> None:\n \"\"\"Setup logging configuration for the ancestor logger \"nplinker\".\n\n ??? info \"Usage Documentation\"\n [How to setup logging][how-to-setup-logging]\n\n Args:\n level: The log level, use the logging module's log level constants.\n Valid levels are: `NOTSET`, `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`.\n file: The file to write the log to.\n If the file is an empty string (by default), the log will not be written to a file.\n If the file does not exist, it will be created.\n The log will be written to the file in append mode.\n use_console: Whether to log to the console.\n \"\"\"\n # Get the ancestor logger \"nplinker\"\n logger = logging.getLogger(\"nplinker\")\n logger.setLevel(level)\n\n # File handler\n if file:\n logger.addHandler(\n RichHandler(\n console=Console(file=open(file, \"a\"), width=120), # force the line width to 120\n omit_repeated_times=False,\n rich_tracebacks=True,\n tracebacks_show_locals=True,\n log_time_format=\"[%Y-%m-%d %X]\",\n )\n )\n\n # Console handler\n if use_console:\n logger.addHandler(\n RichHandler(\n omit_repeated_times=False,\n rich_tracebacks=True,\n tracebacks_show_locals=True,\n log_time_format=\"[%Y-%m-%d %X]\",\n )\n )\n
"},{"location":"api/nplinker/#nplinker.defaults","title":"nplinker.defaults","text":""},{"location":"api/nplinker/#nplinker.defaults.NPLINKER_APP_DATA_DIR","title":"NPLINKER_APP_DATA_DIR module-attribute
","text":"NPLINKER_APP_DATA_DIR: Final = parent / 'data'\n
"},{"location":"api/nplinker/#nplinker.defaults.STRAIN_MAPPINGS_FILENAME","title":"STRAIN_MAPPINGS_FILENAME module-attribute
","text":"STRAIN_MAPPINGS_FILENAME: Final = 'strain_mappings.json'\n
"},{"location":"api/nplinker/#nplinker.defaults.GENOME_BGC_MAPPINGS_FILENAME","title":"GENOME_BGC_MAPPINGS_FILENAME module-attribute
","text":"GENOME_BGC_MAPPINGS_FILENAME: Final = (\n \"genome_bgc_mappings.json\"\n)\n
"},{"location":"api/nplinker/#nplinker.defaults.GENOME_STATUS_FILENAME","title":"GENOME_STATUS_FILENAME module-attribute
","text":"GENOME_STATUS_FILENAME: Final = 'genome_status.json'\n
"},{"location":"api/nplinker/#nplinker.defaults.GNPS_SPECTRA_FILENAME","title":"GNPS_SPECTRA_FILENAME module-attribute
","text":"GNPS_SPECTRA_FILENAME: Final = 'spectra.mgf'\n
"},{"location":"api/nplinker/#nplinker.defaults.GNPS_MOLECULAR_FAMILY_FILENAME","title":"GNPS_MOLECULAR_FAMILY_FILENAME module-attribute
","text":"GNPS_MOLECULAR_FAMILY_FILENAME: Final = (\n \"molecular_families.tsv\"\n)\n
"},{"location":"api/nplinker/#nplinker.defaults.GNPS_ANNOTATIONS_FILENAME","title":"GNPS_ANNOTATIONS_FILENAME module-attribute
","text":"GNPS_ANNOTATIONS_FILENAME: Final = 'annotations.tsv'\n
"},{"location":"api/nplinker/#nplinker.defaults.GNPS_FILE_MAPPINGS_TSV","title":"GNPS_FILE_MAPPINGS_TSV module-attribute
","text":"GNPS_FILE_MAPPINGS_TSV: Final = 'file_mappings.tsv'\n
"},{"location":"api/nplinker/#nplinker.defaults.GNPS_FILE_MAPPINGS_CSV","title":"GNPS_FILE_MAPPINGS_CSV module-attribute
","text":"GNPS_FILE_MAPPINGS_CSV: Final = 'file_mappings.csv'\n
"},{"location":"api/nplinker/#nplinker.defaults.STRAINS_SELECTED_FILENAME","title":"STRAINS_SELECTED_FILENAME module-attribute
","text":"STRAINS_SELECTED_FILENAME: Final = 'strains_selected.json'\n
"},{"location":"api/nplinker/#nplinker.defaults.DOWNLOADS_DIRNAME","title":"DOWNLOADS_DIRNAME module-attribute
","text":"DOWNLOADS_DIRNAME: Final = 'downloads'\n
"},{"location":"api/nplinker/#nplinker.defaults.MIBIG_DIRNAME","title":"MIBIG_DIRNAME module-attribute
","text":"MIBIG_DIRNAME: Final = 'mibig'\n
"},{"location":"api/nplinker/#nplinker.defaults.GNPS_DIRNAME","title":"GNPS_DIRNAME module-attribute
","text":"GNPS_DIRNAME: Final = 'gnps'\n
"},{"location":"api/nplinker/#nplinker.defaults.ANTISMASH_DIRNAME","title":"ANTISMASH_DIRNAME module-attribute
","text":"ANTISMASH_DIRNAME: Final = 'antismash'\n
"},{"location":"api/nplinker/#nplinker.defaults.BIGSCAPE_DIRNAME","title":"BIGSCAPE_DIRNAME module-attribute
","text":"BIGSCAPE_DIRNAME: Final = 'bigscape'\n
"},{"location":"api/nplinker/#nplinker.defaults.BIGSCAPE_RUNNING_OUTPUT_DIRNAME","title":"BIGSCAPE_RUNNING_OUTPUT_DIRNAME module-attribute
","text":"BIGSCAPE_RUNNING_OUTPUT_DIRNAME: Final = (\n \"bigscape_running_output\"\n)\n
"},{"location":"api/nplinker/#nplinker.defaults.OUTPUT_DIRNAME","title":"OUTPUT_DIRNAME module-attribute
","text":"OUTPUT_DIRNAME: Final = 'output'\n
"},{"location":"api/nplinker/#nplinker.config","title":"nplinker.config","text":""},{"location":"api/nplinker/#nplinker.config.CONFIG_VALIDATORS","title":"CONFIG_VALIDATORS module-attribute
","text":"CONFIG_VALIDATORS = [\n Validator(\n \"root_dir\",\n required=True,\n cast=transform_to_full_path,\n condition=lambda v: is_dir(),\n ),\n Validator(\n \"mode\",\n required=True,\n cast=lambda v: lower(),\n is_in=[\"local\", \"podp\"],\n ),\n Validator(\n \"podp_id\",\n required=True,\n when=Validator(\"mode\", eq=\"podp\"),\n ),\n Validator(\n \"podp_id\",\n required=False,\n when=Validator(\"mode\", eq=\"local\"),\n ),\n Validator(\n \"log.level\",\n is_type_of=str,\n cast=lambda v: upper(),\n is_in=[\n \"NOTSET\",\n \"DEBUG\",\n \"INFO\",\n \"WARNING\",\n \"ERROR\",\n \"CRITICAL\",\n ],\n ),\n Validator(\"log.file\", is_type_of=str),\n Validator(\"log.use_console\", is_type_of=bool),\n Validator(\n \"mibig.to_use\", required=True, is_type_of=bool\n ),\n Validator(\n \"mibig.version\",\n required=True,\n is_type_of=str,\n when=Validator(\"mibig.to_use\", eq=True),\n ),\n Validator(\n \"bigscape.parameters\", required=True, is_type_of=str\n ),\n Validator(\n \"bigscape.cutoff\", required=True, is_type_of=str\n ),\n Validator(\n \"bigscape.version\", required=True, is_type_of=int\n ),\n Validator(\n \"scoring.methods\",\n required=True,\n cast=lambda v: [lower() for i in v],\n is_type_of=list,\n len_min=1,\n condition=lambda v: issubset(\n {\"metcalf\", \"rosetta\"}\n ),\n ),\n]\n
"},{"location":"api/nplinker/#nplinker.config.load_config","title":"load_config","text":"load_config(config_file: str | PathLike) -> Dynaconf\n
Load and validate the configuration file.
Usage DocumentationConfig Loader
Parameters:
config_file
(str | PathLike
) \u2013 Path to the configuration file.
Returns:
Dynaconf
( Dynaconf
) \u2013 A Dynaconf object containing the configuration settings.
Raises:
FileNotFoundError
\u2013 If the configuration file does not exist.
src/nplinker/config.py
def load_config(config_file: str | PathLike) -> Dynaconf:\n \"\"\"Load and validate the configuration file.\n\n ??? info \"Usage Documentation\"\n [Config Loader][config-loader]\n\n Args:\n config_file: Path to the configuration file.\n\n Returns:\n Dynaconf: A Dynaconf object containing the configuration settings.\n\n Raises:\n FileNotFoundError: If the configuration file does not exist.\n \"\"\"\n config_file = transform_to_full_path(config_file)\n if not config_file.exists():\n raise FileNotFoundError(f\"Config file '{config_file}' not found\")\n\n # Locate the default config file\n default_config_file = Path(__file__).resolve().parent / \"nplinker_default.toml\"\n\n # Load config files\n config = Dynaconf(settings_files=[config_file], preload=[default_config_file])\n\n # Validate configs\n config.validators.register(*CONFIG_VALIDATORS)\n config.validators.validate()\n\n return config\n
"},{"location":"api/schema/","title":"Schemas","text":""},{"location":"api/schema/#nplinker.schemas","title":"nplinker.schemas","text":""},{"location":"api/schema/#nplinker.schemas.GENOME_STATUS_SCHEMA","title":"GENOME_STATUS_SCHEMA module-attribute
","text":"GENOME_STATUS_SCHEMA = load(f)\n
Schema for the genome status JSON file.
Schema Content:genome_status_schema.json
{\n \"$schema\": \"https://json-schema.org/draft/2020-12/schema\",\n \"$id\": \"https://raw.githubusercontent.com/NPLinker/nplinker/main/src/nplinker/schemas/genome_status_schema.json\",\n \"title\": \"Status of genomes\",\n \"description\": \"A list of genome status objects, each of which contains information about a single genome\",\n \"type\": \"object\",\n \"required\": [\n \"genome_status\",\n \"version\"\n ],\n \"properties\": {\n \"genome_status\": {\n \"type\": \"array\",\n \"title\": \"Genome status\",\n \"description\": \"A list of genome status objects\",\n \"items\": {\n \"type\": \"object\",\n \"required\": [\n \"original_id\",\n \"resolved_refseq_id\",\n \"resolve_attempted\",\n \"bgc_path\"\n ],\n \"properties\": {\n \"original_id\": {\n \"type\": \"string\",\n \"title\": \"Original ID\",\n \"description\": \"The original ID of the genome\",\n \"minLength\": 1\n },\n \"resolved_refseq_id\": {\n \"type\": \"string\",\n \"title\": \"Resolved RefSeq ID\",\n \"description\": \"The RefSeq ID that was resolved for this genome\"\n },\n \"resolve_attempted\": {\n \"type\": \"boolean\",\n \"title\": \"Resolve Attempted\",\n \"description\": \"Whether or not an attempt was made to resolve this genome\"\n },\n \"bgc_path\": {\n \"type\": \"string\",\n \"title\": \"BGC Path\",\n \"description\": \"The path to the downloaded BGC file for this genome\"\n }\n }\n },\n \"minItems\": 1,\n \"uniqueItems\": true\n },\n \"version\": {\n \"type\": \"string\",\n \"enum\": [\n \"1.0\"\n ]\n }\n },\n \"additionalProperties\": false\n}\n
"},{"location":"api/schema/#nplinker.schemas.GENOME_BGC_MAPPINGS_SCHEMA","title":"GENOME_BGC_MAPPINGS_SCHEMA module-attribute
","text":"GENOME_BGC_MAPPINGS_SCHEMA = load(f)\n
Schema for genome BGC mappings JSON file.
Schema Content:genome_bgc_mappings_schema.json
{\n \"$schema\": \"https://json-schema.org/draft/2020-12/schema\",\n \"$id\": \"https://raw.githubusercontent.com/NPLinker/nplinker/main/src/nplinker/schemas/genome_bgc_mappings_schema.json\",\n \"title\": \"Mappings from genome ID to BGC IDs\",\n \"description\": \"A list of mappings from genome ID to BGC (biosynthetic gene cluster) IDs\",\n \"type\": \"object\",\n \"required\": [\n \"mappings\",\n \"version\"\n ],\n \"properties\": {\n \"mappings\": {\n \"type\": \"array\",\n \"title\": \"Mappings from genome ID to BGC IDs\",\n \"description\": \"A list of mappings from genome ID to BGC IDs\",\n \"items\": {\n \"type\": \"object\",\n \"required\": [\n \"genome_ID\",\n \"BGC_ID\"\n ],\n \"properties\": {\n \"genome_ID\": {\n \"type\": \"string\",\n \"title\": \"Genome ID\",\n \"description\": \"The genome ID used in BGC database such as antiSMASH\",\n \"minLength\": 1\n },\n \"BGC_ID\": {\n \"type\": \"array\",\n \"title\": \"BGC ID\",\n \"description\": \"A list of BGC IDs\",\n \"items\": {\n \"type\": \"string\",\n \"minLength\": 1\n },\n \"minItems\": 1,\n \"uniqueItems\": true\n }\n }\n },\n \"minItems\": 1,\n \"uniqueItems\": true\n },\n \"version\": {\n \"type\": \"string\",\n \"enum\": [\n \"1.0\"\n ]\n }\n },\n \"additionalProperties\": false\n}\n
"},{"location":"api/schema/#nplinker.schemas.STRAIN_MAPPINGS_SCHEMA","title":"STRAIN_MAPPINGS_SCHEMA module-attribute
","text":"STRAIN_MAPPINGS_SCHEMA = load(f)\n
Schema for strain mappings JSON file.
Schema Content:strain_mappings_schema.json
{\n \"$schema\": \"https://json-schema.org/draft/2020-12/schema\",\n \"$id\": \"https://raw.githubusercontent.com/NPLinker/nplinker/main/src/nplinker/schemas/strain_mappings_schema.json\",\n \"title\": \"Strain mappings\",\n \"description\": \"A list of mappings from strain ID to strain aliases\",\n \"type\": \"object\",\n \"required\": [\n \"strain_mappings\",\n \"version\"\n ],\n \"properties\": {\n \"strain_mappings\": {\n \"type\": \"array\",\n \"title\": \"Strain mappings\",\n \"description\": \"A list of strain mappings\",\n \"items\": {\n \"type\": \"object\",\n \"required\": [\n \"strain_id\",\n \"strain_alias\"\n ],\n \"properties\": {\n \"strain_id\": {\n \"type\": \"string\",\n \"title\": \"Strain ID\",\n \"description\": \"Strain ID, which could be any strain name or accession number\",\n \"minLength\": 1\n },\n \"strain_alias\": {\n \"type\": \"array\",\n \"title\": \"Strain aliases\",\n \"description\": \"A list of strain aliases, which could be any names that refer to the same strain\",\n \"items\": {\n \"type\": \"string\",\n \"minLength\": 1\n },\n \"minItems\": 1,\n \"uniqueItems\": true\n }\n }\n },\n \"minItems\": 1,\n \"uniqueItems\": true\n },\n \"version\": {\n \"type\": \"string\",\n \"enum\": [\n \"1.0\"\n ]\n }\n },\n \"additionalProperties\": false\n}\n
"},{"location":"api/schema/#nplinker.schemas.USER_STRAINS_SCHEMA","title":"USER_STRAINS_SCHEMA module-attribute
","text":"USER_STRAINS_SCHEMA = load(f)\n
Schema for user strains JSON file.
Schema Content:user_strains.json
{\n \"$schema\": \"https://json-schema.org/draft/2020-12/schema\",\n \"$id\": \"https://raw.githubusercontent.com/NPLinker/nplinker/main/src/nplinker/schemas/user_strains.json\",\n \"title\": \"User specified strains\",\n \"description\": \"A list of strain IDs specified by user\",\n \"type\": \"object\",\n \"required\": [\n \"strain_ids\"\n ],\n \"properties\": {\n \"strain_ids\": {\n \"type\": \"array\",\n \"title\": \"Strain IDs\",\n \"description\": \"A list of strain IDs specified by user. The strain IDs must be the same as the ones in the strain mappings file.\",\n \"items\": {\n \"type\": \"string\",\n \"minLength\": 1\n },\n \"minItems\": 1,\n \"uniqueItems\": true\n },\n \"version\": {\n \"type\": \"string\",\n \"enum\": [\n \"1.0\"\n ]\n }\n },\n \"additionalProperties\": false\n}\n
"},{"location":"api/schema/#nplinker.schemas.PODP_ADAPTED_SCHEMA","title":"PODP_ADAPTED_SCHEMA module-attribute
","text":"PODP_ADAPTED_SCHEMA = load(f)\n
Schema for PODP JSON file.
The PODP JSON file is the project JSON file downloaded from PODP platform. For example, for PODP project MSV000079284, its JSON file is https://pairedomicsdata.bioinformatics.nl/api/projects/4b29ddc3-26d0-40d7-80c5-44fb6631dbf9.4.
Schema Content:podp_adapted_schema.json
{\n \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n \"$id\": \"https://raw.githubusercontent.com/NPLinker/nplinker/main/src/nplinker/schemas/podp_adapted_schema.json\",\n \"title\": \"Adapted Paired Omics Data Platform Schema for NPLinker\",\n \"description\": \"This schema is adapted from PODP schema (https://pairedomicsdata.bioinformatics.nl/schema.json) for NPLinker. It's used to validate the input data for NPLinker. Thus, only required fields for NPLinker are kept in this schema, and some fields are modified to fit NPLinker's requirements.\",\n \"type\": \"object\",\n \"required\": [\n \"version\",\n \"metabolomics\",\n \"genomes\",\n \"genome_metabolome_links\"\n ],\n \"properties\": {\n \"version\": {\n \"type\": \"string\",\n \"readOnly\": true,\n \"default\": \"3\",\n \"enum\": [\n \"3\"\n ]\n },\n \"metabolomics\": {\n \"type\": \"object\",\n \"title\": \"2. Metabolomics Information\",\n \"description\": \"Please provide basic information on the publicly available metabolomics project from which paired data is available. Currently, we allow for links to mass spectrometry data deposited in GNPS-MaSSIVE or MetaboLights.\",\n \"properties\": {\n \"project\": {\n \"type\": \"object\",\n \"required\": [\n \"molecular_network\"\n ],\n \"title\": \"GNPS-MassIVE\",\n \"properties\": {\n \"GNPSMassIVE_ID\": {\n \"type\": \"string\",\n \"title\": \"GNPS-MassIVE identifier\",\n \"description\": \"Please provide the GNPS-MassIVE identifier of your metabolomics data set, e.g., MSV000078839.\",\n \"pattern\": \"^MSV[0-9]{9}$\"\n },\n \"MaSSIVE_URL\": {\n \"type\": \"string\",\n \"title\": \"Link to MassIVE upload\",\n \"description\": \"Please provide the link to the MassIVE upload, e.g., <a target=\\\"_blank\\\" rel=\\\"noopener noreferrer\\\" href=\\\"https://gnps.ucsd.edu/ProteoSAFe/result.jsp?task=a507232a787243a5afd69a6c6fa1e508&view=advanced_view\\\">https://gnps.ucsd.edu/ProteoSAFe/result.jsp?task=a507232a787243a5afd69a6c6fa1e508&view=advanced_view</a>. Warning, there cannot be spaces in the URI.\",\n \"format\": \"uri\"\n },\n \"molecular_network\": {\n \"type\": \"string\",\n \"pattern\": \"^[0-9a-z]{32}$\",\n \"title\": \"Molecular Network Task ID\",\n \"description\": \"If you have run a Molecular Network on GNPS, please provide the task ID of the Molecular Network job. It can be found in the URL of the Molecular Networking job, e.g., in <a target=\\\"_blank\\\" rel=\\\"noopener noreferrer\\\" href=\\\"https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=c36f90ba29fe44c18e96db802de0c6b9\\\">https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=c36f90ba29fe44c18e96db802de0c6b9</a> the task ID is c36f90ba29fe44c18e96db802de0c6b9.\"\n }\n }\n }\n },\n \"required\": [\n \"project\"\n ],\n \"additionalProperties\": true\n },\n \"genomes\": {\n \"type\": \"array\",\n \"title\": \"3. (Meta)genomics Information\",\n \"description\": \"Please add all genomes and/or metagenomes for which paired data is available as separate entries.\",\n \"items\": {\n \"type\": \"object\",\n \"required\": [\n \"genome_ID\",\n \"genome_label\"\n ],\n \"properties\": {\n \"genome_ID\": {\n \"type\": \"object\",\n \"title\": \"Genome accession\",\n \"description\": \"At least one of the three identifiers is required.\",\n \"anyOf\": [\n {\n \"required\": [\n \"GenBank_accession\"\n ]\n },\n {\n \"required\": [\n \"RefSeq_accession\"\n ]\n },\n {\n \"required\": [\n \"JGI_Genome_ID\"\n ]\n }\n ],\n \"properties\": {\n \"GenBank_accession\": {\n \"type\": \"string\",\n \"title\": \"GenBank accession number\",\n \"description\": \"If the publicly available genome got a GenBank accession number assigned, e.g., <a href=\\\"https://www.ncbi.nlm.nih.gov/nuccore/AL645882\\\" target=\\\"_blank\\\" rel=\\\"noopener noreferrer\\\">AL645882</a>, please provide it here. The genome sequence must be submitted to GenBank/ENA/DDBJ (and an accession number must be received) before this form can be filled out. In case of a whole genome sequence, please use master records. At least one identifier must be entered.\",\n \"minLength\": 1\n },\n \"RefSeq_accession\": {\n \"type\": \"string\",\n \"title\": \"RefSeq accession number\",\n \"description\": \"For example: <a target=\\\"_blank\\\" rel=\\\"noopener noreferrer\\\" href=\\\"https://www.ncbi.nlm.nih.gov/nuccore/NC_003888.3\\\">NC_003888.3</a>\",\n \"minLength\": 1\n },\n \"JGI_Genome_ID\": {\n \"type\": \"string\",\n \"title\": \"JGI IMG genome ID\",\n \"description\": \"For example: <a target=\\\"_blank\\\" rel=\\\"noopener noreferrer\\\" href=\\\"https://img.jgi.doe.gov/cgi-bin/m/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=641228474\\\">641228474</a>\",\n \"minLength\": 1\n }\n }\n },\n \"genome_label\": {\n \"type\": \"string\",\n \"title\": \"Genome label\",\n \"description\": \"Please assign a unique Genome Label for this genome or metagenome to help you recall it during the linking step. For example 'Streptomyces sp. CNB091'\",\n \"minLength\": 1\n }\n }\n },\n \"minItems\": 1\n },\n \"genome_metabolome_links\": {\n \"type\": \"array\",\n \"title\": \"6. Genome - Proteome - Metabolome Links\",\n \"description\": \"Create a linked pair by selecting the Genome Label and optional Proteome label as provided earlier. Subsequently links to the metabolomics data file belonging to that genome/proteome with appropriate experimental methods.\",\n \"items\": {\n \"type\": \"object\",\n \"required\": [\n \"genome_label\",\n \"metabolomics_file\"\n ],\n \"properties\": {\n \"genome_label\": {\n \"type\": \"string\",\n \"title\": \"Genome/Metagenome\",\n \"description\": \"Please select the Genome Label to be linked to a metabolomics data file.\"\n },\n \"metabolomics_file\": {\n \"type\": \"string\",\n \"title\": \"Location of metabolomics data file\",\n \"description\": \"Please provide a direct link to the metabolomics data file location, e.g. <a href=\\\"ftp://massive.ucsd.edu/MSV000078839/spectrum/R5/CNB091_R5_M.mzXML\\\" target=\\\"_blank\\\" rel=\\\"noopener noreferrer\\\">ftp://massive.ucsd.edu/MSV000078839/spectrum/R5/CNB091_R5_M.mzXML</a> found in the FTP download of a MassIVE dataset or <a target=\\\"_blank\\\" rel=\\\"noopener noreferrer\\\" href=\\\"https://www.ebi.ac.uk/metabolights/MTBLS307/files/Urine_44_fullscan1_pos.mzXML\\\">https://www.ebi.ac.uk/metabolights/MTBLS307/files/Urine_44_fullscan1_pos.mzXML</a> found in the Files section of a MetaboLights study. Warning, there cannot be spaces in the URI.\",\n \"format\": \"uri\"\n }\n },\n \"additionalProperties\": true\n },\n \"minItems\": 1\n }\n },\n \"additionalProperties\": true\n}\n
"},{"location":"api/schema/#nplinker.schemas.validate_podp_json","title":"validate_podp_json","text":"validate_podp_json(json_data: dict) -> None\n
Validate JSON data against the PODP JSON schema.
All validation error messages are collected and raised as a single ValueError.
Parameters:
json_data
(dict
) \u2013 The JSON data to validate.
Raises:
ValueError
\u2013 If the JSON data does not match the schema.
Examples:
Download PODP JSON file for project MSV000079284 from https://pairedomicsdata.bioinformatics.nl/api/projects/4b29ddc3-26d0-40d7-80c5-44fb6631dbf9.4 and save it as podp_project.json
.
Validate it:
>>> with open(podp_project.json, \"r\") as f:\n... json_data = json.load(f)\n>>> validate_podp_json(json_data)\n
Source code in src/nplinker/schemas/__init__.py
def validate_podp_json(json_data: dict) -> None:\n \"\"\"Validate JSON data against the PODP JSON schema.\n\n All validation error messages are collected and raised as a single\n ValueError.\n\n Args:\n json_data: The JSON data to validate.\n\n Raises:\n ValueError: If the JSON data does not match the schema.\n\n Examples:\n Download PODP JSON file for project MSV000079284 from\n https://pairedomicsdata.bioinformatics.nl/api/projects/4b29ddc3-26d0-40d7-80c5-44fb6631dbf9.4\n and save it as `podp_project.json`.\n\n Validate it:\n >>> with open(podp_project.json, \"r\") as f:\n ... json_data = json.load(f)\n >>> validate_podp_json(json_data)\n \"\"\"\n validator = Draft7Validator(PODP_ADAPTED_SCHEMA)\n errors = sorted(validator.iter_errors(json_data), key=lambda e: e.path)\n if errors:\n error_messages = [f\"{e.json_path}: {e.message}\" for e in errors]\n raise ValueError(\n \"Not match PODP adapted schema, here are the detailed error:\\n - \"\n + \"\\n - \".join(error_messages)\n )\n
"},{"location":"api/scoring/","title":"Data Models","text":""},{"location":"api/scoring/#nplinker.scoring","title":"nplinker.scoring","text":""},{"location":"api/scoring/#nplinker.scoring.LinkGraph","title":"LinkGraph","text":"LinkGraph()\n
Class to represent the links between objects in NPLinker.
This class wraps the networkx.Graph
class to provide a more user-friendly interface for working with the links.
The links between objects are stored as edges in a graph, while the objects themselves are stored as nodes.
The scoring data for each link (or link data) is stored as the key/value attributes of the edge.
Examples:
Create a LinkGraph object:
>>> lg = LinkGraph()\n
Add a link between a GCF and a Spectrum object:
>>> lg.add_link(gcf, spectrum, metcalf=Score(\"metcalf\", 1.0, {\"cutoff\": 0.5}))\n
Get all links for a given object:
>>> lg[gcf]\n{spectrum: {\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})}}\n
Get all links in the LinkGraph:
>>> lg.links\n[(gcf, spectrum, {\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})})]\n
Check if there is a link between two objects:
>>> lg.has_link(gcf, spectrum)\nTrue\n
Get the link data between two objects:
>>> lg.get_link_data(gcf, spectrum)\n{\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})}\n
Source code in src/nplinker/scoring/link_graph.py
def __init__(self) -> None:\n \"\"\"Initialize a LinkGraph object.\n\n Examples:\n Create a LinkGraph object:\n >>> lg = LinkGraph()\n\n Add a link between a GCF and a Spectrum object:\n >>> lg.add_link(gcf, spectrum, metcalf=Score(\"metcalf\", 1.0, {\"cutoff\": 0.5}))\n\n Get all links for a given object:\n >>> lg[gcf]\n {spectrum: {\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})}}\n\n Get all links in the LinkGraph:\n >>> lg.links\n [(gcf, spectrum, {\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})})]\n\n Check if there is a link between two objects:\n >>> lg.has_link(gcf, spectrum)\n True\n\n Get the link data between two objects:\n >>> lg.get_link_data(gcf, spectrum)\n {\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})}\n \"\"\"\n self._g: Graph = Graph()\n
"},{"location":"api/scoring/#nplinker.scoring.LinkGraph.links","title":"links property
","text":"links: list[LINK]\n
Get all links.
Returns:
list[LINK]
\u2013 A list of tuples containing the links between objects.
Examples:
>>> lg.links\n[(gcf, spectrum, {\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})})]\n
"},{"location":"api/scoring/#nplinker.scoring.LinkGraph.__str__","title":"__str__","text":"__str__() -> str\n
Get a short summary of the LinkGraph.
Source code insrc/nplinker/scoring/link_graph.py
def __str__(self) -> str:\n \"\"\"Get a short summary of the LinkGraph.\"\"\"\n return f\"{self.__class__.__name__}(#links={len(self.links)}, #objects={len(self)})\"\n
"},{"location":"api/scoring/#nplinker.scoring.LinkGraph.__len__","title":"__len__","text":"__len__() -> int\n
Get the number of objects.
Source code insrc/nplinker/scoring/link_graph.py
def __len__(self) -> int:\n \"\"\"Get the number of objects.\"\"\"\n return len(self._g)\n
"},{"location":"api/scoring/#nplinker.scoring.LinkGraph.__getitem__","title":"__getitem__","text":"__getitem__(u: Entity) -> dict[Entity, LINK_DATA]\n
Get all links for a given object.
Parameters:
u
(Entity
) \u2013 the given object
Returns:
dict[Entity, LINK_DATA]
\u2013 A dictionary of links for the given object.
Raises:
KeyError
\u2013 if the input object is not found in the link graph.
src/nplinker/scoring/link_graph.py
@validate_u\ndef __getitem__(self, u: Entity) -> dict[Entity, LINK_DATA]:\n \"\"\"Get all links for a given object.\n\n Args:\n u: the given object\n\n Returns:\n A dictionary of links for the given object.\n\n Raises:\n KeyError: if the input object is not found in the link graph.\n \"\"\"\n try:\n links = self._g[u]\n except KeyError:\n raise KeyError(f\"{u} not found in the link graph.\")\n\n return {**links} # type: ignore\n
"},{"location":"api/scoring/#nplinker.scoring.LinkGraph.add_link","title":"add_link","text":"add_link(u: Entity, v: Entity, **data: Score) -> None\n
Add a link between two objects.
The objects u
and v
must be different types, i.e. one must be a GCF and the other must be a Spectrum or MolecularFamily.
Parameters:
u
(Entity
) \u2013 the first object, either a GCF, Spectrum, or MolecularFamily
v
(Entity
) \u2013 the second object, either a GCF, Spectrum, or MolecularFamily
data
(Score
, default: {}
) \u2013 keyword arguments. At least one scoring method and its data must be provided. The key must be the name of the scoring method defined in ScoringMethod
, and the value is a Score
object, e.g. metcalf=Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})
.
Examples:
>>> lg.add_link(gcf, spectrum, metcalf=Score(\"metcalf\", 1.0, {\"cutoff\": 0.5}))\n
Source code in src/nplinker/scoring/link_graph.py
@validate_uv\ndef add_link(\n self,\n u: Entity,\n v: Entity,\n **data: Score,\n) -> None:\n \"\"\"Add a link between two objects.\n\n The objects `u` and `v` must be different types, i.e. one must be a GCF and the other must be\n a Spectrum or MolecularFamily.\n\n Args:\n u: the first object, either a GCF, Spectrum, or MolecularFamily\n v: the second object, either a GCF, Spectrum, or MolecularFamily\n data: keyword arguments. At least one scoring method and its data must be provided.\n The key must be the name of the scoring method defined in `ScoringMethod`, and the\n value is a `Score` object, e.g. `metcalf=Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})`.\n\n Examples:\n >>> lg.add_link(gcf, spectrum, metcalf=Score(\"metcalf\", 1.0, {\"cutoff\": 0.5}))\n \"\"\"\n # validate the data\n if not data:\n raise ValueError(\"At least one scoring method and its data must be provided.\")\n for key, value in data.items():\n if not ScoringMethod.has_value(key):\n raise ValueError(\n f\"{key} is not a valid name of scoring method. See `ScoringMethod` for valid names.\"\n )\n if not isinstance(value, Score):\n raise TypeError(f\"{value} is not a Score object.\")\n\n self._g.add_edge(u, v, **data)\n
"},{"location":"api/scoring/#nplinker.scoring.LinkGraph.has_link","title":"has_link","text":"has_link(u: Entity, v: Entity) -> bool\n
Check if there is a link between two objects.
Parameters:
u
(Entity
) \u2013 the first object, either a GCF, Spectrum, or MolecularFamily
v
(Entity
) \u2013 the second object, either a GCF, Spectrum, or MolecularFamily
Returns:
bool
\u2013 True if there is a link between the two objects, False otherwise
Examples:
>>> lg.has_link(gcf, spectrum)\nTrue\n
Source code in src/nplinker/scoring/link_graph.py
@validate_uv\ndef has_link(self, u: Entity, v: Entity) -> bool:\n \"\"\"Check if there is a link between two objects.\n\n Args:\n u: the first object, either a GCF, Spectrum, or MolecularFamily\n v: the second object, either a GCF, Spectrum, or MolecularFamily\n\n Returns:\n True if there is a link between the two objects, False otherwise\n\n Examples:\n >>> lg.has_link(gcf, spectrum)\n True\n \"\"\"\n return self._g.has_edge(u, v)\n
"},{"location":"api/scoring/#nplinker.scoring.LinkGraph.get_link_data","title":"get_link_data","text":"get_link_data(u: Entity, v: Entity) -> LINK_DATA | None\n
Get the data for a link between two objects.
Parameters:
u
(Entity
) \u2013 the first object, either a GCF, Spectrum, or MolecularFamily
v
(Entity
) \u2013 the second object, either a GCF, Spectrum, or MolecularFamily
Returns:
LINK_DATA | None
\u2013 A dictionary of scoring methods and their data for the link between the two objects, or
LINK_DATA | None
\u2013 None if there is no link between the two objects.
Examples:
>>> lg.get_link_data(gcf, spectrum)\n{\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})}\n
Source code in src/nplinker/scoring/link_graph.py
@validate_uv\ndef get_link_data(\n self,\n u: Entity,\n v: Entity,\n) -> LINK_DATA | None:\n \"\"\"Get the data for a link between two objects.\n\n Args:\n u: the first object, either a GCF, Spectrum, or MolecularFamily\n v: the second object, either a GCF, Spectrum, or MolecularFamily\n\n Returns:\n A dictionary of scoring methods and their data for the link between the two objects, or\n None if there is no link between the two objects.\n\n Examples:\n >>> lg.get_link_data(gcf, spectrum)\n {\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})}\n \"\"\"\n return self._g.get_edge_data(u, v) # type: ignore\n
"},{"location":"api/scoring/#nplinker.scoring.LinkGraph.filter","title":"filter","text":"filter(\n u_nodes: Sequence[Entity],\n v_nodes: Sequence[Entity] = [],\n) -> LinkGraph\n
Return a new LinkGraph object with the filtered links between the given objects.
The new LinkGraph object will only contain the links between u_nodes
and v_nodes
.
If u_nodes
or v_nodes
is empty, the new LinkGraph object will contain the links for the given objects in v_nodes
or u_nodes
, respectively. If both are empty, return an empty LinkGraph object.
Note that not all objects in u_nodes
and v_nodes
need to be present in the original LinkGraph.
Parameters:
u_nodes
(Sequence[Entity]
) \u2013 a sequence of objects used as the first object in the links
v_nodes
(Sequence[Entity]
, default: []
) \u2013 a sequence of objects used as the second object in the links
Returns:
LinkGraph
\u2013 A new LinkGraph object with the filtered links between the given objects.
Examples:
Filter the links for gcf1
and gcf2
:
>>> new_lg = lg.filter([gcf1, gcf2])\nFilter the links for `spectrum1` and `spectrum2`:\n>>> new_lg = lg.filter([spectrum1, spectrum2])\nFilter the links between two lists of objects:\n>>> new_lg = lg.filter([gcf1, gcf2], [spectrum1, spectrum2])\n
Source code in src/nplinker/scoring/link_graph.py
def filter(self, u_nodes: Sequence[Entity], v_nodes: Sequence[Entity] = [], /) -> LinkGraph:\n \"\"\"Return a new LinkGraph object with the filtered links between the given objects.\n\n The new LinkGraph object will only contain the links between `u_nodes` and `v_nodes`.\n\n If `u_nodes` or `v_nodes` is empty, the new LinkGraph object will contain the links for\n the given objects in `v_nodes` or `u_nodes`, respectively. If both are empty, return an\n empty LinkGraph object.\n\n Note that not all objects in `u_nodes` and `v_nodes` need to be present in the original\n LinkGraph.\n\n Args:\n u_nodes: a sequence of objects used as the first object in the links\n v_nodes: a sequence of objects used as the second object in the links\n\n Returns:\n A new LinkGraph object with the filtered links between the given objects.\n\n Examples:\n Filter the links for `gcf1` and `gcf2`:\n >>> new_lg = lg.filter([gcf1, gcf2])\n Filter the links for `spectrum1` and `spectrum2`:\n >>> new_lg = lg.filter([spectrum1, spectrum2])\n Filter the links between two lists of objects:\n >>> new_lg = lg.filter([gcf1, gcf2], [spectrum1, spectrum2])\n \"\"\"\n lg = LinkGraph()\n\n # exchange u_nodes and v_nodes if u_nodes is empty but v_nodes not\n if len(u_nodes) == 0 and len(v_nodes) != 0:\n u_nodes = v_nodes\n v_nodes = []\n\n if len(v_nodes) == 0:\n for u in u_nodes:\n self._filter_one_node(u, lg)\n\n for u in u_nodes:\n for v in v_nodes:\n self._filter_two_nodes(u, v, lg)\n\n return lg\n
"},{"location":"api/scoring/#nplinker.scoring.Score","title":"Score dataclass
","text":"Score(name: str, value: float, parameter: dict)\n
A data class to represent score data.
Attributes:
name
(str
) \u2013 the name of the scoring method. See ScoringMethod
for valid values.
value
(float
) \u2013 the score value.
parameter
(dict
) \u2013 the parameters used for the scoring method.
instance-attribute
","text":"name: str\n
"},{"location":"api/scoring/#nplinker.scoring.Score.value","title":"value instance-attribute
","text":"value: float\n
"},{"location":"api/scoring/#nplinker.scoring.Score.parameter","title":"parameter instance-attribute
","text":"parameter: dict\n
"},{"location":"api/scoring/#nplinker.scoring.Score.__post_init__","title":"__post_init__","text":"__post_init__() -> None\n
Check if the value of name
is valid.
Raises:
ValueError
\u2013 if the value of name
is not valid.
src/nplinker/scoring/score.py
def __post_init__(self) -> None:\n \"\"\"Check if the value of `name` is valid.\n\n Raises:\n ValueError: if the value of `name` is not valid.\n \"\"\"\n if ScoringMethod.has_value(self.name) is False:\n raise ValueError(\n f\"{self.name} is not a valid value. Valid values are: {[e.value for e in ScoringMethod]}\"\n )\n
"},{"location":"api/scoring/#nplinker.scoring.Score.__getitem__","title":"__getitem__","text":"__getitem__(key)\n
Source code in src/nplinker/scoring/score.py
def __getitem__(self, key):\n if key in {field.name for field in fields(self)}:\n return getattr(self, key)\n else:\n raise KeyError(f\"{key} not found in {self.__class__.__name__}\")\n
"},{"location":"api/scoring/#nplinker.scoring.Score.__setitem__","title":"__setitem__","text":"__setitem__(key, value)\n
Source code in src/nplinker/scoring/score.py
def __setitem__(self, key, value):\n # validate the value of `name`\n if key == \"name\" and ScoringMethod.has_value(value) is False:\n raise ValueError(\n f\"{value} is not a valid value. Valid values are: {[e.value for e in ScoringMethod]}\"\n )\n\n if key in {field.name for field in fields(self)}:\n setattr(self, key, value)\n else:\n raise KeyError(f\"{key} not found in {self.__class__.__name__}\")\n
"},{"location":"api/scoring_abc/","title":"Abstract Base Classes","text":""},{"location":"api/scoring_abc/#nplinker.scoring.abc","title":"nplinker.scoring.abc","text":""},{"location":"api/scoring_abc/#nplinker.scoring.abc.ScoringBase","title":"ScoringBase","text":" Bases: ABC
Abstract base class of scoring methods.
Attributes:
name
(str
) \u2013 The name of the scoring method.
npl
(NPLinker | None
) \u2013 The NPLinker object.
class-attribute
instance-attribute
","text":"name: str = 'ScoringBase'\n
"},{"location":"api/scoring_abc/#nplinker.scoring.abc.ScoringBase.npl","title":"npl class-attribute
instance-attribute
","text":"npl: NPLinker | None = None\n
"},{"location":"api/scoring_abc/#nplinker.scoring.abc.ScoringBase.setup","title":"setup abstractmethod
classmethod
","text":"setup(npl: NPLinker)\n
Setup class level attributes.
Source code insrc/nplinker/scoring/abc.py
@classmethod\n@abstractmethod\ndef setup(cls, npl: NPLinker):\n \"\"\"Setup class level attributes.\"\"\"\n
"},{"location":"api/scoring_abc/#nplinker.scoring.abc.ScoringBase.get_links","title":"get_links abstractmethod
","text":"get_links(*objects, **parameters) -> LinkGraph\n
Get links information for the given objects.
Parameters:
objects
\u2013 A list of objects to get links for.
parameters
\u2013 The parameters used for scoring.
Returns:
LinkGraph
\u2013 The LinkGraph object.
src/nplinker/scoring/abc.py
@abstractmethod\ndef get_links(\n self,\n *objects,\n **parameters,\n) -> LinkGraph:\n \"\"\"Get links information for the given objects.\n\n Args:\n objects: A list of objects to get links for.\n parameters: The parameters used for scoring.\n\n Returns:\n The LinkGraph object.\n \"\"\"\n
"},{"location":"api/scoring_abc/#nplinker.scoring.abc.ScoringBase.format_data","title":"format_data abstractmethod
","text":"format_data(data) -> str\n
Format the scoring data to a string.
Source code insrc/nplinker/scoring/abc.py
@abstractmethod\ndef format_data(self, data) -> str:\n \"\"\"Format the scoring data to a string.\"\"\"\n
"},{"location":"api/scoring_abc/#nplinker.scoring.abc.ScoringBase.sort","title":"sort abstractmethod
","text":"sort(objects, reverse=True) -> list\n
Sort the given objects based on the scoring data.
Source code insrc/nplinker/scoring/abc.py
@abstractmethod\ndef sort(self, objects, reverse=True) -> list:\n \"\"\"Sort the given objects based on the scoring data.\"\"\"\n
"},{"location":"api/scoring_methods/","title":"Scoring Methods","text":""},{"location":"api/scoring_methods/#nplinker.scoring","title":"nplinker.scoring","text":""},{"location":"api/scoring_methods/#nplinker.scoring.ScoringMethod","title":"ScoringMethod","text":" Bases: Enum
Enum class for scoring methods.
"},{"location":"api/scoring_methods/#nplinker.scoring.ScoringMethod.METCALF","title":"METCALFclass-attribute
instance-attribute
","text":"METCALF = 'metcalf'\n
"},{"location":"api/scoring_methods/#nplinker.scoring.ScoringMethod.ROSETTA","title":"ROSETTA class-attribute
instance-attribute
","text":"ROSETTA = 'rosetta'\n
"},{"location":"api/scoring_methods/#nplinker.scoring.ScoringMethod.NPLCLASS","title":"NPLCLASS class-attribute
instance-attribute
","text":"NPLCLASS = 'nplclass'\n
"},{"location":"api/scoring_methods/#nplinker.scoring.ScoringMethod.has_value","title":"has_value classmethod
","text":"has_value(value: str) -> bool\n
Check if the enum has a value.
Source code insrc/nplinker/scoring/scoring_method.py
@classmethod\ndef has_value(cls, value: str) -> bool:\n \"\"\"Check if the enum has a value.\"\"\"\n return any(value == item.value for item in cls)\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring","title":"MetcalfScoring","text":" Bases: ScoringBase
Metcalf scoring method.
Attributes:
name
\u2013 The name of this scoring method, set to a fixed value metcalf
.
npl
(NPLinker | None
) \u2013 The NPLinker object.
CACHE
(str
) \u2013 The name of the cache file to use for storing the MetcalfScoring.
presence_gcf_strain
(DataFrame
) \u2013 A DataFrame to store presence of gcfs with respect to strains. The index of the DataFrame are the GCF objects and the columns are Strain objects. The values are 1 where the gcf occurs in the strain, 0 otherwise.
presence_spec_strain
(DataFrame
) \u2013 A DataFrame to store presence of spectra with respect to strains. The index of the DataFrame are the Spectrum objects and the columns are Strain objects. The values are 1 where the spectrum occurs in the strain, 0 otherwise.
presence_mf_strain
(DataFrame
) \u2013 A DataFrame to store presence of molecular families with respect to strains. The index of the DataFrame are the MolecularFamily objects and the columns are Strain objects. The values are 1 where the molecular family occurs in the strain, 0 otherwise.
raw_score_spec_gcf
(DataFrame
) \u2013 A DataFrame to store the raw Metcalf scores for spectrum-gcf links. The columns are \"spec\", \"gcf\" and \"score\":
raw_score_mf_gcf
(DataFrame
) \u2013 A DataFrame to store the raw Metcalf scores for molecular family-gcf links. The columns are \"mf\", \"gcf\" and \"score\":
metcalf_mean
(ndarray | None
) \u2013 A numpy array to store the mean value used for standardising Metcalf scores. The array has shape (n_strains+1, n_strains+1), where n_strains is the number of strains.
metcalf_std
(ndarray | None
) \u2013 A numpy array to store the standard deviation value used for standardising Metcalf scores. The array has shape (n_strains+1, n_strains+1), where n_strains is the number of strains.
class-attribute
instance-attribute
","text":"name = METCALF.value\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.npl","title":"npl class-attribute
instance-attribute
","text":"npl: NPLinker | None = None\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.CACHE","title":"CACHE class-attribute
instance-attribute
","text":"CACHE: str = 'cache_metcalf_scoring.pckl'\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.metcalf_weights","title":"metcalf_weights class-attribute
instance-attribute
","text":"metcalf_weights: tuple[int, int, int, int] = (10, -10, 0, 1)\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.presence_gcf_strain","title":"presence_gcf_strain class-attribute
instance-attribute
","text":"presence_gcf_strain: DataFrame = DataFrame()\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.presence_spec_strain","title":"presence_spec_strain class-attribute
instance-attribute
","text":"presence_spec_strain: DataFrame = DataFrame()\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.presence_mf_strain","title":"presence_mf_strain class-attribute
instance-attribute
","text":"presence_mf_strain: DataFrame = DataFrame()\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.raw_score_spec_gcf","title":"raw_score_spec_gcf class-attribute
instance-attribute
","text":"raw_score_spec_gcf: DataFrame = DataFrame(\n columns=[\"spec\", \"gcf\", \"score\"]\n)\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.raw_score_mf_gcf","title":"raw_score_mf_gcf class-attribute
instance-attribute
","text":"raw_score_mf_gcf: DataFrame = DataFrame(\n columns=[\"mf\", \"gcf\", \"score\"]\n)\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.metcalf_mean","title":"metcalf_mean class-attribute
instance-attribute
","text":"metcalf_mean: ndarray | None = None\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.metcalf_std","title":"metcalf_std class-attribute
instance-attribute
","text":"metcalf_std: ndarray | None = None\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.setup","title":"setup classmethod
","text":"setup(npl: NPLinker) -> None\n
Setup the MetcalfScoring object.
This method is only called once to setup the MetcalfScoring object.
Parameters:
npl
(NPLinker
) \u2013 The NPLinker object.
src/nplinker/scoring/metcalf_scoring.py
@classmethod\ndef setup(cls, npl: NPLinker) -> None:\n \"\"\"Setup the MetcalfScoring object.\n\n This method is only called once to setup the MetcalfScoring object.\n\n Args:\n npl: The NPLinker object.\n \"\"\"\n if cls.npl is not None:\n logger.info(\"MetcalfScoring.setup already called, skipping.\")\n return\n\n logger.info(\n f\"MetcalfScoring.setup starts: #bgcs={len(npl.bgcs)}, #gcfs={len(npl.gcfs)}, \"\n f\"#spectra={len(npl.spectra)}, #mfs={len(npl.mfs)}, #strains={npl.strains}\"\n )\n cls.npl = npl\n\n # calculate presence of gcfs/spectra/mfs with respect to strains\n cls.presence_gcf_strain = get_presence_gcf_strain(npl.gcfs, npl.strains)\n cls.presence_spec_strain = get_presence_spec_strain(npl.spectra, npl.strains)\n cls.presence_mf_strain = get_presence_mf_strain(npl.mfs, npl.strains)\n\n # calculate raw Metcalf scores for spec-gcf links\n raw_score_spec_gcf = cls._calc_raw_score(\n cls.presence_spec_strain, cls.presence_gcf_strain, cls.metcalf_weights\n )\n cls.raw_score_spec_gcf = raw_score_spec_gcf.reset_index().melt(id_vars=\"index\")\n cls.raw_score_spec_gcf.columns = [\"spec\", \"gcf\", \"score\"] # type: ignore\n\n # calculate raw Metcalf scores for spec-gcf links\n raw_score_mf_gcf = cls._calc_raw_score(\n cls.presence_mf_strain, cls.presence_gcf_strain, cls.metcalf_weights\n )\n cls.raw_score_mf_gcf = raw_score_mf_gcf.reset_index().melt(id_vars=\"index\")\n cls.raw_score_mf_gcf.columns = [\"mf\", \"gcf\", \"score\"] # type: ignore\n\n # calculate mean and std for standardising Metcalf scores\n cls.metcalf_mean, cls.metcalf_std = cls._calc_mean_std(\n len(npl.strains), cls.metcalf_weights\n )\n\n logger.info(\"MetcalfScoring.setup completed\")\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.get_links","title":"get_links","text":"get_links(*objects, **parameters)\n
Get links for the given objects.
Parameters:
objects
\u2013 The objects to get links for. All objects must be of the same type, i.e. GCF
, Spectrum
or MolecularFamily
type. If no objects are provided, all detected objects (npl.gcfs
) will be used.
parameters
\u2013 The scoring parameters to use for the links. The parameters are:
cutoff
: The minimum score to consider a link (\u2265cutoff). Default is 0.standardised
: Whether to use standardised scores. Default is False.Returns:
The LinkGraph
object containing the links involving the input objects with the Metcalf scores.
Raises:
TypeError
\u2013 If the input objects are not of the same type or the object type is invalid.
src/nplinker/scoring/metcalf_scoring.py
def get_links(self, *objects, **parameters):\n \"\"\"Get links for the given objects.\n\n Args:\n objects: The objects to get links for. All objects must be of the same type, i.e. `GCF`,\n `Spectrum` or `MolecularFamily` type.\n If no objects are provided, all detected objects (`npl.gcfs`) will be used.\n parameters: The scoring parameters to use for the links.\n The parameters are:\n\n - `cutoff`: The minimum score to consider a link (\u2265cutoff). Default is 0.\n - `standardised`: Whether to use standardised scores. Default is False.\n\n Returns:\n The [`LinkGraph`][nplinker.scoring.LinkGraph] object containing the links involving the\n input objects with the Metcalf scores.\n\n Raises:\n TypeError: If the input objects are not of the same type or the object type is invalid.\n \"\"\"\n # validate input objects\n if len(objects) == 0:\n objects = self.npl.gcfs\n # check if all objects are of the same type\n types = {type(i) for i in objects}\n if len(types) > 1:\n raise TypeError(\"Input objects must be of the same type.\")\n # check if the object type is valid\n obj_type = next(iter(types))\n if obj_type not in (GCF, Spectrum, MolecularFamily):\n raise TypeError(\n f\"Invalid type {obj_type}. Input objects must be GCF, Spectrum or MolecularFamily objects.\"\n )\n\n # validate scoring parameters\n self._cutoff: float = parameters.get(\"cutoff\", 0)\n self._standardised: bool = parameters.get(\"standardised\", False)\n parameters.update({\"cutoff\": self._cutoff, \"standardised\": self._standardised})\n\n logger.info(\n f\"MetcalfScoring: #objects={len(objects)}, type={obj_type}, cutoff={self._cutoff}, \"\n f\"standardised={self._standardised}\"\n )\n if not self._standardised:\n scores_list = self._get_links(*objects, obj_type=obj_type, score_cutoff=self._cutoff)\n else:\n if self.metcalf_mean is None or self.metcalf_std is None:\n raise ValueError(\n \"MetcalfScoring.metcalf_mean and metcalf_std are not set. Run MetcalfScoring.setup first.\"\n )\n # use negative infinity as the score cutoff to ensure we get all links\n scores_list = self._get_links(*objects, obj_type=obj_type, score_cutoff=-np.inf)\n scores_list = self._calc_standardised_score(scores_list)\n\n links = LinkGraph()\n for score_df in scores_list:\n for row in score_df.itertuples(index=False): # row has attributes: spec/mf, gcf, score\n met = row.spec if score_df.name == LinkType.SPEC_GCF else row.mf\n links.add_link(\n row.gcf,\n met,\n metcalf=Score(self.name, row.score, parameters),\n )\n\n logger.info(f\"MetcalfScoring: completed! Found {len(links.links)} links in total.\")\n return links\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.format_data","title":"format_data","text":"format_data(data)\n
Format the data for display.
Source code insrc/nplinker/scoring/metcalf_scoring.py
def format_data(self, data):\n \"\"\"Format the data for display.\"\"\"\n # for metcalf the data will just be a floating point value (i.e. the score)\n return f\"{data:.4f}\"\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.sort","title":"sort","text":"sort(objects, reverse=True)\n
Sort the objects based on the score.
Source code insrc/nplinker/scoring/metcalf_scoring.py
def sort(self, objects, reverse=True):\n \"\"\"Sort the objects based on the score.\"\"\"\n # sort based on score\n return sorted(objects, key=lambda objlink: objlink[self], reverse=reverse)\n
"},{"location":"api/scoring_utils/","title":"Utilities","text":""},{"location":"api/scoring_utils/#nplinker.scoring.utils","title":"nplinker.scoring.utils","text":""},{"location":"api/scoring_utils/#nplinker.scoring.utils.get_presence_gcf_strain","title":"get_presence_gcf_strain","text":"get_presence_gcf_strain(\n gcfs: Sequence[GCF], strains: StrainCollection\n) -> DataFrame\n
Get the occurrence of strains in gcfs.
The occurrence is a DataFrame with GCF objects as index and Strain objects as columns, and the values are 1 if the gcf occurs in the strain, 0 otherwise.
Source code insrc/nplinker/scoring/utils.py
def get_presence_gcf_strain(gcfs: Sequence[GCF], strains: StrainCollection) -> pd.DataFrame:\n \"\"\"Get the occurrence of strains in gcfs.\n\n The occurrence is a DataFrame with GCF objects as index and Strain objects as columns, and the\n values are 1 if the gcf occurs in the strain, 0 otherwise.\n \"\"\"\n df_gcf_strain = pd.DataFrame(\n 0,\n index=gcfs,\n columns=list(strains),\n dtype=int,\n ) # type: ignore\n for gcf in gcfs:\n for strain in strains:\n if gcf.has_strain(strain):\n df_gcf_strain.loc[gcf, strain] = 1\n return df_gcf_strain # type: ignore\n
"},{"location":"api/scoring_utils/#nplinker.scoring.utils.get_presence_spec_strain","title":"get_presence_spec_strain","text":"get_presence_spec_strain(\n spectra: Sequence[Spectrum], strains: StrainCollection\n) -> DataFrame\n
Get the occurrence of strains in spectra.
The occurrence is a DataFrame with Spectrum objects as index and Strain objects as columns, and the values are 1 if the spectrum occurs in the strain, 0 otherwise.
Source code insrc/nplinker/scoring/utils.py
def get_presence_spec_strain(\n spectra: Sequence[Spectrum], strains: StrainCollection\n) -> pd.DataFrame:\n \"\"\"Get the occurrence of strains in spectra.\n\n The occurrence is a DataFrame with Spectrum objects as index and Strain objects as columns, and\n the values are 1 if the spectrum occurs in the strain, 0 otherwise.\n \"\"\"\n df_spec_strain = pd.DataFrame(\n 0,\n index=spectra,\n columns=list(strains),\n dtype=int,\n ) # type: ignore\n for spectrum in spectra:\n for strain in strains:\n if spectrum.has_strain(strain):\n df_spec_strain.loc[spectrum, strain] = 1\n return df_spec_strain # type: ignore\n
"},{"location":"api/scoring_utils/#nplinker.scoring.utils.get_presence_mf_strain","title":"get_presence_mf_strain","text":"get_presence_mf_strain(\n mfs: Sequence[MolecularFamily],\n strains: StrainCollection,\n) -> DataFrame\n
Get the occurrence of strains in molecular families.
The occurrence is a DataFrame with MolecularFamily objects as index and Strain objects as columns, and the values are 1 if the molecular family occurs in the strain, 0 otherwise.
Source code insrc/nplinker/scoring/utils.py
def get_presence_mf_strain(\n mfs: Sequence[MolecularFamily], strains: StrainCollection\n) -> pd.DataFrame:\n \"\"\"Get the occurrence of strains in molecular families.\n\n The occurrence is a DataFrame with MolecularFamily objects as index and Strain objects as\n columns, and the values are 1 if the molecular family occurs in the strain, 0 otherwise.\n \"\"\"\n df_mf_strain = pd.DataFrame(\n 0,\n index=mfs,\n columns=list(strains),\n dtype=int,\n ) # type: ignore\n for mf in mfs:\n for strain in strains:\n if mf.has_strain(strain):\n df_mf_strain.loc[mf, strain] = 1\n return df_mf_strain # type: ignore\n
"},{"location":"api/strain/","title":"Data Models","text":""},{"location":"api/strain/#nplinker.strain","title":"nplinker.strain","text":""},{"location":"api/strain/#nplinker.strain.Strain","title":"Strain","text":"Strain(id: str)\n
Class to model the mapping between strain id and its aliases.
It's recommended to use NCBI taxonomy strain id or name as the primary id.
Attributes:
id
(str
) \u2013 The representative id of the strain.
names
(set[str]
) \u2013 A set of names associated with the strain.
aliases
(set[str]
) \u2013 A set of aliases associated with the strain.
Parameters:
id
(str
) \u2013 the representative id of the strain.
src/nplinker/strain/strain.py
def __init__(self, id: str) -> None:\n \"\"\"To model the mapping between strain id and its aliases.\n\n Args:\n id: the representative id of the strain.\n \"\"\"\n self.id: str = id\n self._aliases: set[str] = set()\n
"},{"location":"api/strain/#nplinker.strain.Strain.id","title":"id instance-attribute
","text":"id: str = id\n
"},{"location":"api/strain/#nplinker.strain.Strain.names","title":"names property
","text":"names: set[str]\n
Get the set of strain names including id and aliases.
Returns:
set[str]
\u2013 A set of names associated with the strain.
property
","text":"aliases: set[str]\n
Get the set of known aliases.
Returns:
set[str]
\u2013 A set of aliases associated with the strain.
__repr__() -> str\n
Source code in src/nplinker/strain/strain.py
def __repr__(self) -> str:\n return str(self)\n
"},{"location":"api/strain/#nplinker.strain.Strain.__str__","title":"__str__","text":"__str__() -> str\n
Source code in src/nplinker/strain/strain.py
def __str__(self) -> str:\n return f\"Strain({self.id}) [{len(self._aliases)} aliases]\"\n
"},{"location":"api/strain/#nplinker.strain.Strain.__eq__","title":"__eq__","text":"__eq__(other) -> bool\n
Source code in src/nplinker/strain/strain.py
def __eq__(self, other) -> bool:\n if isinstance(other, Strain):\n return self.id == other.id\n return NotImplemented\n
"},{"location":"api/strain/#nplinker.strain.Strain.__hash__","title":"__hash__","text":"__hash__() -> int\n
Hash function for Strain.
Note that Strain is a mutable container, so here we hash on only the id to avoid the hash value changes when self._aliases
is updated.
src/nplinker/strain/strain.py
def __hash__(self) -> int:\n \"\"\"Hash function for Strain.\n\n Note that Strain is a mutable container, so here we hash on only the id\n to avoid the hash value changes when `self._aliases` is updated.\n \"\"\"\n return hash(self.id)\n
"},{"location":"api/strain/#nplinker.strain.Strain.__contains__","title":"__contains__","text":"__contains__(alias: str) -> bool\n
Source code in src/nplinker/strain/strain.py
def __contains__(self, alias: str) -> bool:\n if not isinstance(alias, str):\n raise TypeError(f\"Expected str, got {type(alias)}\")\n return alias in self._aliases\n
"},{"location":"api/strain/#nplinker.strain.Strain.add_alias","title":"add_alias","text":"add_alias(alias: str) -> None\n
Add an alias for the strain.
Parameters:
alias
(str
) \u2013 The alias to add for the strain.
src/nplinker/strain/strain.py
def add_alias(self, alias: str) -> None:\n \"\"\"Add an alias for the strain.\n\n Args:\n alias: The alias to add for the strain.\n \"\"\"\n if not isinstance(alias, str):\n raise TypeError(f\"Expected str, got {type(alias)}\")\n if len(alias) == 0:\n logger.warning(\"Refusing to add an empty-string alias to strain {%s}\", self)\n else:\n self._aliases.add(alias)\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection","title":"StrainCollection","text":"StrainCollection()\n
A collection of Strain
objects.
src/nplinker/strain/strain_collection.py
def __init__(self) -> None:\n # the order of strains is needed for scoring part, so use a list\n self._strains: list[Strain] = []\n self._strain_dict_name: dict[str, list[Strain]] = {}\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.__repr__","title":"__repr__","text":"__repr__() -> str\n
Source code in src/nplinker/strain/strain_collection.py
def __repr__(self) -> str:\n return str(self)\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.__str__","title":"__str__","text":"__str__() -> str\n
Source code in src/nplinker/strain/strain_collection.py
def __str__(self) -> str:\n if len(self) > 20:\n return f\"StrainCollection(n={len(self)})\"\n\n return f\"StrainCollection(n={len(self)}) [\" + \",\".join(s.id for s in self._strains) + \"]\"\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.__len__","title":"__len__","text":"__len__() -> int\n
Source code in src/nplinker/strain/strain_collection.py
def __len__(self) -> int:\n return len(self._strains)\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.__eq__","title":"__eq__","text":"__eq__(other) -> bool\n
Source code in src/nplinker/strain/strain_collection.py
def __eq__(self, other) -> bool:\n if isinstance(other, StrainCollection):\n return (\n self._strains == other._strains\n and self._strain_dict_name == other._strain_dict_name\n )\n return NotImplemented\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.__add__","title":"__add__","text":"__add__(other) -> StrainCollection\n
Source code in src/nplinker/strain/strain_collection.py
def __add__(self, other) -> StrainCollection:\n if isinstance(other, StrainCollection):\n sc = StrainCollection()\n for strain in self._strains:\n sc.add(strain)\n for strain in other._strains:\n sc.add(strain)\n return sc\n return NotImplemented\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.__contains__","title":"__contains__","text":"__contains__(item: Strain) -> bool\n
Check if the strain collection contains the given Strain object.
Source code insrc/nplinker/strain/strain_collection.py
def __contains__(self, item: Strain) -> bool:\n \"\"\"Check if the strain collection contains the given Strain object.\"\"\"\n if isinstance(item, Strain):\n return item.id in self._strain_dict_name\n raise TypeError(f\"Expected Strain, got {type(item)}\")\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.__iter__","title":"__iter__","text":"__iter__() -> Iterator[Strain]\n
Source code in src/nplinker/strain/strain_collection.py
def __iter__(self) -> Iterator[Strain]:\n return iter(self._strains)\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.add","title":"add","text":"add(strain: Strain) -> None\n
Add strain to the collection.
If the strain already exists, merge the aliases.
Parameters:
strain
(Strain
) \u2013 The strain to add.
src/nplinker/strain/strain_collection.py
def add(self, strain: Strain) -> None:\n \"\"\"Add strain to the collection.\n\n If the strain already exists, merge the aliases.\n\n Args:\n strain: The strain to add.\n \"\"\"\n if strain in self._strains:\n # only one strain object per id\n strain_ref = self._strain_dict_name[strain.id][0]\n new_aliases = [alias for alias in strain.aliases if alias not in strain_ref.aliases]\n for alias in new_aliases:\n strain_ref.add_alias(alias)\n if alias not in self._strain_dict_name:\n self._strain_dict_name[alias] = [strain_ref]\n else:\n self._strain_dict_name[alias].append(strain_ref)\n else:\n self._strains.append(strain)\n for name in strain.names:\n if name not in self._strain_dict_name:\n self._strain_dict_name[name] = [strain]\n else:\n self._strain_dict_name[name].append(strain)\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.remove","title":"remove","text":"remove(strain: Strain) -> None\n
Remove a strain from the collection.
It removes the given strain object from the collection by strain id. If the strain id is not found, raise ValueError
.
Parameters:
strain
(Strain
) \u2013 The strain to remove.
Raises:
ValueError
\u2013 If the strain is not found in the collection.
src/nplinker/strain/strain_collection.py
def remove(self, strain: Strain) -> None:\n \"\"\"Remove a strain from the collection.\n\n It removes the given strain object from the collection by strain id.\n If the strain id is not found, raise `ValueError`.\n\n Args:\n strain: The strain to remove.\n\n Raises:\n ValueError: If the strain is not found in the collection.\n \"\"\"\n if strain in self._strains:\n self._strains.remove(strain)\n # only one strain object per id\n strain_ref = self._strain_dict_name[strain.id][0]\n for name in strain_ref.names:\n if name in self._strain_dict_name:\n new_strain_list = [s for s in self._strain_dict_name[name] if s.id != strain.id]\n if not new_strain_list:\n del self._strain_dict_name[name]\n else:\n self._strain_dict_name[name] = new_strain_list\n else:\n raise ValueError(f\"Strain {strain} not found in the strain collection.\")\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.filter","title":"filter","text":"filter(strain_set: set[Strain])\n
Remove all strains that are not in strain_set
from the strain collection.
Parameters:
strain_set
(set[Strain]
) \u2013 Set of strains to keep.
src/nplinker/strain/strain_collection.py
def filter(self, strain_set: set[Strain]):\n \"\"\"Remove all strains that are not in `strain_set` from the strain collection.\n\n Args:\n strain_set: Set of strains to keep.\n \"\"\"\n # note that we need to copy the list of strains, as we are modifying it\n for strain in self._strains.copy():\n if strain not in strain_set:\n self.remove(strain)\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.intersection","title":"intersection","text":"intersection(other: StrainCollection) -> StrainCollection\n
Get the intersection of two strain collections.
Parameters:
other
(StrainCollection
) \u2013 The other strain collection to compare.
Returns:
StrainCollection
\u2013 StrainCollection object containing the strains that are in both collections.
src/nplinker/strain/strain_collection.py
def intersection(self, other: StrainCollection) -> StrainCollection:\n \"\"\"Get the intersection of two strain collections.\n\n Args:\n other: The other strain collection to compare.\n\n Returns:\n StrainCollection object containing the strains that are in both collections.\n \"\"\"\n intersection = StrainCollection()\n for strain in self:\n if strain in other:\n intersection.add(strain)\n return intersection\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.has_name","title":"has_name","text":"has_name(name: str) -> bool\n
Check if the strain collection contains the given strain name (id or alias).
Parameters:
name
(str
) \u2013 Strain name (id or alias) to check.
Returns:
bool
\u2013 True if the strain name is in the collection, False otherwise.
src/nplinker/strain/strain_collection.py
def has_name(self, name: str) -> bool:\n \"\"\"Check if the strain collection contains the given strain name (id or alias).\n\n Args:\n name: Strain name (id or alias) to check.\n\n Returns:\n True if the strain name is in the collection, False otherwise.\n \"\"\"\n return name in self._strain_dict_name\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.lookup","title":"lookup","text":"lookup(name: str) -> list[Strain]\n
Lookup a strain by name (id or alias).
Parameters:
name
(str
) \u2013 Strain name (id or alias) to lookup.
Returns:
list[Strain]
\u2013 List of Strain objects with the given name.
Raises:
ValueError
\u2013 If the strain name is not found.
src/nplinker/strain/strain_collection.py
def lookup(self, name: str) -> list[Strain]:\n \"\"\"Lookup a strain by name (id or alias).\n\n Args:\n name: Strain name (id or alias) to lookup.\n\n Returns:\n List of Strain objects with the given name.\n\n Raises:\n ValueError: If the strain name is not found.\n \"\"\"\n if name in self._strain_dict_name:\n return self._strain_dict_name[name]\n raise ValueError(f\"Strain {name} not found in the strain collection.\")\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.read_json","title":"read_json staticmethod
","text":"read_json(file: str | PathLike) -> StrainCollection\n
Read a strain mappings JSON file and return a StrainCollection
object.
Parameters:
file
(str | PathLike
) \u2013 Path to the strain mappings JSON file.
Returns:
StrainCollection
\u2013 StrainCollection
object.
src/nplinker/strain/strain_collection.py
@staticmethod\ndef read_json(file: str | PathLike) -> StrainCollection:\n \"\"\"Read a strain mappings JSON file and return a `StrainCollection` object.\n\n Args:\n file: Path to the strain mappings JSON file.\n\n Returns:\n `StrainCollection` object.\n \"\"\"\n with open(file, \"r\") as f:\n json_data = json.load(f)\n\n # validate json data\n validate(instance=json_data, schema=STRAIN_MAPPINGS_SCHEMA)\n\n strain_collection = StrainCollection()\n for data in json_data[\"strain_mappings\"]:\n strain = Strain(data[\"strain_id\"])\n for alias in data[\"strain_alias\"]:\n strain.add_alias(alias)\n strain_collection.add(strain)\n return strain_collection\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.to_json","title":"to_json","text":"to_json(file: str | PathLike | None = None) -> str | None\n
Convert the StrainCollection
object to a JSON string.
Parameters:
file
(str | PathLike | None
, default: None
) \u2013 Path to output JSON file. If None, return the JSON string instead.
Returns:
str | None
\u2013 If input file
is None, return the JSON string. Otherwise, write the JSON string to the given
str | None
\u2013 file.
src/nplinker/strain/strain_collection.py
def to_json(self, file: str | PathLike | None = None) -> str | None:\n \"\"\"Convert the `StrainCollection` object to a JSON string.\n\n Args:\n file: Path to output JSON file. If None, return the JSON string instead.\n\n Returns:\n If input `file` is None, return the JSON string. Otherwise, write the JSON string to the given\n file.\n \"\"\"\n data_list = [\n {\"strain_id\": strain.id, \"strain_alias\": list(strain.aliases)} for strain in self\n ]\n json_data = {\"strain_mappings\": data_list, \"version\": \"1.0\"}\n\n # validate json data\n validate(instance=json_data, schema=STRAIN_MAPPINGS_SCHEMA)\n\n if file is not None:\n with open(file, \"w\") as f:\n json.dump(json_data, f)\n return None\n return json.dumps(json_data)\n
"},{"location":"api/strain_utils/","title":"Utilities","text":""},{"location":"api/strain_utils/#nplinker.strain.utils","title":"nplinker.strain.utils","text":""},{"location":"api/strain_utils/#nplinker.strain.utils.load_user_strains","title":"load_user_strains","text":"load_user_strains(json_file: str | PathLike) -> set[Strain]\n
Load user specified strains from a JSON file.
The JSON file will be validated against the schema USER_STRAINS_SCHEMA
The content of the JSON file could be, for example:
{\"strain_ids\": [\"strain1\", \"strain2\"]}\n
Parameters:
json_file
(str | PathLike
) \u2013 Path to the JSON file containing user specified strains.
Returns:
set[Strain]
\u2013 A set of user specified strains.
src/nplinker/strain/utils.py
def load_user_strains(json_file: str | PathLike) -> set[Strain]:\n \"\"\"Load user specified strains from a JSON file.\n\n The JSON file will be validated against the schema\n [USER_STRAINS_SCHEMA][nplinker.schemas.USER_STRAINS_SCHEMA]\n\n The content of the JSON file could be, for example:\n ```\n {\"strain_ids\": [\"strain1\", \"strain2\"]}\n ```\n\n Args:\n json_file: Path to the JSON file containing user specified strains.\n\n Returns:\n A set of user specified strains.\n \"\"\"\n with open(json_file, \"r\") as f:\n json_data = json.load(f)\n\n # validate json data\n validate(instance=json_data, schema=USER_STRAINS_SCHEMA)\n\n strains = set()\n for strain_id in json_data[\"strain_ids\"]:\n strains.add(Strain(strain_id))\n\n return strains\n
"},{"location":"api/strain_utils/#nplinker.strain.utils.podp_generate_strain_mappings","title":"podp_generate_strain_mappings","text":"podp_generate_strain_mappings(\n podp_project_json_file: str | PathLike,\n genome_status_json_file: str | PathLike,\n genome_bgc_mappings_file: str | PathLike,\n gnps_file_mappings_file: str | PathLike,\n output_json_file: str | PathLike,\n) -> StrainCollection\n
Generate strain mappings JSON file for PODP pipeline.
To get the strain mappings, we need to combine the following mappings:
These mappings are extracted from the following files:
podp_project_json_file
.genome_status_json_file
.genome_bgc_mappings_file
.podp_project_json_file
.gnps_file_mappings_file
.Parameters:
podp_project_json_file
(str | PathLike
) \u2013 The path to the PODP project JSON file.
genome_status_json_file
(str | PathLike
) \u2013 The path to the genome status JSON file.
genome_bgc_mappings_file
(str | PathLike
) \u2013 The path to the genome BGC mappings JSON file.
gnps_file_mappings_file
(str | PathLike
) \u2013 The path to the GNPS file mappings file (csv or tsv).
output_json_file
(str | PathLike
) \u2013 The path to the output JSON file.
Returns:
StrainCollection
\u2013 The strain mappings stored in a StrainCollection object.
extract_mappings_strain_id_original_genome_id
: Extract mappings \"strain_id <-> original_genome_id\".extract_mappings_original_genome_id_resolved_genome_id
: Extract mappings \"original_genome_id <-> resolved_genome_id\".extract_mappings_resolved_genome_id_bgc_id
: Extract mappings \"resolved_genome_id <-> bgc_id\".get_mappings_strain_id_bgc_id
: Get mappings \"strain_id <-> bgc_id\".extract_mappings_strain_id_ms_filename
: Extract mappings \"strain_id <-> MS_filename\".extract_mappings_ms_filename_spectrum_id
: Extract mappings \"MS_filename <-> spectrum_id\".get_mappings_strain_id_spectrum_id
: Get mappings \"strain_id <-> spectrum_id\".src/nplinker/strain/utils.py
def podp_generate_strain_mappings(\n podp_project_json_file: str | PathLike,\n genome_status_json_file: str | PathLike,\n genome_bgc_mappings_file: str | PathLike,\n gnps_file_mappings_file: str | PathLike,\n output_json_file: str | PathLike,\n) -> StrainCollection:\n \"\"\"Generate strain mappings JSON file for PODP pipeline.\n\n To get the strain mappings, we need to combine the following mappings:\n\n - strain_id <-> original_genome_id <-> resolved_genome_id <-> bgc_id\n - strain_id <-> MS_filename <-> spectrum_id\n\n These mappings are extracted from the following files:\n\n - \"strain_id <-> original_genome_id\" is extracted from `podp_project_json_file`.\n - \"original_genome_id <-> resolved_genome_id\" is extracted from `genome_status_json_file`.\n - \"resolved_genome_id <-> bgc_id\" is extracted from `genome_bgc_mappings_file`.\n - \"strain_id <-> MS_filename\" is extracted from `podp_project_json_file`.\n - \"MS_filename <-> spectrum_id\" is extracted from `gnps_file_mappings_file`.\n\n Args:\n podp_project_json_file: The path to the PODP project\n JSON file.\n genome_status_json_file: The path to the genome status\n JSON file.\n genome_bgc_mappings_file: The path to the genome BGC\n mappings JSON file.\n gnps_file_mappings_file: The path to the GNPS file\n mappings file (csv or tsv).\n output_json_file: The path to the output JSON file.\n\n Returns:\n The strain mappings stored in a StrainCollection object.\n\n See Also:\n - `extract_mappings_strain_id_original_genome_id`: Extract mappings\n \"strain_id <-> original_genome_id\".\n - `extract_mappings_original_genome_id_resolved_genome_id`: Extract mappings\n \"original_genome_id <-> resolved_genome_id\".\n - `extract_mappings_resolved_genome_id_bgc_id`: Extract mappings\n \"resolved_genome_id <-> bgc_id\".\n - `get_mappings_strain_id_bgc_id`: Get mappings \"strain_id <-> bgc_id\".\n - `extract_mappings_strain_id_ms_filename`: Extract mappings\n \"strain_id <-> MS_filename\".\n - `extract_mappings_ms_filename_spectrum_id`: Extract mappings\n \"MS_filename <-> spectrum_id\".\n - `get_mappings_strain_id_spectrum_id`: Get mappings \"strain_id <-> spectrum_id\".\n \"\"\"\n # Get mappings strain_id <-> original_genome_id <-> resolved_genome_id <-> bgc_id\n mappings_strain_id_bgc_id = get_mappings_strain_id_bgc_id(\n extract_mappings_strain_id_original_genome_id(podp_project_json_file),\n extract_mappings_original_genome_id_resolved_genome_id(genome_status_json_file),\n extract_mappings_resolved_genome_id_bgc_id(genome_bgc_mappings_file),\n )\n\n # Get mappings strain_id <-> MS_filename <-> spectrum_id\n mappings_strain_id_spectrum_id = get_mappings_strain_id_spectrum_id(\n extract_mappings_strain_id_ms_filename(podp_project_json_file),\n extract_mappings_ms_filename_spectrum_id(gnps_file_mappings_file),\n )\n\n # Get mappings strain_id <-> bgc_id / spectrum_id\n mappings = mappings_strain_id_bgc_id.copy()\n for strain_id, spectrum_ids in mappings_strain_id_spectrum_id.items():\n if strain_id in mappings:\n mappings[strain_id].update(spectrum_ids)\n else:\n mappings[strain_id] = spectrum_ids.copy()\n\n # Create StrainCollection\n sc = StrainCollection()\n for strain_id, bgc_ids in mappings.items():\n if not sc.has_name(strain_id):\n strain = Strain(strain_id)\n for bgc_id in bgc_ids:\n strain.add_alias(bgc_id)\n sc.add(strain)\n else:\n # strain_list has only one element\n strain_list = sc.lookup(strain_id)\n for bgc_id in bgc_ids:\n strain_list[0].add_alias(bgc_id)\n\n # Write strain mappings JSON file\n sc.to_json(output_json_file)\n logger.info(\"Generated strain mappings JSON file: %s\", output_json_file)\n\n return sc\n
"},{"location":"api/utils/","title":"Utilities","text":""},{"location":"api/utils/#nplinker.utils","title":"nplinker.utils","text":""},{"location":"api/utils/#nplinker.utils.calculate_md5","title":"calculate_md5","text":"calculate_md5(\n fpath: str | PathLike, chunk_size: int = 1024 * 1024\n) -> str\n
Calculate the MD5 checksum of a file.
Parameters:
fpath
(str | PathLike
) \u2013 Path to the file.
chunk_size
(int
, default: 1024 * 1024
) \u2013 Chunk size for reading the file. Defaults to 1024*1024.
Returns:
str
\u2013 MD5 checksum of the file.
src/nplinker/utils.py
def calculate_md5(fpath: str | PathLike, chunk_size: int = 1024 * 1024) -> str:\n \"\"\"Calculate the MD5 checksum of a file.\n\n Args:\n fpath: Path to the file.\n chunk_size: Chunk size for reading the file. Defaults to 1024*1024.\n\n Returns:\n MD5 checksum of the file.\n \"\"\"\n if sys.version_info >= (3, 9):\n md5 = hashlib.md5(usedforsecurity=False)\n else:\n md5 = hashlib.md5()\n with open(fpath, \"rb\") as f:\n for chunk in iter(lambda: f.read(chunk_size), b\"\"):\n md5.update(chunk)\n return md5.hexdigest()\n
"},{"location":"api/utils/#nplinker.utils.check_disk_space","title":"check_disk_space","text":"check_disk_space(func)\n
A decorator to check available disk space.
If the available disk space is less than 500GB, raise and log a warning.
Warns:
UserWarning
\u2013 If the available disk space is less than 500GB.
src/nplinker/utils.py
def check_disk_space(func):\n \"\"\"A decorator to check available disk space.\n\n If the available disk space is less than 500GB, raise and log a warning.\n\n Warnings:\n UserWarning: If the available disk space is less than 500GB.\n \"\"\"\n\n @functools.wraps(func)\n def wrapper_check_disk_space(*args, **kwargs):\n _, _, free = shutil.disk_usage(\"/\")\n free_gb = free // (2**30)\n if free_gb < 50:\n warning_message = f\"Available disk space is {free_gb}GB. Is it enough for your project?\"\n logger.warning(warning_message)\n warnings.warn(warning_message, UserWarning)\n return func(*args, **kwargs)\n\n return wrapper_check_disk_space\n
"},{"location":"api/utils/#nplinker.utils.check_md5","title":"check_md5","text":"check_md5(fpath: str | PathLike, md5: str) -> bool\n
Verify the MD5 checksum of a file.
Parameters:
fpath
(str | PathLike
) \u2013 Path to the file.
md5
(str
) \u2013 MD5 checksum to verify.
Returns:
bool
\u2013 True if the MD5 checksum matches, False otherwise.
src/nplinker/utils.py
def check_md5(fpath: str | PathLike, md5: str) -> bool:\n \"\"\"Verify the MD5 checksum of a file.\n\n Args:\n fpath: Path to the file.\n md5: MD5 checksum to verify.\n\n Returns:\n True if the MD5 checksum matches, False otherwise.\n \"\"\"\n return md5 == calculate_md5(fpath)\n
"},{"location":"api/utils/#nplinker.utils.download_and_extract_archive","title":"download_and_extract_archive","text":"download_and_extract_archive(\n url: str,\n download_root: str | PathLike,\n extract_root: str | Path | None = None,\n filename: str | None = None,\n md5: str | None = None,\n remove_finished: bool = False,\n) -> None\n
Download an archive file and then extract it.
This method is a wrapper of download_url
and extract_archive
functions.
Parameters:
url
(str
) \u2013 URL to download file from
download_root
(str | PathLike
) \u2013 Path to the directory to place downloaded file in. If it doesn't exist, it will be created.
extract_root
(str | Path | None
, default: None
) \u2013 Path to the directory the file will be extracted to. The given directory will be created if not exist. If omitted, the download_root
is used.
filename
(str | None
, default: None
) \u2013 Name to save the downloaded file under. If None, use the basename of the URL
md5
(str | None
, default: None
) \u2013 MD5 checksum of the download. If None, do not check
remove_finished
(bool
, default: False
) \u2013 If True
, remove the downloaded file after the extraction. Defaults to False.
src/nplinker/utils.py
def download_and_extract_archive(\n url: str,\n download_root: str | PathLike,\n extract_root: str | Path | None = None,\n filename: str | None = None,\n md5: str | None = None,\n remove_finished: bool = False,\n) -> None:\n \"\"\"Download an archive file and then extract it.\n\n This method is a wrapper of [`download_url`][nplinker.utils.download_url] and\n [`extract_archive`][nplinker.utils.extract_archive] functions.\n\n Args:\n url: URL to download file from\n download_root: Path to the directory to place downloaded\n file in. If it doesn't exist, it will be created.\n extract_root: Path to the directory the file\n will be extracted to. The given directory will be created if not exist.\n If omitted, the `download_root` is used.\n filename: Name to save the downloaded file under.\n If None, use the basename of the URL\n md5: MD5 checksum of the download. If None, do not check\n remove_finished: If `True`, remove the downloaded file\n after the extraction. Defaults to False.\n \"\"\"\n download_root = Path(download_root)\n if extract_root is None:\n extract_root = download_root\n else:\n extract_root = Path(extract_root)\n if not filename:\n filename = Path(url).name\n\n download_url(url, download_root, filename, md5)\n\n archive = download_root / filename\n extract_archive(archive, extract_root, remove_finished=remove_finished)\n
"},{"location":"api/utils/#nplinker.utils.download_url","title":"download_url","text":"download_url(\n url: str,\n root: str | PathLike,\n filename: str | None = None,\n md5: str | None = None,\n http_method: str = \"GET\",\n allow_http_redirect: bool = True,\n) -> None\n
Download a file from a url and place it in root.
Parameters:
url
(str
) \u2013 URL to download file from
root
(str | PathLike
) \u2013 Directory to place downloaded file in. If it doesn't exist, it will be created.
filename
(str | None
, default: None
) \u2013 Name to save the file under. If None, use the basename of the URL.
md5
(str | None
, default: None
) \u2013 MD5 checksum of the download. If None, do not check.
http_method
(str
, default: 'GET'
) \u2013 HTTP request method, e.g. \"GET\", \"POST\". Defaults to \"GET\".
allow_http_redirect
(bool
, default: True
) \u2013 If true, enable following redirects for all HTTP (\"http:\") methods.
src/nplinker/utils.py
@check_disk_space\ndef download_url(\n url: str,\n root: str | PathLike,\n filename: str | None = None,\n md5: str | None = None,\n http_method: str = \"GET\",\n allow_http_redirect: bool = True,\n) -> None:\n \"\"\"Download a file from a url and place it in root.\n\n Args:\n url: URL to download file from\n root: Directory to place downloaded file in. If it doesn't exist, it will be created.\n filename: Name to save the file under. If None, use the\n basename of the URL.\n md5: MD5 checksum of the download. If None, do not check.\n http_method: HTTP request method, e.g. \"GET\", \"POST\".\n Defaults to \"GET\".\n allow_http_redirect: If true, enable following redirects for all HTTP (\"http:\") methods.\n \"\"\"\n root = transform_to_full_path(root)\n # create the download directory if not exist\n root.mkdir(exist_ok=True)\n if not filename:\n filename = Path(url).name\n fpath = root / filename\n\n # check if file is already present locally\n if fpath.is_file() and md5 is not None and check_md5(fpath, md5):\n logger.info(\"Using downloaded and verified file: \" + str(fpath))\n return\n\n # download the file\n logger.info(f\"Downloading {filename} to {root}\")\n with open(fpath, \"wb\") as fh:\n with httpx.stream(http_method, url, follow_redirects=allow_http_redirect) as response:\n if not response.is_success:\n fpath.unlink(missing_ok=True)\n raise RuntimeError(\n f\"Failed to download url {url} with status code {response.status_code}\"\n )\n total = int(response.headers.get(\"Content-Length\", 0))\n\n with Progress(\n TextColumn(\"[progress.description]{task.description}\"),\n BarColumn(bar_width=None),\n \"[progress.percentage]{task.percentage:>3.1f}%\",\n \"\u2022\",\n DownloadColumn(),\n \"\u2022\",\n TransferSpeedColumn(),\n \"\u2022\",\n TimeRemainingColumn(),\n \"\u2022\",\n TimeElapsedColumn(),\n ) as progress:\n task = progress.add_task(f\"[hot_pink]Downloading {fpath.name}\", total=total)\n for chunk in response.iter_bytes():\n fh.write(chunk)\n progress.update(task, advance=len(chunk))\n\n # check integrity of downloaded file\n if md5 is not None and not check_md5(fpath, md5):\n raise RuntimeError(\"MD5 validation failed.\")\n
"},{"location":"api/utils/#nplinker.utils.extract_archive","title":"extract_archive","text":"extract_archive(\n from_path: str | PathLike,\n extract_root: str | PathLike | None = None,\n members: list | None = None,\n remove_finished: bool = False,\n) -> str\n
Extract an archive.
The archive type and a possible compression is automatically detected from the file name.
If the file is compressed but not an archive, the call is dispatched to _decompress
function.
Parameters:
from_path
(str | PathLike
) \u2013 Path to the file to be extracted.
extract_root
(str | PathLike | None
, default: None
) \u2013 Path to the directory the file will be extracted to. The given directory will be created if not exist. If omitted, the directory of the archive file is used.
members
(list | None
, default: None
) \u2013 Optional selection of members to extract. If not specified, all members are extracted. Members must be a subset of the list returned by - zipfile.ZipFile.namelist()
or a list of strings for zip file - tarfile.TarFile.getmembers()
for tar file
remove_finished
(bool
, default: False
) \u2013 If True
, remove the file after the extraction.
Returns:
str
\u2013 Path to the directory the file was extracted to.
src/nplinker/utils.py
def extract_archive(\n from_path: str | PathLike,\n extract_root: str | PathLike | None = None,\n members: list | None = None,\n remove_finished: bool = False,\n) -> str:\n \"\"\"Extract an archive.\n\n The archive type and a possible compression is automatically detected from\n the file name.\n\n If the file is compressed but not an archive, the call is dispatched to `_decompress` function.\n\n Args:\n from_path: Path to the file to be extracted.\n extract_root: Path to the directory the file will be extracted to.\n The given directory will be created if not exist.\n If omitted, the directory of the archive file is used.\n members: Optional selection of members to extract. If not specified,\n all members are extracted.\n Members must be a subset of the list returned by\n - `zipfile.ZipFile.namelist()` or a list of strings for zip file\n - `tarfile.TarFile.getmembers()` for tar file\n remove_finished: If `True`, remove the file after the extraction.\n\n Returns:\n Path to the directory the file was extracted to.\n \"\"\"\n from_path = Path(from_path)\n\n if extract_root is None:\n extract_root = from_path.parent\n else:\n extract_root = Path(extract_root)\n\n # create the extract directory if not exist\n extract_root.mkdir(exist_ok=True)\n\n logger.info(f\"Extracting {from_path} to {extract_root}\")\n suffix, archive_type, compression = _detect_file_type(from_path)\n if not archive_type:\n return _decompress(\n from_path,\n extract_root / from_path.name.replace(suffix, \"\"),\n remove_finished=remove_finished,\n )\n\n extractor = _ARCHIVE_EXTRACTORS[archive_type]\n\n extractor(str(from_path), str(extract_root), members, compression)\n if remove_finished:\n from_path.unlink()\n\n return str(extract_root)\n
"},{"location":"api/utils/#nplinker.utils.is_file_format","title":"is_file_format","text":"is_file_format(\n file: str | PathLike, format: str = \"tsv\"\n) -> bool\n
Check if the file is in the given format.
Parameters:
file
(str | PathLike
) \u2013 Path to the file to check.
format
(str
, default: 'tsv'
) \u2013 The format to check for, either \"tsv\" or \"csv\".
Returns:
bool
\u2013 True if the file is in the given format, False otherwise.
src/nplinker/utils.py
def is_file_format(file: str | PathLike, format: str = \"tsv\") -> bool:\n \"\"\"Check if the file is in the given format.\n\n Args:\n file: Path to the file to check.\n format: The format to check for, either \"tsv\" or \"csv\".\n\n Returns:\n True if the file is in the given format, False otherwise.\n \"\"\"\n try:\n with open(file, \"rt\") as f:\n if format == \"tsv\":\n reader = csv.reader(f, delimiter=\"\\t\")\n elif format == \"csv\":\n reader = csv.reader(f, delimiter=\",\")\n else:\n raise ValueError(f\"Unknown format '{format}'.\")\n for _ in reader:\n pass\n return True\n except csv.Error:\n return False\n
"},{"location":"api/utils/#nplinker.utils.list_dirs","title":"list_dirs","text":"list_dirs(\n root: str | PathLike, keep_parent: bool = True\n) -> list[str]\n
List all directories at a given root.
Parameters:
root
(str | PathLike
) \u2013 Path to directory whose folders need to be listed
keep_parent
(bool
, default: True
) \u2013 If true, prepends the path to each result, otherwise only returns the name of the directories found
src/nplinker/utils.py
def list_dirs(root: str | PathLike, keep_parent: bool = True) -> list[str]:\n \"\"\"List all directories at a given root.\n\n Args:\n root: Path to directory whose folders need to be listed\n keep_parent: If true, prepends the path to each result, otherwise\n only returns the name of the directories found\n \"\"\"\n root = transform_to_full_path(root)\n directories = [str(p) for p in root.iterdir() if p.is_dir()]\n if not keep_parent:\n directories = [os.path.basename(d) for d in directories]\n return directories\n
"},{"location":"api/utils/#nplinker.utils.list_files","title":"list_files","text":"list_files(\n root: str | PathLike,\n prefix: str | tuple[str, ...] = \"\",\n suffix: str | tuple[str, ...] = \"\",\n keep_parent: bool = True,\n) -> list[str]\n
List all files at a given root.
Parameters:
root
(str | PathLike
) \u2013 Path to directory whose files need to be listed
prefix
(str | tuple[str, ...]
, default: ''
) \u2013 Prefix of the file names to match, Defaults to empty string '\"\"'.
suffix
(str | tuple[str, ...]
, default: ''
) \u2013 Suffix of the files to match, e.g. \".png\" or (\".jpg\", \".png\"). Defaults to empty string '\"\"'.
keep_parent
(bool
, default: True
) \u2013 If true, prepends the parent path to each result, otherwise only returns the name of the files found. Defaults to False.
src/nplinker/utils.py
def list_files(\n root: str | PathLike,\n prefix: str | tuple[str, ...] = \"\",\n suffix: str | tuple[str, ...] = \"\",\n keep_parent: bool = True,\n) -> list[str]:\n \"\"\"List all files at a given root.\n\n Args:\n root: Path to directory whose files need to be listed\n prefix: Prefix of the file names to match,\n Defaults to empty string '\"\"'.\n suffix: Suffix of the files to match, e.g. \".png\" or\n (\".jpg\", \".png\").\n Defaults to empty string '\"\"'.\n keep_parent: If true, prepends the parent path to each\n result, otherwise only returns the name of the files found.\n Defaults to False.\n \"\"\"\n root = Path(root)\n files = [\n str(p)\n for p in root.iterdir()\n if p.is_file() and p.name.startswith(prefix) and p.name.endswith(suffix)\n ]\n\n if not keep_parent:\n files = [os.path.basename(f) for f in files]\n\n return files\n
"},{"location":"api/utils/#nplinker.utils.transform_to_full_path","title":"transform_to_full_path","text":"transform_to_full_path(p: str | PathLike) -> Path\n
Transform a path to a full path.
The path is expanded (i.e. the ~
will be replaced with actual path) and converted to an absolute path (i.e. .
or ..
will be replaced with actual path).
Parameters:
p
(str | PathLike
) \u2013 The path to transform.
Returns:
Path
\u2013 The transformed full path.
src/nplinker/utils.py
def transform_to_full_path(p: str | PathLike) -> Path:\n \"\"\"Transform a path to a full path.\n\n The path is expanded (i.e. the `~` will be replaced with actual path) and converted to an\n absolute path (i.e. `.` or `..` will be replaced with actual path).\n\n Args:\n p: The path to transform.\n\n Returns:\n The transformed full path.\n \"\"\"\n # Multiple calls to `Path` are used to ensure static typing compatibility.\n p = Path(p).expanduser()\n p = Path(p).resolve()\n return Path(p)\n
"},{"location":"concepts/bigscape/","title":"BigScape","text":"NPLinker can run BigScape automatically if the bigscape
directory does not exist in the working directory. Both version 1 and version 2 of BigScape are supported.
See the configuration template for how to set parameters for running BigScape.
See the default configurations for the default parameters used in NPLinker.
"},{"location":"concepts/config_file/","title":"Config File","text":""},{"location":"concepts/config_file/#configuration-template","title":"Configuration Template","text":"#############################\n# NPLinker configuration file\n#############################\n\n# The root directory of the NPLinker project. You need to create it first.\n# The value is required and must be a full path.\nroot_dir = \"<NPLinker root directory>\"\n# The mode for preparing dataset.\n# The available modes are \"podp\" and \"local\".\n# \"podp\" mode is for using the PODP platform (https://pairedomicsdata.bioinformatics.nl/) to prepare the dataset.\n# \"local\" mode is for preparing the dataset locally. So uers do not need to upload their data to the PODP platform.\n# The value is required.\nmode = \"podp\"\n# The PODP project identifier.\n# The value is required if the mode is \"podp\".\npodp_id = \"\"\n\n\n[log]\n# Log level. The available levels are same as the levels in python package `logging`:\n# \"DEBUG\", \"INFO\", \"WARNING\", \"ERROR\", \"CRITICAL\".\n# The default value is \"INFO\".\nlevel = \"INFO\"\n# The log file to append log messages.\n# The value is optional.\n# If not set or use empty string, log messages will not be written to a file.\n# The file will be created if it does not exist. Log messages will be appended to the file if it exists.\nfile = \"path/to/logfile\"\n# Whether to write log meesages to console.\n# The default value is true.\nuse_console = true\n\n\n[mibig]\n# Whether to use mibig metadta (json).\n# The default value is true.\nto_use = true\n# The version of mibig metadata.\n# Make sure using the same version of mibig in bigscape.\n# The default value is \"3.1\"\nversion = \"3.1\"\n\n\n[bigscape]\n# The parameters to use for running BiG-SCAPE.\n# Version of BiG-SCAPE to run. Make sure to change the parameters property below as well\n# when changing versions.\nversion = 1\n# Required BiG-SCAPE parameters.\n# --------------\n# For version 1:\n# -------------\n# Required parameters are: `--mix`, `--include_singletons` and `--cutoffs`. NPLinker needs them to run the analysis properly.\n# Do NOT set these parameters: `--inputdir`, `--outputdir`, `--pfam_dir`. NPLinker will automatically configure them.\n# If parameter `--mibig` is set, make sure to set the config `mibig.to_use` to true and `mibig.version` to the version of mibig in BiG-SCAPE.\n# The default value is \"--mibig --clans-off --mix --include_singletons --cutoffs 0.30\".\n# --------------\n# For version 2:\n# --------------\n# Note that BiG-SCAPE v2 has subcommands. NPLinker requires the `cluster` subcommand and its parameters.\n# Required parameters of `cluster` subcommand are: `--mibig_version`, `--include_singletons` and `--gcf_cutoffs`.\n# DO NOT set these parameters: `--pfam_path`, `--inputdir`, `--outputdir`. NPLinker will automatically configure them.\n# BiG-SCPAPE v2 also runs a `--mix` analysis by default, so you don't need to set this parameter here.\n# Example parameters for BiG-SCAPE v2: \"--mibig_version 3.1 --include_singletons --gcf_cutoffs 0.30\"\nparameters = \"--mibig --clans-off --mix --include_singletons --cutoffs 0.30\"\n# Which bigscape cutoff to use for NPLinker analysis.\n# There might be multiple cutoffs in bigscape output.\n# Note that this value must be a string.\n# The default value is \"0.30\".\ncutoff = \"0.30\"\n\n\n[scoring]\n# Scoring methods.\n# Valid values are \"metcalf\" and \"rosetta\".\n# The default value is \"metcalf\".\nmethods = [\"metcalf\"]\n
"},{"location":"concepts/config_file/#default-configurations","title":"Default Configurations","text":"The default configurations are automatically used by NPLinker if you don't set them in your config file.
# NPLinker default configurations\n\n[log]\nlevel = \"INFO\"\nuse_console = true\n\n[mibig]\nto_use = true\nversion = \"3.1\"\n\n[bigscape]\nversion = 1\nparameters = \"--mibig --clans-off --mix --include_singletons --cutoffs 0.30\"\ncutoff = \"0.30\"\n\n[scoring]\nmethods = [\"metcalf\"]\n
"},{"location":"concepts/config_file/#config-loader","title":"Config loader","text":"You can load the configuration file using the load_config function.
from nplinker.config import load_config\nconfig = load_config('path/to/nplinker.toml')\n
When you use NPLinker as an application, you can get access to the configuration object directly:
from nplinker import NPLinker\nnpl = NPLinker('path/to/nplinker.toml')\nprint(npl.config)\n
"},{"location":"concepts/gnps_data/","title":"GNPS data","text":"NPLinker requires GNPS molecular networking data as input. It currently accepts data from the following GNPS workflows:
METABOLOMICS-SNETS
(data should be downloaded from the option Download Clustered Spectra as MGF
)METABOLOMICS-SNETS-V2
(Download Clustered Spectra as MGF
)FEATURE-BASED-MOLECULAR-NETWORKING
(Download Cytoscape Data
)METABOLOMICS-SNETS
workflowMETABOLOMICS-SNETS-V2
FEATURE-BASED-MOLECULAR-NETWORKING
NPLinker input GNPS file in the archive of Download Clustered Spectra as MGF
spectra.mgf METABOLOMICS-SNETS*.mgf molecular_families.tsv networkedges_selfloop/*.pairsinfo annotations.tsv result_specnets_DB/*.tsv file_mappings.tsv clusterinfosummarygroup_attributes_withIDs_withcomponentID/*.tsv For example, the file METABOLOMICS-SNETS*.mgf
from the downloaded zip archive is used as the spectra.mgf
input file of NPLinker.
When manually preparing GNPS data for NPLinker, the METABOLOMICS-SNETS*.mgf
must be renamed to spectra.mgf
and placed in the gnps
sub-directory of the NPLinker working directory.
Download Clustered Spectra as MGF
spectra.mgf METABOLOMICS-SNETS-V2*.mgf molecular_families.tsv networkedges_selfloop/*.selfloop annotations.tsv result_specnets_DB/*.tsv file_mappings.tsv clusterinfosummarygroup_attributes_withIDs_withcomponentID/*.clustersummary NPLinker input GNPS file in the archive of Download Cytoscape Data
spectra.mgf spectra/*.mgf molecular_families.tsv networkedges_selfloop/*.selfloop annotations.tsv DB_result/*.tsv file_mappings.csv quantification_table/*.csv Note that file_mappings.csv
is a CSV file, not a TSV file, different from the other workflows.
NPLinker requires a fixed structure of working directory with fixed names for the input and output data.
root_dir # (1)!\n \u2502\n \u251c\u2500\u2500 nplinker.toml # (2)!\n \u251c\u2500\u2500 strain_mappings.json [F] # (3)!\n \u251c\u2500\u2500 strains_selected.json [F][O] # (4)!\n \u2502\n \u251c\u2500\u2500 gnps [F] # (5)!\n \u2502 \u251c\u2500\u2500 spectra.mgf [F]\n \u2502 \u251c\u2500\u2500 molecular_families.tsv [F]\n \u2502 \u251c\u2500\u2500 annotations.tsv [F]\n \u2502 \u2514\u2500\u2500 file_mappings.tsv (.csv) [F] # (6)!\n \u2502\n \u251c\u2500\u2500 antismash [F] # (7)!\n \u2502 \u251c\u2500\u2500 GCF_000514975.1\n \u2502 \u2502 \u251c\u2500\u2500 xxx.region001.gbk\n \u2502 \u2502 \u2514\u2500\u2500 ...\n \u2502 \u251c\u2500\u2500 GCF_000016425.1\n \u2502 \u2502 \u251c\u2500\u2500 xxxx.region001.gbk\n \u2502 \u2502 \u2514\u2500\u2500 ...\n \u2502 \u2514\u2500\u2500 ...\n \u2502\n \u251c\u2500\u2500 bigscape [F][O] # (8)!\n \u2502 \u251c\u2500\u2500 mix_clustering_c0.30.tsv [F] # (9)!\n \u2502 \u2514\u2500\u2500 bigscape_running_output\n \u2502 \u2514\u2500\u2500 ...\n \u2502\n \u251c\u2500\u2500 downloads [F][A] # (10)!\n \u2502 \u251c\u2500\u2500 paired_datarecord_4b29ddc3-26d0-40d7-80c5-44fb6631dbf9.4.json # (11)!\n \u2502 \u251c\u2500\u2500 GCF_000016425.1.zip\n \u2502 \u251c\u2500\u2500 GCF_0000514975.1.zip\n \u2502 \u251c\u2500\u2500 c22f44b14a3d450eb836d607cb9521bb.zip\n \u2502 \u251c\u2500\u2500 genome_status.json\n \u2502 \u2514\u2500\u2500 mibig_json_3.1.tar.gz\n \u2502\n \u251c\u2500\u2500 mibig [F][A] # (12)!\n \u2502 \u251c\u2500\u2500 BGC0000001.json\n \u2502 \u251c\u2500\u2500 BGC0000002.json\n \u2502 \u2514\u2500\u2500 ...\n \u2502\n \u251c\u2500\u2500 output [F][A] # (13)!\n \u2502 \u2514\u2500\u2500 ...\n \u2502\n \u2514\u2500\u2500 ... # (14)!\n
root_dir
is the working directory you created, used as the root directory for NPLinker.nplinker.toml
is the configuration file (toml format) provided by the user for running NPLinker. strain_mappings.json
contains the mappings from strain to genomics and metabolomics data. It is generated by NPLinker for podp
mode; for local
mode, users need to create it manually. [F]
means the file name nplinker.toml
is a fixed name (including the extension) and must be named as shown.strains_selected.json
is an optional file containing the list of strains to be used in the analysis. If it is not provided, NPLinker will use all strains detected from the input data. [O]
means the file strains_selected.json
is optional for users to provide.gnps
directory contains the GNPS data. The files in this directory must be named as shown. See XXX for more information about the GNPS data..tsv
or .csv
format.antismash
directory contains a collection of AntiSMASH BGC data. The BGC data (*.region*.gbk
files) must be stored in subdirectories named after NCBI accession number (e.g. GCF_000514975.1
).bigscape
directory is optional and contains the output of BigScape. If the directory is not provided, NPLinker will run BigScape automatically to generate the data using the AntiSMASH BGC data.mix_clustering_c0.30.tsv
is an example output of BigScape. The file name must follow the pattern mix_clustering_c{cutoff}.tsv
, where {cutoff}
is the cutoff value used in the BigScape run.downloads
directory is automatically created and managed by NPLinker. It stores the downloaded data from the internet. Users can also use it to store their own downloaded data. [A]
means the directory is automatically created and/or managed by NPLinker.downloads
directory.mibig
directory contains the MIBiG metadata, which is automatically created and downloaded by NPLinker. Users should not interfere with this directory and its content.output
directory is automatically created by NPLinker. It stores the output data of NPLinker.Tip
[F]
means the file or directory name is fixed and must be named as shown. The names are defined in the defaults module.[O]
means the file or directory is optional for users to provide. It does not mean the file or directory is optional for NPLinker to use. If it's not provided by the user, NPLinker may generate it.[A]
means the directory is automatically created and/or managed by NPLinker.The DatasetArranger is implemented according to the following flowcharts.
"},{"location":"diagrams/arranger/#strain-mappings-file","title":"Strain mappings file","text":"flowchart TD\n StrainMappings[`strain_mappings.json`] --> SM{Is the mode PODP?}\n SM --> |No |SM0[Validate the file]\n SM --> |Yes|SM1[Generate the file] --> SM0
"},{"location":"diagrams/arranger/#strain-selection-file","title":"Strain selection file","text":"flowchart TD\n StrainsSelected[`strains_selected.json`] --> S{Does the file exist?}\n S --> |No | S0[Nothing to do]\n S --> |Yes| S1[Validate the file]
"},{"location":"diagrams/arranger/#podp-project-metadata-json-file","title":"PODP project metadata json file","text":"flowchart TD\n podp[PODP project metadata json file] --> A{Is the mode PODP?}\n A --> |No | A0[Nothing to do]\n A --> |Yes| P{Does the file exist?}\n P --> |No | P0[Download the file] --> P1\n P --> |Yes| P1[Validate the file]
"},{"location":"diagrams/arranger/#gnps-antismash-and-bigscape","title":"GNPS, AntiSMASH and BigScape","text":"flowchart TD\n ConfigError[Dynaconf config validation error]\n DataError[Data validation error]\n UseIt[Use the data]\n Download[First remove existing data if relevent, then download or generate data]\n\n A[GNPS, antiSMASH and BigSCape] --> B{Pass Dynaconf config validation?}\n B -->|No | ConfigError\n B -->|Yes| G{Is the mode PODP?}\n\n G -->|No, local mode| G1{Does data dir exist?}\n G1 -->|No | DataError\n G1 -->|Yes| H{Pass data validation?}\n H --> |No | DataError\n H --> |Yes| UseIt \n\n G -->|Yes, podp mode| G2{Does data dir exist?}\n G2 --> |No | Download\n G2 --> |Yes | J{Pass data validation?}\n J -->|No | Download --> |try max 2 times| J\n J -->|Yes| UseIt
"},{"location":"diagrams/arranger/#mibig-data","title":"MIBiG Data","text":"MIBiG data is always downloaded automatically. Users cannot provide their own MIBiG data.
flowchart TD\n Mibig[MIBiG] --> M0{Pass Dynaconf config validation?}\n M0 -->|No | M01[Dynaconf config validation error]\n M0 -->|Yes | MibigDownload[First remove existing data if relevant and then download data]
"},{"location":"diagrams/loader/","title":"Dataset Loading Pipeline","text":"The DatasetLoader is implemented according to the following pipeline.
"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"NPLinker","text":"NPLinker is a python framework for data mining microbial natural products by integrating genomics and metabolomics data.
For a deep understanding of NPLinker, please refer to the original paper.
Under Development
NPLinker v2 is under active development (see its pre-releases). The documentation is not complete yet. If you have any questions, please contact us via GitHub Issues.
"},{"location":"install/","title":"Installation","text":"RequirementsNPLinker is a python package that has both pypi packages and non-pypi packages as dependencies. It requires ~4.5GB of disk space to install all the dependencies.
Install nplinker
package as following:
# Check python version (\u22653.9)\npython --version\n\n# Create a new virtual environment\npython -m venv env # (1)!\nsource env/bin/activate # (2)! \n\n# install nplinker package (requiring ~300MB of disk space)\npip install --pre nplinker # (3)!\n\n# install nplinker non-pypi dependencies and databases (~4GB)\ninstall-nplinker-deps\n
conda
to create a new environment. But NPLinker is not available on conda yet.pip
command and make sure it is provided by the activated virtual environment. --pre
option. You can also install NPLinker from source code:
Install from latest source codepip install git+https://github.com/nplinker/nplinker@dev # (1)!\ninstall-nplinker-deps\n
@dev
is the branch name. You can replace it with the branch name, commit or tag.NPLinker uses the standard library logging module for managing log messages and the python library rich to colorize the log messages. Depending on how you use NPLinker, you can set up logging in different ways.
"},{"location":"logging/#nplinker-as-an-application","title":"NPLinker as an application","text":"If you're using NPLinker as an application, you're running the whole workflow of NPLinker as described in the Quickstart. In this case, you can set up logging in the nplinker configuration file nplinker.toml
.
If you're using NPLinker as a library, you're using only some functions and classes of NPLinker in your script. By default, NPLinker will not log any messages. However, you can set up logging in your script to log messages.
Set up logging in 'your_script.py'# Set up logging configuration first\nfrom nplinker import setup_logging\n\nsetup_logging(level=\"DEBUG\", file=\"nplinker.log\", use_console=True) # (1)!\n\n# Your business code here\n# e.g. download and extract nplinker example data\nfrom nplinker.utils import download_and_extract_archive\n\ndownload_and_extract_archive(\n url=\"https://zenodo.org/records/10822604/files/nplinker_local_mode_example.zip\",\n download_root=\".\",\n)\n
setup_logging
function sets up the logging configuration. The level
argument sets the logging level. The file
argument sets the log file. The use_console
argument sets whether to log messages to the console.The log messages will be written to the log file nplinker.log
and displayed in the console with a format like this: [Date Time] Level Log-message Module:Line
.
# Run your script\n$ python your_script.py\nDownloading nplinker_local_mode_example.zip \u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501 100.0% \u2022 195.3/195.3 MB \u2022 2.6 MB/s \u2022 0:00:00 \u2022 0:01:02 # (1)!\n[2024-05-10 15:14:48] INFO Extracting nplinker_local_mode_example.zip to . utils.py:401\n\n# Check the log file\n$ cat nplinker.log\n[2024-05-10 15:14:48] INFO Extracting nplinker_local_mode_example.zip to . utils.py:401\n
NPLinker allows you to run in two modes:
local
modepodp
mode The local
mode assumes that the data required by NPLinker is available on your local machine.
The required input data includes:
METABOLOMICS-SNETS
,METABOLOMICS-SNETS-V2
FEATURE-BASED-MOLECULAR-NETWORKING
The podp
mode assumes that you use an identifier of Paired Omics Data Platform (PODP) as the input for NPLinker. Then NPLinker will download and prepare all data necessary based on the PODP id which refers to the metadata of the dataset.
So, which mode will you use? The answer is important for the next steps.
"},{"location":"quickstart/#1-create-a-working-directory","title":"1. Create a working directory","text":"The working directory is used to store all input and output data for NPLinker. You can name this directory as you like, for example nplinker_quickstart
:
mkdir nplinker_quickstart\n
Important
Before going to the next step, make sure you get familiar with how NPLinker organizes data in the working directory, see Working Directory Structure page.
"},{"location":"quickstart/#2-prepare-input-data-local-mode-only","title":"2. Prepare input data (local
mode only)","text":"Details Skip this step if you choose to use the podp
mode.
If you choose to use the local
mode, meaning you have input data of NPLinker stored on your local machine, you need to move the input data to the working directory created in the previous step.
NPLinker accepts data from the output of the following GNPS workflows:
METABOLOMICS-SNETS
METABOLOMICS-SNETS-V2
FEATURE-BASED-MOLECULAR-NETWORKING
.NPLinker provides the tools GNPSDownloader
and GNPSExtractor
to download and extract the GNPS data with ease. What you need to give is a valid GNPS task ID, referring to a task of the GNPS workflows supported by NPLinker.
Given an example of GNPS task at https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=c22f44b14a3d450eb836d607cb9521bb, the task id is the last part of this url, i.e. c22f44b14a3d450eb836d607cb9521bb
. Open this link, you can find the worklow info at the row \"Workflow\" of the table \"Job Status\", for this case, it is METABOLOMICS-SNETS
.
from nplinker.metabolomics.gnps import GNPSDownloader, GNPSExtractor\n\n# Go to the working directory\ncd nplinker_quickstart\n\n# Download GNPS data & get the path to the downloaded archive\ndownloader = GNPSDownloader(\"gnps_task_id\", \"downloads\") # (1)!\ndownloaded_archive = downloader.download().get_download_file()\n\n# Extract GNPS data to `gnps` directory\nextractor = GNPSExtractor(downloaded_archive, \"gnps\") # (2)!\n
downloaded_archive
with the actuall path to your GNPS data archive if you skipped the download steps.The required data for NPLinker will be extracted to the gnps
subdirectory of the working directory.
Info
Not all GNPS data are required by NPLinker, and only the necessary data will be extracted. During the extraction, these data will be renamed to the standard names used by NPLinker. See the page GNPS Data for more information.
Prepare GNPS data manuallyIf you have GNPS data but it is not the archive format as downloaded from GNPS, it's recommended to re-download the data from GNPS.
If (re-)downloading is not possible, you could manually prepare data for the gnps
directory. In this case, you must make sure that the data is organized as expected by NPLinker. See the page GNPS Data for examples of how to prepare the data.
NPLinker requires AntiSMASH BGC data as input, which are organized in the antismash
subdirectory of the working directory.
For each output of AntiSMASH run, the BGC data must be stored in a subdirectory named after the NCBI accession number (e.g. GCF_000514975.1
). And only the *.region*.gbk
files are required by NPLinker.
When manually preparing AntiSMASH data for NPLinker, you must make sure that the data is organized as expected by NPLinker. See the page Working Directory Structure for more information.
"},{"location":"quickstart/#bigscape-data-optional","title":"BigScape data (optional)","text":"It is optional to provide the output of BigScape to NPLinker. If the output of BigScape is not provided, NPLinker will run BigScape automatically to generate the data using the AntiSMASH BGC data.
If you have the output of BigScape, you can put its mix_clustering_c{cutoff}.tsv
file in the bigscape
subdirectory of the NPLinker working directory, where {cutoff}
is the cutoff value used in the BigScape run.
The strain mappings file strain_mapping.json
is required by NPLinker to map the strain to genomics and metabolomics data.
{\n \"strain_mappings\": [\n {\n \"strain_id\": \"strain_id_1\", # (1)!\n \"strain_alias\": [\"bgc_id_1\", \"spectrum_id_1\", ...] # (2)!\n },\n {\n \"strain_id\": \"strain_id_2\",\n \"strain_alias\": [\"bgc_id_2\", \"spectrum_id_2\", ...]\n },\n ...\n ],\n \"version\": \"1.0\" # (3)!\n}\n
strain_id
is the unique identifier of the strain.strain_alias
is a list of aliases of the strain, which are the identifiers of the BGCs and spectra of the strain.version
is the schema version of this file. It is recommended to use the latest version of the schema. The current latest version is 1.0
. The BGC id is same as the name of the BGC file in the antismash
directory, for example, given a BGC file xxxx.region001.gbk
, the BGC id is xxxx.region001
.
The spectrum id is same as the scan number in the spectra.mgf
file in the gnps
directory, for example, given a spectrum in the mgf file with a scan SCANS=1
, the spectrum id is 1
.
If you labelled the mzXML files (input for GNPS) with the strain id, you may need the function extract_mappings_ms_filename_spectrum_id to extract the mappings from mzXML files to the spectrum ids.
For the local
mode, you need to create this file manually and put it in the working directory. It takes some effort to prepare this file manually, especially when you have a large number of strains.
The configuration file nplinker.toml
is required by NPLinker to specify the working directory, mode, and other settings for the run of NPLinker. You can put the nplinker.toml
file in any place, but it is recommended to put it in the working directory created in step 2.
The details of all settings can be found at this page Config File.
To keep it simple, default settings will be used automatically by NPLinker if you don't set them in your nplinker.toml
config file.
What you need to do is to set the root_dir
and mode
in the nplinker.toml
file.
local
modepodp
mode nplinker.tomlroot_dir = \"absolute/path/to/working/directory\" # (1)!\nmode = \"local\"\n# and other settings you want to override the default settings \n
absolute/path/to/working/directory
with the absolute path to the working directory created in step 2.root_dir = \"absolute/path/to/working/directory\" # (1)!\nmode = \"podp\"\npodp_id = \"podp_id\" # (2)!\n# and other settings you want to override the default settings \n
absolute/path/to/working/directory
with the absolute path to the working directory created in step 2.podp_id
with the identifier of the dataset in the Paired Omics Data Platform (PODP).Before running NPLinker, make sure your working directory has the correct directory structure and names as described in the Working Directory Structure page.
Run NPLinker in your working directoryfrom nplinker import NPLinker\n\n# create an instance of NPLinker\nnpl = NPLinker(\"nplinker.toml\") # (1)!\n\n# load data\nnpl.load_data()\n\n# check loaded data\nprint(npl.bgcs)\nprint(npl.gcfs)\nprint(npl.spectra)\nprint(npl.mfs)\nprint(npl.strains)\n\n# compute the links for the first 3 GCFs using metcalf scoring method\nlink_graph = npl.get_links(npl.gcfs[:3], \"metcalf\") # (2)!\n\n# get links as a list of tuples\nlink_graph.links \n\n# get the link data between two objects or entities\nlink_graph.get_link_data(npl.gcfs[0], npl.spectra[0]) \n\n# Save data to a pickle file\nnpl.save_data(\"npl.pkl\", link_graph)\n
nplinker.toml
with the actual path to your configuration file.get_links
returns a LinkGraph object that represents the calculated links between the GCFs and other entities as a graph.For more info about the classes and methods, see the API Documentation.
"},{"location":"api/antismash/","title":"AntiSMASH","text":""},{"location":"api/antismash/#nplinker.genomics.antismash","title":"nplinker.genomics.antismash","text":""},{"location":"api/antismash/#nplinker.genomics.antismash.AntismashBGCLoader","title":"AntismashBGCLoader","text":"AntismashBGCLoader(data_dir: str | PathLike)\n
Bases: BGCLoaderBase
Data loader for AntiSMASH BGC genbank (.gbk) files.
Parameters:
data_dir
(str | PathLike
) \u2013 Path to AntiSMASH directory that contains a collection of AntiSMASH outputs.
The input data_dir
must follow the structure defined in the Working Directory Structure for AntiSMASH data, e.g.:
antismash\n \u251c\u2500\u2500 genome_id_1 # one AntiSMASH output, e.g. GCF_000514775.1\n \u2502\u00a0 \u251c\u2500\u2500 NZ_AZWO01000004.region001.gbk\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u251c\u2500\u2500 genome_id_2\n \u2502\u00a0 \u251c\u2500\u2500 ...\n \u2514\u2500\u2500 ...\n
Source code in src/nplinker/genomics/antismash/antismash_loader.py
def __init__(self, data_dir: str | PathLike) -> None:\n \"\"\"Initialize the AntiSMASH BGC loader.\n\n Args:\n data_dir: Path to AntiSMASH directory that contains a collection of AntiSMASH outputs.\n\n Notes:\n The input `data_dir` must follow the structure defined in the\n [Working Directory Structure][working-directory-structure] for AntiSMASH data, e.g.:\n ```shell\n antismash\n \u251c\u2500\u2500 genome_id_1 # one AntiSMASH output, e.g. GCF_000514775.1\n \u2502\u00a0 \u251c\u2500\u2500 NZ_AZWO01000004.region001.gbk\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u251c\u2500\u2500 genome_id_2\n \u2502\u00a0 \u251c\u2500\u2500 ...\n \u2514\u2500\u2500 ...\n ```\n \"\"\"\n self.data_dir = str(data_dir)\n self._file_dict = self._parse_data_dir(self.data_dir)\n self._bgcs = self._parse_bgcs(self._file_dict)\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.AntismashBGCLoader.data_dir","title":"data_dir instance-attribute
","text":"data_dir = str(data_dir)\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.AntismashBGCLoader.get_bgc_genome_mapping","title":"get_bgc_genome_mapping","text":"get_bgc_genome_mapping() -> dict[str, str]\n
Get the mapping from BGC to genome.
Info
The directory name of the gbk files is treated as genome id.
Returns:
dict[str, str]
\u2013 The key is BGC name (gbk file name) and value is genome id (the directory name of the
dict[str, str]
\u2013 gbk file).
src/nplinker/genomics/antismash/antismash_loader.py
def get_bgc_genome_mapping(self) -> dict[str, str]:\n \"\"\"Get the mapping from BGC to genome.\n\n !!! info\n The directory name of the gbk files is treated as genome id.\n\n Returns:\n The key is BGC name (gbk file name) and value is genome id (the directory name of the\n gbk file).\n \"\"\"\n return {\n bid: os.path.basename(os.path.dirname(bpath)) for bid, bpath in self._file_dict.items()\n }\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.AntismashBGCLoader.get_files","title":"get_files","text":"get_files() -> dict[str, str]\n
Get BGC gbk files.
Returns:
dict[str, str]
\u2013 The key is BGC name (gbk file name) and value is path to the gbk file.
src/nplinker/genomics/antismash/antismash_loader.py
def get_files(self) -> dict[str, str]:\n \"\"\"Get BGC gbk files.\n\n Returns:\n The key is BGC name (gbk file name) and value is path to the gbk file.\n \"\"\"\n return self._file_dict\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.AntismashBGCLoader.get_bgcs","title":"get_bgcs","text":"get_bgcs() -> list[BGC]\n
Get all BGC objects.
Returns:
list[BGC]
\u2013 A list of BGC objects
src/nplinker/genomics/antismash/antismash_loader.py
def get_bgcs(self) -> list[BGC]:\n \"\"\"Get all BGC objects.\n\n Returns:\n A list of BGC objects\n \"\"\"\n return self._bgcs\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.GenomeStatus","title":"GenomeStatus","text":"GenomeStatus(\n original_id: str,\n resolved_refseq_id: str = \"\",\n resolve_attempted: bool = False,\n bgc_path: str = \"\",\n)\n
Class to represent the status of a single genome.
The status of genomes is tracked in the file GENOME_STATUS_FILENAME.
Parameters:
original_id
(str
) \u2013 The original ID of the genome.
resolved_refseq_id
(str
, default: ''
) \u2013 The resolved RefSeq ID of the genome. Defaults to \"\".
resolve_attempted
(bool
, default: False
) \u2013 A flag indicating whether an attempt to resolve the RefSeq ID has been made. Defaults to False.
bgc_path
(str
, default: ''
) \u2013 The path to the downloaded BGC file for the genome. Defaults to \"\".
src/nplinker/genomics/antismash/podp_antismash_downloader.py
def __init__(\n self,\n original_id: str,\n resolved_refseq_id: str = \"\",\n resolve_attempted: bool = False,\n bgc_path: str = \"\",\n):\n \"\"\"Initialize a GenomeStatus object for the given genome.\n\n Args:\n original_id: The original ID of the genome.\n resolved_refseq_id: The resolved RefSeq ID of the\n genome. Defaults to \"\".\n resolve_attempted: A flag indicating whether an\n attempt to resolve the RefSeq ID has been made. Defaults to False.\n bgc_path: The path to the downloaded BGC file for\n the genome. Defaults to \"\".\n \"\"\"\n self.original_id = original_id\n self.resolved_refseq_id = \"\" if resolved_refseq_id == \"None\" else resolved_refseq_id\n self.resolve_attempted = resolve_attempted\n self.bgc_path = bgc_path\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.GenomeStatus.original_id","title":"original_id instance-attribute
","text":"original_id = original_id\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.GenomeStatus.resolved_refseq_id","title":"resolved_refseq_id instance-attribute
","text":"resolved_refseq_id = (\n \"\"\n if resolved_refseq_id == \"None\"\n else resolved_refseq_id\n)\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.GenomeStatus.resolve_attempted","title":"resolve_attempted instance-attribute
","text":"resolve_attempted = resolve_attempted\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.GenomeStatus.bgc_path","title":"bgc_path instance-attribute
","text":"bgc_path = bgc_path\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.GenomeStatus.read_json","title":"read_json staticmethod
","text":"read_json(\n file: str | PathLike,\n) -> dict[str, \"GenomeStatus\"]\n
Get a dict of GenomeStatus objects by loading given genome status file.
Note that an empty dict is returned if the given file doesn't exist.
Parameters:
file
(str | PathLike
) \u2013 Path to genome status file.
Returns:
dict[str, 'GenomeStatus']
\u2013 Dict keys are genome original id and values are GenomeStatus objects. An empty dict is returned if the given file doesn't exist.
src/nplinker/genomics/antismash/podp_antismash_downloader.py
@staticmethod\ndef read_json(file: str | PathLike) -> dict[str, \"GenomeStatus\"]:\n \"\"\"Get a dict of GenomeStatus objects by loading given genome status file.\n\n Note that an empty dict is returned if the given file doesn't exist.\n\n Args:\n file: Path to genome status file.\n\n Returns:\n Dict keys are genome original id and values are GenomeStatus\n objects. An empty dict is returned if the given file doesn't exist.\n \"\"\"\n genome_status_dict = {}\n if Path(file).exists():\n with open(file, \"r\") as f:\n data = json.load(f)\n\n # validate json data before using it\n validate(data, schema=GENOME_STATUS_SCHEMA)\n\n genome_status_dict = {\n gs[\"original_id\"]: GenomeStatus(**gs) for gs in data[\"genome_status\"]\n }\n return genome_status_dict\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.GenomeStatus.to_json","title":"to_json staticmethod
","text":"to_json(\n genome_status_dict: Mapping[str, \"GenomeStatus\"],\n file: str | PathLike | None = None,\n) -> str | None\n
Convert the genome status dictionary to a JSON string.
If a file path is provided, the JSON string is written to the file. If the file already exists, it is overwritten.
Parameters:
genome_status_dict
(Mapping[str, 'GenomeStatus']
) \u2013 A dictionary of genome status objects. The keys are the original genome IDs and the values are GenomeStatus objects.
file
(str | PathLike | None
, default: None
) \u2013 The path to the output JSON file. If None, the JSON string is returned but not written to a file.
Returns:
str | None
\u2013 The JSON string if file
is None, otherwise None.
src/nplinker/genomics/antismash/podp_antismash_downloader.py
@staticmethod\ndef to_json(\n genome_status_dict: Mapping[str, \"GenomeStatus\"], file: str | PathLike | None = None\n) -> str | None:\n \"\"\"Convert the genome status dictionary to a JSON string.\n\n If a file path is provided, the JSON string is written to the file. If\n the file already exists, it is overwritten.\n\n Args:\n genome_status_dict: A dictionary of genome\n status objects. The keys are the original genome IDs and the values\n are GenomeStatus objects.\n file: The path to the output JSON file.\n If None, the JSON string is returned but not written to a file.\n\n Returns:\n The JSON string if `file` is None, otherwise None.\n \"\"\"\n gs_list = [gs._to_dict() for gs in genome_status_dict.values()]\n json_data = {\"genome_status\": gs_list, \"version\": \"1.0\"}\n\n # validate json object before dumping\n validate(json_data, schema=GENOME_STATUS_SCHEMA)\n\n if file is not None:\n with open(file, \"w\") as f:\n json.dump(json_data, f)\n return None\n return json.dumps(json_data)\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.download_and_extract_antismash_data","title":"download_and_extract_antismash_data","text":"download_and_extract_antismash_data(\n antismash_id: str,\n download_root: str | PathLike,\n extract_root: str | PathLike,\n) -> None\n
Download and extract antiSMASH BGC archive for a specified genome.
The antiSMASH database (https://antismash-db.secondarymetabolites.org/) is used to download the BGC archive. And antiSMASH use RefSeq assembly id of a genome as the id of the archive.
Parameters:
antismash_id
(str
) \u2013 The id used to download BGC archive from antiSMASH database. If the id is versioned (e.g., \"GCF_004339725.1\") please be sure to specify the version as well.
download_root
(str | PathLike
) \u2013 Path to the directory to place downloaded archive in.
extract_root
(str | PathLike
) \u2013 Path to the directory data files will be extracted to. Note that an antismash
directory will be created in the specified extract_root
if it doesn't exist. The files will be extracted to <extract_root>/antismash/<antismash_id>
directory.
Raises:
ValueError
\u2013 if <extract_root>/antismash/<refseq_assembly_id>
dir is not empty.
Examples:
>>> download_and_extract_antismash_metadata(\"GCF_004339725.1\", \"/data/download\", \"/data/extracted\")\n
Source code in src/nplinker/genomics/antismash/antismash_downloader.py
def download_and_extract_antismash_data(\n antismash_id: str, download_root: str | PathLike, extract_root: str | PathLike\n) -> None:\n \"\"\"Download and extract antiSMASH BGC archive for a specified genome.\n\n The antiSMASH database (https://antismash-db.secondarymetabolites.org/)\n is used to download the BGC archive. And antiSMASH use RefSeq assembly id\n of a genome as the id of the archive.\n\n Args:\n antismash_id: The id used to download BGC archive from antiSMASH database.\n If the id is versioned (e.g., \"GCF_004339725.1\") please be sure to\n specify the version as well.\n download_root: Path to the directory to place downloaded archive in.\n extract_root: Path to the directory data files will be extracted to.\n Note that an `antismash` directory will be created in the specified `extract_root` if\n it doesn't exist. The files will be extracted to `<extract_root>/antismash/<antismash_id>` directory.\n\n Raises:\n ValueError: if `<extract_root>/antismash/<refseq_assembly_id>` dir is not empty.\n\n Examples:\n >>> download_and_extract_antismash_metadata(\"GCF_004339725.1\", \"/data/download\", \"/data/extracted\")\n \"\"\"\n download_root = Path(download_root)\n extract_root = Path(extract_root)\n extract_path = extract_root / \"antismash\" / antismash_id\n\n try:\n if extract_path.exists():\n _check_extract_path(extract_path)\n else:\n extract_path.mkdir(parents=True, exist_ok=True)\n\n for base_url in [ANTISMASH_DB_DOWNLOAD_URL, ANTISMASH_DBV2_DOWNLOAD_URL]:\n url = base_url.format(antismash_id, antismash_id + \".zip\")\n download_and_extract_archive(url, download_root, extract_path, antismash_id + \".zip\")\n break\n\n # delete subdirs\n for subdir_path in list_dirs(extract_path):\n shutil.rmtree(subdir_path)\n\n # delete unnecessary files\n files_to_keep = list_files(extract_path, suffix=(\".json\", \".gbk\"))\n for file in list_files(extract_path):\n if file not in files_to_keep:\n os.remove(file)\n\n logger.info(\"antiSMASH BGC data of %s is downloaded and extracted.\", antismash_id)\n\n except Exception as e:\n shutil.rmtree(extract_path)\n logger.warning(e)\n raise e\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.parse_bgc_genbank","title":"parse_bgc_genbank","text":"parse_bgc_genbank(file: str | PathLike) -> BGC\n
Parse a single BGC gbk file to BGC object.
Parameters:
file
(str | PathLike
) \u2013 Path to BGC gbk file
Returns:
BGC
\u2013 BGC object
Examples:
>>> bgc = AntismashBGCLoader.parse_bgc(\n... \"/data/antismash/GCF_000016425.1/NC_009380.1.region001.gbk\")\n
Source code in src/nplinker/genomics/antismash/antismash_loader.py
def parse_bgc_genbank(file: str | PathLike) -> BGC:\n \"\"\"Parse a single BGC gbk file to BGC object.\n\n Args:\n file: Path to BGC gbk file\n\n Returns:\n BGC object\n\n Examples:\n >>> bgc = AntismashBGCLoader.parse_bgc(\n ... \"/data/antismash/GCF_000016425.1/NC_009380.1.region001.gbk\")\n \"\"\"\n file = Path(file)\n fname = file.stem\n\n record = SeqIO.read(file, format=\"genbank\")\n description = record.description # \"DEFINITION\" in gbk file\n antismash_id = record.id # \"VERSION\" in gbk file\n features = _parse_antismash_genbank(record)\n product_prediction = features.get(\"product\")\n if product_prediction is None:\n raise ValueError(f\"Not found product prediction in antiSMASH Genbank file {file}\")\n\n # init BGC\n bgc = BGC(fname, *product_prediction)\n bgc.description = description\n bgc.antismash_id = antismash_id\n bgc.antismash_file = str(file)\n bgc.antismash_region = features.get(\"region_number\")\n bgc.smiles = features.get(\"smiles\")\n bgc.strain = Strain(fname)\n return bgc\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.get_best_available_genome_id","title":"get_best_available_genome_id","text":"get_best_available_genome_id(\n genome_id_data: Mapping[str, str]\n) -> str | None\n
Get the best available ID from genome_id_data dict.
Parameters:
genome_id_data
(Mapping[str, str]
) \u2013 dictionary containing information for each genome record present.
Returns:
str | None
\u2013 ID for the genome, if present, otherwise None.
src/nplinker/genomics/antismash/podp_antismash_downloader.py
def get_best_available_genome_id(genome_id_data: Mapping[str, str]) -> str | None:\n \"\"\"Get the best available ID from genome_id_data dict.\n\n Args:\n genome_id_data: dictionary containing information for each genome record present.\n\n Returns:\n ID for the genome, if present, otherwise None.\n \"\"\"\n if \"RefSeq_accession\" in genome_id_data:\n best_id = genome_id_data[\"RefSeq_accession\"]\n elif \"GenBank_accession\" in genome_id_data:\n best_id = genome_id_data[\"GenBank_accession\"]\n elif \"JGI_Genome_ID\" in genome_id_data:\n best_id = genome_id_data[\"JGI_Genome_ID\"]\n else:\n best_id = None\n\n if best_id is None or len(best_id) == 0:\n logger.warning(f\"Failed to get valid genome ID in genome data: {genome_id_data}\")\n return None\n return best_id\n
"},{"location":"api/antismash/#nplinker.genomics.antismash.podp_download_and_extract_antismash_data","title":"podp_download_and_extract_antismash_data","text":"podp_download_and_extract_antismash_data(\n genome_records: Sequence[\n Mapping[str, Mapping[str, str]]\n ],\n project_download_root: str | PathLike,\n project_extract_root: str | PathLike,\n)\n
Download and extract antiSMASH BGC archive for the given genome records.
Parameters:
genome_records
(Sequence[Mapping[str, Mapping[str, str]]]
) \u2013 list of dicts representing genome records.
The dict of each genome record contains a key of genome ID with a value of another dict containing information about genome type, label and accession ids (RefSeq, GenBank, and/or JGI).
project_download_root
(str | PathLike
) \u2013 Path to the directory to place downloaded archive in.
project_extract_root
(str | PathLike
) \u2013 Path to the directory downloaded archive will be extracted to.
Note that an antismash
directory will be created in the specified extract_root
if it doesn't exist. The files will be extracted to <extract_root>/antismash/<antismash_id>
directory.
Warns:
UserWarning
\u2013 when no antiSMASH data is found for some genomes.
src/nplinker/genomics/antismash/podp_antismash_downloader.py
def podp_download_and_extract_antismash_data(\n genome_records: Sequence[Mapping[str, Mapping[str, str]]],\n project_download_root: str | PathLike,\n project_extract_root: str | PathLike,\n):\n \"\"\"Download and extract antiSMASH BGC archive for the given genome records.\n\n Args:\n genome_records: list of dicts representing genome records.\n\n The dict of each genome record contains a key of genome ID with a value\n of another dict containing information about genome type, label and\n accession ids (RefSeq, GenBank, and/or JGI).\n project_download_root: Path to the directory to place\n downloaded archive in.\n project_extract_root: Path to the directory downloaded archive will be extracted to.\n\n Note that an `antismash` directory will be created in the specified\n `extract_root` if it doesn't exist. The files will be extracted to\n `<extract_root>/antismash/<antismash_id>` directory.\n\n Warnings:\n UserWarning: when no antiSMASH data is found for some genomes.\n \"\"\"\n if not Path(project_download_root).exists():\n # otherwise in case of failed first download, the folder doesn't exist and\n # genome_status_file can't be written\n Path(project_download_root).mkdir(parents=True, exist_ok=True)\n\n gs_file = Path(project_download_root, GENOME_STATUS_FILENAME)\n gs_dict = GenomeStatus.read_json(gs_file)\n\n for i, genome_record in enumerate(genome_records):\n # get the best available ID from the dict\n genome_id_data = genome_record[\"genome_ID\"]\n raw_genome_id = get_best_available_genome_id(genome_id_data)\n if raw_genome_id is None or len(raw_genome_id) == 0:\n logger.warning(f'Invalid input genome record \"{genome_record}\"')\n continue\n\n # check if genome ID exist in the genome status file\n if raw_genome_id not in gs_dict:\n gs_dict[raw_genome_id] = GenomeStatus(raw_genome_id)\n\n gs_obj = gs_dict[raw_genome_id]\n\n logger.info(\n f\"Checking for antismash data {i + 1}/{len(genome_records)}, \"\n f\"current genome ID={raw_genome_id}\"\n )\n # first, check if BGC data is downloaded\n if gs_obj.bgc_path and Path(gs_obj.bgc_path).exists():\n logger.info(f\"Genome ID {raw_genome_id} already downloaded to {gs_obj.bgc_path}\")\n continue\n # second, check if lookup attempted previously\n if gs_obj.resolve_attempted:\n logger.info(f\"Genome ID {raw_genome_id} skipped due to previous failed attempt\")\n continue\n\n # if not downloaded or lookup attempted, then try to resolve the ID\n # and download\n logger.info(f\"Start lookup process for genome ID {raw_genome_id}\")\n gs_obj.resolved_refseq_id = _resolve_refseq_id(genome_id_data)\n gs_obj.resolve_attempted = True\n\n if gs_obj.resolved_refseq_id == \"\":\n # give up on this one\n logger.warning(f\"Failed lookup for genome ID {raw_genome_id}\")\n continue\n\n # if resolved id is valid, try to download and extract antismash data\n try:\n download_and_extract_antismash_data(\n gs_obj.resolved_refseq_id, project_download_root, project_extract_root\n )\n\n gs_obj.bgc_path = str(\n Path(project_download_root, gs_obj.resolved_refseq_id + \".zip\").absolute()\n )\n\n output_path = Path(project_extract_root, \"antismash\", gs_obj.resolved_refseq_id)\n if output_path.exists():\n Path.touch(output_path / \"completed\", exist_ok=True)\n\n except Exception:\n gs_obj.bgc_path = \"\"\n\n # raise and log warning for failed downloads\n failed_ids = [gs.original_id for gs in gs_dict.values() if not gs.bgc_path]\n if failed_ids:\n warning_message = (\n f\"Failed to download antiSMASH data for the following genome IDs: {failed_ids}\"\n )\n logger.warning(warning_message)\n warnings.warn(warning_message, UserWarning)\n\n # save updated genome status to json file\n GenomeStatus.to_json(gs_dict, gs_file)\n\n if len(failed_ids) == len(genome_records):\n raise ValueError(\"No antiSMASH data found for any genome\")\n
"},{"location":"api/arranger/","title":"Dataset Arranger","text":""},{"location":"api/arranger/#nplinker.arranger","title":"nplinker.arranger","text":""},{"location":"api/arranger/#nplinker.arranger.PODP_PROJECT_URL","title":"PODP_PROJECT_URL module-attribute
","text":"PODP_PROJECT_URL = \"https://pairedomicsdata.bioinformatics.nl/api/projects/{}\"\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger","title":"DatasetArranger","text":"DatasetArranger(config: Dynaconf)\n
Arrange datasets based on the fixed working directory structure with the given configuration.
Concept and DiagramWorking Directory Structure
Dataset Arranging Pipeline
\"Arrange datasets\" means:
local
mode (config.mode
is local
), the datasets provided by users are validated.podp
mode (config.mode
is podp
), the datasets are automatically downloaded or generated, then validated.The datasets include MIBiG, GNPS, antiSMASH, and BiG-SCAPE data.
Attributes:
config
\u2013 A Dynaconf object that contains the configuration settings.
root_dir
\u2013 The root directory of the datasets.
downloads_dir
\u2013 The directory to store downloaded files.
mibig_dir
\u2013 The directory to store MIBiG metadata.
gnps_dir
\u2013 The directory to store GNPS data.
antismash_dir
\u2013 The directory to store antiSMASH data.
bigscape_dir
\u2013 The directory to store BiG-SCAPE data.
bigscape_running_output_dir
\u2013 The directory to store the running output of BiG-SCAPE.
Parameters:
config
(Dynaconf
) \u2013 A Dynaconf object that contains the configuration settings.
Examples:
>>> from nplinker.config import load_config\n>>> from nplinker.arranger import DatasetArranger\n>>> config = load_config(\"nplinker.toml\")\n>>> arranger = DatasetArranger(config)\n>>> arranger.arrange()\n
See Also DatasetLoader: Load all data from files to memory.
Source code insrc/nplinker/arranger.py
def __init__(self, config: Dynaconf) -> None:\n \"\"\"Initialize the DatasetArranger.\n\n Args:\n config: A Dynaconf object that contains the configuration settings.\n\n\n Examples:\n >>> from nplinker.config import load_config\n >>> from nplinker.arranger import DatasetArranger\n >>> config = load_config(\"nplinker.toml\")\n >>> arranger = DatasetArranger(config)\n >>> arranger.arrange()\n\n See Also:\n [DatasetLoader][nplinker.loader.DatasetLoader]: Load all data from files to memory.\n \"\"\"\n self.config = config\n self.root_dir = config.root_dir\n self.downloads_dir = self.root_dir / defaults.DOWNLOADS_DIRNAME\n self.downloads_dir.mkdir(exist_ok=True)\n\n self.mibig_dir = self.root_dir / defaults.MIBIG_DIRNAME\n self.gnps_dir = self.root_dir / defaults.GNPS_DIRNAME\n self.antismash_dir = self.root_dir / defaults.ANTISMASH_DIRNAME\n self.bigscape_dir = self.root_dir / defaults.BIGSCAPE_DIRNAME\n self.bigscape_running_output_dir = (\n self.bigscape_dir / defaults.BIGSCAPE_RUNNING_OUTPUT_DIRNAME\n )\n\n self.arrange_podp_project_json()\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.config","title":"config instance-attribute
","text":"config = config\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.root_dir","title":"root_dir instance-attribute
","text":"root_dir = root_dir\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.downloads_dir","title":"downloads_dir instance-attribute
","text":"downloads_dir = root_dir / DOWNLOADS_DIRNAME\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.mibig_dir","title":"mibig_dir instance-attribute
","text":"mibig_dir = root_dir / MIBIG_DIRNAME\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.gnps_dir","title":"gnps_dir instance-attribute
","text":"gnps_dir = root_dir / GNPS_DIRNAME\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.antismash_dir","title":"antismash_dir instance-attribute
","text":"antismash_dir = root_dir / ANTISMASH_DIRNAME\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.bigscape_dir","title":"bigscape_dir instance-attribute
","text":"bigscape_dir = root_dir / BIGSCAPE_DIRNAME\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.bigscape_running_output_dir","title":"bigscape_running_output_dir instance-attribute
","text":"bigscape_running_output_dir = (\n bigscape_dir / BIGSCAPE_RUNNING_OUTPUT_DIRNAME\n)\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.arrange","title":"arrange","text":"arrange() -> None\n
Arrange all datasets according to the configuration.
The datasets include MIBiG, GNPS, antiSMASH, and BiG-SCAPE.
Source code insrc/nplinker/arranger.py
def arrange(self) -> None:\n \"\"\"Arrange all datasets according to the configuration.\n\n The datasets include MIBiG, GNPS, antiSMASH, and BiG-SCAPE.\n \"\"\"\n # The order of arranging the datasets matters, as some datasets depend on others\n self.arrange_mibig()\n self.arrange_gnps()\n self.arrange_antismash()\n self.arrange_bigscape()\n self.arrange_strain_mappings()\n self.arrange_strains_selected()\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.arrange_podp_project_json","title":"arrange_podp_project_json","text":"arrange_podp_project_json() -> None\n
Arrange the PODP project JSON file.
This method only works for the podp
mode. If the JSON file does not exist, download it first; then the downloaded or existing JSON file will be validated according to the PODP_ADAPTED_SCHEMA.
src/nplinker/arranger.py
def arrange_podp_project_json(self) -> None:\n \"\"\"Arrange the PODP project JSON file.\n\n This method only works for the `podp` mode. If the JSON file does not exist, download it\n first; then the downloaded or existing JSON file will be validated according to the\n [PODP_ADAPTED_SCHEMA][nplinker.schemas.PODP_ADAPTED_SCHEMA].\n \"\"\"\n if self.config.mode == \"podp\":\n file_name = f\"paired_datarecord_{self.config.podp_id}.json\"\n podp_file = self.downloads_dir / file_name\n if not podp_file.exists():\n download_url(\n PODP_PROJECT_URL.format(self.config.podp_id),\n self.downloads_dir,\n file_name,\n )\n\n with open(podp_file, \"r\") as f:\n json_data = json.load(f)\n validate_podp_json(json_data)\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.arrange_mibig","title":"arrange_mibig","text":"arrange_mibig() -> None\n
Arrange the MIBiG metadata.
If config.mibig.to_use
is True
, download and extract the MIBiG metadata and override the existing MIBiG metadata if it exists. This ensures that the MIBiG metadata is always up-to-date to the specified version in the configuration.
src/nplinker/arranger.py
def arrange_mibig(self) -> None:\n \"\"\"Arrange the MIBiG metadata.\n\n If `config.mibig.to_use` is `True`, download and extract the MIBiG metadata and override\n the existing MIBiG metadata if it exists. This ensures that the MIBiG metadata is always\n up-to-date to the specified version in the configuration.\n \"\"\"\n if self.config.mibig.to_use:\n if self.mibig_dir.exists():\n # remove existing mibig data\n shutil.rmtree(self.mibig_dir)\n download_and_extract_mibig_metadata(\n self.downloads_dir,\n self.mibig_dir,\n version=self.config.mibig.version,\n )\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.arrange_gnps","title":"arrange_gnps","text":"arrange_gnps() -> None\n
Arrange the GNPS data.
For local
mode, validate the GNPS data directory.
For podp
mode, if the GNPS data does not exist, download it; if it exists but not valid, remove the data and re-downloads it.
The validation process includes:
file_mappings.tsv
or file_mappings.csv
spectra.mgf
molecular_families.tsv
annotations.tsv
src/nplinker/arranger.py
def arrange_gnps(self) -> None:\n \"\"\"Arrange the GNPS data.\n\n For `local` mode, validate the GNPS data directory.\n\n For `podp` mode, if the GNPS data does not exist, download it; if it exists but not valid,\n remove the data and re-downloads it.\n\n The validation process includes:\n\n - Check if the GNPS data directory exists.\n - Check if the required files exist in the GNPS data directory, including:\n - `file_mappings.tsv` or `file_mappings.csv`\n - `spectra.mgf`\n - `molecular_families.tsv`\n - `annotations.tsv`\n \"\"\"\n pass_validation = False\n if self.config.mode == \"podp\":\n # retry downloading at most 3 times if downloaded data has problems\n for _ in range(3):\n try:\n validate_gnps(self.gnps_dir)\n pass_validation = True\n break\n except (FileNotFoundError, ValueError):\n # Don't need to remove downloaded archive, as it'll be overwritten\n shutil.rmtree(self.gnps_dir, ignore_errors=True)\n self._download_and_extract_gnps()\n\n if not pass_validation:\n validate_gnps(self.gnps_dir)\n\n # get the path to file_mappings file (csv or tsv)\n self.gnps_file_mappings_file = self._get_gnps_file_mappings_file()\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.arrange_antismash","title":"arrange_antismash","text":"arrange_antismash() -> None\n
Arrange the antiSMASH data.
For local
mode, validate the antiSMASH data.
For podp
mode, if the antiSMASH data does not exist, download it; if it exists but not valid, remove the data and re-download it.
The validation process includes:
.region???.gbk
where ???
is a number).AntiSMASH BGC directory must follow the structure below:
antismash\n \u251c\u2500\u2500 genome_id_1 (one AntiSMASH output, e.g. GCF_000514775.1)\n \u2502\u00a0 \u251c\u2500\u2500 GCF_000514775.1.gbk\n \u2502\u00a0 \u251c\u2500\u2500 NZ_AZWO01000004.region001.gbk\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u251c\u2500\u2500 genome_id_2\n \u2502\u00a0 \u251c\u2500\u2500 ...\n \u2514\u2500\u2500 ...\n
Source code in src/nplinker/arranger.py
def arrange_antismash(self) -> None:\n \"\"\"Arrange the antiSMASH data.\n\n For `local` mode, validate the antiSMASH data.\n\n For `podp` mode, if the antiSMASH data does not exist, download it; if it exists but not\n valid, remove the data and re-download it.\n\n The validation process includes:\n\n - Check if the antiSMASH data directory exists.\n - Check if the antiSMASH data directory contains at least one sub-directory, and each\n sub-directory contains at least one BGC file (with the suffix `.region???.gbk` where\n `???` is a number).\n\n AntiSMASH BGC directory must follow the structure below:\n ```\n antismash\n \u251c\u2500\u2500 genome_id_1 (one AntiSMASH output, e.g. GCF_000514775.1)\n \u2502\u00a0 \u251c\u2500\u2500 GCF_000514775.1.gbk\n \u2502\u00a0 \u251c\u2500\u2500 NZ_AZWO01000004.region001.gbk\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u251c\u2500\u2500 genome_id_2\n \u2502\u00a0 \u251c\u2500\u2500 ...\n \u2514\u2500\u2500 ...\n ```\n \"\"\"\n pass_validation = False\n if self.config.mode == \"podp\":\n for _ in range(3):\n try:\n validate_antismash(self.antismash_dir)\n pass_validation = True\n break\n except FileNotFoundError:\n shutil.rmtree(self.antismash_dir, ignore_errors=True)\n self._download_and_extract_antismash()\n\n if not pass_validation:\n validate_antismash(self.antismash_dir)\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.arrange_bigscape","title":"arrange_bigscape","text":"arrange_bigscape() -> None\n
Arrange the BiG-SCAPE data.
For local
mode, validate the BiG-SCAPE data.
For podp
mode, if the BiG-SCAPE data does not exist, run BiG-SCAPE to generate the clustering file; if it exists but not valid, remove the data and re-run BiG-SCAPE to generate the data.
The running output of BiG-SCAPE will be saved to the directory bigscape_running_output
in the default BiG-SCAPE directory, and the clustering file mix_clustering_c{self.config.bigscape.cutoff}.tsv
will be copied to the default BiG-SCAPE directory.
The validation process includes:
mix_clustering_c{self.config.bigscape.cutoff}.tsv
exists in the BiG-SCAPE data directory.data_sqlite.db
file exists in the BiG-SCAPE data directory.src/nplinker/arranger.py
def arrange_bigscape(self) -> None:\n \"\"\"Arrange the BiG-SCAPE data.\n\n For `local` mode, validate the BiG-SCAPE data.\n\n For `podp` mode, if the BiG-SCAPE data does not exist, run BiG-SCAPE to generate the\n clustering file; if it exists but not valid, remove the data and re-run BiG-SCAPE to generate\n the data.\n\n The running output of BiG-SCAPE will be saved to the directory `bigscape_running_output`\n in the default BiG-SCAPE directory, and the clustering file\n `mix_clustering_c{self.config.bigscape.cutoff}.tsv` will be copied to the default BiG-SCAPE\n directory.\n\n The validation process includes:\n\n - Check if the default BiG-SCAPE data directory exists.\n - Check if the clustering file `mix_clustering_c{self.config.bigscape.cutoff}.tsv` exists in the\n BiG-SCAPE data directory.\n - Check if the `data_sqlite.db` file exists in the BiG-SCAPE data directory.\n \"\"\"\n pass_validation = False\n if self.config.mode == \"podp\":\n for _ in range(3):\n try:\n validate_bigscape(self.bigscape_dir, self.config.bigscape.cutoff)\n pass_validation = True\n break\n except FileNotFoundError:\n shutil.rmtree(self.bigscape_dir, ignore_errors=True)\n self._run_bigscape()\n\n if not pass_validation:\n validate_bigscape(self.bigscape_dir, self.config.bigscape.cutoff)\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.arrange_strain_mappings","title":"arrange_strain_mappings","text":"arrange_strain_mappings() -> None\n
Arrange the strain mappings file.
For local
mode, validate the strain mappings file.
For podp
mode, always generate the new strain mappings file and validate it.
The validation checks if the strain mappings file exists and if it is a valid JSON file according to STRAIN_MAPPINGS_SCHEMA.
Source code insrc/nplinker/arranger.py
def arrange_strain_mappings(self) -> None:\n \"\"\"Arrange the strain mappings file.\n\n For `local` mode, validate the strain mappings file.\n\n For `podp` mode, always generate the new strain mappings file and validate it.\n\n The validation checks if the strain mappings file exists and if it is a valid JSON file\n according to [STRAIN_MAPPINGS_SCHEMA][nplinker.schemas.STRAIN_MAPPINGS_SCHEMA].\n \"\"\"\n if self.config.mode == \"podp\":\n self._generate_strain_mappings()\n\n self._validate_strain_mappings()\n
"},{"location":"api/arranger/#nplinker.arranger.DatasetArranger.arrange_strains_selected","title":"arrange_strains_selected","text":"arrange_strains_selected() -> None\n
Arrange the strains selected file.
If the file exists, validate it according to the schema defined in user_strains.json
.
src/nplinker/arranger.py
def arrange_strains_selected(self) -> None:\n \"\"\"Arrange the strains selected file.\n\n If the file exists, validate it according to the schema defined in `user_strains.json`.\n \"\"\"\n strains_selected_file = self.root_dir / defaults.STRAINS_SELECTED_FILENAME\n if strains_selected_file.exists():\n with open(strains_selected_file, \"r\") as f:\n json_data = json.load(f)\n validate(instance=json_data, schema=USER_STRAINS_SCHEMA)\n
"},{"location":"api/arranger/#nplinker.arranger.validate_gnps","title":"validate_gnps","text":"validate_gnps(gnps_dir: str | PathLike) -> None\n
Validate the GNPS data directory and its contents.
The GNPS data directory must contain the following files:
file_mappings.tsv
or file_mappings.csv
spectra.mgf
molecular_families.tsv
annotations.tsv
Parameters:
gnps_dir
(str | PathLike
) \u2013 Path to the GNPS data directory.
Raises:
FileNotFoundError
\u2013 If the GNPS data directory is not found or any of the required files is not found.
ValueError
\u2013 If both file_mappings.tsv and file_mapping.csv are found.
src/nplinker/arranger.py
def validate_gnps(gnps_dir: str | PathLike) -> None:\n \"\"\"Validate the GNPS data directory and its contents.\n\n The GNPS data directory must contain the following files:\n\n - `file_mappings.tsv` or `file_mappings.csv`\n - `spectra.mgf`\n - `molecular_families.tsv`\n - `annotations.tsv`\n\n Args:\n gnps_dir: Path to the GNPS data directory.\n\n Raises:\n FileNotFoundError: If the GNPS data directory is not found or any of the required files\n is not found.\n ValueError: If both file_mappings.tsv and file_mapping.csv are found.\n \"\"\"\n gnps_dir = Path(gnps_dir)\n if not gnps_dir.exists():\n raise FileNotFoundError(f\"GNPS data directory not found at {gnps_dir}\")\n\n file_mappings_tsv = gnps_dir / defaults.GNPS_FILE_MAPPINGS_TSV\n file_mappings_csv = gnps_dir / defaults.GNPS_FILE_MAPPINGS_CSV\n if file_mappings_tsv.exists() and file_mappings_csv.exists():\n raise ValueError(\n f\"Both {file_mappings_tsv.name} and {file_mappings_csv.name} found in GNPS directory \"\n f\"{gnps_dir}, only one is allowed.\"\n )\n elif not file_mappings_tsv.exists() and not file_mappings_csv.exists():\n raise FileNotFoundError(\n f\"Neither {file_mappings_tsv.name} nor {file_mappings_csv.name} found in GNPS directory\"\n f\" {gnps_dir}\"\n )\n\n required_files = [\n gnps_dir / defaults.GNPS_SPECTRA_FILENAME,\n gnps_dir / defaults.GNPS_MOLECULAR_FAMILY_FILENAME,\n gnps_dir / defaults.GNPS_ANNOTATIONS_FILENAME,\n ]\n list_not_found = [f.name for f in required_files if not f.exists()]\n if list_not_found:\n raise FileNotFoundError(\n f\"Files not found in GNPS directory {gnps_dir}: ', '.join({list_not_found})\"\n )\n
"},{"location":"api/arranger/#nplinker.arranger.validate_antismash","title":"validate_antismash","text":"validate_antismash(antismash_dir: str | PathLike) -> None\n
Validate the antiSMASH data directory and its contents.
The validation only checks the structure of the antiSMASH data directory and file names. It does not check
podp
modeThe antiSMASH data directory must exist and contain at least one sub-directory. The name of the sub-directories must not contain any space. Each sub-directory must contain at least one BGC file (with the suffix .region???.gbk
where ???
is the region number).
Parameters:
antismash_dir
(str | PathLike
) \u2013 Path to the antiSMASH data directory.
Raises:
FileNotFoundError
\u2013 If the antiSMASH data directory is not found, or no sub-directories are found in the antiSMASH data directory, or no BGC files are found in any sub-directory.
ValueError
\u2013 If any sub-directory name contains a space.
src/nplinker/arranger.py
def validate_antismash(antismash_dir: str | PathLike) -> None:\n \"\"\"Validate the antiSMASH data directory and its contents.\n\n The validation only checks the structure of the antiSMASH data directory and file names.\n It does not check\n\n - the content of the BGC files\n - the consistency between the antiSMASH data and the PODP project JSON file for the `podp` mode\n\n The antiSMASH data directory must exist and contain at least one sub-directory. The name of the\n sub-directories must not contain any space. Each sub-directory must contain at least one BGC\n file (with the suffix `.region???.gbk` where `???` is the region number).\n\n Args:\n antismash_dir: Path to the antiSMASH data directory.\n\n Raises:\n FileNotFoundError: If the antiSMASH data directory is not found, or no sub-directories\n are found in the antiSMASH data directory, or no BGC files are found in any\n sub-directory.\n ValueError: If any sub-directory name contains a space.\n \"\"\"\n antismash_dir = Path(antismash_dir)\n if not antismash_dir.exists():\n raise FileNotFoundError(f\"antiSMASH data directory not found at {antismash_dir}\")\n\n sub_dirs = list_dirs(antismash_dir)\n if not sub_dirs:\n raise FileNotFoundError(\n \"No BGC directories found in antiSMASH data directory {antismash_dir}\"\n )\n\n for sub_dir in sub_dirs:\n dir_name = Path(sub_dir).name\n if \" \" in dir_name:\n raise ValueError(\n f\"antiSMASH sub-directory name {dir_name} contains space, which is not allowed\"\n )\n\n gbk_files = list_files(sub_dir, suffix=\".gbk\", keep_parent=False)\n bgc_files = fnmatch.filter(gbk_files, \"*.region???.gbk\")\n if not bgc_files:\n raise FileNotFoundError(f\"No BGC files found in antiSMASH sub-directory {sub_dir}\")\n
"},{"location":"api/arranger/#nplinker.arranger.validate_bigscape","title":"validate_bigscape","text":"validate_bigscape(\n bigscape_dir: str | PathLike, cutoff: str\n) -> None\n
Validate the BiG-SCAPE data directory and its contents.
The BiG-SCAPE data directory must exist and contain the clustering file mix_clustering_c{self.config.bigscape.cutoff}.tsv
where {self.config.bigscape.cutoff}
is the bigscape cutoff value set in the config file.
Alternatively, the directory can contain the BiG-SCAPE database file generated by BiG-SCAPE v2. At the moment, all the family assignments in the database will be used, so this database should contain results from a single run with the desired cutoff.
Parameters:
bigscape_dir
(str | PathLike
) \u2013 Path to the BiG-SCAPE data directory.
cutoff
(str
) \u2013 The BiG-SCAPE cutoff value.
Raises:
FileNotFoundError
\u2013 If the BiG-SCAPE data directory or the clustering file is not found.
src/nplinker/arranger.py
def validate_bigscape(bigscape_dir: str | PathLike, cutoff: str) -> None:\n \"\"\"Validate the BiG-SCAPE data directory and its contents.\n\n The BiG-SCAPE data directory must exist and contain the clustering file\n `mix_clustering_c{self.config.bigscape.cutoff}.tsv` where `{self.config.bigscape.cutoff}` is the\n bigscape cutoff value set in the config file.\n\n Alternatively, the directory can contain the BiG-SCAPE database file generated by BiG-SCAPE v2.\n At the moment, all the family assignments in the database will be used, so this database should\n contain results from a single run with the desired cutoff.\n\n Args:\n bigscape_dir: Path to the BiG-SCAPE data directory.\n cutoff: The BiG-SCAPE cutoff value.\n\n Raises:\n FileNotFoundError: If the BiG-SCAPE data directory or the clustering file is not found.\n \"\"\"\n bigscape_dir = Path(bigscape_dir)\n if not bigscape_dir.exists():\n raise FileNotFoundError(f\"BiG-SCAPE data directory not found at {bigscape_dir}\")\n\n clustering_file = bigscape_dir / f\"mix_clustering_c{cutoff}.tsv\"\n database_file = bigscape_dir / \"data_sqlite.db\"\n if not clustering_file.exists() and not database_file.exists():\n raise FileNotFoundError(f\"BiG-SCAPE data not found in {clustering_file} or {database_file}\")\n
"},{"location":"api/bigscape/","title":"BigScape","text":""},{"location":"api/bigscape/#nplinker.genomics.bigscape","title":"nplinker.genomics.bigscape","text":""},{"location":"api/bigscape/#nplinker.genomics.bigscape.BigscapeGCFLoader","title":"BigscapeGCFLoader","text":"BigscapeGCFLoader(cluster_file: str | PathLike)\n
Bases: GCFLoaderBase
Data loader for BiG-SCAPE GCF cluster file.
Attributes:
cluster_file
(str
) \u2013 path to the BiG-SCAPE cluster file.
Parameters:
cluster_file
(str | PathLike
) \u2013 Path to the BiG-SCAPE cluster file, the filename has a pattern of <class>_clustering_c0.xx.tsv
.
src/nplinker/genomics/bigscape/bigscape_loader.py
def __init__(self, cluster_file: str | PathLike, /) -> None:\n \"\"\"Initialize the BiG-SCAPE GCF loader.\n\n Args:\n cluster_file: Path to the BiG-SCAPE cluster file,\n the filename has a pattern of `<class>_clustering_c0.xx.tsv`.\n \"\"\"\n self.cluster_file: str = str(cluster_file)\n self._gcf_list = self._parse_gcf(self.cluster_file)\n
"},{"location":"api/bigscape/#nplinker.genomics.bigscape.BigscapeGCFLoader.cluster_file","title":"cluster_file instance-attribute
","text":"cluster_file: str = str(cluster_file)\n
"},{"location":"api/bigscape/#nplinker.genomics.bigscape.BigscapeGCFLoader.get_gcfs","title":"get_gcfs","text":"get_gcfs(\n keep_mibig_only: bool = False,\n keep_singleton: bool = False,\n) -> list[GCF]\n
Get all GCF objects.
Parameters:
keep_mibig_only
(bool
, default: False
) \u2013 True to keep GCFs that contain only MIBiG BGCs.
keep_singleton
(bool
, default: False
) \u2013 True to keep singleton GCFs. A singleton GCF is a GCF that contains only one BGC.
Returns:
list[GCF]
\u2013 A list of GCF objects.
src/nplinker/genomics/bigscape/bigscape_loader.py
def get_gcfs(self, keep_mibig_only: bool = False, keep_singleton: bool = False) -> list[GCF]:\n \"\"\"Get all GCF objects.\n\n Args:\n keep_mibig_only: True to keep GCFs that contain only MIBiG\n BGCs.\n keep_singleton: True to keep singleton GCFs. A singleton GCF\n is a GCF that contains only one BGC.\n\n Returns:\n A list of GCF objects.\n \"\"\"\n gcf_list = self._gcf_list\n if not keep_mibig_only:\n gcf_list = [gcf for gcf in gcf_list if not gcf.has_mibig_only()]\n if not keep_singleton:\n gcf_list = [gcf for gcf in gcf_list if not gcf.is_singleton()]\n return gcf_list\n
"},{"location":"api/bigscape/#nplinker.genomics.bigscape.BigscapeV2GCFLoader","title":"BigscapeV2GCFLoader","text":"BigscapeV2GCFLoader(db_file: str | PathLike)\n
Bases: GCFLoaderBase
Data loader for BiG-SCAPE v2 database file.
Attributes:
db_file
\u2013 Path to the BiG-SCAPE database file.
Parameters:
db_file
(str | PathLike
) \u2013 Path to the BiG-SCAPE v2 database file
src/nplinker/genomics/bigscape/bigscape_loader.py
def __init__(self, db_file: str | PathLike, /) -> None:\n \"\"\"Initialize the BiG-SCAPE v2 GCF loader.\n\n Args:\n db_file: Path to the BiG-SCAPE v2 database file\n \"\"\"\n self.db_file = str(db_file)\n self._gcf_list = self._parse_gcf(self.db_file)\n
"},{"location":"api/bigscape/#nplinker.genomics.bigscape.BigscapeV2GCFLoader.db_file","title":"db_file instance-attribute
","text":"db_file = str(db_file)\n
"},{"location":"api/bigscape/#nplinker.genomics.bigscape.BigscapeV2GCFLoader.get_gcfs","title":"get_gcfs","text":"get_gcfs(\n keep_mibig_only: bool = False,\n keep_singleton: bool = False,\n) -> list[GCF]\n
Get all GCF objects.
Parameters:
keep_mibig_only
(bool
, default: False
) \u2013 True to keep GCFs that contain only MIBiG BGCs.
keep_singleton
(bool
, default: False
) \u2013 True to keep singleton GCFs. A singleton GCF is a GCF that contains only one BGC.
Returns:
list[GCF]
\u2013 a list of GCF objects.
src/nplinker/genomics/bigscape/bigscape_loader.py
def get_gcfs(self, keep_mibig_only: bool = False, keep_singleton: bool = False) -> list[GCF]:\n \"\"\"Get all GCF objects.\n\n Args:\n keep_mibig_only: True to keep GCFs that contain only MIBiG BGCs.\n keep_singleton: True to keep singleton GCFs.\n A singleton GCF is a GCF that contains only one BGC.\n\n Returns:\n a list of GCF objects.\n \"\"\"\n gcf_list = self._gcf_list\n if not keep_mibig_only:\n gcf_list = [gcf for gcf in gcf_list if not gcf.has_mibig_only()]\n if not keep_singleton:\n gcf_list = [gcf for gcf in gcf_list if not gcf.is_singleton()]\n return gcf_list\n
"},{"location":"api/bigscape/#nplinker.genomics.bigscape.run_bigscape","title":"run_bigscape","text":"run_bigscape(\n antismash_path: str | PathLike,\n output_path: str | PathLike,\n extra_params: str,\n version: Literal[1, 2] = 1,\n) -> bool\n
Runs BiG-SCAPE to cluster BGCs.
The behavior of this function is slightly different depending on the version of BiG-SCAPE that is set to run using the configuration file. Mostly this means a different set of parameters is used between the two versions.
The AntiSMASH output directory should be a directory that contains GBK files. The directory can contain subdirectories, in which case BiG-SCAPE will search recursively for GBK files. E.g.:
example_folder\n \u251c\u2500\u2500 organism_1\n \u2502\u00a0 \u251c\u2500\u2500 organism_1.region001.gbk\n \u2502\u00a0 \u251c\u2500\u2500 organism_1.region002.gbk\n \u2502\u00a0 \u251c\u2500\u2500 organism_1.region003.gbk\n \u2502\u00a0 \u251c\u2500\u2500 organism_1.final.gbk <- skipped!\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u251c\u2500\u2500 organism_2\n \u2502\u00a0 \u251c\u2500\u2500 ...\n \u2514\u2500\u2500 ...\n
By default, only GBK Files with \"cluster\" or \"region\" in the filename are accepted. GBK Files with \"final\" in the filename are excluded.
Parameters:
antismash_path
(str | PathLike
) \u2013 Path to the antismash output directory.
output_path
(str | PathLike
) \u2013 Path to the output directory where BiG-SCAPE will write its results.
extra_params
(str
) \u2013 Additional parameters to pass to BiG-SCAPE.
version
(Literal[1, 2]
, default: 1
) \u2013 The version of BiG-SCAPE to run. Must be 1 or 2.
Returns:
bool
\u2013 True if BiG-SCAPE ran successfully, False otherwise.
Raises:
ValueError
\u2013 If an unexpected BiG-SCAPE version number is specified.
FileNotFoundError
\u2013 If the antismash_path does not exist or if the BiG-SCAPE python script could not be found.
RuntimeError
\u2013 If BiG-SCAPE fails to run.
Examples:
>>> from nplinker.genomics.bigscape import run_bigscape\n>>> run_bigscape(antismash_path=\"./antismash\", output_path=\"./output\",\n... extra_params=\"--help\", version=1)\n
Source code in src/nplinker/genomics/bigscape/runbigscape.py
def run_bigscape(\n antismash_path: str | PathLike,\n output_path: str | PathLike,\n extra_params: str,\n version: Literal[1, 2] = 1,\n) -> bool:\n \"\"\"Runs BiG-SCAPE to cluster BGCs.\n\n The behavior of this function is slightly different depending on the version of\n BiG-SCAPE that is set to run using the configuration file.\n Mostly this means a different set of parameters is used between the two versions.\n\n The AntiSMASH output directory should be a directory that contains GBK files.\n The directory can contain subdirectories, in which case BiG-SCAPE will search\n recursively for GBK files. E.g.:\n\n ```\n example_folder\n \u251c\u2500\u2500 organism_1\n \u2502\u00a0 \u251c\u2500\u2500 organism_1.region001.gbk\n \u2502\u00a0 \u251c\u2500\u2500 organism_1.region002.gbk\n \u2502\u00a0 \u251c\u2500\u2500 organism_1.region003.gbk\n \u2502\u00a0 \u251c\u2500\u2500 organism_1.final.gbk <- skipped!\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u251c\u2500\u2500 organism_2\n \u2502\u00a0 \u251c\u2500\u2500 ...\n \u2514\u2500\u2500 ...\n ```\n\n By default, only GBK Files with \"cluster\" or \"region\" in the filename are\n accepted. GBK Files with \"final\" in the filename are excluded.\n\n Args:\n antismash_path: Path to the antismash output directory.\n output_path: Path to the output directory where BiG-SCAPE will write its results.\n extra_params: Additional parameters to pass to BiG-SCAPE.\n version: The version of BiG-SCAPE to run. Must be 1 or 2.\n\n Returns:\n True if BiG-SCAPE ran successfully, False otherwise.\n\n Raises:\n ValueError: If an unexpected BiG-SCAPE version number is specified.\n FileNotFoundError: If the antismash_path does not exist or if the BiG-SCAPE python\n script could not be found.\n RuntimeError: If BiG-SCAPE fails to run.\n\n Examples:\n >>> from nplinker.genomics.bigscape import run_bigscape\n >>> run_bigscape(antismash_path=\"./antismash\", output_path=\"./output\",\n ... extra_params=\"--help\", version=1)\n \"\"\"\n # switch to correct version of BiG-SCAPE\n if version == 1:\n bigscape_py_path = \"bigscape.py\"\n elif version == 2:\n bigscape_py_path = \"bigscape-v2.py\"\n else:\n raise ValueError(\"Invalid BiG-SCAPE version number. Expected: 1 or 2.\")\n\n try:\n subprocess.run([bigscape_py_path, \"-h\"], capture_output=True, check=True)\n except Exception as e:\n raise FileNotFoundError(\n f\"Failed to find/run BiG-SCAPE executable program (path={bigscape_py_path}, err={e})\"\n ) from e\n\n if not os.path.exists(antismash_path):\n raise FileNotFoundError(f'antismash_path \"{antismash_path}\" does not exist!')\n\n logger.info(f\"Running BiG-SCAPE version {version}\")\n logger.info(\n f'run_bigscape: input=\"{antismash_path}\", output=\"{output_path}\", extra_params={extra_params}\"'\n )\n\n # assemble arguments. first argument is the python file\n args = [bigscape_py_path]\n\n # version 2 points to specific Pfam file, version 1 points to directory\n # version 2 also requires the cluster subcommand\n if version == 1:\n args.extend([\"--pfam_dir\", PFAM_PATH])\n elif version == 2:\n args.extend([\"cluster\", \"--pfam_path\", os.path.join(PFAM_PATH, \"Pfam-A.hmm\")])\n\n # add input and output paths. these are unchanged\n args.extend([\"-i\", str(antismash_path), \"-o\", str(output_path)])\n\n # append the user supplied params, if any\n if len(extra_params) > 0:\n args.extend(extra_params.split(\" \"))\n\n logger.info(f\"BiG-SCAPE command: {args}\")\n result = subprocess.run(args, stdout=sys.stdout, stderr=sys.stderr)\n\n # return true on any non-error return code\n if result.returncode == 0:\n logger.info(f\"BiG-SCAPE completed with return code {result.returncode}\")\n return True\n\n # otherwise log details and raise a runtime error\n logger.error(f\"BiG-SCAPE failed with return code {result.returncode}\")\n logger.error(f\"output: {str(result.stdout)}\")\n logger.error(f\"stderr: {str(result.stderr)}\")\n\n raise RuntimeError(f\"Failed to run BiG-SCAPE with error code {result.returncode}\")\n
"},{"location":"api/genomics/","title":"Data Models","text":""},{"location":"api/genomics/#nplinker.genomics","title":"nplinker.genomics","text":""},{"location":"api/genomics/#nplinker.genomics.BGC","title":"BGC","text":"BGC(id: str, /, *product_prediction: str)\n
Class to model BGC (biosynthetic gene cluster) data.
BGC data include both annotations and sequence data. This class is mainly designed to model the annotations or metadata.
The raw BGC data is stored in GenBank format (.gbk). Additional GenBank features could be added to the GenBank file to annotate BGCs, e.g. antiSMASH has some self-defined features (like region
) in its output GenBank files.
The annotations of BGC can be stored in JSON format, which is defined and used by MIBiG.
Attributes:
id
\u2013 BGC identifier, e.g. MIBiG accession, GenBank accession.
product_prediction
\u2013 A tuple of (predicted) natural products or product classes of the BGC. For antiSMASH's GenBank data, the feature region /product
gives product information. For MIBiG metadata, its biosynthetic class provides such info.
mibig_bgc_class
(tuple[str] | None
) \u2013 A tuple of MIBiG biosynthetic classes to which the BGC belongs. Defaults to None, which means the class is unknown.
MIBiG defines 6 major biosynthetic classes for natural products, including NRP
, Polyketide
, RiPP
, Terpene
, Saccharide
and Alkaloid
. Note that natural products created by the other biosynthetic mechanisms fall under the category Other
. For more details see the paper.
description
(str | None
) \u2013 Brief description of the BGC. Defaults to None.
smiles
(tuple[str] | None
) \u2013 A tuple of SMILES formulas of the BGC's products. Defaults to None.
antismash_file
(str | None
) \u2013 The path to the antiSMASH GenBank file. Defaults to None.
antismash_id
(str | None
) \u2013 Identifier of the antiSMASH BGC, referring to the feature VERSION
of GenBank file. Defaults to None.
antismash_region
(int | None
) \u2013 AntiSMASH BGC region number, referring to the feature region
of GenBank file. Defaults to None.
parents
(set[GCF]
) \u2013 The set of GCFs that contain the BGC.
strain
(Strain | None
) \u2013 The strain of the BGC.
Parameters:
id
(str
) \u2013 BGC identifier, e.g. MIBiG accession, GenBank accession.
product_prediction
(str
, default: ()
) \u2013 BGC's (predicted) natural products or product classes.
Examples:
>>> bgc = BGC(\"Unique_BGC_ID\", \"Polyketide\", \"NRP\")\n>>> bgc.id\n'Unique_BGC_ID'\n>>> bgc.product_prediction\n('Polyketide', 'NRP')\n>>> bgc.is_mibig()\nFalse\n
Source code in src/nplinker/genomics/bgc.py
def __init__(self, id: str, /, *product_prediction: str):\n \"\"\"Initialize the BGC object.\n\n Args:\n id: BGC identifier, e.g. MIBiG accession, GenBank accession.\n product_prediction: BGC's (predicted) natural products or product classes.\n\n Examples:\n >>> bgc = BGC(\"Unique_BGC_ID\", \"Polyketide\", \"NRP\")\n >>> bgc.id\n 'Unique_BGC_ID'\n >>> bgc.product_prediction\n ('Polyketide', 'NRP')\n >>> bgc.is_mibig()\n False\n \"\"\"\n # BGC metadata\n self.id = id\n self.product_prediction = product_prediction\n\n self.mibig_bgc_class: tuple[str] | None = None\n self.description: str | None = None\n self.smiles: tuple[str] | None = None\n\n # antismash related attributes\n self.antismash_file: str | None = None\n self.antismash_id: str | None = None # version in .gbk, id in SeqRecord\n self.antismash_region: int | None = None # antismash region number\n\n # other attributes\n self.parents: set[GCF] = set()\n self._strain: Strain | None = None\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.id","title":"id instance-attribute
","text":"id = id\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.product_prediction","title":"product_prediction instance-attribute
","text":"product_prediction = product_prediction\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.mibig_bgc_class","title":"mibig_bgc_class instance-attribute
","text":"mibig_bgc_class: tuple[str] | None = None\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.description","title":"description instance-attribute
","text":"description: str | None = None\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.smiles","title":"smiles instance-attribute
","text":"smiles: tuple[str] | None = None\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.antismash_file","title":"antismash_file instance-attribute
","text":"antismash_file: str | None = None\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.antismash_id","title":"antismash_id instance-attribute
","text":"antismash_id: str | None = None\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.antismash_region","title":"antismash_region instance-attribute
","text":"antismash_region: int | None = None\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.parents","title":"parents instance-attribute
","text":"parents: set[GCF] = set()\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.strain","title":"strain property
writable
","text":"strain: Strain | None\n
Get the strain of the BGC.
"},{"location":"api/genomics/#nplinker.genomics.BGC.bigscape_classes","title":"bigscape_classesproperty
","text":"bigscape_classes: set[str | None]\n
Get BiG-SCAPE's BGC classes.
BiG-SCAPE's BGC classes are similar to those defined in MiBIG but have more categories (7 classes), including:
For BGC falls outside of these categories, the value is \"Others\".
Default is None, which means the class is unknown.
More details see: https://doi.org/10.1038%2Fs41589-019-0400-9.
"},{"location":"api/genomics/#nplinker.genomics.BGC.aa_predictions","title":"aa_predictionsproperty
","text":"aa_predictions: list\n
Amino acids as predicted monomers of product.
Returns:
list
\u2013 list of dicts with key as amino acid and value as prediction
list
\u2013 probability.
__repr__()\n
Source code in src/nplinker/genomics/bgc.py
def __repr__(self):\n return str(self)\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.__str__","title":"__str__","text":"__str__()\n
Source code in src/nplinker/genomics/bgc.py
def __str__(self):\n return \"{}(id={}, strain={}, asid={}, region={})\".format(\n self.__class__.__name__,\n self.id,\n self.strain,\n self.antismash_id,\n self.antismash_region,\n )\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.__eq__","title":"__eq__","text":"__eq__(other) -> bool\n
Source code in src/nplinker/genomics/bgc.py
def __eq__(self, other) -> bool:\n if isinstance(other, BGC):\n return self.id == other.id and self.product_prediction == other.product_prediction\n return NotImplemented\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.__hash__","title":"__hash__","text":"__hash__() -> int\n
Source code in src/nplinker/genomics/bgc.py
def __hash__(self) -> int:\n return hash((self.id, self.product_prediction))\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.__reduce__","title":"__reduce__","text":"__reduce__() -> tuple\n
Reduce function for pickling.
Source code insrc/nplinker/genomics/bgc.py
def __reduce__(self) -> tuple:\n \"\"\"Reduce function for pickling.\"\"\"\n return (self.__class__, (self.id, *self.product_prediction), self.__dict__)\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.add_parent","title":"add_parent","text":"add_parent(gcf: GCF) -> None\n
Add a parent GCF to the BGC.
Parameters:
gcf
(GCF
) \u2013 gene cluster family
src/nplinker/genomics/bgc.py
def add_parent(self, gcf: GCF) -> None:\n \"\"\"Add a parent GCF to the BGC.\n\n Args:\n gcf: gene cluster family\n \"\"\"\n gcf.add_bgc(self)\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.detach_parent","title":"detach_parent","text":"detach_parent(gcf: GCF) -> None\n
Remove a parent GCF.
Source code insrc/nplinker/genomics/bgc.py
def detach_parent(self, gcf: GCF) -> None:\n \"\"\"Remove a parent GCF.\"\"\"\n gcf.detach_bgc(self)\n
"},{"location":"api/genomics/#nplinker.genomics.BGC.is_mibig","title":"is_mibig","text":"is_mibig() -> bool\n
Check if the BGC is a MIBiG reference BGC or not.
WarningThis method evaluates MIBiG BGC based on the pattern that MIBiG BGC names start with \"BGC\". It might give false positive result.
Returns:
bool
\u2013 True if it's MIBiG reference BGC
src/nplinker/genomics/bgc.py
def is_mibig(self) -> bool:\n \"\"\"Check if the BGC is a MIBiG reference BGC or not.\n\n Warning:\n This method evaluates MIBiG BGC based on the pattern that MIBiG\n BGC names start with \"BGC\". It might give false positive result.\n\n Returns:\n True if it's MIBiG reference BGC\n \"\"\"\n return self.id.startswith(\"BGC\")\n
"},{"location":"api/genomics/#nplinker.genomics.GCF","title":"GCF","text":"GCF(id: str)\n
Class to model gene cluster family (GCF).
GCF is a group of similar BGCs and generated by clustering BGCs with tools such as BiG-SCAPE and BiG-SLICE.
Attributes:
id
\u2013 id of the GCF object.
bgc_ids
(set[str]
) \u2013 a set of BGC ids that belongs to the GCF.
bigscape_class
(str | None
) \u2013 BiG-SCAPE's BGC class. BiG-SCAPE's BGC classes are similar to those defined in MiBIG but have more categories (7 classes), including:
For BGC falls outside of these categories, the value is \"Others\".
Default is None, which means the class is unknown.
More details see: https://doi.org/10.1038%2Fs41589-019-0400-9.
Parameters:
id
(str
) \u2013 id of the GCF object.
Examples:
>>> gcf = GCF(\"Unique_GCF_ID\")\n>>> gcf.id\n'Unique_GCF_ID'\n
Source code in src/nplinker/genomics/gcf.py
def __init__(self, id: str, /) -> None:\n \"\"\"Initialize the GCF object.\n\n Args:\n id: id of the GCF object.\n\n Examples:\n >>> gcf = GCF(\"Unique_GCF_ID\")\n >>> gcf.id\n 'Unique_GCF_ID'\n \"\"\"\n self.id = id\n self.bgc_ids: set[str] = set()\n self.bigscape_class: str | None = None\n self._bgcs: set[BGC] = set()\n self._strains: StrainCollection = StrainCollection()\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.id","title":"id instance-attribute
","text":"id = id\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.bgc_ids","title":"bgc_ids instance-attribute
","text":"bgc_ids: set[str] = set()\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.bigscape_class","title":"bigscape_class instance-attribute
","text":"bigscape_class: str | None = None\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.bgcs","title":"bgcs property
","text":"bgcs: set[BGC]\n
Get the BGC objects.
"},{"location":"api/genomics/#nplinker.genomics.GCF.strains","title":"strainsproperty
","text":"strains: StrainCollection\n
Get the strains in the GCF.
"},{"location":"api/genomics/#nplinker.genomics.GCF.__str__","title":"__str__","text":"__str__() -> str\n
Source code in src/nplinker/genomics/gcf.py
def __str__(self) -> str:\n return (\n f\"GCF(id={self.id}, #BGC_objects={len(self.bgcs)}, #bgc_ids={len(self.bgc_ids)},\"\n f\"#strains={len(self._strains)}).\"\n )\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.__repr__","title":"__repr__","text":"__repr__() -> str\n
Source code in src/nplinker/genomics/gcf.py
def __repr__(self) -> str:\n return str(self)\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.__eq__","title":"__eq__","text":"__eq__(other) -> bool\n
Source code in src/nplinker/genomics/gcf.py
def __eq__(self, other) -> bool:\n if isinstance(other, GCF):\n return self.id == other.id and self.bgcs == other.bgcs\n return NotImplemented\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.__hash__","title":"__hash__","text":"__hash__() -> int\n
Hash function for GCF.
Note that GCF class is a mutable container. We only hash the GCF id to avoid the hash value changes when self._bgcs
is updated.
src/nplinker/genomics/gcf.py
def __hash__(self) -> int:\n \"\"\"Hash function for GCF.\n\n Note that GCF class is a mutable container. We only hash the GCF id to\n avoid the hash value changes when `self._bgcs` is updated.\n \"\"\"\n return hash(self.id)\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.__reduce__","title":"__reduce__","text":"__reduce__() -> tuple\n
Reduce function for pickling.
Source code insrc/nplinker/genomics/gcf.py
def __reduce__(self) -> tuple:\n \"\"\"Reduce function for pickling.\"\"\"\n return (self.__class__, (self.id,), self.__dict__)\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.add_bgc","title":"add_bgc","text":"add_bgc(bgc: BGC) -> None\n
Add a BGC object to the GCF.
Source code insrc/nplinker/genomics/gcf.py
def add_bgc(self, bgc: BGC) -> None:\n \"\"\"Add a BGC object to the GCF.\"\"\"\n bgc.parents.add(self)\n self._bgcs.add(bgc)\n self.bgc_ids.add(bgc.id)\n if bgc.strain is not None:\n self._strains.add(bgc.strain)\n else:\n logger.warning(\"No strain specified for the BGC %s\", bgc.id)\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.detach_bgc","title":"detach_bgc","text":"detach_bgc(bgc: BGC) -> None\n
Remove a child BGC object.
Source code insrc/nplinker/genomics/gcf.py
def detach_bgc(self, bgc: BGC) -> None:\n \"\"\"Remove a child BGC object.\"\"\"\n bgc.parents.remove(self)\n self._bgcs.remove(bgc)\n self.bgc_ids.remove(bgc.id)\n if bgc.strain is not None:\n for other_bgc in self._bgcs:\n if other_bgc.strain == bgc.strain:\n return\n self._strains.remove(bgc.strain)\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.has_strain","title":"has_strain","text":"has_strain(strain: Strain) -> bool\n
Check if the given strain exists.
Parameters:
strain
(Strain
) \u2013 Strain
object.
Returns:
bool
\u2013 True when the given strain exist.
src/nplinker/genomics/gcf.py
def has_strain(self, strain: Strain) -> bool:\n \"\"\"Check if the given strain exists.\n\n Args:\n strain: `Strain` object.\n\n Returns:\n True when the given strain exist.\n \"\"\"\n return strain in self._strains\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.has_mibig_only","title":"has_mibig_only","text":"has_mibig_only() -> bool\n
Check if the GCF's children are only MIBiG BGCs.
Returns:
bool
\u2013 True if GCF.bgc_ids
are only MIBiG BGC ids.
src/nplinker/genomics/gcf.py
def has_mibig_only(self) -> bool:\n \"\"\"Check if the GCF's children are only MIBiG BGCs.\n\n Returns:\n True if `GCF.bgc_ids` are only MIBiG BGC ids.\n \"\"\"\n return all(map(lambda id: id.startswith(\"BGC\"), self.bgc_ids))\n
"},{"location":"api/genomics/#nplinker.genomics.GCF.is_singleton","title":"is_singleton","text":"is_singleton() -> bool\n
Check if the GCF contains only one BGC.
Returns:
bool
\u2013 True if GCF.bgc_ids
contains only one BGC id.
src/nplinker/genomics/gcf.py
def is_singleton(self) -> bool:\n \"\"\"Check if the GCF contains only one BGC.\n\n Returns:\n True if `GCF.bgc_ids` contains only one BGC id.\n \"\"\"\n return len(self.bgc_ids) == 1\n
"},{"location":"api/genomics_abc/","title":"Abstract Base Classes","text":""},{"location":"api/genomics_abc/#nplinker.genomics.abc","title":"nplinker.genomics.abc","text":""},{"location":"api/genomics_abc/#nplinker.genomics.abc.BGCLoaderBase","title":"BGCLoaderBase","text":"BGCLoaderBase(data_dir: str | PathLike)\n
Bases: ABC
Abstract base class for BGC loader.
Parameters:
data_dir
(str | PathLike
) \u2013 Path to directory that contains BGC metadata files (.json) or full data genbank files (.gbk).
src/nplinker/genomics/abc.py
def __init__(self, data_dir: str | PathLike) -> None:\n \"\"\"Initialize the BGC loader.\n\n Args:\n data_dir: Path to directory that contains BGC metadata files\n (.json) or full data genbank files (.gbk).\n \"\"\"\n self.data_dir = str(data_dir)\n
"},{"location":"api/genomics_abc/#nplinker.genomics.abc.BGCLoaderBase.data_dir","title":"data_dir instance-attribute
","text":"data_dir = str(data_dir)\n
"},{"location":"api/genomics_abc/#nplinker.genomics.abc.BGCLoaderBase.get_files","title":"get_files abstractmethod
","text":"get_files() -> dict[str, str]\n
Get path to BGC files.
Returns:
dict[str, str]
\u2013 The key is BGC name and value is path to BGC file
src/nplinker/genomics/abc.py
@abstractmethod\ndef get_files(self) -> dict[str, str]:\n \"\"\"Get path to BGC files.\n\n Returns:\n The key is BGC name and value is path to BGC file\n \"\"\"\n
"},{"location":"api/genomics_abc/#nplinker.genomics.abc.BGCLoaderBase.get_bgcs","title":"get_bgcs abstractmethod
","text":"get_bgcs() -> list[BGC]\n
Get BGC objects.
Returns:
list[BGC]
\u2013 A list of BGC objects
src/nplinker/genomics/abc.py
@abstractmethod\ndef get_bgcs(self) -> list[BGC]:\n \"\"\"Get BGC objects.\n\n Returns:\n A list of BGC objects\n \"\"\"\n
"},{"location":"api/genomics_abc/#nplinker.genomics.abc.GCFLoaderBase","title":"GCFLoaderBase","text":" Bases: ABC
Abstract base class for GCF loader.
"},{"location":"api/genomics_abc/#nplinker.genomics.abc.GCFLoaderBase.get_gcfs","title":"get_gcfsabstractmethod
","text":"get_gcfs(\n keep_mibig_only: bool, keep_singleton: bool\n) -> list[GCF]\n
Get GCF objects.
Parameters:
keep_mibig_only
(bool
) \u2013 True to keep GCFs that contain only MIBiG BGCs.
keep_singleton
(bool
) \u2013 True to keep singleton GCFs. A singleton GCF is a GCF that contains only one BGC.
Returns:
list[GCF]
\u2013 A list of GCF objects
src/nplinker/genomics/abc.py
@abstractmethod\ndef get_gcfs(self, keep_mibig_only: bool, keep_singleton: bool) -> list[GCF]:\n \"\"\"Get GCF objects.\n\n Args:\n keep_mibig_only: True to keep GCFs that contain only MIBiG\n BGCs.\n keep_singleton: True to keep singleton GCFs. A singleton GCF\n is a GCF that contains only one BGC.\n\n Returns:\n A list of GCF objects\n \"\"\"\n
"},{"location":"api/genomics_utils/","title":"Utilities","text":""},{"location":"api/genomics_utils/#nplinker.genomics.utils","title":"nplinker.genomics.utils","text":""},{"location":"api/genomics_utils/#nplinker.genomics.utils.generate_mappings_genome_id_bgc_id","title":"generate_mappings_genome_id_bgc_id","text":"generate_mappings_genome_id_bgc_id(\n bgc_dir: str | PathLike,\n output_file: str | PathLike | None = None,\n) -> None\n
Generate a file that maps genome id to BGC id.
The input bgc_dir
must follow the structure of the antismash
directory defined in Working Directory Structure, e.g.:
bgc_dir\n \u251c\u2500\u2500 genome_id_1\n \u2502\u00a0 \u251c\u2500\u2500 bgc_id_1.gbk\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u251c\u2500\u2500 genome_id_2\n \u2502\u00a0 \u251c\u2500\u2500 bgc_id_2.gbk\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u2514\u2500\u2500 ...\n
Parameters:
bgc_dir
(str | PathLike
) \u2013 The directory has one-layer of subfolders and each subfolder contains BGC files in .gbk
format.
It assumes that
output_file
(str | PathLike | None
, default: None
) \u2013 The path to the output file. The file will be overwritten if it already exists.
Defaults to None, in which case the output file will be placed in the directory bgc_dir
with the file name GENOME_BGC_MAPPINGS_FILENAME.
src/nplinker/genomics/utils.py
def generate_mappings_genome_id_bgc_id(\n bgc_dir: str | PathLike, output_file: str | PathLike | None = None\n) -> None:\n \"\"\"Generate a file that maps genome id to BGC id.\n\n The input `bgc_dir` must follow the structure of the `antismash` directory defined in\n [Working Directory Structure][working-directory-structure], e.g.:\n ```shell\n bgc_dir\n \u251c\u2500\u2500 genome_id_1\n \u2502\u00a0 \u251c\u2500\u2500 bgc_id_1.gbk\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u251c\u2500\u2500 genome_id_2\n \u2502\u00a0 \u251c\u2500\u2500 bgc_id_2.gbk\n \u2502\u00a0 \u2514\u2500\u2500 ...\n \u2514\u2500\u2500 ...\n ```\n\n Args:\n bgc_dir: The directory has one-layer of subfolders and each subfolder contains BGC files\n in `.gbk` format.\n\n It assumes that\n\n - the subfolder name is the genome id (e.g. refseq),\n - the BGC file name is the BGC id.\n output_file: The path to the output file.\n The file will be overwritten if it already exists.\n\n Defaults to None, in which case the output file will be placed in\n the directory `bgc_dir` with the file name\n [GENOME_BGC_MAPPINGS_FILENAME][nplinker.defaults.GENOME_BGC_MAPPINGS_FILENAME].\n \"\"\"\n bgc_dir = Path(bgc_dir)\n genome_bgc_mappings = {}\n\n for subdir in list_dirs(bgc_dir):\n genome_id = Path(subdir).name\n bgc_files = list_files(subdir, suffix=(\".gbk\"), keep_parent=False)\n bgc_ids = [bgc_id for f in bgc_files if (bgc_id := Path(f).stem) != genome_id]\n if bgc_ids:\n genome_bgc_mappings[genome_id] = bgc_ids\n else:\n logger.warning(\"No BGC files found in %s\", subdir)\n\n # sort mappings by genome_id and construct json data\n genome_bgc_mappings = dict(sorted(genome_bgc_mappings.items()))\n json_data_mappings = [{\"genome_ID\": k, \"BGC_ID\": v} for k, v in genome_bgc_mappings.items()]\n json_data = {\"mappings\": json_data_mappings, \"version\": \"1.0\"}\n\n # validate json data\n validate(instance=json_data, schema=GENOME_BGC_MAPPINGS_SCHEMA)\n\n if output_file is None:\n output_file = bgc_dir / GENOME_BGC_MAPPINGS_FILENAME\n with open(output_file, \"w\") as f:\n json.dump(json_data, f)\n logger.info(\"Generated genome-BGC mappings file: %s\", output_file)\n
"},{"location":"api/genomics_utils/#nplinker.genomics.utils.add_strain_to_bgc","title":"add_strain_to_bgc","text":"add_strain_to_bgc(\n strains: StrainCollection, bgcs: Sequence[BGC]\n) -> tuple[list[BGC], list[BGC]]\n
Assign a Strain object to BGC.strain
for input BGCs.
BGC id is used to find the corresponding Strain object. It's possible that no Strain object is found for a BGC id.
Note
The input bgcs
will be changed in place.
Parameters:
strains
(StrainCollection
) \u2013 A collection of all strain objects.
bgcs
(Sequence[BGC]
) \u2013 A list of BGC objects.
Returns:
tuple[list[BGC], list[BGC]]
\u2013 A tuple of two lists of BGC objects,
Raises:
ValueError
\u2013 Multiple strain objects found for a BGC id.
src/nplinker/genomics/utils.py
def add_strain_to_bgc(\n strains: StrainCollection, bgcs: Sequence[BGC]\n) -> tuple[list[BGC], list[BGC]]:\n \"\"\"Assign a Strain object to `BGC.strain` for input BGCs.\n\n BGC id is used to find the corresponding Strain object. It's possible that\n no Strain object is found for a BGC id.\n\n !!! Note\n The input `bgcs` will be changed in place.\n\n Args:\n strains: A collection of all strain objects.\n bgcs: A list of BGC objects.\n\n Returns:\n A tuple of two lists of BGC objects,\n\n - the first list contains BGC objects that are updated with Strain object;\n - the second list contains BGC objects that are not updated with\n Strain object because no Strain object is found.\n\n Raises:\n ValueError: Multiple strain objects found for a BGC id.\n \"\"\"\n bgc_with_strain = []\n bgc_without_strain = []\n for bgc in bgcs:\n try:\n strain_list = strains.lookup(bgc.id)\n except ValueError:\n bgc_without_strain.append(bgc)\n continue\n if len(strain_list) > 1:\n raise ValueError(\n f\"Multiple strain objects found for BGC id '{bgc.id}'.\"\n f\"BGC object accept only one strain.\"\n )\n bgc.strain = strain_list[0]\n bgc_with_strain.append(bgc)\n\n logger.info(\n f\"{len(bgc_with_strain)} BGC objects updated with Strain object.\\n\"\n f\"{len(bgc_without_strain)} BGC objects not updated with Strain object.\"\n )\n return bgc_with_strain, bgc_without_strain\n
"},{"location":"api/genomics_utils/#nplinker.genomics.utils.add_bgc_to_gcf","title":"add_bgc_to_gcf","text":"add_bgc_to_gcf(\n bgcs: Sequence[BGC], gcfs: Sequence[GCF]\n) -> tuple[list[GCF], list[GCF], dict[GCF, set[str]]]\n
Add BGC objects to GCF object based on GCF's BGC ids.
The attribute of GCF.bgc_ids
contains the ids of BGC objects. These ids are used to find BGC objects from the input bgcs
list. The found BGC objects are added to the bgcs
attribute of GCF object. It is possible that some BGC ids are not found in the input bgcs
list, and so their BGC objects are missing in the GCF object.
Note
This method changes the lists bgcs
and gcfs
in place.
Parameters:
bgcs
(Sequence[BGC]
) \u2013 A list of BGC objects.
gcfs
(Sequence[GCF]
) \u2013 A list of GCF objects.
Returns:
tuple[list[GCF], list[GCF], dict[GCF, set[str]]]
\u2013 A tuple of two lists and a dictionary,
src/nplinker/genomics/utils.py
def add_bgc_to_gcf(\n bgcs: Sequence[BGC], gcfs: Sequence[GCF]\n) -> tuple[list[GCF], list[GCF], dict[GCF, set[str]]]:\n \"\"\"Add BGC objects to GCF object based on GCF's BGC ids.\n\n The attribute of `GCF.bgc_ids` contains the ids of BGC objects. These ids\n are used to find BGC objects from the input `bgcs` list. The found BGC\n objects are added to the `bgcs` attribute of GCF object. It is possible that\n some BGC ids are not found in the input `bgcs` list, and so their BGC\n objects are missing in the GCF object.\n\n !!! note\n This method changes the lists `bgcs` and `gcfs` in place.\n\n Args:\n bgcs: A list of BGC objects.\n gcfs: A list of GCF objects.\n\n Returns:\n A tuple of two lists and a dictionary,\n\n - The first list contains GCF objects that are updated with BGC objects;\n - The second list contains GCF objects that are not updated with BGC objects\n because no BGC objects are found;\n - The dictionary contains GCF objects as keys and a set of ids of missing\n BGC objects as values.\n \"\"\"\n bgc_dict = {bgc.id: bgc for bgc in bgcs}\n gcf_with_bgc = []\n gcf_without_bgc = []\n gcf_missing_bgc: dict[GCF, set[str]] = {}\n for gcf in gcfs:\n for bgc_id in gcf.bgc_ids:\n try:\n bgc = bgc_dict[bgc_id]\n except KeyError:\n if gcf not in gcf_missing_bgc:\n gcf_missing_bgc[gcf] = {bgc_id}\n else:\n gcf_missing_bgc[gcf].add(bgc_id)\n continue\n gcf.add_bgc(bgc)\n\n if gcf.bgcs:\n gcf_with_bgc.append(gcf)\n else:\n gcf_without_bgc.append(gcf)\n\n logger.info(\n f\"{len(gcf_with_bgc)} GCF objects updated with BGC objects.\\n\"\n f\"{len(gcf_without_bgc)} GCF objects not updated with BGC objects.\\n\"\n f\"{len(gcf_missing_bgc)} GCF objects have missing BGC objects.\"\n )\n return gcf_with_bgc, gcf_without_bgc, gcf_missing_bgc\n
"},{"location":"api/genomics_utils/#nplinker.genomics.utils.get_mibig_from_gcf","title":"get_mibig_from_gcf","text":"get_mibig_from_gcf(\n gcfs: Sequence[GCF],\n) -> tuple[list[BGC], StrainCollection]\n
Get MIBiG BGCs and strains from GCF objects.
Parameters:
gcfs
(Sequence[GCF]
) \u2013 A list of GCF objects.
Returns:
tuple[list[BGC], StrainCollection]
\u2013 A tuple of two objects,
src/nplinker/genomics/utils.py
def get_mibig_from_gcf(gcfs: Sequence[GCF]) -> tuple[list[BGC], StrainCollection]:\n \"\"\"Get MIBiG BGCs and strains from GCF objects.\n\n Args:\n gcfs: A list of GCF objects.\n\n Returns:\n A tuple of two objects,\n\n - the first is a list of MIBiG BGC objects used in the GCFs;\n - the second is a StrainCollection object that contains all Strain objects used in the\n GCFs.\n \"\"\"\n mibig_bgcs_in_use = []\n mibig_strains_in_use = StrainCollection()\n for gcf in gcfs:\n for bgc in gcf.bgcs:\n if bgc.is_mibig():\n mibig_bgcs_in_use.append(bgc)\n if bgc.strain is not None:\n mibig_strains_in_use.add(bgc.strain)\n return mibig_bgcs_in_use, mibig_strains_in_use\n
"},{"location":"api/genomics_utils/#nplinker.genomics.utils.extract_mappings_strain_id_original_genome_id","title":"extract_mappings_strain_id_original_genome_id","text":"extract_mappings_strain_id_original_genome_id(\n podp_project_json_file: str | PathLike,\n) -> dict[str, set[str]]\n
Extract mappings \"strain_id <-> original_genome_id\".
Tip
The podp_project_json_file
is the JSON file downloaded from PODP platform.
For example, for PODP project MSV000079284, its JSON file is https://pairedomicsdata.bioinformatics.nl/api/projects/4b29ddc3-26d0-40d7-80c5-44fb6631dbf9.4.
Parameters:
podp_project_json_file
(str | PathLike
) \u2013 The path to the PODP project JSON file.
Returns:
dict[str, set[str]]
\u2013 Key is strain id and value is a set of original genome ids.
src/nplinker/genomics/utils.py
def extract_mappings_strain_id_original_genome_id(\n podp_project_json_file: str | PathLike,\n) -> dict[str, set[str]]:\n \"\"\"Extract mappings \"strain_id <-> original_genome_id\".\n\n !!! tip\n The `podp_project_json_file` is the JSON file downloaded from PODP platform.\n\n For example, for PODP project MSV000079284, its JSON file is\n https://pairedomicsdata.bioinformatics.nl/api/projects/4b29ddc3-26d0-40d7-80c5-44fb6631dbf9.4.\n\n Args:\n podp_project_json_file: The path to the PODP project\n JSON file.\n\n Returns:\n Key is strain id and value is a set of original genome ids.\n\n See Also:\n - [podp_generate_strain_mappings][nplinker.strain.utils.podp_generate_strain_mappings]:\n Generate strain mappings JSON file for PODP pipeline.\n \"\"\"\n mappings_dict: dict[str, set[str]] = {}\n with open(podp_project_json_file, \"r\") as f:\n json_data = json.load(f)\n\n validate_podp_json(json_data)\n\n for record in json_data[\"genomes\"]:\n strain_id = record[\"genome_label\"]\n genome_id = get_best_available_genome_id(record[\"genome_ID\"])\n if genome_id is None:\n logger.warning(\"Failed to extract genome ID from genome with label %s\", strain_id)\n continue\n if strain_id in mappings_dict:\n mappings_dict[strain_id].add(genome_id)\n else:\n mappings_dict[strain_id] = {genome_id}\n return mappings_dict\n
"},{"location":"api/genomics_utils/#nplinker.genomics.utils.extract_mappings_original_genome_id_resolved_genome_id","title":"extract_mappings_original_genome_id_resolved_genome_id","text":"extract_mappings_original_genome_id_resolved_genome_id(\n genome_status_json_file: str | PathLike,\n) -> dict[str, str]\n
Extract mappings \"original_genome_id <-> resolved_genome_id\".
Tip
The genome_status_json_file
is generated by the podp_download_and_extract_antismash_data function with a default file name GENOME_STATUS_FILENAME.
Parameters:
genome_status_json_file
(str | PathLike
) \u2013 The path to the genome status JSON file.
Returns:
dict[str, str]
\u2013 Key is original genome id and value is resolved genome id.
src/nplinker/genomics/utils.py
def extract_mappings_original_genome_id_resolved_genome_id(\n genome_status_json_file: str | PathLike,\n) -> dict[str, str]:\n \"\"\"Extract mappings \"original_genome_id <-> resolved_genome_id\".\n\n !!! tip\n The `genome_status_json_file` is generated by the [podp_download_and_extract_antismash_data]\n [nplinker.genomics.antismash.podp_antismash_downloader.podp_download_and_extract_antismash_data]\n function with a default file name [GENOME_STATUS_FILENAME][nplinker.defaults.GENOME_STATUS_FILENAME].\n\n Args:\n genome_status_json_file: The path to the genome status JSON file.\n\n\n Returns:\n Key is original genome id and value is resolved genome id.\n\n See Also:\n - [podp_generate_strain_mappings][nplinker.strain.utils.podp_generate_strain_mappings]:\n Generate strain mappings JSON file for PODP pipeline.\n \"\"\"\n gs_mappings_dict = GenomeStatus.read_json(genome_status_json_file)\n return {gs.original_id: gs.resolved_refseq_id for gs in gs_mappings_dict.values()}\n
"},{"location":"api/genomics_utils/#nplinker.genomics.utils.extract_mappings_resolved_genome_id_bgc_id","title":"extract_mappings_resolved_genome_id_bgc_id","text":"extract_mappings_resolved_genome_id_bgc_id(\n genome_bgc_mappings_file: str | PathLike,\n) -> dict[str, set[str]]\n
Extract mappings \"resolved_genome_id <-> bgc_id\".
Tip
The genome_bgc_mappings_file
is usually generated by the generate_mappings_genome_id_bgc_id function with a default file name GENOME_BGC_MAPPINGS_FILENAME.
Parameters:
genome_bgc_mappings_file
(str | PathLike
) \u2013 The path to the genome BGC mappings JSON file.
Returns:
dict[str, set[str]]
\u2013 Key is resolved genome id and value is a set of BGC ids.
src/nplinker/genomics/utils.py
def extract_mappings_resolved_genome_id_bgc_id(\n genome_bgc_mappings_file: str | PathLike,\n) -> dict[str, set[str]]:\n \"\"\"Extract mappings \"resolved_genome_id <-> bgc_id\".\n\n !!! tip\n The `genome_bgc_mappings_file` is usually generated by the\n [generate_mappings_genome_id_bgc_id][nplinker.genomics.utils.generate_mappings_genome_id_bgc_id]\n function with a default file name [GENOME_BGC_MAPPINGS_FILENAME][nplinker.defaults.GENOME_BGC_MAPPINGS_FILENAME].\n\n Args:\n genome_bgc_mappings_file: The path to the genome BGC\n mappings JSON file.\n\n Returns:\n Key is resolved genome id and value is a set of BGC ids.\n\n See Also:\n - [podp_generate_strain_mappings][nplinker.strain.utils.podp_generate_strain_mappings]:\n Generate strain mappings JSON file for PODP pipeline.\n \"\"\"\n with open(genome_bgc_mappings_file, \"r\") as f:\n json_data = json.load(f)\n\n # validate the JSON data\n validate(json_data, GENOME_BGC_MAPPINGS_SCHEMA)\n\n return {mapping[\"genome_ID\"]: set(mapping[\"BGC_ID\"]) for mapping in json_data[\"mappings\"]}\n
"},{"location":"api/genomics_utils/#nplinker.genomics.utils.get_mappings_strain_id_bgc_id","title":"get_mappings_strain_id_bgc_id","text":"get_mappings_strain_id_bgc_id(\n mappings_strain_id_original_genome_id: Mapping[\n str, set[str]\n ],\n mappings_original_genome_id_resolved_genome_id: Mapping[\n str, str\n ],\n mappings_resolved_genome_id_bgc_id: Mapping[\n str, set[str]\n ],\n) -> dict[str, set[str]]\n
Get mappings \"strain_id <-> bgc_id\".
Parameters:
mappings_strain_id_original_genome_id
(Mapping[str, set[str]]
) \u2013 Mappings \"strain_id <-> original_genome_id\".
mappings_original_genome_id_resolved_genome_id
(Mapping[str, str]
) \u2013 Mappings \"original_genome_id <-> resolved_genome_id\".
mappings_resolved_genome_id_bgc_id
(Mapping[str, set[str]]
) \u2013 Mappings \"resolved_genome_id <-> bgc_id\".
Returns:
dict[str, set[str]]
\u2013 Key is strain id and value is a set of BGC ids.
extract_mappings_strain_id_original_genome_id
: Extract mappings \"strain_id <-> original_genome_id\".extract_mappings_original_genome_id_resolved_genome_id
: Extract mappings \"original_genome_id <-> resolved_genome_id\".extract_mappings_resolved_genome_id_bgc_id
: Extract mappings \"resolved_genome_id <-> bgc_id\".src/nplinker/genomics/utils.py
def get_mappings_strain_id_bgc_id(\n mappings_strain_id_original_genome_id: Mapping[str, set[str]],\n mappings_original_genome_id_resolved_genome_id: Mapping[str, str],\n mappings_resolved_genome_id_bgc_id: Mapping[str, set[str]],\n) -> dict[str, set[str]]:\n \"\"\"Get mappings \"strain_id <-> bgc_id\".\n\n Args:\n mappings_strain_id_original_genome_id: Mappings \"strain_id <-> original_genome_id\".\n mappings_original_genome_id_resolved_genome_id: Mappings \"original_genome_id <-> resolved_genome_id\".\n mappings_resolved_genome_id_bgc_id: Mappings \"resolved_genome_id <-> bgc_id\".\n\n Returns:\n Key is strain id and value is a set of BGC ids.\n\n See Also:\n - `extract_mappings_strain_id_original_genome_id`: Extract mappings\n \"strain_id <-> original_genome_id\".\n - `extract_mappings_original_genome_id_resolved_genome_id`: Extract mappings\n \"original_genome_id <-> resolved_genome_id\".\n - `extract_mappings_resolved_genome_id_bgc_id`: Extract mappings\n \"resolved_genome_id <-> bgc_id\".\n - [podp_generate_strain_mappings][nplinker.strain.utils.podp_generate_strain_mappings]:\n Generate strain mappings JSON file for PODP pipeline.\n \"\"\"\n mappings_dict = {}\n for strain_id, original_genome_ids in mappings_strain_id_original_genome_id.items():\n bgc_ids = set()\n for original_genome_id in original_genome_ids:\n resolved_genome_id = mappings_original_genome_id_resolved_genome_id[original_genome_id]\n if (bgc_id := mappings_resolved_genome_id_bgc_id.get(resolved_genome_id)) is not None:\n bgc_ids.update(bgc_id)\n if bgc_ids:\n mappings_dict[strain_id] = bgc_ids\n return mappings_dict\n
"},{"location":"api/gnps/","title":"GNPS","text":""},{"location":"api/gnps/#nplinker.metabolomics.gnps","title":"nplinker.metabolomics.gnps","text":""},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSFormat","title":"GNPSFormat","text":" Bases: Enum
Enum class for GNPS formats or workflows.
ConceptGNPS data
The name of the enum is a short name for the workflow, and the value of the enum is the workflow name used on the GNPS website.
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSFormat.SNETS","title":"SNETSclass-attribute
instance-attribute
","text":"SNETS = 'METABOLOMICS-SNETS'\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSFormat.SNETSV2","title":"SNETSV2 class-attribute
instance-attribute
","text":"SNETSV2 = 'METABOLOMICS-SNETS-V2'\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSFormat.FBMN","title":"FBMN class-attribute
instance-attribute
","text":"FBMN = 'FEATURE-BASED-MOLECULAR-NETWORKING'\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSFormat.Unknown","title":"Unknown class-attribute
instance-attribute
","text":"Unknown = 'Unknown-GNPS-Workflow'\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSDownloader","title":"GNPSDownloader","text":"GNPSDownloader(task_id: str, download_root: str | PathLike)\n
Download GNPS zip archive for the given task id.
ConceptGNPS data
Note that only GNPS workflows listed in the GNPSFormat enum are supported.
Attributes:
GNPS_DATA_DOWNLOAD_URL
(str
) \u2013 URL template for downloading GNPS data.
GNPS_DATA_DOWNLOAD_URL_FBMN
(str
) \u2013 URL template for downloading GNPS data for FBMN.
gnps_format
(GNPSFormat
) \u2013 GNPS workflow type.
Parameters:
task_id
(str
) \u2013 GNPS task id, identifying the data to be downloaded.
download_root
(str | PathLike
) \u2013 Path where to store the downloaded archive.
Raises:
ValueError
\u2013 If the given task id does not correspond to a supported GNPS workflow.
Examples:
>>> GNPSDownloader(\"c22f44b14a3d450eb836d607cb9521bb\", \"~/downloads\")\n
Source code in src/nplinker/metabolomics/gnps/gnps_downloader.py
def __init__(self, task_id: str, download_root: str | PathLike):\n \"\"\"Initialize the GNPSDownloader.\n\n Args:\n task_id: GNPS task id, identifying the data to be downloaded.\n download_root: Path where to store the downloaded archive.\n\n Raises:\n ValueError: If the given task id does not correspond to a supported\n GNPS workflow.\n\n Examples:\n >>> GNPSDownloader(\"c22f44b14a3d450eb836d607cb9521bb\", \"~/downloads\")\n \"\"\"\n gnps_format = gnps_format_from_task_id(task_id)\n if gnps_format == GNPSFormat.Unknown:\n raise ValueError(\n f\"Unknown workflow type for GNPS task '{task_id}'.\"\n f\"Supported GNPS workflows are described in the GNPSFormat enum, \"\n f\"including such as 'METABOLOMICS-SNETS', 'METABOLOMICS-SNETS-V2' \"\n f\"and 'FEATURE-BASED-MOLECULAR-NETWORKING'.\"\n )\n\n self._task_id = task_id\n self._download_root: Path = Path(download_root)\n self._gnps_format = gnps_format\n self._file_name = gnps_format.value + \"-\" + self._task_id + \".zip\"\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSDownloader.GNPS_DATA_DOWNLOAD_URL","title":"GNPS_DATA_DOWNLOAD_URL class-attribute
instance-attribute
","text":"GNPS_DATA_DOWNLOAD_URL: str = (\n \"https://gnps.ucsd.edu/ProteoSAFe/DownloadResult?task={}&view=download_clustered_spectra\"\n)\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSDownloader.GNPS_DATA_DOWNLOAD_URL_FBMN","title":"GNPS_DATA_DOWNLOAD_URL_FBMN class-attribute
instance-attribute
","text":"GNPS_DATA_DOWNLOAD_URL_FBMN: str = (\n \"https://gnps.ucsd.edu/ProteoSAFe/DownloadResult?task={}&view=download_cytoscape_data\"\n)\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSDownloader.gnps_format","title":"gnps_format property
","text":"gnps_format: GNPSFormat\n
Get the GNPS workflow type.
Returns:
GNPSFormat
\u2013 GNPS workflow type.
download() -> Self\n
Download GNPS data.
Note: GNPS data is downloaded using the POST method (empty payload is OK).
Source code insrc/nplinker/metabolomics/gnps/gnps_downloader.py
def download(self) -> Self:\n \"\"\"Download GNPS data.\n\n Note: GNPS data is downloaded using the POST method (empty payload is OK).\n \"\"\"\n download_url(\n self.get_url(), self._download_root, filename=self._file_name, http_method=\"POST\"\n )\n return self\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSDownloader.get_download_file","title":"get_download_file","text":"get_download_file() -> str\n
Get the path to the downloaded file.
Returns:
str
\u2013 Download path as string
src/nplinker/metabolomics/gnps/gnps_downloader.py
def get_download_file(self) -> str:\n \"\"\"Get the path to the downloaded file.\n\n Returns:\n Download path as string\n \"\"\"\n return str(Path(self._download_root) / self._file_name)\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSDownloader.get_task_id","title":"get_task_id","text":"get_task_id() -> str\n
Get the GNPS task id.
Returns:
str
\u2013 Task id as string.
src/nplinker/metabolomics/gnps/gnps_downloader.py
def get_task_id(self) -> str:\n \"\"\"Get the GNPS task id.\n\n Returns:\n Task id as string.\n \"\"\"\n return self._task_id\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSDownloader.get_url","title":"get_url","text":"get_url() -> str\n
Get the download URL.
Returns:
str
\u2013 URL pointing to the GNPS data to be downloaded.
src/nplinker/metabolomics/gnps/gnps_downloader.py
def get_url(self) -> str:\n \"\"\"Get the download URL.\n\n Returns:\n URL pointing to the GNPS data to be downloaded.\n \"\"\"\n if self.gnps_format == GNPSFormat.FBMN:\n return GNPSDownloader.GNPS_DATA_DOWNLOAD_URL_FBMN.format(self._task_id)\n return GNPSDownloader.GNPS_DATA_DOWNLOAD_URL.format(self._task_id)\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSExtractor","title":"GNPSExtractor","text":"GNPSExtractor(\n file: str | PathLike, extract_dir: str | PathLike\n)\n
Extract files from a GNPS molecular networking archive (.zip).
ConceptGNPS data
Four files are extracted and renamed to the following names:
The files to be extracted are selected based on the GNPS workflow type, as described below (in the order of the files above):
Attributes:
gnps_format
(GNPSFormat
) \u2013 The GNPS workflow type.
extract_dir
(str
) \u2013 The path where to extract the files to.
Parameters:
file
(str | PathLike
) \u2013 The path to the GNPS zip file.
extract_dir
(str | PathLike
) \u2013 path to the directory where to extract the files to.
Raises:
ValueError
\u2013 If the given file is an invalid GNPS archive.
Examples:
>>> gnps_extractor = GNPSExtractor(\"path/to/gnps_archive.zip\", \"path/to/extract_dir\")\n>>> gnps_extractor.gnps_format\n<GNPSFormat.SNETS: 'METABOLOMICS-SNETS'>\n>>> gnps_extractor.extract_dir\n'path/to/extract_dir'\n
Source code in src/nplinker/metabolomics/gnps/gnps_extractor.py
def __init__(self, file: str | PathLike, extract_dir: str | PathLike):\n \"\"\"Initialize the GNPSExtractor.\n\n Args:\n file: The path to the GNPS zip file.\n extract_dir: path to the directory where to extract the files to.\n\n Raises:\n ValueError: If the given file is an invalid GNPS archive.\n\n Examples:\n >>> gnps_extractor = GNPSExtractor(\"path/to/gnps_archive.zip\", \"path/to/extract_dir\")\n >>> gnps_extractor.gnps_format\n <GNPSFormat.SNETS: 'METABOLOMICS-SNETS'>\n >>> gnps_extractor.extract_dir\n 'path/to/extract_dir'\n \"\"\"\n gnps_format = gnps_format_from_archive(file)\n if gnps_format == GNPSFormat.Unknown:\n raise ValueError(\n f\"Unknown workflow type for GNPS archive '{file}'.\"\n f\"Supported GNPS workflows are described in the GNPSFormat enum, \"\n f\"including such as 'METABOLOMICS-SNETS', 'METABOLOMICS-SNETS-V2' \"\n f\"and 'FEATURE-BASED-MOLECULAR-NETWORKING'.\"\n )\n\n self._file = Path(file)\n self._extract_path = Path(extract_dir)\n self._gnps_format = gnps_format\n # the order of filenames matters\n self._target_files = [\n \"file_mappings\",\n \"spectra.mgf\",\n \"molecular_families.tsv\",\n \"annotations.tsv\",\n ]\n\n self._extract()\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSExtractor.gnps_format","title":"gnps_format property
","text":"gnps_format: GNPSFormat\n
Get the GNPS workflow type.
Returns:
GNPSFormat
\u2013 GNPS workflow type.
property
","text":"extract_dir: str\n
Get the path where to extract the files to.
Returns:
str
\u2013 Path where to extract files as string.
GNPSSpectrumLoader(file: str | PathLike)\n
Bases: SpectrumLoaderBase
Load mass spectra from the given GNPS MGF file.
ConceptGNPS data
The file mappings file is from GNPS output archive, as described below for each GNPS workflow type:
Parameters:
file
(str | PathLike
) \u2013 path to the MGF file.
Raises:
ValueError
\u2013 Raises ValueError if the file is not valid.
Examples:
>>> loader = GNPSSpectrumLoader(\"gnps_spectra.mgf\")\n>>> print(loader.spectra[0])\n
Source code in src/nplinker/metabolomics/gnps/gnps_spectrum_loader.py
def __init__(self, file: str | PathLike) -> None:\n \"\"\"Initialize the GNPSSpectrumLoader.\n\n Args:\n file: path to the MGF file.\n\n Raises:\n ValueError: Raises ValueError if the file is not valid.\n\n Examples:\n >>> loader = GNPSSpectrumLoader(\"gnps_spectra.mgf\")\n >>> print(loader.spectra[0])\n \"\"\"\n self._file = str(file)\n self._spectra: list[Spectrum] = []\n\n self._validate()\n self._load()\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSSpectrumLoader.spectra","title":"spectra property
","text":"spectra: list[Spectrum]\n
Get the list of Spectrum objects.
Returns:
list[Spectrum]
\u2013 list[Spectrum]: the loaded spectra as a list of Spectrum
objects.
GNPSMolecularFamilyLoader(file: str | PathLike)\n
Bases: MolecularFamilyLoaderBase
Load molecular families from GNPS data.
ConceptGNPS data
The molecular family file is from GNPS output archive, as described below for each GNPS workflow type:
The ComponentIndex
column in the GNPS molecular family file is treated as family id.
But for molecular families that have only one member (i.e. spectrum), named singleton molecular families, their files have the same value of -1
in the ComponentIndex
column. To make the family id unique,the spectrum id plus a prefix singleton-
is used as the family id of singleton molecular families.
Parameters:
file
(str | PathLike
) \u2013 Path to the GNPS molecular family file.
Raises:
ValueError
\u2013 Raises ValueError if the file is not valid.
Examples:
>>> loader = GNPSMolecularFamilyLoader(\"gnps_molecular_families.tsv\")\n>>> print(loader.families)\n[<MolecularFamily 1>, <MolecularFamily 2>, ...]\n>>> print(loader.families[0].spectra_ids)\n{'1', '3', '7', ...}\n
Source code in src/nplinker/metabolomics/gnps/gnps_molecular_family_loader.py
def __init__(self, file: str | PathLike) -> None:\n \"\"\"Initialize the GNPSMolecularFamilyLoader.\n\n Args:\n file: Path to the GNPS molecular family file.\n\n Raises:\n ValueError: Raises ValueError if the file is not valid.\n\n Examples:\n >>> loader = GNPSMolecularFamilyLoader(\"gnps_molecular_families.tsv\")\n >>> print(loader.families)\n [<MolecularFamily 1>, <MolecularFamily 2>, ...]\n >>> print(loader.families[0].spectra_ids)\n {'1', '3', '7', ...}\n \"\"\"\n self._mfs: list[MolecularFamily] = []\n self._file = file\n\n self._validate()\n self._load()\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSMolecularFamilyLoader.get_mfs","title":"get_mfs","text":"get_mfs(\n keep_singleton: bool = False,\n) -> list[MolecularFamily]\n
Get MolecularFamily objects.
Parameters:
keep_singleton
(bool
, default: False
) \u2013 True to keep singleton molecular families. A singleton molecular family is a molecular family that contains only one spectrum.
Returns:
list[MolecularFamily]
\u2013 A list of MolecularFamily objects with their spectra ids.
src/nplinker/metabolomics/gnps/gnps_molecular_family_loader.py
def get_mfs(self, keep_singleton: bool = False) -> list[MolecularFamily]:\n \"\"\"Get MolecularFamily objects.\n\n Args:\n keep_singleton: True to keep singleton molecular families. A\n singleton molecular family is a molecular family that contains\n only one spectrum.\n\n Returns:\n A list of MolecularFamily objects with their spectra ids.\n \"\"\"\n mfs = self._mfs\n if not keep_singleton:\n mfs = [mf for mf in mfs if not mf.is_singleton()]\n return mfs\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSAnnotationLoader","title":"GNPSAnnotationLoader","text":"GNPSAnnotationLoader(file: str | PathLike)\n
Bases: AnnotationLoaderBase
Load annotations from GNPS output file.
ConceptGNPS data
The annotation file is a .tsv
file from GNPS output archive, as described below for each GNPS workflow type:
Parameters:
file
(str | PathLike
) \u2013 The GNPS annotation file.
Examples:
>>> loader = GNPSAnnotationLoader(\"gnps_annotations.tsv\")\n>>> print(loader.annotations[\"100\"])\n{'#Scan#': '100',\n'Adduct': 'M+H',\n'CAS_Number': 'N/A',\n'Charge': '1',\n'Compound_Name': 'MLS002153841-01!Iobenguane sulfate',\n'Compound_Source': 'NIH Pharmacologically Active Library',\n'Data_Collector': 'VP/LMS',\n'ExactMass': '274.992',\n'INCHI': 'N/A',\n'INCHI_AUX': 'N/A',\n'Instrument': 'qTof',\n'IonMode': 'Positive',\n'Ion_Source': 'LC-ESI',\n'LibMZ': '276.003',\n'LibraryName': 'lib-00014.mgf',\n'LibraryQualityString': 'Gold',\n'Library_Class': '1',\n'MQScore': '0.704152',\n'MZErrorPPM': '405416',\n'MassDiff': '111.896',\n'Organism': 'GNPS-NIH-SMALLMOLECULEPHARMACOLOGICALLYACTIVE',\n'PI': 'Dorrestein',\n'Precursor_MZ': '276.003',\n'Pubmed_ID': 'N/A',\n'RT_Query': '795.979',\n'SharedPeaks': '7',\n'Smiles': 'NC(=N)NCc1cccc(I)c1.OS(=O)(=O)O',\n'SpecCharge': '1',\n'SpecMZ': '164.107',\n'SpectrumFile': 'spectra/specs_ms.pklbin',\n'SpectrumID': 'CCMSLIB00000086167',\n'TIC_Query': '986.997',\n'UpdateWorkflowName': 'UPDATE-SINGLE-ANNOTATED-GOLD',\n'tags': ' ',\n'png_url': 'https://metabolomics-usi.gnps2.org/png/?usi1=mzspec:GNPS:GNPS-LIBRARY:accession:CCMSLIB00000086167',\n'json_url': 'https://metabolomics-usi.gnps2.org/json/?usi1=mzspec:GNPS:GNPS-LIBRARY:accession:CCMSLIB00000086167',\n'svg_url': 'https://metabolomics-usi.gnps2.org/svg/?usi1=mzspec:GNPS:GNPS-LIBRARY:accession:CCMSLIB00000086167',\n'spectrum_url': 'https://metabolomics-usi.gnps2.org/spectrum/?usi1=mzspec:GNPS:GNPS-LIBRARY:accession:CCMSLIB00000086167'}\n
Source code in src/nplinker/metabolomics/gnps/gnps_annotation_loader.py
def __init__(self, file: str | PathLike) -> None:\n \"\"\"Initialize the GNPSAnnotationLoader.\n\n Args:\n file: The GNPS annotation file.\n\n Examples:\n >>> loader = GNPSAnnotationLoader(\"gnps_annotations.tsv\")\n >>> print(loader.annotations[\"100\"])\n {'#Scan#': '100',\n 'Adduct': 'M+H',\n 'CAS_Number': 'N/A',\n 'Charge': '1',\n 'Compound_Name': 'MLS002153841-01!Iobenguane sulfate',\n 'Compound_Source': 'NIH Pharmacologically Active Library',\n 'Data_Collector': 'VP/LMS',\n 'ExactMass': '274.992',\n 'INCHI': 'N/A',\n 'INCHI_AUX': 'N/A',\n 'Instrument': 'qTof',\n 'IonMode': 'Positive',\n 'Ion_Source': 'LC-ESI',\n 'LibMZ': '276.003',\n 'LibraryName': 'lib-00014.mgf',\n 'LibraryQualityString': 'Gold',\n 'Library_Class': '1',\n 'MQScore': '0.704152',\n 'MZErrorPPM': '405416',\n 'MassDiff': '111.896',\n 'Organism': 'GNPS-NIH-SMALLMOLECULEPHARMACOLOGICALLYACTIVE',\n 'PI': 'Dorrestein',\n 'Precursor_MZ': '276.003',\n 'Pubmed_ID': 'N/A',\n 'RT_Query': '795.979',\n 'SharedPeaks': '7',\n 'Smiles': 'NC(=N)NCc1cccc(I)c1.OS(=O)(=O)O',\n 'SpecCharge': '1',\n 'SpecMZ': '164.107',\n 'SpectrumFile': 'spectra/specs_ms.pklbin',\n 'SpectrumID': 'CCMSLIB00000086167',\n 'TIC_Query': '986.997',\n 'UpdateWorkflowName': 'UPDATE-SINGLE-ANNOTATED-GOLD',\n 'tags': ' ',\n 'png_url': 'https://metabolomics-usi.gnps2.org/png/?usi1=mzspec:GNPS:GNPS-LIBRARY:accession:CCMSLIB00000086167',\n 'json_url': 'https://metabolomics-usi.gnps2.org/json/?usi1=mzspec:GNPS:GNPS-LIBRARY:accession:CCMSLIB00000086167',\n 'svg_url': 'https://metabolomics-usi.gnps2.org/svg/?usi1=mzspec:GNPS:GNPS-LIBRARY:accession:CCMSLIB00000086167',\n 'spectrum_url': 'https://metabolomics-usi.gnps2.org/spectrum/?usi1=mzspec:GNPS:GNPS-LIBRARY:accession:CCMSLIB00000086167'}\n \"\"\"\n self._file = Path(file)\n self._annotations: dict[str, dict] = {}\n\n self._validate()\n self._load()\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSAnnotationLoader.annotations","title":"annotations property
","text":"annotations: dict[str, dict]\n
Get annotations.
Returns:
dict[str, dict]
\u2013 Keys are spectrum ids (\"#Scan#\" in annotation file) and values are the annotations dict
dict[str, dict]
\u2013 for each spectrum.
GNPSFileMappingLoader(file: str | PathLike)\n
Bases: FileMappingLoaderBase
Class to load file mappings from GNPS output file.
ConceptGNPS data
File mappings refers to the mapping from spectrum id to files in which this spectrum occurs.
The file mappings file is from GNPS output archive, as described below for each GNPS workflow type:
Parameters:
file
(str | PathLike
) \u2013 Path to the GNPS file mappings file.
Raises:
ValueError
\u2013 Raises ValueError if the file is not valid.
Examples:
>>> loader = GNPSFileMappingLoader(\"gnps_file_mappings.tsv\")\n>>> print(loader.mappings[\"1\"])\n['26c.mzXML']\n>>> print(loader.mapping_reversed[\"26c.mzXML\"])\n{'1', '3', '7', ...}\n
Source code in src/nplinker/metabolomics/gnps/gnps_file_mapping_loader.py
def __init__(self, file: str | PathLike) -> None:\n \"\"\"Initialize the GNPSFileMappingLoader.\n\n Args:\n file: Path to the GNPS file mappings file.\n\n Raises:\n ValueError: Raises ValueError if the file is not valid.\n\n Examples:\n >>> loader = GNPSFileMappingLoader(\"gnps_file_mappings.tsv\")\n >>> print(loader.mappings[\"1\"])\n ['26c.mzXML']\n >>> print(loader.mapping_reversed[\"26c.mzXML\"])\n {'1', '3', '7', ...}\n \"\"\"\n self._gnps_format = gnps_format_from_file_mapping(file)\n if self._gnps_format is GNPSFormat.Unknown:\n raise ValueError(\"Unknown workflow type for GNPS file mappings file \")\n\n self._file = Path(file)\n self._mapping: dict[str, list[str]] = {}\n\n self._validate()\n self._load()\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.GNPSFileMappingLoader.mappings","title":"mappings property
","text":"mappings: dict[str, list[str]]\n
Return mapping from spectrum id to files in which this spectrum occurs.
Returns:
dict[str, list[str]]
\u2013 Mapping from spectrum id to names of all files in which this spectrum occurs.
property
","text":"mapping_reversed: dict[str, set[str]]\n
Return mapping from file name to all spectra that occur in this file.
Returns:
dict[str, set[str]]
\u2013 Mapping from file name to all spectra ids that occur in this file.
gnps_format_from_archive(\n zip_file: str | PathLike,\n) -> GNPSFormat\n
Detect GNPS format from GNPS zip archive.
The detection is based on the filename of the zip file and the names of the files contained in the zip file.
Parameters:
zip_file
(str | PathLike
) \u2013 Path to the GNPS zip file.
Returns:
GNPSFormat
\u2013 The format identified in the GNPS zip file.
Examples:
>>> gnps_format_from_archive(\"ProteoSAFe-METABOLOMICS-SNETS-c22f44b1-download_clustered_spectra.zip\")\n<GNPSFormat.SNETS: 'METABOLOMICS-SNETS'>\n>>> gnps_format_from_archive(\"ProteoSAFe-METABOLOMICS-SNETS-V2-189e8bf1-download_clustered_spectra.zip\")\n<GNPSFormat.SNETSV2: 'METABOLOMICS-SNETS-V2'>\n>>> gnps_format_from_archive(\"ProteoSAFe-FEATURE-BASED-MOLECULAR-NETWORKING-672d0a53-download_cytoscape_data.zip\")\n<GNPSFormat.FBMN: 'FEATURE-BASED-MOLECULAR-NETWORKING'>\n
Source code in src/nplinker/metabolomics/gnps/gnps_format.py
def gnps_format_from_archive(zip_file: str | PathLike) -> GNPSFormat:\n \"\"\"Detect GNPS format from GNPS zip archive.\n\n The detection is based on the filename of the zip file and the names of the\n files contained in the zip file.\n\n Args:\n zip_file: Path to the GNPS zip file.\n\n Returns:\n The format identified in the GNPS zip file.\n\n Examples:\n >>> gnps_format_from_archive(\"ProteoSAFe-METABOLOMICS-SNETS-c22f44b1-download_clustered_spectra.zip\")\n <GNPSFormat.SNETS: 'METABOLOMICS-SNETS'>\n >>> gnps_format_from_archive(\"ProteoSAFe-METABOLOMICS-SNETS-V2-189e8bf1-download_clustered_spectra.zip\")\n <GNPSFormat.SNETSV2: 'METABOLOMICS-SNETS-V2'>\n >>> gnps_format_from_archive(\"ProteoSAFe-FEATURE-BASED-MOLECULAR-NETWORKING-672d0a53-download_cytoscape_data.zip\")\n <GNPSFormat.FBMN: 'FEATURE-BASED-MOLECULAR-NETWORKING'>\n \"\"\"\n file = Path(zip_file)\n # Guess the format from the filename of the zip file\n if GNPSFormat.FBMN.value in file.name:\n return GNPSFormat.FBMN\n # the order of the if statements matters for the following two\n if GNPSFormat.SNETSV2.value in file.name:\n return GNPSFormat.SNETSV2\n if GNPSFormat.SNETS.value in file.name:\n return GNPSFormat.SNETS\n\n # Guess the format from the names of the files in the zip file\n with zipfile.ZipFile(file) as archive:\n filenames = archive.namelist()\n if any(GNPSFormat.FBMN.value in x for x in filenames):\n return GNPSFormat.FBMN\n # the order of the if statements matters for the following two\n if any(GNPSFormat.SNETSV2.value in x for x in filenames):\n return GNPSFormat.SNETSV2\n if any(GNPSFormat.SNETS.value in x for x in filenames):\n return GNPSFormat.SNETS\n\n return GNPSFormat.Unknown\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.gnps_format_from_file_mapping","title":"gnps_format_from_file_mapping","text":"gnps_format_from_file_mapping(\n file: str | PathLike,\n) -> GNPSFormat\n
Detect GNPS format from the given file mapping file.
The GNPS file mapping file is located in different folders depending on the GNPS workflow. Here are the locations in corresponding GNPS zip archives:
METABOLOMICS-SNETS
workflow: the .tsv
file in the folder clusterinfosummarygroup_attributes_withIDs_withcomponentID
METABOLOMICS-SNETS-V2
workflow: the .clustersummary
file (tsv) in the folder clusterinfosummarygroup_attributes_withIDs_withcomponentID
FEATURE-BASED-MOLECULAR-NETWORKING
workflow: the .csv
file in the folder quantification_table
Parameters:
file
(str | PathLike
) \u2013 Path to the file to peek the format for.
Returns:
GNPSFormat
\u2013 GNPS format identified in the file.
src/nplinker/metabolomics/gnps/gnps_format.py
def gnps_format_from_file_mapping(file: str | PathLike) -> GNPSFormat:\n \"\"\"Detect GNPS format from the given file mapping file.\n\n The GNPS file mapping file is located in different folders depending on the\n GNPS workflow. Here are the locations in corresponding GNPS zip archives:\n\n - `METABOLOMICS-SNETS` workflow: the `.tsv` file in the folder\n `clusterinfosummarygroup_attributes_withIDs_withcomponentID`\n - `METABOLOMICS-SNETS-V2` workflow: the `.clustersummary` file (tsv) in the folder\n `clusterinfosummarygroup_attributes_withIDs_withcomponentID`\n - `FEATURE-BASED-MOLECULAR-NETWORKING` workflow: the `.csv` file in the folder\n `quantification_table`\n\n Args:\n file: Path to the file to peek the format for.\n\n Returns:\n GNPS format identified in the file.\n \"\"\"\n with open(file, \"r\") as f:\n header = f.readline().strip()\n\n if re.search(r\"\\bAllFiles\\b\", header):\n return GNPSFormat.SNETS\n if re.search(r\"\\bUniqueFileSources\\b\", header):\n return GNPSFormat.SNETSV2\n if re.search(r\"\\b{}\\b\".format(re.escape(\"row ID\")), header):\n return GNPSFormat.FBMN\n return GNPSFormat.Unknown\n
"},{"location":"api/gnps/#nplinker.metabolomics.gnps.gnps_format_from_task_id","title":"gnps_format_from_task_id","text":"gnps_format_from_task_id(task_id: str) -> GNPSFormat\n
Detect GNPS format for the given task id.
Parameters:
task_id
(str
) \u2013 GNPS task id.
Returns:
GNPSFormat
\u2013 The format identified in the GNPS task.
Examples:
>>> gnps_format_from_task_id(\"c22f44b14a3d450eb836d607cb9521bb\")\n<GNPSFormat.SNETS: 'METABOLOMICS-SNETS'>\n>>> gnps_format_from_task_id(\"189e8bf16af145758b0a900f1c44ff4a\")\n<GNPSFormat.SNETSV2: 'METABOLOMICS-SNETS-V2'>\n>>> gnps_format_from_task_id(\"92036537c21b44c29e509291e53f6382\")\n<GNPSFormat.FBMN: 'FEATURE-BASED-MOLECULAR-NETWORKING'>\n>>> gnps_format_from_task_id(\"0ad6535e34d449788f297e712f43068a\")\n<GNPSFormat.Unknown: 'Unknown-GNPS-Workflow'>\n
Source code in src/nplinker/metabolomics/gnps/gnps_format.py
def gnps_format_from_task_id(task_id: str) -> GNPSFormat:\n \"\"\"Detect GNPS format for the given task id.\n\n Args:\n task_id: GNPS task id.\n\n Returns:\n The format identified in the GNPS task.\n\n Examples:\n >>> gnps_format_from_task_id(\"c22f44b14a3d450eb836d607cb9521bb\")\n <GNPSFormat.SNETS: 'METABOLOMICS-SNETS'>\n >>> gnps_format_from_task_id(\"189e8bf16af145758b0a900f1c44ff4a\")\n <GNPSFormat.SNETSV2: 'METABOLOMICS-SNETS-V2'>\n >>> gnps_format_from_task_id(\"92036537c21b44c29e509291e53f6382\")\n <GNPSFormat.FBMN: 'FEATURE-BASED-MOLECULAR-NETWORKING'>\n >>> gnps_format_from_task_id(\"0ad6535e34d449788f297e712f43068a\")\n <GNPSFormat.Unknown: 'Unknown-GNPS-Workflow'>\n \"\"\"\n task_html = httpx.get(GNPS_TASK_URL.format(task_id))\n soup = BeautifulSoup(task_html.text, features=\"html.parser\")\n try:\n # find the td tag that follows the th tag containing 'Workflow'\n workflow_tag = soup.find(\"th\", string=\"Workflow\").find_next_sibling(\"td\") # type: ignore\n workflow_format = workflow_tag.contents[0].strip() # type: ignore\n except AttributeError:\n return GNPSFormat.Unknown\n\n if workflow_format == GNPSFormat.FBMN.value:\n return GNPSFormat.FBMN\n if workflow_format == GNPSFormat.SNETSV2.value:\n return GNPSFormat.SNETSV2\n if workflow_format == GNPSFormat.SNETS.value:\n return GNPSFormat.SNETS\n return GNPSFormat.Unknown\n
"},{"location":"api/loader/","title":"Dataset Loader","text":""},{"location":"api/loader/#nplinker.loader","title":"nplinker.loader","text":""},{"location":"api/loader/#nplinker.loader.DatasetLoader","title":"DatasetLoader","text":"DatasetLoader(config: Dynaconf)\n
Load datasets from the working directory with the given configuration.
Concept and DiagramWorking Directory Structure
Dataset Loading Pipeline
Loaded data are stored in the data containers (attributes), e.g. self.bgcs
, self.gcfs
, etc.
Attributes:
config
\u2013 A Dynaconf object that contains the configuration settings.
bgcs
(list[BGC]
) \u2013 A list of BGC objects.
gcfs
(list[GCF]
) \u2013 A list of GCF objects.
spectra
(list[Spectrum]
) \u2013 A list of Spectrum objects.
mfs
(list[MolecularFamily]
) \u2013 A list of MolecularFamily objects.
mibig_bgcs
(list[BGC]
) \u2013 A list of MIBiG BGC objects.
mibig_strains_in_use
(StrainCollection
) \u2013 A StrainCollection object that contains the strains in use from MIBiG.
product_types
(list
) \u2013 A list of product types.
strains
(StrainCollection
) \u2013 A StrainCollection object that contains all strains.
class_matches
\u2013 A ClassMatches object that contains class match info.
chem_classes
\u2013 A ChemClassPredictions object that contains chemical class predictions.
Parameters:
config
(Dynaconf
) \u2013 A Dynaconf object that contains the configuration settings.
Examples:
>>> from nplinker.config import load_config\n>>> from nplinker.loader import DatasetLoader\n>>> config = load_config(\"nplinker.toml\")\n>>> loader = DatasetLoader(config)\n>>> loader.load()\n
See Also DatasetArranger: Download, generate and/or validate datasets to ensure they are ready for loading.
Source code insrc/nplinker/loader.py
def __init__(self, config: Dynaconf) -> None:\n \"\"\"Initialize the DatasetLoader.\n\n Args:\n config: A Dynaconf object that contains the configuration settings.\n\n Examples:\n >>> from nplinker.config import load_config\n >>> from nplinker.loader import DatasetLoader\n >>> config = load_config(\"nplinker.toml\")\n >>> loader = DatasetLoader(config)\n >>> loader.load()\n\n See Also:\n [DatasetArranger][nplinker.arranger.DatasetArranger]: Download, generate and/or validate\n datasets to ensure they are ready for loading.\n \"\"\"\n self.config = config\n\n self.bgcs: list[BGC] = []\n self.gcfs: list[GCF] = []\n self.spectra: list[Spectrum] = []\n self.mfs: list[MolecularFamily] = []\n self.mibig_bgcs: list[BGC] = []\n self.mibig_strains_in_use: StrainCollection = StrainCollection()\n self.product_types: list = []\n self.strains: StrainCollection = StrainCollection()\n\n self.class_matches = None\n self.chem_classes = None\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.RUN_CANOPUS_DEFAULT","title":"RUN_CANOPUS_DEFAULT class-attribute
instance-attribute
","text":"RUN_CANOPUS_DEFAULT = False\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.EXTRA_CANOPUS_PARAMS_DEFAULT","title":"EXTRA_CANOPUS_PARAMS_DEFAULT class-attribute
instance-attribute
","text":"EXTRA_CANOPUS_PARAMS_DEFAULT = (\n \"--maxmz 600 formula zodiac structure canopus\"\n)\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.OR_CANOPUS","title":"OR_CANOPUS class-attribute
instance-attribute
","text":"OR_CANOPUS = 'canopus_dir'\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.OR_MOLNETENHANCER","title":"OR_MOLNETENHANCER class-attribute
instance-attribute
","text":"OR_MOLNETENHANCER = 'molnetenhancer_dir'\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.config","title":"config instance-attribute
","text":"config = config\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.bgcs","title":"bgcs instance-attribute
","text":"bgcs: list[BGC] = []\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.gcfs","title":"gcfs instance-attribute
","text":"gcfs: list[GCF] = []\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.spectra","title":"spectra instance-attribute
","text":"spectra: list[Spectrum] = []\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.mfs","title":"mfs instance-attribute
","text":"mfs: list[MolecularFamily] = []\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.mibig_bgcs","title":"mibig_bgcs instance-attribute
","text":"mibig_bgcs: list[BGC] = []\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.mibig_strains_in_use","title":"mibig_strains_in_use instance-attribute
","text":"mibig_strains_in_use: StrainCollection = StrainCollection()\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.product_types","title":"product_types instance-attribute
","text":"product_types: list = []\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.strains","title":"strains instance-attribute
","text":"strains: StrainCollection = StrainCollection()\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.class_matches","title":"class_matches instance-attribute
","text":"class_matches = None\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.chem_classes","title":"chem_classes instance-attribute
","text":"chem_classes = None\n
"},{"location":"api/loader/#nplinker.loader.DatasetLoader.load","title":"load","text":"load() -> bool\n
Load all data from data files in the working directory.
See Dataset Loading Pipeline for the detailed steps.
Returns:
bool
\u2013 True if all data are loaded successfully.
src/nplinker/loader.py
def load(self) -> bool:\n \"\"\"Load all data from data files in the working directory.\n\n See [Dataset Loading Pipeline][dataset-loading-pipeline] for the detailed steps.\n\n Returns:\n True if all data are loaded successfully.\n \"\"\"\n if not self._load_strain_mappings():\n return False\n\n if not self._load_metabolomics():\n return False\n\n if not self._load_genomics():\n return False\n\n # set self.strains with all strains from input plus mibig strains in use\n self.strains = self.strains + self.mibig_strains_in_use\n\n if len(self.strains) == 0:\n raise Exception(\"Failed to find *ANY* strains.\")\n\n return True\n
"},{"location":"api/metabolomics/","title":"Data Models","text":""},{"location":"api/metabolomics/#nplinker.metabolomics","title":"nplinker.metabolomics","text":""},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily","title":"MolecularFamily","text":"MolecularFamily(id: str)\n
Class to model molecular family.
Attributes:
id
(str
) \u2013 Unique id for the molecular family.
spectra_ids
(set[str]
) \u2013 Set of spectrum ids in the molecular family.
spectra
(set[Spectrum]
) \u2013 Set of Spectrum objects in the molecular family.
strains
(StrainCollection
) \u2013 StrainCollection object that contains strains in the molecular family.
Parameters:
id
(str
) \u2013 Unique id for the molecular family.
src/nplinker/metabolomics/molecular_family.py
def __init__(self, id: str):\n \"\"\"Initialize the MolecularFamily.\n\n Args:\n id: Unique id for the molecular family.\n \"\"\"\n self.id: str = id\n self.spectra_ids: set[str] = set()\n self._spectra: set[Spectrum] = set()\n self._strains: StrainCollection = StrainCollection()\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.id","title":"id instance-attribute
","text":"id: str = id\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.spectra_ids","title":"spectra_ids instance-attribute
","text":"spectra_ids: set[str] = set()\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.spectra","title":"spectra property
","text":"spectra: set[Spectrum]\n
Get Spectrum objects in the molecular family.
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.strains","title":"strainsproperty
","text":"strains: StrainCollection\n
Get strains in the molecular family.
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.__str__","title":"__str__","text":"__str__() -> str\n
Source code in src/nplinker/metabolomics/molecular_family.py
def __str__(self) -> str:\n return (\n f\"MolecularFamily(id={self.id}, #Spectrum_objects={len(self._spectra)}, \"\n f\"#spectrum_ids={len(self.spectra_ids)}, #strains={len(self._strains)})\"\n )\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.__repr__","title":"__repr__","text":"__repr__() -> str\n
Source code in src/nplinker/metabolomics/molecular_family.py
def __repr__(self) -> str:\n return str(self)\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.__eq__","title":"__eq__","text":"__eq__(other) -> bool\n
Source code in src/nplinker/metabolomics/molecular_family.py
def __eq__(self, other) -> bool:\n if isinstance(other, MolecularFamily):\n return self.id == other.id\n return NotImplemented\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.__hash__","title":"__hash__","text":"__hash__() -> int\n
Source code in src/nplinker/metabolomics/molecular_family.py
def __hash__(self) -> int:\n return hash(self.id)\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.__reduce__","title":"__reduce__","text":"__reduce__() -> tuple\n
Reduce function for pickling.
Source code insrc/nplinker/metabolomics/molecular_family.py
def __reduce__(self) -> tuple:\n \"\"\"Reduce function for pickling.\"\"\"\n return (self.__class__, (self.id,), self.__dict__)\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.add_spectrum","title":"add_spectrum","text":"add_spectrum(spectrum: Spectrum) -> None\n
Add a Spectrum object to the molecular family.
Parameters:
spectrum
(Spectrum
) \u2013 Spectrum
object to add to the molecular family.
src/nplinker/metabolomics/molecular_family.py
def add_spectrum(self, spectrum: Spectrum) -> None:\n \"\"\"Add a Spectrum object to the molecular family.\n\n Args:\n spectrum: `Spectrum` object to add to the molecular family.\n \"\"\"\n self._spectra.add(spectrum)\n self.spectra_ids.add(spectrum.id)\n self._strains = self._strains + spectrum.strains\n # add the molecular family to the spectrum\n spectrum.family = self\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.detach_spectrum","title":"detach_spectrum","text":"detach_spectrum(spectrum: Spectrum) -> None\n
Remove a Spectrum object from the molecular family.
Parameters:
spectrum
(Spectrum
) \u2013 Spectrum
object to remove from the molecular family.
src/nplinker/metabolomics/molecular_family.py
def detach_spectrum(self, spectrum: Spectrum) -> None:\n \"\"\"Remove a Spectrum object from the molecular family.\n\n Args:\n spectrum: `Spectrum` object to remove from the molecular family.\n \"\"\"\n self._spectra.remove(spectrum)\n self.spectra_ids.remove(spectrum.id)\n self._strains = self._update_strains()\n # remove the molecular family from the spectrum\n spectrum.family = None\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.has_strain","title":"has_strain","text":"has_strain(strain: Strain) -> bool\n
Check if the given strain exists.
Parameters:
strain
(Strain
) \u2013 Strain
object.
Returns:
bool
\u2013 True when the given strain exists.
src/nplinker/metabolomics/molecular_family.py
def has_strain(self, strain: Strain) -> bool:\n \"\"\"Check if the given strain exists.\n\n Args:\n strain: `Strain` object.\n\n Returns:\n True when the given strain exists.\n \"\"\"\n return strain in self._strains\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.MolecularFamily.is_singleton","title":"is_singleton","text":"is_singleton() -> bool\n
Check if the molecular family contains only one spectrum.
Returns:
bool
\u2013 True when the molecular family has only one spectrum.
src/nplinker/metabolomics/molecular_family.py
def is_singleton(self) -> bool:\n \"\"\"Check if the molecular family contains only one spectrum.\n\n Returns:\n True when the molecular family has only one spectrum.\n \"\"\"\n return len(self.spectra_ids) == 1\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum","title":"Spectrum","text":"Spectrum(\n id: str,\n mz: list[float],\n intensity: list[float],\n precursor_mz: float,\n rt: float = 0,\n metadata: dict | None = None,\n)\n
Class to model MS/MS Spectrum.
Attributes:
id
\u2013 the spectrum ID.
mz
\u2013 the list of m/z values.
intensity
\u2013 the list of intensity values.
precursor_mz
\u2013 the m/z value of the precursor.
rt
\u2013 the retention time in seconds.
metadata
\u2013 the metadata of the spectrum, i.e. the header information in the MGF file.
gnps_annotations
(dict
) \u2013 the GNPS annotations of the spectrum.
gnps_id
(str | None
) \u2013 the GNPS ID of the spectrum.
strains
(StrainCollection
) \u2013 the strains that this spectrum belongs to.
family
(MolecularFamily | None
) \u2013 the molecular family that this spectrum belongs to.
peaks
(ndarray
) \u2013 2D array of peaks, each row is a peak of (m/z, intensity) values.
Parameters:
id
(str
) \u2013 the spectrum ID.
mz
(list[float]
) \u2013 the list of m/z values.
intensity
(list[float]
) \u2013 the list of intensity values.
precursor_mz
(float
) \u2013 the precursor m/z.
rt
(float
, default: 0
) \u2013 the retention time in seconds. Defaults to 0.
metadata
(dict | None
, default: None
) \u2013 the metadata of the spectrum, i.e. the header information in the MGF file.
src/nplinker/metabolomics/spectrum.py
def __init__(\n self,\n id: str,\n mz: list[float],\n intensity: list[float],\n precursor_mz: float,\n rt: float = 0,\n metadata: dict | None = None,\n) -> None:\n \"\"\"Initialize the Spectrum.\n\n Args:\n id: the spectrum ID.\n mz: the list of m/z values.\n intensity: the list of intensity values.\n precursor_mz: the precursor m/z.\n rt: the retention time in seconds. Defaults to 0.\n metadata: the metadata of the spectrum, i.e. the header information\n in the MGF file.\n \"\"\"\n self.id = id\n self.mz = mz\n self.intensity = intensity\n self.precursor_mz = precursor_mz\n self.rt = rt\n self.metadata = metadata or {}\n\n self.gnps_annotations: dict = {}\n self.gnps_id: str | None = None\n self.strains: StrainCollection = StrainCollection()\n self.family: MolecularFamily | None = None\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.id","title":"id instance-attribute
","text":"id = id\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.mz","title":"mz instance-attribute
","text":"mz = mz\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.intensity","title":"intensity instance-attribute
","text":"intensity = intensity\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.precursor_mz","title":"precursor_mz instance-attribute
","text":"precursor_mz = precursor_mz\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.rt","title":"rt instance-attribute
","text":"rt = rt\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.metadata","title":"metadata instance-attribute
","text":"metadata = metadata or {}\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.gnps_annotations","title":"gnps_annotations instance-attribute
","text":"gnps_annotations: dict = {}\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.gnps_id","title":"gnps_id instance-attribute
","text":"gnps_id: str | None = None\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.strains","title":"strains instance-attribute
","text":"strains: StrainCollection = StrainCollection()\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.family","title":"family instance-attribute
","text":"family: MolecularFamily | None = None\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.peaks","title":"peaks cached
property
","text":"peaks: ndarray\n
Get the peaks, a 2D array with each row containing the values of (m/z, intensity).
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.__str__","title":"__str__","text":"__str__() -> str\n
Source code in src/nplinker/metabolomics/spectrum.py
def __str__(self) -> str:\n return f\"Spectrum(id={self.id}, #strains={len(self.strains)})\"\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.__repr__","title":"__repr__","text":"__repr__() -> str\n
Source code in src/nplinker/metabolomics/spectrum.py
def __repr__(self) -> str:\n return str(self)\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.__eq__","title":"__eq__","text":"__eq__(other) -> bool\n
Source code in src/nplinker/metabolomics/spectrum.py
def __eq__(self, other) -> bool:\n if isinstance(other, Spectrum):\n return self.id == other.id and self.precursor_mz == other.precursor_mz\n return NotImplemented\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.__hash__","title":"__hash__","text":"__hash__() -> int\n
Source code in src/nplinker/metabolomics/spectrum.py
def __hash__(self) -> int:\n return hash((self.id, self.precursor_mz))\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.__reduce__","title":"__reduce__","text":"__reduce__() -> tuple\n
Reduce function for pickling.
Source code insrc/nplinker/metabolomics/spectrum.py
def __reduce__(self) -> tuple:\n \"\"\"Reduce function for pickling.\"\"\"\n return (\n self.__class__,\n (self.id, self.mz, self.intensity, self.precursor_mz, self.rt, self.metadata),\n self.__dict__,\n )\n
"},{"location":"api/metabolomics/#nplinker.metabolomics.Spectrum.has_strain","title":"has_strain","text":"has_strain(strain: Strain) -> bool\n
Check if the given strain exists in the spectrum.
Parameters:
strain
(Strain
) \u2013 Strain
object.
Returns:
bool
\u2013 True when the given strain exist in the spectrum.
src/nplinker/metabolomics/spectrum.py
def has_strain(self, strain: Strain) -> bool:\n \"\"\"Check if the given strain exists in the spectrum.\n\n Args:\n strain: `Strain` object.\n\n Returns:\n True when the given strain exist in the spectrum.\n \"\"\"\n return strain in self.strains\n
"},{"location":"api/metabolomics_abc/","title":"Abstract Base Classes","text":""},{"location":"api/metabolomics_abc/#nplinker.metabolomics.abc","title":"nplinker.metabolomics.abc","text":""},{"location":"api/metabolomics_abc/#nplinker.metabolomics.abc.SpectrumLoaderBase","title":"SpectrumLoaderBase","text":" Bases: ABC
Abstract base class for SpectrumLoader.
"},{"location":"api/metabolomics_abc/#nplinker.metabolomics.abc.SpectrumLoaderBase.spectra","title":"spectraabstractmethod
property
","text":"spectra: list[Spectrum]\n
Get Spectrum objects.
Returns:
list[Spectrum]
\u2013 A sequence of Spectrum objects.
Bases: ABC
Abstract base class for MolecularFamilyLoader.
"},{"location":"api/metabolomics_abc/#nplinker.metabolomics.abc.MolecularFamilyLoaderBase.get_mfs","title":"get_mfsabstractmethod
","text":"get_mfs(keep_singleton: bool) -> list[MolecularFamily]\n
Get MolecularFamily objects.
Parameters:
keep_singleton
(bool
) \u2013 True to keep singleton molecular families. A singleton molecular family is a molecular family that contains only one spectrum.
Returns:
list[MolecularFamily]
\u2013 A sequence of MolecularFamily objects.
src/nplinker/metabolomics/abc.py
@abstractmethod\ndef get_mfs(self, keep_singleton: bool) -> list[MolecularFamily]:\n \"\"\"Get MolecularFamily objects.\n\n Args:\n keep_singleton: True to keep singleton molecular families. A\n singleton molecular family is a molecular family that contains\n only one spectrum.\n\n Returns:\n A sequence of MolecularFamily objects.\n \"\"\"\n
"},{"location":"api/metabolomics_abc/#nplinker.metabolomics.abc.FileMappingLoaderBase","title":"FileMappingLoaderBase","text":" Bases: ABC
Abstract base class for FileMappingLoader.
"},{"location":"api/metabolomics_abc/#nplinker.metabolomics.abc.FileMappingLoaderBase.mappings","title":"mappingsabstractmethod
property
","text":"mappings: dict[str, list[str]]\n
Get file mappings.
Returns:
dict[str, list[str]]
\u2013 A mapping from spectrum ID to the names of files where the spectrum occurs.
Bases: ABC
Abstract base class for AnnotationLoader.
"},{"location":"api/metabolomics_abc/#nplinker.metabolomics.abc.AnnotationLoaderBase.annotations","title":"annotationsabstractmethod
property
","text":"annotations: dict[str, dict]\n
Get annotations.
Returns:
dict[str, dict]
\u2013 A mapping from spectrum ID to its annotations.
add_annotation_to_spectrum(\n annotations: Mapping[str, dict],\n spectra: Sequence[Spectrum],\n) -> None\n
Add annotations to the Spectrum.gnps_annotations
attribute for input spectra.
It is possible that some spectra don't have annotations.
Note
The input spectra
list is changed in place.
Parameters:
annotations
(Mapping[str, dict]
) \u2013 A dictionary of GNPS annotations, where the keys are spectrum ids and the values are GNPS annotations.
spectra
(Sequence[Spectrum]
) \u2013 A list of Spectrum objects.
src/nplinker/metabolomics/utils.py
def add_annotation_to_spectrum(\n annotations: Mapping[str, dict], spectra: Sequence[Spectrum]\n) -> None:\n \"\"\"Add annotations to the `Spectrum.gnps_annotations` attribute for input spectra.\n\n It is possible that some spectra don't have annotations.\n\n !!! note\n The input `spectra` list is changed in place.\n\n Args:\n annotations: A dictionary of GNPS annotations, where the keys are\n spectrum ids and the values are GNPS annotations.\n spectra: A list of Spectrum objects.\n \"\"\"\n for spec in spectra:\n if spec.id in annotations:\n spec.gnps_annotations = annotations[spec.id]\n
"},{"location":"api/metabolomics_utils/#nplinker.metabolomics.utils.add_strains_to_spectrum","title":"add_strains_to_spectrum","text":"add_strains_to_spectrum(\n strains: StrainCollection, spectra: Sequence[Spectrum]\n) -> tuple[list[Spectrum], list[Spectrum]]\n
Add Strain
objects to the Spectrum.strains
attribute for input spectra.
Note
The input spectra
list is changed in place.
Parameters:
strains
(StrainCollection
) \u2013 A collection of strain objects.
spectra
(Sequence[Spectrum]
) \u2013 A list of Spectrum objects.
Returns:
tuple[list[Spectrum], list[Spectrum]]
\u2013 A tuple of two lists of Spectrum objects,
src/nplinker/metabolomics/utils.py
def add_strains_to_spectrum(\n strains: StrainCollection, spectra: Sequence[Spectrum]\n) -> tuple[list[Spectrum], list[Spectrum]]:\n \"\"\"Add `Strain` objects to the `Spectrum.strains` attribute for input spectra.\n\n !!! note\n The input `spectra` list is changed in place.\n\n Args:\n strains: A collection of strain objects.\n spectra: A list of Spectrum objects.\n\n Returns:\n A tuple of two lists of Spectrum objects,\n\n - the first list contains Spectrum objects that are updated with Strain objects;\n - the second list contains Spectrum objects that are not updated with Strain objects\n because no Strain objects are found.\n \"\"\"\n spectra_with_strains = []\n spectra_without_strains = []\n for spec in spectra:\n try:\n strain_list = strains.lookup(spec.id)\n except ValueError:\n spectra_without_strains.append(spec)\n continue\n\n for strain in strain_list:\n spec.strains.add(strain)\n spectra_with_strains.append(spec)\n\n logger.info(\n f\"{len(spectra_with_strains)} Spectrum objects updated with Strain objects.\\n\"\n f\"{len(spectra_without_strains)} Spectrum objects not updated with Strain objects.\"\n )\n\n return spectra_with_strains, spectra_without_strains\n
"},{"location":"api/metabolomics_utils/#nplinker.metabolomics.utils.add_spectrum_to_mf","title":"add_spectrum_to_mf","text":"add_spectrum_to_mf(\n spectra: Sequence[Spectrum],\n mfs: Sequence[MolecularFamily],\n) -> tuple[\n list[MolecularFamily],\n list[MolecularFamily],\n dict[MolecularFamily, set[str]],\n]\n
Add Spectrum objects to MolecularFamily objects.
The attribute MolecularFamily.spectra_ids
contains the ids of Spectrum
objects. These ids are used to find Spectrum
objects from the input spectra
list. The found Spectrum
objects are added to the MolecularFamily.spectra
attribute.
It is possible that some spectrum ids are not found in the input spectra
list, and so their Spectrum
objects are missing in the MolecularFamily
object.
Note
The input mfs
list is changed in place.
Parameters:
spectra
(Sequence[Spectrum]
) \u2013 A list of Spectrum objects.
mfs
(Sequence[MolecularFamily]
) \u2013 A list of MolecularFamily objects.
Returns:
tuple[list[MolecularFamily], list[MolecularFamily], dict[MolecularFamily, set[str]]]
\u2013 A tuple of three elements,
MolecularFamily
objects that are updated with Spectrum
objectsMolecularFamily
objects that are not updated with Spectrum
objects (all Spectrum
objects are missing).MolecularFamily
objects as keys and a set of ids of missing Spectrum
objects as values.src/nplinker/metabolomics/utils.py
def add_spectrum_to_mf(\n spectra: Sequence[Spectrum], mfs: Sequence[MolecularFamily]\n) -> tuple[list[MolecularFamily], list[MolecularFamily], dict[MolecularFamily, set[str]]]:\n \"\"\"Add Spectrum objects to MolecularFamily objects.\n\n The attribute `MolecularFamily.spectra_ids` contains the ids of `Spectrum` objects.\n These ids are used to find `Spectrum` objects from the input `spectra` list. The found `Spectrum`\n objects are added to the `MolecularFamily.spectra` attribute.\n\n It is possible that some spectrum ids are not found in the input `spectra` list, and so their\n `Spectrum` objects are missing in the `MolecularFamily` object.\n\n\n !!! note\n The input `mfs` list is changed in place.\n\n Args:\n spectra: A list of Spectrum objects.\n mfs: A list of MolecularFamily objects.\n\n Returns:\n A tuple of three elements,\n\n - the first list contains `MolecularFamily` objects that are updated with `Spectrum` objects\n - the second list contains `MolecularFamily` objects that are not updated with `Spectrum`\n objects (all `Spectrum` objects are missing).\n - the third is a dictionary containing `MolecularFamily` objects as keys and a set of ids\n of missing `Spectrum` objects as values.\n \"\"\"\n spec_dict = {spec.id: spec for spec in spectra}\n mf_with_spec = []\n mf_without_spec = []\n mf_missing_spec: dict[MolecularFamily, set[str]] = {}\n for mf in mfs:\n for spec_id in mf.spectra_ids:\n try:\n spec = spec_dict[spec_id]\n except KeyError:\n if mf not in mf_missing_spec:\n mf_missing_spec[mf] = {spec_id}\n else:\n mf_missing_spec[mf].add(spec_id)\n continue\n mf.add_spectrum(spec)\n\n if mf.spectra:\n mf_with_spec.append(mf)\n else:\n mf_without_spec.append(mf)\n\n logger.info(\n f\"{len(mf_with_spec)} MolecularFamily objects updated with Spectrum objects.\\n\"\n f\"{len(mf_without_spec)} MolecularFamily objects not updated with Spectrum objects.\\n\"\n f\"{len(mf_missing_spec)} MolecularFamily objects have missing Spectrum objects.\"\n )\n return mf_with_spec, mf_without_spec, mf_missing_spec\n
"},{"location":"api/metabolomics_utils/#nplinker.metabolomics.utils.extract_mappings_strain_id_ms_filename","title":"extract_mappings_strain_id_ms_filename","text":"extract_mappings_strain_id_ms_filename(\n podp_project_json_file: str | PathLike,\n) -> dict[str, set[str]]\n
Extract mappings \"strain_id <-> MS_filename\".
Parameters:
podp_project_json_file
(str | PathLike
) \u2013 The path to the PODP project JSON file.
Returns:
dict[str, set[str]]
\u2013 Key is strain id and value is a set of MS filenames.
The podp_project_json_file
is the project JSON file downloaded from PODP platform. For example, for project MSV000079284, its json file is https://pairedomicsdata.bioinformatics.nl/api/projects/4b29ddc3-26d0-40d7-80c5-44fb6631dbf9.4.
src/nplinker/metabolomics/utils.py
def extract_mappings_strain_id_ms_filename(\n podp_project_json_file: str | PathLike,\n) -> dict[str, set[str]]:\n \"\"\"Extract mappings \"strain_id <-> MS_filename\".\n\n Args:\n podp_project_json_file: The path to the PODP project JSON file.\n\n Returns:\n Key is strain id and value is a set of MS filenames.\n\n Notes:\n The `podp_project_json_file` is the project JSON file downloaded from\n PODP platform. For example, for project MSV000079284, its json file is\n https://pairedomicsdata.bioinformatics.nl/api/projects/4b29ddc3-26d0-40d7-80c5-44fb6631dbf9.4.\n\n See Also:\n - [podp_generate_strain_mappings][nplinker.strain.utils.podp_generate_strain_mappings]:\n Generate strain mappings JSON file for PODP pipeline.\n \"\"\"\n mappings_dict: dict[str, set[str]] = {}\n with open(podp_project_json_file, \"r\") as f:\n json_data = json.load(f)\n\n validate_podp_json(json_data)\n\n # Extract mappings strain id <-> metabolomics filename\n for record in json_data[\"genome_metabolome_links\"]:\n strain_id = record[\"genome_label\"]\n # get the actual filename of the mzXML URL\n filename = Path(record[\"metabolomics_file\"]).name\n if strain_id in mappings_dict:\n mappings_dict[strain_id].add(filename)\n else:\n mappings_dict[strain_id] = {filename}\n return mappings_dict\n
"},{"location":"api/metabolomics_utils/#nplinker.metabolomics.utils.extract_mappings_ms_filename_spectrum_id","title":"extract_mappings_ms_filename_spectrum_id","text":"extract_mappings_ms_filename_spectrum_id(\n gnps_file_mappings_file: str | PathLike,\n) -> dict[str, set[str]]\n
Extract mappings \"MS_filename <-> spectrum_id\".
Parameters:
gnps_file_mappings_file
(str | PathLike
) \u2013 The path to the GNPS file mappings file (csv or tsv).
Returns:
dict[str, set[str]]
\u2013 Key is MS filename and value is a set of spectrum ids.
The gnps_file_mappings_file
is downloaded from GNPS website and named as GNPS_FILE_MAPPINGS_TSV or GNPS_FILE_MAPPINGS_CSV. For more details, see GNPS data.
src/nplinker/metabolomics/utils.py
def extract_mappings_ms_filename_spectrum_id(\n gnps_file_mappings_file: str | PathLike,\n) -> dict[str, set[str]]:\n \"\"\"Extract mappings \"MS_filename <-> spectrum_id\".\n\n Args:\n gnps_file_mappings_file: The path to the GNPS file mappings file (csv or tsv).\n\n Returns:\n Key is MS filename and value is a set of spectrum ids.\n\n Notes:\n The `gnps_file_mappings_file` is downloaded from GNPS website and named as\n [GNPS_FILE_MAPPINGS_TSV][nplinker.defaults.GNPS_FILE_MAPPINGS_TSV] or\n [GNPS_FILE_MAPPINGS_CSV][nplinker.defaults.GNPS_FILE_MAPPINGS_CSV].\n For more details, see [GNPS data][gnps-data].\n\n See Also:\n - [GNPSFileMappingLoader][nplinker.metabolomics.gnps.gnps_file_mapping_loader.GNPSFileMappingLoader]:\n Load GNPS file mappings file.\n - [podp_generate_strain_mappings][nplinker.strain.utils.podp_generate_strain_mappings]:\n Generate strain mappings JSON file for PODP pipeline.\n \"\"\"\n loader = GNPSFileMappingLoader(gnps_file_mappings_file)\n return loader.mapping_reversed\n
"},{"location":"api/metabolomics_utils/#nplinker.metabolomics.utils.get_mappings_strain_id_spectrum_id","title":"get_mappings_strain_id_spectrum_id","text":"get_mappings_strain_id_spectrum_id(\n mappings_strain_id_ms_filename: Mapping[str, set[str]],\n mappings_ms_filename_spectrum_id: Mapping[\n str, set[str]\n ],\n) -> dict[str, set[str]]\n
Get mappings \"strain_id <-> spectrum_id\".
Parameters:
mappings_strain_id_ms_filename
(Mapping[str, set[str]]
) \u2013 Mappings \"strain_id <-> MS_filename\".
mappings_ms_filename_spectrum_id
(Mapping[str, set[str]]
) \u2013 Mappings \"MS_filename <-> spectrum_id\".
Returns:
dict[str, set[str]]
\u2013 Key is strain id and value is a set of spectrum ids.
extract_mappings_strain_id_ms_filename
: Extract mappings \"strain_id <-> MS_filename\".extract_mappings_ms_filename_spectrum_id
: Extract mappings \"MS_filename <-> spectrum_id\".src/nplinker/metabolomics/utils.py
def get_mappings_strain_id_spectrum_id(\n mappings_strain_id_ms_filename: Mapping[str, set[str]],\n mappings_ms_filename_spectrum_id: Mapping[str, set[str]],\n) -> dict[str, set[str]]:\n \"\"\"Get mappings \"strain_id <-> spectrum_id\".\n\n Args:\n mappings_strain_id_ms_filename: Mappings\n \"strain_id <-> MS_filename\".\n mappings_ms_filename_spectrum_id: Mappings\n \"MS_filename <-> spectrum_id\".\n\n Returns:\n Key is strain id and value is a set of spectrum ids.\n\n\n See Also:\n - `extract_mappings_strain_id_ms_filename`: Extract mappings \"strain_id <-> MS_filename\".\n - `extract_mappings_ms_filename_spectrum_id`: Extract mappings \"MS_filename <-> spectrum_id\".\n - [podp_generate_strain_mappings][nplinker.strain.utils.podp_generate_strain_mappings]:\n Generate strain mappings JSON file for PODP pipeline.\n \"\"\"\n mappings_dict = {}\n for strain_id, ms_filenames in mappings_strain_id_ms_filename.items():\n spectrum_ids = set()\n for ms_filename in ms_filenames:\n if (sid := mappings_ms_filename_spectrum_id.get(ms_filename)) is not None:\n spectrum_ids.update(sid)\n if spectrum_ids:\n mappings_dict[strain_id] = spectrum_ids\n return mappings_dict\n
"},{"location":"api/mibig/","title":"MiBIG","text":""},{"location":"api/mibig/#nplinker.genomics.mibig","title":"nplinker.genomics.mibig","text":""},{"location":"api/mibig/#nplinker.genomics.mibig.MibigLoader","title":"MibigLoader","text":"MibigLoader(data_dir: str | PathLike)\n
Bases: BGCLoaderBase
Parse MIBiG metadata files and return BGC objects.
MIBiG metadata file (json) contains annotations/metadata information for each BGC. See https://mibig.secondarymetabolites.org/download.
The MiBIG accession is used as BGC id and strain name. The loaded BGC objects have Strain object as their strain attribute (i.e. BGC.strain
).
Parameters:
data_dir
(str | PathLike
) \u2013 Path to the directory of MIBiG metadata json files
Examples:
>>> loader = MibigLoader(\"path/to/mibig/data/dir\")\n>>> loader.data_dir\n'path/to/mibig/data/dir'\n>>> loader.get_bgcs()\n[BGC('BGC000001', 'NRP'), BGC('BGC000002', 'Polyketide')]\n
Source code in src/nplinker/genomics/mibig/mibig_loader.py
def __init__(self, data_dir: str | PathLike):\n \"\"\"Initialize the MIBiG metadata loader.\n\n Args:\n data_dir: Path to the directory of MIBiG metadata json files\n\n Examples:\n >>> loader = MibigLoader(\"path/to/mibig/data/dir\")\n >>> loader.data_dir\n 'path/to/mibig/data/dir'\n >>> loader.get_bgcs()\n [BGC('BGC000001', 'NRP'), BGC('BGC000002', 'Polyketide')]\n \"\"\"\n self.data_dir = str(data_dir)\n self._file_dict = self.parse_data_dir(self.data_dir)\n self._metadata_dict = self._parse_metadata()\n self._bgcs = self._parse_bgcs()\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigLoader.data_dir","title":"data_dir instance-attribute
","text":"data_dir = str(data_dir)\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigLoader.get_files","title":"get_files","text":"get_files() -> dict[str, str]\n
Get the path of all MIBiG metadata json files.
Returns:
dict[str, str]
\u2013 The key is metadata file name (BGC accession), and the value is path to the metadata
dict[str, str]
\u2013 json file
src/nplinker/genomics/mibig/mibig_loader.py
def get_files(self) -> dict[str, str]:\n \"\"\"Get the path of all MIBiG metadata json files.\n\n Returns:\n The key is metadata file name (BGC accession), and the value is path to the metadata\n json file\n \"\"\"\n return self._file_dict\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigLoader.parse_data_dir","title":"parse_data_dir staticmethod
","text":"parse_data_dir(data_dir: str | PathLike) -> dict[str, str]\n
Parse metadata directory and return paths to all metadata json files.
Parameters:
data_dir
(str | PathLike
) \u2013 path to the directory of MIBiG metadata json files
Returns:
dict[str, str]
\u2013 The key is metadata file name (BGC accession), and the value is path to the metadata
dict[str, str]
\u2013 json file
src/nplinker/genomics/mibig/mibig_loader.py
@staticmethod\ndef parse_data_dir(data_dir: str | PathLike) -> dict[str, str]:\n \"\"\"Parse metadata directory and return paths to all metadata json files.\n\n Args:\n data_dir: path to the directory of MIBiG metadata json files\n\n Returns:\n The key is metadata file name (BGC accession), and the value is path to the metadata\n json file\n \"\"\"\n file_dict = {}\n json_files = list_files(data_dir, prefix=\"BGC\", suffix=\".json\")\n for file in json_files:\n fname = Path(file).stem\n file_dict[fname] = file\n return file_dict\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigLoader.get_metadata","title":"get_metadata","text":"get_metadata() -> dict[str, MibigMetadata]\n
Get MibigMetadata objects.
Returns:
dict[str, MibigMetadata]
\u2013 The key is BGC accession (file name) and the value is MibigMetadata object
src/nplinker/genomics/mibig/mibig_loader.py
def get_metadata(self) -> dict[str, MibigMetadata]:\n \"\"\"Get MibigMetadata objects.\n\n Returns:\n The key is BGC accession (file name) and the value is MibigMetadata object\n \"\"\"\n return self._metadata_dict\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigLoader.get_bgcs","title":"get_bgcs","text":"get_bgcs() -> list[BGC]\n
Get BGC objects.
The BGC objects use MiBIG accession as id and have Strain object as their strain attribute (i.e. BGC.strain
), where the name of the Strain object is also MiBIG accession.
Returns:
list[BGC]
\u2013 A list of BGC objects
src/nplinker/genomics/mibig/mibig_loader.py
def get_bgcs(self) -> list[BGC]:\n \"\"\"Get BGC objects.\n\n The BGC objects use MiBIG accession as id and have Strain object as\n their strain attribute (i.e. `BGC.strain`), where the name of the Strain\n object is also MiBIG accession.\n\n Returns:\n A list of BGC objects\n \"\"\"\n return self._bgcs\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigMetadata","title":"MibigMetadata","text":"MibigMetadata(file: str | PathLike)\n
Class to model the BGC metadata/annotations defined in MIBiG.
MIBiG is a specification of BGC metadata and use JSON schema to represent BGC metadata. More details see: https://mibig.secondarymetabolites.org/download.
Parameters:
file
(str | PathLike
) \u2013 Path to the json file of MIBiG BGC metadata
Examples:
>>> metadata = MibigMetadata(\"/data/BGC0000001.json\")\n
Source code in src/nplinker/genomics/mibig/mibig_metadata.py
def __init__(self, file: str | PathLike) -> None:\n \"\"\"Initialize the MIBiG metadata object.\n\n Args:\n file: Path to the json file of MIBiG BGC metadata\n\n Examples:\n >>> metadata = MibigMetadata(\"/data/BGC0000001.json\")\n \"\"\"\n self.file = str(file)\n with open(self.file, \"rb\") as f:\n self.metadata = json.load(f)\n\n self._mibig_accession: str\n self._biosyn_class: tuple[str]\n self._parse_metadata()\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigMetadata.file","title":"file instance-attribute
","text":"file = str(file)\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigMetadata.metadata","title":"metadata instance-attribute
","text":"metadata = load(f)\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigMetadata.mibig_accession","title":"mibig_accession property
","text":"mibig_accession: str\n
Get the value of metadata item 'mibig_accession'.
"},{"location":"api/mibig/#nplinker.genomics.mibig.MibigMetadata.biosyn_class","title":"biosyn_classproperty
","text":"biosyn_class: tuple[str]\n
Get the value of metadata item 'biosyn_class'.
The 'biosyn_class' is biosynthetic class(es), namely the type of natural product or secondary metabolite.
MIBiG defines 6 major biosynthetic classes for natural products, including NRP
, Polyketide
, RiPP
, Terpene
, Saccharide
and Alkaloid
. Note that natural products created by the other biosynthetic mechanisms fall under the category Other
. For more details see the paper.
download_and_extract_mibig_metadata(\n download_root: str | PathLike,\n extract_path: str | PathLike,\n version: str = \"3.1\",\n)\n
Download and extract MIBiG metadata json files.
Note that it does not matter whether the metadata json files are in nested folders or not in the archive, all json files will be extracted to the same location, i.e. extract_path
. The nested folders will be removed if they exist. So the extract_path
will have only json files.
Parameters:
download_root
(str | PathLike
) \u2013 Path to the directory in which to place the downloaded archive.
extract_path
(str | PathLike
) \u2013 Path to an empty directory where the json files will be extracted. The directory must be empty if it exists. If it doesn't exist, the directory will be created.
version
(str
, default: '3.1'
) \u2013 description. Defaults to \"3.1\".
Examples:
>>> download_and_extract_mibig_metadata(\"/data/download\", \"/data/mibig_metadata\")\n
Source code in src/nplinker/genomics/mibig/mibig_downloader.py
def download_and_extract_mibig_metadata(\n download_root: str | os.PathLike,\n extract_path: str | os.PathLike,\n version: str = \"3.1\",\n):\n \"\"\"Download and extract MIBiG metadata json files.\n\n Note that it does not matter whether the metadata json files are in nested folders or not in the archive,\n all json files will be extracted to the same location, i.e. `extract_path`. The nested\n folders will be removed if they exist. So the `extract_path` will have only json files.\n\n Args:\n download_root: Path to the directory in which to place the downloaded archive.\n extract_path: Path to an empty directory where the json files will be extracted.\n The directory must be empty if it exists. If it doesn't exist, the directory will be created.\n version: _description_. Defaults to \"3.1\".\n\n Examples:\n >>> download_and_extract_mibig_metadata(\"/data/download\", \"/data/mibig_metadata\")\n \"\"\"\n download_root = Path(download_root)\n extract_path = Path(extract_path)\n\n if download_root == extract_path:\n raise ValueError(\"Identical path of download directory and extract directory\")\n\n # check if extract_path is empty\n if not extract_path.exists():\n extract_path.mkdir(parents=True)\n else:\n if len(list(extract_path.iterdir())) != 0:\n raise ValueError(f'Nonempty directory: \"{extract_path}\"')\n\n # download and extract\n md5 = _MD5_MIBIG_METADATA[version]\n download_and_extract_archive(\n url=MIBIG_METADATA_URL.format(version=version),\n download_root=download_root,\n extract_root=extract_path,\n md5=md5,\n )\n\n # After extracting mibig archive, it's either one dir or many json files,\n # if it's a dir, then move all json files from it to extract_path\n subdirs = list_dirs(extract_path)\n if len(subdirs) > 1:\n raise ValueError(f\"Expected one extracted directory, got {len(subdirs)}\")\n\n if len(subdirs) == 1:\n subdir_path = subdirs[0]\n for fname in list_files(subdir_path, prefix=\"BGC\", suffix=\".json\", keep_parent=False):\n shutil.move(os.path.join(subdir_path, fname), os.path.join(extract_path, fname))\n # delete subdir\n if subdir_path != extract_path:\n shutil.rmtree(subdir_path)\n
"},{"location":"api/mibig/#nplinker.genomics.mibig.parse_bgc_metadata_json","title":"parse_bgc_metadata_json","text":"parse_bgc_metadata_json(file: str | PathLike) -> BGC\n
Parse MIBiG metadata file and return BGC object.
Note that the MiBIG accession is used as the BGC id and strain name. The BGC object has Strain object as its strain attribute.
Parameters:
file
(str | PathLike
) \u2013 Path to the MIBiG metadata json file
Returns:
BGC
\u2013 BGC object
src/nplinker/genomics/mibig/mibig_loader.py
def parse_bgc_metadata_json(file: str | PathLike) -> BGC:\n \"\"\"Parse MIBiG metadata file and return BGC object.\n\n Note that the MiBIG accession is used as the BGC id and strain name. The BGC\n object has Strain object as its strain attribute.\n\n Args:\n file: Path to the MIBiG metadata json file\n\n Returns:\n BGC object\n \"\"\"\n metadata = MibigMetadata(str(file))\n mibig_bgc = BGC(metadata.mibig_accession, *metadata.biosyn_class)\n mibig_bgc.mibig_bgc_class = metadata.biosyn_class\n mibig_bgc.strain = Strain(metadata.mibig_accession)\n return mibig_bgc\n
"},{"location":"api/nplinker/","title":"NPLinker","text":""},{"location":"api/nplinker/#nplinker","title":"nplinker","text":""},{"location":"api/nplinker/#nplinker.NPLinker","title":"NPLinker","text":"NPLinker(config_file: str | PathLike)\n
The central class of NPLinker application.
Attributes:
config
(Dynaconf
) \u2013 The configuration object for the current NPLinker application.
root_dir
(str
) \u2013 The path to the root directory of the current NPLinker application.
output_dir
(str
) \u2013 The path to the output directory of the current NPLinker application.
bgcs
(list[BGC]
) \u2013 A list of all BGC objects.
gcfs
(list[GCF]
) \u2013 A list of all GCF objects.
spectra
(list[Spectrum]
) \u2013 A list of all Spectrum objects.
mfs
(list[MolecularFamily]
) \u2013 A list of all MolecularFamily objects.
mibig_bgcs
(list[BGC]
) \u2013 A list of all MiBIG BGC objects.
strains
(StrainCollection
) \u2013 A StrainCollection object containing all Strain objects.
product_types
(list[str]
) \u2013 A list of all BiGSCAPE product types.
scoring_methods
(list[str]
) \u2013 A list of all valid scoring methods.
Parameters:
config_file
(str | PathLike
) \u2013 Path to the configuration file to use.
Examples:
Starting the NPLinker application:
>>> from nplinker import NPLinker\n>>> npl = NPLinker(\"path/to/config.toml\")\n
Loading data from files to python objects:
>>> npl.load_data()\n
Checking the number of GCF objects:
>>> len(npl.gcfs)\n
Getting the links for all GCF objects using the Metcalf scoring method, and the result is stored in a LinkGraph object:
>>> lg = npl.get_links(npl.gcfs, \"metcalf\")\n
Getting the link data between two objects:
>>> link_data = lg.get_link_data(npl.gcfs[0], npl.spectra[0])\n{\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0, \"standardised\": False})}\n
Saving the data to a pickle file:
>>> npl.save_data(\"path/to/output.pkl\", lg)\n
Source code in src/nplinker/nplinker.py
def __init__(self, config_file: str | PathLike):\n \"\"\"Initialise an NPLinker instance.\n\n Args:\n config_file: Path to the configuration file to use.\n\n\n Examples:\n Starting the NPLinker application:\n >>> from nplinker import NPLinker\n >>> npl = NPLinker(\"path/to/config.toml\")\n\n Loading data from files to python objects:\n >>> npl.load_data()\n\n Checking the number of GCF objects:\n >>> len(npl.gcfs)\n\n Getting the links for all GCF objects using the Metcalf scoring method, and the result\n is stored in a [LinkGraph][nplinker.scoring.LinkGraph] object:\n >>> lg = npl.get_links(npl.gcfs, \"metcalf\")\n\n Getting the link data between two objects:\n >>> link_data = lg.get_link_data(npl.gcfs[0], npl.spectra[0])\n {\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0, \"standardised\": False})}\n\n Saving the data to a pickle file:\n >>> npl.save_data(\"path/to/output.pkl\", lg)\n \"\"\"\n # Load the configuration file\n self.config: Dynaconf = load_config(config_file)\n\n # Setup logging for the application\n setup_logging(\n level=self.config.log.level,\n file=self.config.log.get(\"file\", \"\"),\n use_console=self.config.log.use_console,\n )\n logger.info(\n \"Configuration:\\n %s\", pformat(self.config.as_dict(), width=20, sort_dicts=False)\n )\n\n # Setup the output directory\n self._output_dir = self.config.root_dir / OUTPUT_DIRNAME\n self._output_dir.mkdir(exist_ok=True)\n\n # Initialise data containers that will be populated by the `load_data` method\n self._bgc_dict: dict[str, BGC] = {}\n self._gcf_dict: dict[str, GCF] = {}\n self._spec_dict: dict[str, Spectrum] = {}\n self._mf_dict: dict[str, MolecularFamily] = {}\n self._mibig_bgcs: list[BGC] = []\n self._strains: StrainCollection = StrainCollection()\n self._product_types: list = []\n self._chem_classes = None # TODO: to be refactored\n self._class_matches = None # TODO: to be refactored\n\n # Flags to keep track of whether the scoring methods have been set up\n self._scoring_methods_setup_done = {name: False for name in self._valid_scoring_methods}\n
"},{"location":"api/nplinker/#nplinker.NPLinker.config","title":"config instance-attribute
","text":"config: Dynaconf = load_config(config_file)\n
"},{"location":"api/nplinker/#nplinker.NPLinker.root_dir","title":"root_dir property
","text":"root_dir: str\n
Get the path to the root directory of the current NPLinker instance.
"},{"location":"api/nplinker/#nplinker.NPLinker.output_dir","title":"output_dirproperty
","text":"output_dir: str\n
Get the path to the output directory of the current NPLinker instance.
"},{"location":"api/nplinker/#nplinker.NPLinker.bgcs","title":"bgcsproperty
","text":"bgcs: list[BGC]\n
Get all BGC objects.
"},{"location":"api/nplinker/#nplinker.NPLinker.gcfs","title":"gcfsproperty
","text":"gcfs: list[GCF]\n
Get all GCF objects.
"},{"location":"api/nplinker/#nplinker.NPLinker.spectra","title":"spectraproperty
","text":"spectra: list[Spectrum]\n
Get all Spectrum objects.
"},{"location":"api/nplinker/#nplinker.NPLinker.mfs","title":"mfsproperty
","text":"mfs: list[MolecularFamily]\n
Get all MolecularFamily objects.
"},{"location":"api/nplinker/#nplinker.NPLinker.mibig_bgcs","title":"mibig_bgcsproperty
","text":"mibig_bgcs: list[BGC]\n
Get all MiBIG BGC objects.
"},{"location":"api/nplinker/#nplinker.NPLinker.strains","title":"strainsproperty
","text":"strains: StrainCollection\n
Get all Strain objects.
"},{"location":"api/nplinker/#nplinker.NPLinker.product_types","title":"product_typesproperty
","text":"product_types: list[str]\n
Get all BiGSCAPE product types.
"},{"location":"api/nplinker/#nplinker.NPLinker.chem_classes","title":"chem_classesproperty
","text":"chem_classes\n
Returns loaded ChemClassPredictions with the class predictions.
"},{"location":"api/nplinker/#nplinker.NPLinker.class_matches","title":"class_matchesproperty
","text":"class_matches\n
ClassMatches with the matched classes and scoring tables from MIBiG.
"},{"location":"api/nplinker/#nplinker.NPLinker.scoring_methods","title":"scoring_methodsproperty
","text":"scoring_methods: list[str]\n
Get names of all valid scoring methods.
"},{"location":"api/nplinker/#nplinker.NPLinker.load_data","title":"load_data","text":"load_data()\n
Load all data from files into memory.
This method is a convenience function that calls the DatasetArranger
class to arrange data files (download, generate and/or validate data) in the correct directory structure, and then calls the DatasetLoader
class to load all data from the files into memory.
The loaded data is stored in various data containers for easy access, e.g. self.bgcs
for all BGC objects, self.strains
for all Strain objects, etc.
src/nplinker/nplinker.py
def load_data(self):\n \"\"\"Load all data from files into memory.\n\n This method is a convenience function that calls the\n [`DatasetArranger`][nplinker.arranger.DatasetArranger] class to arrange data files\n (download, generate and/or validate data) in the [correct directory structure][working-directory-structure],\n and then calls the [`DatasetLoader`][nplinker.loader.DatasetLoader] class to load all data\n from the files into memory.\n\n The loaded data is stored in various data containers for easy access, e.g.\n [`self.bgcs`][nplinker.NPLinker.bgcs] for all BGC objects,\n [`self.strains`][nplinker.NPLinker.strains] for all Strain objects, etc.\n \"\"\"\n arranger = DatasetArranger(self.config)\n arranger.arrange()\n loader = DatasetLoader(self.config)\n loader.load()\n\n self._bgc_dict = {bgc.id: bgc for bgc in loader.bgcs}\n self._gcf_dict = {gcf.id: gcf for gcf in loader.gcfs}\n self._spec_dict = {spec.id: spec for spec in loader.spectra}\n self._mf_dict = {mf.id: mf for mf in loader.mfs}\n\n self._mibig_bgcs = loader.mibig_bgcs\n self._strains = loader.strains\n self._product_types = loader.product_types\n self._chem_classes = loader.chem_classes\n self._class_matches = loader.class_matches\n
"},{"location":"api/nplinker/#nplinker.NPLinker.get_links","title":"get_links","text":"get_links(\n objects: (\n Sequence[BGC]\n | Sequence[GCF]\n | Sequence[Spectrum]\n | Sequence[MolecularFamily]\n ),\n scoring_method: str,\n **scoring_params: Any\n) -> LinkGraph\n
Get links for the given objects using the specified scoring method and parameters.
Parameters:
objects
(Sequence[BGC] | Sequence[GCF] | Sequence[Spectrum] | Sequence[MolecularFamily]
) \u2013 A sequence of objects to get links for. The objects must be of the same type, i.e. BGC
, GCF
, Spectrum
or MolecularFamily
type.
Warning
For scoring method metcalf
, the BGC
objects are not supported.
scoring_method
(str
) \u2013 The scoring method to use. Must be one of the valid scoring methods self.scoring_methods
, such as metcalf
.
scoring_params
(Any
, default: {}
) \u2013 Parameters to pass to the scoring method. If not given, the default parameters of the specified scoring method will be used.
Check the get_links
method of the scoring method class for the available parameters and their default values.
metcalf
cutoff
, standardised
Returns:
LinkGraph
\u2013 A LinkGraph object containing the links for the given objects.
Raises:
ValueError
\u2013 If input objects are empty or if the scoring method is invalid.
TypeError
\u2013 If the input objects are not of the same type or if the object type is invalid.
Examples:
Using default scoring parameters:
>>> lg = npl.get_links(npl.gcfs, \"metcalf\")\n
Scoring parameters provided:
>>> lg = npl.get_links(npl.gcfs, \"metcalf\", cutoff=0.5, standardised=True)\n
Source code in src/nplinker/nplinker.py
def get_links(\n self,\n objects: Sequence[BGC] | Sequence[GCF] | Sequence[Spectrum] | Sequence[MolecularFamily],\n scoring_method: str,\n **scoring_params: Any,\n) -> LinkGraph:\n \"\"\"Get links for the given objects using the specified scoring method and parameters.\n\n Args:\n objects: A sequence of objects to get links for. The objects must be of the same\n type, i.e. `BGC`, `GCF`, `Spectrum` or `MolecularFamily` type.\n !!! Warning\n For scoring method `metcalf`, the `BGC` objects are not supported.\n scoring_method: The scoring method to use. Must be one of the valid scoring methods\n [`self.scoring_methods`][nplinker.NPLinker.scoring_methods], such as `metcalf`.\n scoring_params: Parameters to pass to the scoring method. If not given, the default\n parameters of the specified scoring method will be used.\n\n Check the `get_links` method of the scoring method class for the available\n parameters and their default values.\n\n | Scoring Method | Scoring Parameters |\n | -------------- | ------------------ |\n | `metcalf` | [`cutoff`, `standardised`][nplinker.scoring.MetcalfScoring.get_links] |\n\n Returns:\n A LinkGraph object containing the links for the given objects.\n\n Raises:\n ValueError: If input objects are empty or if the scoring method is invalid.\n TypeError: If the input objects are not of the same type or if the object type is invalid.\n\n Examples:\n Using default scoring parameters:\n >>> lg = npl.get_links(npl.gcfs, \"metcalf\")\n\n Scoring parameters provided:\n >>> lg = npl.get_links(npl.gcfs, \"metcalf\", cutoff=0.5, standardised=True)\n \"\"\"\n # Validate objects\n if len(objects) == 0:\n raise ValueError(\"No objects provided to get links for\")\n # check if all objects are of the same type\n types = {type(i) for i in objects}\n if len(types) > 1:\n raise TypeError(\"Input objects must be of the same type.\")\n # check if the object type is valid\n obj_type = next(iter(types))\n if obj_type not in (BGC, GCF, Spectrum, MolecularFamily):\n raise TypeError(\n f\"Invalid type {obj_type}. Input objects must be BGC, GCF, Spectrum or MolecularFamily objects.\"\n )\n\n # Validate scoring method\n if scoring_method not in self._valid_scoring_methods:\n raise ValueError(f\"Invalid scoring method {scoring_method}.\")\n\n # Check if the scoring method has been set up\n if not self._scoring_methods_setup_done[scoring_method]:\n self._valid_scoring_methods[scoring_method].setup(self)\n self._scoring_methods_setup_done[scoring_method] = True\n\n # Initialise the scoring method\n scoring = self._valid_scoring_methods[scoring_method]()\n\n return scoring.get_links(*objects, **scoring_params)\n
"},{"location":"api/nplinker/#nplinker.NPLinker.lookup_bgc","title":"lookup_bgc","text":"lookup_bgc(id: str) -> BGC | None\n
Get the BGC object with the given ID.
Parameters:
id
(str
) \u2013 the ID of the BGC to look up.
Returns:
BGC | None
\u2013 The BGC object with the given ID, or None if no such object exists.
Examples:
>>> bgc = npl.lookup_bgc(\"BGC000001\")\n>>> bgc\nBGC(id=\"BGC000001\", ...)\n
Source code in src/nplinker/nplinker.py
def lookup_bgc(self, id: str) -> BGC | None:\n \"\"\"Get the BGC object with the given ID.\n\n Args:\n id: the ID of the BGC to look up.\n\n Returns:\n The BGC object with the given ID, or None if no such object exists.\n\n Examples:\n >>> bgc = npl.lookup_bgc(\"BGC000001\")\n >>> bgc\n BGC(id=\"BGC000001\", ...)\n \"\"\"\n return self._bgc_dict.get(id, None)\n
"},{"location":"api/nplinker/#nplinker.NPLinker.lookup_gcf","title":"lookup_gcf","text":"lookup_gcf(id: str) -> GCF | None\n
Get the GCF object with the given ID.
Parameters:
id
(str
) \u2013 the ID of the GCF to look up.
Returns:
GCF | None
\u2013 The GCF object with the given ID, or None if no such object exists.
src/nplinker/nplinker.py
def lookup_gcf(self, id: str) -> GCF | None:\n \"\"\"Get the GCF object with the given ID.\n\n Args:\n id: the ID of the GCF to look up.\n\n Returns:\n The GCF object with the given ID, or None if no such object exists.\n \"\"\"\n return self._gcf_dict.get(id, None)\n
"},{"location":"api/nplinker/#nplinker.NPLinker.lookup_spectrum","title":"lookup_spectrum","text":"lookup_spectrum(id: str) -> Spectrum | None\n
Get the Spectrum object with the given ID.
Parameters:
id
(str
) \u2013 the ID of the Spectrum to look up.
Returns:
Spectrum | None
\u2013 The Spectrum object with the given ID, or None if no such object exists.
src/nplinker/nplinker.py
def lookup_spectrum(self, id: str) -> Spectrum | None:\n \"\"\"Get the Spectrum object with the given ID.\n\n Args:\n id: the ID of the Spectrum to look up.\n\n Returns:\n The Spectrum object with the given ID, or None if no such object exists.\n \"\"\"\n return self._spec_dict.get(id, None)\n
"},{"location":"api/nplinker/#nplinker.NPLinker.lookup_mf","title":"lookup_mf","text":"lookup_mf(id: str) -> MolecularFamily | None\n
Get the MolecularFamily object with the given ID.
Parameters:
id
(str
) \u2013 the ID of the MolecularFamily to look up.
Returns:
MolecularFamily | None
\u2013 The MolecularFamily object with the given ID, or None if no such object exists.
src/nplinker/nplinker.py
def lookup_mf(self, id: str) -> MolecularFamily | None:\n \"\"\"Get the MolecularFamily object with the given ID.\n\n Args:\n id: the ID of the MolecularFamily to look up.\n\n Returns:\n The MolecularFamily object with the given ID, or None if no such object exists.\n \"\"\"\n return self._mf_dict.get(id, None)\n
"},{"location":"api/nplinker/#nplinker.NPLinker.save_data","title":"save_data","text":"save_data(\n file: str | PathLike, links: LinkGraph | None = None\n) -> None\n
Pickle data to a file.
The pickled data is a tuple of BGCs, GCFs, Spectra, MolecularFamilies, StrainCollection and links, i.e. (bgcs, gcfs, spectra, mfs, strains, links)
.
Parameters:
file
(str | PathLike
) \u2013 The path to the pickle file to save the data to.
links
(LinkGraph | None
, default: None
) \u2013 The LinkGraph object to save.
Examples:
Saving the data to a pickle file, links data is None
:
>>> npl.save_data(\"path/to/output.pkl\")\n
Also saving the links data:
>>> lg = npl.get_links(npl.gcfs, \"metcalf\")\n>>> npl.save_data(\"path/to/output.pkl\", lg)\n
Source code in src/nplinker/nplinker.py
def save_data(\n self,\n file: str | PathLike,\n links: LinkGraph | None = None,\n) -> None:\n \"\"\"Pickle data to a file.\n\n The pickled data is a tuple of BGCs, GCFs, Spectra, MolecularFamilies, StrainCollection and\n links, i.e. `(bgcs, gcfs, spectra, mfs, strains, links)`.\n\n Args:\n file: The path to the pickle file to save the data to.\n links: The LinkGraph object to save.\n\n Examples:\n Saving the data to a pickle file, links data is `None`:\n >>> npl.save_data(\"path/to/output.pkl\")\n\n Also saving the links data:\n >>> lg = npl.get_links(npl.gcfs, \"metcalf\")\n >>> npl.save_data(\"path/to/output.pkl\", lg)\n \"\"\"\n data = (self.bgcs, self.gcfs, self.spectra, self.mfs, self.strains, links)\n with open(file, \"wb\") as f:\n pickle.dump(data, f)\n
"},{"location":"api/nplinker/#nplinker.setup_logging","title":"setup_logging","text":"setup_logging(\n level: str = \"INFO\",\n file: str = \"\",\n use_console: bool = True,\n) -> None\n
Setup logging configuration for the ancestor logger \"nplinker\".
Usage DocumentationHow to setup logging
Parameters:
level
(str
, default: 'INFO'
) \u2013 The log level, use the logging module's log level constants. Valid levels are: NOTSET
, DEBUG
, INFO
, WARNING
, ERROR
, CRITICAL
.
file
(str
, default: ''
) \u2013 The file to write the log to. If the file is an empty string (by default), the log will not be written to a file. If the file does not exist, it will be created. The log will be written to the file in append mode.
use_console
(bool
, default: True
) \u2013 Whether to log to the console.
src/nplinker/logger.py
def setup_logging(level: str = \"INFO\", file: str = \"\", use_console: bool = True) -> None:\n \"\"\"Setup logging configuration for the ancestor logger \"nplinker\".\n\n ??? info \"Usage Documentation\"\n [How to setup logging][how-to-setup-logging]\n\n Args:\n level: The log level, use the logging module's log level constants.\n Valid levels are: `NOTSET`, `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`.\n file: The file to write the log to.\n If the file is an empty string (by default), the log will not be written to a file.\n If the file does not exist, it will be created.\n The log will be written to the file in append mode.\n use_console: Whether to log to the console.\n \"\"\"\n # Get the ancestor logger \"nplinker\"\n logger = logging.getLogger(\"nplinker\")\n logger.setLevel(level)\n\n # File handler\n if file:\n logger.addHandler(\n RichHandler(\n console=Console(file=open(file, \"a\"), width=120), # force the line width to 120\n omit_repeated_times=False,\n rich_tracebacks=True,\n tracebacks_show_locals=True,\n log_time_format=\"[%Y-%m-%d %X]\",\n )\n )\n\n # Console handler\n if use_console:\n logger.addHandler(\n RichHandler(\n omit_repeated_times=False,\n rich_tracebacks=True,\n tracebacks_show_locals=True,\n log_time_format=\"[%Y-%m-%d %X]\",\n )\n )\n
"},{"location":"api/nplinker/#nplinker.defaults","title":"nplinker.defaults","text":""},{"location":"api/nplinker/#nplinker.defaults.NPLINKER_APP_DATA_DIR","title":"NPLINKER_APP_DATA_DIR module-attribute
","text":"NPLINKER_APP_DATA_DIR: Final = parent / 'data'\n
"},{"location":"api/nplinker/#nplinker.defaults.STRAIN_MAPPINGS_FILENAME","title":"STRAIN_MAPPINGS_FILENAME module-attribute
","text":"STRAIN_MAPPINGS_FILENAME: Final = 'strain_mappings.json'\n
"},{"location":"api/nplinker/#nplinker.defaults.GENOME_BGC_MAPPINGS_FILENAME","title":"GENOME_BGC_MAPPINGS_FILENAME module-attribute
","text":"GENOME_BGC_MAPPINGS_FILENAME: Final = (\n \"genome_bgc_mappings.json\"\n)\n
"},{"location":"api/nplinker/#nplinker.defaults.GENOME_STATUS_FILENAME","title":"GENOME_STATUS_FILENAME module-attribute
","text":"GENOME_STATUS_FILENAME: Final = 'genome_status.json'\n
"},{"location":"api/nplinker/#nplinker.defaults.GNPS_SPECTRA_FILENAME","title":"GNPS_SPECTRA_FILENAME module-attribute
","text":"GNPS_SPECTRA_FILENAME: Final = 'spectra.mgf'\n
"},{"location":"api/nplinker/#nplinker.defaults.GNPS_MOLECULAR_FAMILY_FILENAME","title":"GNPS_MOLECULAR_FAMILY_FILENAME module-attribute
","text":"GNPS_MOLECULAR_FAMILY_FILENAME: Final = (\n \"molecular_families.tsv\"\n)\n
"},{"location":"api/nplinker/#nplinker.defaults.GNPS_ANNOTATIONS_FILENAME","title":"GNPS_ANNOTATIONS_FILENAME module-attribute
","text":"GNPS_ANNOTATIONS_FILENAME: Final = 'annotations.tsv'\n
"},{"location":"api/nplinker/#nplinker.defaults.GNPS_FILE_MAPPINGS_TSV","title":"GNPS_FILE_MAPPINGS_TSV module-attribute
","text":"GNPS_FILE_MAPPINGS_TSV: Final = 'file_mappings.tsv'\n
"},{"location":"api/nplinker/#nplinker.defaults.GNPS_FILE_MAPPINGS_CSV","title":"GNPS_FILE_MAPPINGS_CSV module-attribute
","text":"GNPS_FILE_MAPPINGS_CSV: Final = 'file_mappings.csv'\n
"},{"location":"api/nplinker/#nplinker.defaults.STRAINS_SELECTED_FILENAME","title":"STRAINS_SELECTED_FILENAME module-attribute
","text":"STRAINS_SELECTED_FILENAME: Final = 'strains_selected.json'\n
"},{"location":"api/nplinker/#nplinker.defaults.DOWNLOADS_DIRNAME","title":"DOWNLOADS_DIRNAME module-attribute
","text":"DOWNLOADS_DIRNAME: Final = 'downloads'\n
"},{"location":"api/nplinker/#nplinker.defaults.MIBIG_DIRNAME","title":"MIBIG_DIRNAME module-attribute
","text":"MIBIG_DIRNAME: Final = 'mibig'\n
"},{"location":"api/nplinker/#nplinker.defaults.GNPS_DIRNAME","title":"GNPS_DIRNAME module-attribute
","text":"GNPS_DIRNAME: Final = 'gnps'\n
"},{"location":"api/nplinker/#nplinker.defaults.ANTISMASH_DIRNAME","title":"ANTISMASH_DIRNAME module-attribute
","text":"ANTISMASH_DIRNAME: Final = 'antismash'\n
"},{"location":"api/nplinker/#nplinker.defaults.BIGSCAPE_DIRNAME","title":"BIGSCAPE_DIRNAME module-attribute
","text":"BIGSCAPE_DIRNAME: Final = 'bigscape'\n
"},{"location":"api/nplinker/#nplinker.defaults.BIGSCAPE_RUNNING_OUTPUT_DIRNAME","title":"BIGSCAPE_RUNNING_OUTPUT_DIRNAME module-attribute
","text":"BIGSCAPE_RUNNING_OUTPUT_DIRNAME: Final = (\n \"bigscape_running_output\"\n)\n
"},{"location":"api/nplinker/#nplinker.defaults.OUTPUT_DIRNAME","title":"OUTPUT_DIRNAME module-attribute
","text":"OUTPUT_DIRNAME: Final = 'output'\n
"},{"location":"api/nplinker/#nplinker.config","title":"nplinker.config","text":""},{"location":"api/nplinker/#nplinker.config.CONFIG_VALIDATORS","title":"CONFIG_VALIDATORS module-attribute
","text":"CONFIG_VALIDATORS = [\n Validator(\n \"root_dir\",\n required=True,\n cast=transform_to_full_path,\n condition=lambda v: is_dir(),\n ),\n Validator(\n \"mode\",\n required=True,\n cast=lambda v: lower(),\n is_in=[\"local\", \"podp\"],\n ),\n Validator(\n \"podp_id\",\n required=True,\n when=Validator(\"mode\", eq=\"podp\"),\n ),\n Validator(\n \"podp_id\",\n required=False,\n when=Validator(\"mode\", eq=\"local\"),\n ),\n Validator(\n \"log.level\",\n is_type_of=str,\n cast=lambda v: upper(),\n is_in=[\n \"NOTSET\",\n \"DEBUG\",\n \"INFO\",\n \"WARNING\",\n \"ERROR\",\n \"CRITICAL\",\n ],\n ),\n Validator(\"log.file\", is_type_of=str),\n Validator(\"log.use_console\", is_type_of=bool),\n Validator(\n \"mibig.to_use\", required=True, is_type_of=bool\n ),\n Validator(\n \"mibig.version\",\n required=True,\n is_type_of=str,\n when=Validator(\"mibig.to_use\", eq=True),\n ),\n Validator(\n \"bigscape.parameters\", required=True, is_type_of=str\n ),\n Validator(\n \"bigscape.cutoff\", required=True, is_type_of=str\n ),\n Validator(\n \"bigscape.version\", required=True, is_type_of=int\n ),\n Validator(\n \"scoring.methods\",\n required=True,\n cast=lambda v: [lower() for i in v],\n is_type_of=list,\n len_min=1,\n condition=lambda v: issubset(\n {\"metcalf\", \"rosetta\"}\n ),\n ),\n]\n
"},{"location":"api/nplinker/#nplinker.config.load_config","title":"load_config","text":"load_config(config_file: str | PathLike) -> Dynaconf\n
Load and validate the configuration file.
Usage DocumentationConfig Loader
Parameters:
config_file
(str | PathLike
) \u2013 Path to the configuration file.
Returns:
Dynaconf
( Dynaconf
) \u2013 A Dynaconf object containing the configuration settings.
Raises:
FileNotFoundError
\u2013 If the configuration file does not exist.
src/nplinker/config.py
def load_config(config_file: str | PathLike) -> Dynaconf:\n \"\"\"Load and validate the configuration file.\n\n ??? info \"Usage Documentation\"\n [Config Loader][config-loader]\n\n Args:\n config_file: Path to the configuration file.\n\n Returns:\n Dynaconf: A Dynaconf object containing the configuration settings.\n\n Raises:\n FileNotFoundError: If the configuration file does not exist.\n \"\"\"\n config_file = transform_to_full_path(config_file)\n if not config_file.exists():\n raise FileNotFoundError(f\"Config file '{config_file}' not found\")\n\n # Locate the default config file\n default_config_file = Path(__file__).resolve().parent / \"nplinker_default.toml\"\n\n # Load config files\n config = Dynaconf(settings_files=[config_file], preload=[default_config_file])\n\n # Validate configs\n config.validators.register(*CONFIG_VALIDATORS)\n config.validators.validate()\n\n return config\n
"},{"location":"api/schema/","title":"Schemas","text":""},{"location":"api/schema/#nplinker.schemas","title":"nplinker.schemas","text":""},{"location":"api/schema/#nplinker.schemas.GENOME_STATUS_SCHEMA","title":"GENOME_STATUS_SCHEMA module-attribute
","text":"GENOME_STATUS_SCHEMA = load(f)\n
Schema for the genome status JSON file.
Schema Content:genome_status_schema.json
{\n \"$schema\": \"https://json-schema.org/draft/2020-12/schema\",\n \"$id\": \"https://raw.githubusercontent.com/NPLinker/nplinker/main/src/nplinker/schemas/genome_status_schema.json\",\n \"title\": \"Status of genomes\",\n \"description\": \"A list of genome status objects, each of which contains information about a single genome\",\n \"type\": \"object\",\n \"required\": [\n \"genome_status\",\n \"version\"\n ],\n \"properties\": {\n \"genome_status\": {\n \"type\": \"array\",\n \"title\": \"Genome status\",\n \"description\": \"A list of genome status objects\",\n \"items\": {\n \"type\": \"object\",\n \"required\": [\n \"original_id\",\n \"resolved_refseq_id\",\n \"resolve_attempted\",\n \"bgc_path\"\n ],\n \"properties\": {\n \"original_id\": {\n \"type\": \"string\",\n \"title\": \"Original ID\",\n \"description\": \"The original ID of the genome\",\n \"minLength\": 1\n },\n \"resolved_refseq_id\": {\n \"type\": \"string\",\n \"title\": \"Resolved RefSeq ID\",\n \"description\": \"The RefSeq ID that was resolved for this genome\"\n },\n \"resolve_attempted\": {\n \"type\": \"boolean\",\n \"title\": \"Resolve Attempted\",\n \"description\": \"Whether or not an attempt was made to resolve this genome\"\n },\n \"bgc_path\": {\n \"type\": \"string\",\n \"title\": \"BGC Path\",\n \"description\": \"The path to the downloaded BGC file for this genome\"\n }\n }\n },\n \"minItems\": 1,\n \"uniqueItems\": true\n },\n \"version\": {\n \"type\": \"string\",\n \"enum\": [\n \"1.0\"\n ]\n }\n },\n \"additionalProperties\": false\n}\n
"},{"location":"api/schema/#nplinker.schemas.GENOME_BGC_MAPPINGS_SCHEMA","title":"GENOME_BGC_MAPPINGS_SCHEMA module-attribute
","text":"GENOME_BGC_MAPPINGS_SCHEMA = load(f)\n
Schema for genome BGC mappings JSON file.
Schema Content:genome_bgc_mappings_schema.json
{\n \"$schema\": \"https://json-schema.org/draft/2020-12/schema\",\n \"$id\": \"https://raw.githubusercontent.com/NPLinker/nplinker/main/src/nplinker/schemas/genome_bgc_mappings_schema.json\",\n \"title\": \"Mappings from genome ID to BGC IDs\",\n \"description\": \"A list of mappings from genome ID to BGC (biosynthetic gene cluster) IDs\",\n \"type\": \"object\",\n \"required\": [\n \"mappings\",\n \"version\"\n ],\n \"properties\": {\n \"mappings\": {\n \"type\": \"array\",\n \"title\": \"Mappings from genome ID to BGC IDs\",\n \"description\": \"A list of mappings from genome ID to BGC IDs\",\n \"items\": {\n \"type\": \"object\",\n \"required\": [\n \"genome_ID\",\n \"BGC_ID\"\n ],\n \"properties\": {\n \"genome_ID\": {\n \"type\": \"string\",\n \"title\": \"Genome ID\",\n \"description\": \"The genome ID used in BGC database such as antiSMASH\",\n \"minLength\": 1\n },\n \"BGC_ID\": {\n \"type\": \"array\",\n \"title\": \"BGC ID\",\n \"description\": \"A list of BGC IDs\",\n \"items\": {\n \"type\": \"string\",\n \"minLength\": 1\n },\n \"minItems\": 1,\n \"uniqueItems\": true\n }\n }\n },\n \"minItems\": 1,\n \"uniqueItems\": true\n },\n \"version\": {\n \"type\": \"string\",\n \"enum\": [\n \"1.0\"\n ]\n }\n },\n \"additionalProperties\": false\n}\n
"},{"location":"api/schema/#nplinker.schemas.STRAIN_MAPPINGS_SCHEMA","title":"STRAIN_MAPPINGS_SCHEMA module-attribute
","text":"STRAIN_MAPPINGS_SCHEMA = load(f)\n
Schema for strain mappings JSON file.
Schema Content:strain_mappings_schema.json
{\n \"$schema\": \"https://json-schema.org/draft/2020-12/schema\",\n \"$id\": \"https://raw.githubusercontent.com/NPLinker/nplinker/main/src/nplinker/schemas/strain_mappings_schema.json\",\n \"title\": \"Strain mappings\",\n \"description\": \"A list of mappings from strain ID to strain aliases\",\n \"type\": \"object\",\n \"required\": [\n \"strain_mappings\",\n \"version\"\n ],\n \"properties\": {\n \"strain_mappings\": {\n \"type\": \"array\",\n \"title\": \"Strain mappings\",\n \"description\": \"A list of strain mappings\",\n \"items\": {\n \"type\": \"object\",\n \"required\": [\n \"strain_id\",\n \"strain_alias\"\n ],\n \"properties\": {\n \"strain_id\": {\n \"type\": \"string\",\n \"title\": \"Strain ID\",\n \"description\": \"Strain ID, which could be any strain name or accession number\",\n \"minLength\": 1\n },\n \"strain_alias\": {\n \"type\": \"array\",\n \"title\": \"Strain aliases\",\n \"description\": \"A list of strain aliases, which could be any names that refer to the same strain\",\n \"items\": {\n \"type\": \"string\",\n \"minLength\": 1\n },\n \"minItems\": 1,\n \"uniqueItems\": true\n }\n }\n },\n \"minItems\": 1,\n \"uniqueItems\": true\n },\n \"version\": {\n \"type\": \"string\",\n \"enum\": [\n \"1.0\"\n ]\n }\n },\n \"additionalProperties\": false\n}\n
"},{"location":"api/schema/#nplinker.schemas.USER_STRAINS_SCHEMA","title":"USER_STRAINS_SCHEMA module-attribute
","text":"USER_STRAINS_SCHEMA = load(f)\n
Schema for user strains JSON file.
Schema Content:user_strains.json
{\n \"$schema\": \"https://json-schema.org/draft/2020-12/schema\",\n \"$id\": \"https://raw.githubusercontent.com/NPLinker/nplinker/main/src/nplinker/schemas/user_strains.json\",\n \"title\": \"User specified strains\",\n \"description\": \"A list of strain IDs specified by user\",\n \"type\": \"object\",\n \"required\": [\n \"strain_ids\"\n ],\n \"properties\": {\n \"strain_ids\": {\n \"type\": \"array\",\n \"title\": \"Strain IDs\",\n \"description\": \"A list of strain IDs specified by user. The strain IDs must be the same as the ones in the strain mappings file.\",\n \"items\": {\n \"type\": \"string\",\n \"minLength\": 1\n },\n \"minItems\": 1,\n \"uniqueItems\": true\n },\n \"version\": {\n \"type\": \"string\",\n \"enum\": [\n \"1.0\"\n ]\n }\n },\n \"additionalProperties\": false\n}\n
"},{"location":"api/schema/#nplinker.schemas.PODP_ADAPTED_SCHEMA","title":"PODP_ADAPTED_SCHEMA module-attribute
","text":"PODP_ADAPTED_SCHEMA = load(f)\n
Schema for PODP JSON file.
The PODP JSON file is the project JSON file downloaded from PODP platform. For example, for PODP project MSV000079284, its JSON file is https://pairedomicsdata.bioinformatics.nl/api/projects/4b29ddc3-26d0-40d7-80c5-44fb6631dbf9.4.
Schema Content:podp_adapted_schema.json
{\n \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n \"$id\": \"https://raw.githubusercontent.com/NPLinker/nplinker/main/src/nplinker/schemas/podp_adapted_schema.json\",\n \"title\": \"Adapted Paired Omics Data Platform Schema for NPLinker\",\n \"description\": \"This schema is adapted from PODP schema (https://pairedomicsdata.bioinformatics.nl/schema.json) for NPLinker. It's used to validate the input data for NPLinker. Thus, only required fields for NPLinker are kept in this schema, and some fields are modified to fit NPLinker's requirements.\",\n \"type\": \"object\",\n \"required\": [\n \"version\",\n \"metabolomics\",\n \"genomes\",\n \"genome_metabolome_links\"\n ],\n \"properties\": {\n \"version\": {\n \"type\": \"string\",\n \"readOnly\": true,\n \"default\": \"3\",\n \"enum\": [\n \"3\"\n ]\n },\n \"metabolomics\": {\n \"type\": \"object\",\n \"title\": \"2. Metabolomics Information\",\n \"description\": \"Please provide basic information on the publicly available metabolomics project from which paired data is available. Currently, we allow for links to mass spectrometry data deposited in GNPS-MaSSIVE or MetaboLights.\",\n \"properties\": {\n \"project\": {\n \"type\": \"object\",\n \"required\": [\n \"molecular_network\"\n ],\n \"title\": \"GNPS-MassIVE\",\n \"properties\": {\n \"GNPSMassIVE_ID\": {\n \"type\": \"string\",\n \"title\": \"GNPS-MassIVE identifier\",\n \"description\": \"Please provide the GNPS-MassIVE identifier of your metabolomics data set, e.g., MSV000078839.\",\n \"pattern\": \"^MSV[0-9]{9}$\"\n },\n \"MaSSIVE_URL\": {\n \"type\": \"string\",\n \"title\": \"Link to MassIVE upload\",\n \"description\": \"Please provide the link to the MassIVE upload, e.g., <a target=\\\"_blank\\\" rel=\\\"noopener noreferrer\\\" href=\\\"https://gnps.ucsd.edu/ProteoSAFe/result.jsp?task=a507232a787243a5afd69a6c6fa1e508&view=advanced_view\\\">https://gnps.ucsd.edu/ProteoSAFe/result.jsp?task=a507232a787243a5afd69a6c6fa1e508&view=advanced_view</a>. Warning, there cannot be spaces in the URI.\",\n \"format\": \"uri\"\n },\n \"molecular_network\": {\n \"type\": \"string\",\n \"pattern\": \"^[0-9a-z]{32}$\",\n \"title\": \"Molecular Network Task ID\",\n \"description\": \"If you have run a Molecular Network on GNPS, please provide the task ID of the Molecular Network job. It can be found in the URL of the Molecular Networking job, e.g., in <a target=\\\"_blank\\\" rel=\\\"noopener noreferrer\\\" href=\\\"https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=c36f90ba29fe44c18e96db802de0c6b9\\\">https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=c36f90ba29fe44c18e96db802de0c6b9</a> the task ID is c36f90ba29fe44c18e96db802de0c6b9.\"\n }\n }\n }\n },\n \"required\": [\n \"project\"\n ],\n \"additionalProperties\": true\n },\n \"genomes\": {\n \"type\": \"array\",\n \"title\": \"3. (Meta)genomics Information\",\n \"description\": \"Please add all genomes and/or metagenomes for which paired data is available as separate entries.\",\n \"items\": {\n \"type\": \"object\",\n \"required\": [\n \"genome_ID\",\n \"genome_label\"\n ],\n \"properties\": {\n \"genome_ID\": {\n \"type\": \"object\",\n \"title\": \"Genome accession\",\n \"description\": \"At least one of the three identifiers is required.\",\n \"anyOf\": [\n {\n \"required\": [\n \"GenBank_accession\"\n ]\n },\n {\n \"required\": [\n \"RefSeq_accession\"\n ]\n },\n {\n \"required\": [\n \"JGI_Genome_ID\"\n ]\n }\n ],\n \"properties\": {\n \"GenBank_accession\": {\n \"type\": \"string\",\n \"title\": \"GenBank accession number\",\n \"description\": \"If the publicly available genome got a GenBank accession number assigned, e.g., <a href=\\\"https://www.ncbi.nlm.nih.gov/nuccore/AL645882\\\" target=\\\"_blank\\\" rel=\\\"noopener noreferrer\\\">AL645882</a>, please provide it here. The genome sequence must be submitted to GenBank/ENA/DDBJ (and an accession number must be received) before this form can be filled out. In case of a whole genome sequence, please use master records. At least one identifier must be entered.\",\n \"minLength\": 1\n },\n \"RefSeq_accession\": {\n \"type\": \"string\",\n \"title\": \"RefSeq accession number\",\n \"description\": \"For example: <a target=\\\"_blank\\\" rel=\\\"noopener noreferrer\\\" href=\\\"https://www.ncbi.nlm.nih.gov/nuccore/NC_003888.3\\\">NC_003888.3</a>\",\n \"minLength\": 1\n },\n \"JGI_Genome_ID\": {\n \"type\": \"string\",\n \"title\": \"JGI IMG genome ID\",\n \"description\": \"For example: <a target=\\\"_blank\\\" rel=\\\"noopener noreferrer\\\" href=\\\"https://img.jgi.doe.gov/cgi-bin/m/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=641228474\\\">641228474</a>\",\n \"minLength\": 1\n }\n }\n },\n \"genome_label\": {\n \"type\": \"string\",\n \"title\": \"Genome label\",\n \"description\": \"Please assign a unique Genome Label for this genome or metagenome to help you recall it during the linking step. For example 'Streptomyces sp. CNB091'\",\n \"minLength\": 1\n }\n }\n },\n \"minItems\": 1\n },\n \"genome_metabolome_links\": {\n \"type\": \"array\",\n \"title\": \"6. Genome - Proteome - Metabolome Links\",\n \"description\": \"Create a linked pair by selecting the Genome Label and optional Proteome label as provided earlier. Subsequently links to the metabolomics data file belonging to that genome/proteome with appropriate experimental methods.\",\n \"items\": {\n \"type\": \"object\",\n \"required\": [\n \"genome_label\",\n \"metabolomics_file\"\n ],\n \"properties\": {\n \"genome_label\": {\n \"type\": \"string\",\n \"title\": \"Genome/Metagenome\",\n \"description\": \"Please select the Genome Label to be linked to a metabolomics data file.\"\n },\n \"metabolomics_file\": {\n \"type\": \"string\",\n \"title\": \"Location of metabolomics data file\",\n \"description\": \"Please provide a direct link to the metabolomics data file location, e.g. <a href=\\\"ftp://massive.ucsd.edu/MSV000078839/spectrum/R5/CNB091_R5_M.mzXML\\\" target=\\\"_blank\\\" rel=\\\"noopener noreferrer\\\">ftp://massive.ucsd.edu/MSV000078839/spectrum/R5/CNB091_R5_M.mzXML</a> found in the FTP download of a MassIVE dataset or <a target=\\\"_blank\\\" rel=\\\"noopener noreferrer\\\" href=\\\"https://www.ebi.ac.uk/metabolights/MTBLS307/files/Urine_44_fullscan1_pos.mzXML\\\">https://www.ebi.ac.uk/metabolights/MTBLS307/files/Urine_44_fullscan1_pos.mzXML</a> found in the Files section of a MetaboLights study. Warning, there cannot be spaces in the URI.\",\n \"format\": \"uri\"\n }\n },\n \"additionalProperties\": true\n },\n \"minItems\": 1\n }\n },\n \"additionalProperties\": true\n}\n
"},{"location":"api/schema/#nplinker.schemas.validate_podp_json","title":"validate_podp_json","text":"validate_podp_json(json_data: dict) -> None\n
Validate JSON data against the PODP JSON schema.
All validation error messages are collected and raised as a single ValueError.
Parameters:
json_data
(dict
) \u2013 The JSON data to validate.
Raises:
ValueError
\u2013 If the JSON data does not match the schema.
Examples:
Download PODP JSON file for project MSV000079284 from https://pairedomicsdata.bioinformatics.nl/api/projects/4b29ddc3-26d0-40d7-80c5-44fb6631dbf9.4 and save it as podp_project.json
.
Validate it:
>>> with open(podp_project.json, \"r\") as f:\n... json_data = json.load(f)\n>>> validate_podp_json(json_data)\n
Source code in src/nplinker/schemas/__init__.py
def validate_podp_json(json_data: dict) -> None:\n \"\"\"Validate JSON data against the PODP JSON schema.\n\n All validation error messages are collected and raised as a single\n ValueError.\n\n Args:\n json_data: The JSON data to validate.\n\n Raises:\n ValueError: If the JSON data does not match the schema.\n\n Examples:\n Download PODP JSON file for project MSV000079284 from\n https://pairedomicsdata.bioinformatics.nl/api/projects/4b29ddc3-26d0-40d7-80c5-44fb6631dbf9.4\n and save it as `podp_project.json`.\n\n Validate it:\n >>> with open(podp_project.json, \"r\") as f:\n ... json_data = json.load(f)\n >>> validate_podp_json(json_data)\n \"\"\"\n validator = Draft7Validator(PODP_ADAPTED_SCHEMA)\n errors = sorted(validator.iter_errors(json_data), key=lambda e: e.path)\n if errors:\n error_messages = [f\"{e.json_path}: {e.message}\" for e in errors]\n raise ValueError(\n \"Not match PODP adapted schema, here are the detailed error:\\n - \"\n + \"\\n - \".join(error_messages)\n )\n
"},{"location":"api/scoring/","title":"Data Models","text":""},{"location":"api/scoring/#nplinker.scoring","title":"nplinker.scoring","text":""},{"location":"api/scoring/#nplinker.scoring.LinkGraph","title":"LinkGraph","text":"LinkGraph()\n
Class to represent the links between objects in NPLinker.
This class wraps the networkx.Graph
class to provide a more user-friendly interface for working with the links.
The links between objects are stored as edges in a graph, while the objects themselves are stored as nodes.
The scoring data for each link (or link data) is stored as the key/value attributes of the edge.
Examples:
Create a LinkGraph object:
>>> lg = LinkGraph()\n
Add a link between a GCF and a Spectrum object:
>>> lg.add_link(gcf, spectrum, metcalf=Score(\"metcalf\", 1.0, {\"cutoff\": 0.5}))\n
Get all links for a given object:
>>> lg[gcf]\n{spectrum: {\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})}}\n
Get all links in the LinkGraph:
>>> lg.links\n[(gcf, spectrum, {\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})})]\n
Check if there is a link between two objects:
>>> lg.has_link(gcf, spectrum)\nTrue\n
Get the link data between two objects:
>>> lg.get_link_data(gcf, spectrum)\n{\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})}\n
Source code in src/nplinker/scoring/link_graph.py
def __init__(self) -> None:\n \"\"\"Initialize a LinkGraph object.\n\n Examples:\n Create a LinkGraph object:\n >>> lg = LinkGraph()\n\n Add a link between a GCF and a Spectrum object:\n >>> lg.add_link(gcf, spectrum, metcalf=Score(\"metcalf\", 1.0, {\"cutoff\": 0.5}))\n\n Get all links for a given object:\n >>> lg[gcf]\n {spectrum: {\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})}}\n\n Get all links in the LinkGraph:\n >>> lg.links\n [(gcf, spectrum, {\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})})]\n\n Check if there is a link between two objects:\n >>> lg.has_link(gcf, spectrum)\n True\n\n Get the link data between two objects:\n >>> lg.get_link_data(gcf, spectrum)\n {\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})}\n \"\"\"\n self._g: Graph = Graph()\n
"},{"location":"api/scoring/#nplinker.scoring.LinkGraph.links","title":"links property
","text":"links: list[LINK]\n
Get all links.
Returns:
list[LINK]
\u2013 A list of tuples containing the links between objects.
Examples:
>>> lg.links\n[(gcf, spectrum, {\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})})]\n
"},{"location":"api/scoring/#nplinker.scoring.LinkGraph.__str__","title":"__str__","text":"__str__() -> str\n
Get a short summary of the LinkGraph.
Source code insrc/nplinker/scoring/link_graph.py
def __str__(self) -> str:\n \"\"\"Get a short summary of the LinkGraph.\"\"\"\n return f\"{self.__class__.__name__}(#links={len(self.links)}, #objects={len(self)})\"\n
"},{"location":"api/scoring/#nplinker.scoring.LinkGraph.__len__","title":"__len__","text":"__len__() -> int\n
Get the number of objects.
Source code insrc/nplinker/scoring/link_graph.py
def __len__(self) -> int:\n \"\"\"Get the number of objects.\"\"\"\n return len(self._g)\n
"},{"location":"api/scoring/#nplinker.scoring.LinkGraph.__getitem__","title":"__getitem__","text":"__getitem__(u: Entity) -> dict[Entity, LINK_DATA]\n
Get all links for a given object.
Parameters:
u
(Entity
) \u2013 the given object
Returns:
dict[Entity, LINK_DATA]
\u2013 A dictionary of links for the given object.
Raises:
KeyError
\u2013 if the input object is not found in the link graph.
src/nplinker/scoring/link_graph.py
@validate_u\ndef __getitem__(self, u: Entity) -> dict[Entity, LINK_DATA]:\n \"\"\"Get all links for a given object.\n\n Args:\n u: the given object\n\n Returns:\n A dictionary of links for the given object.\n\n Raises:\n KeyError: if the input object is not found in the link graph.\n \"\"\"\n try:\n links = self._g[u]\n except KeyError:\n raise KeyError(f\"{u} not found in the link graph.\")\n\n return {**links} # type: ignore\n
"},{"location":"api/scoring/#nplinker.scoring.LinkGraph.add_link","title":"add_link","text":"add_link(u: Entity, v: Entity, **data: Score) -> None\n
Add a link between two objects.
The objects u
and v
must be different types, i.e. one must be a GCF and the other must be a Spectrum or MolecularFamily.
Parameters:
u
(Entity
) \u2013 the first object, either a GCF, Spectrum, or MolecularFamily
v
(Entity
) \u2013 the second object, either a GCF, Spectrum, or MolecularFamily
data
(Score
, default: {}
) \u2013 keyword arguments. At least one scoring method and its data must be provided. The key must be the name of the scoring method defined in ScoringMethod
, and the value is a Score
object, e.g. metcalf=Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})
.
Examples:
>>> lg.add_link(gcf, spectrum, metcalf=Score(\"metcalf\", 1.0, {\"cutoff\": 0.5}))\n
Source code in src/nplinker/scoring/link_graph.py
@validate_uv\ndef add_link(\n self,\n u: Entity,\n v: Entity,\n **data: Score,\n) -> None:\n \"\"\"Add a link between two objects.\n\n The objects `u` and `v` must be different types, i.e. one must be a GCF and the other must be\n a Spectrum or MolecularFamily.\n\n Args:\n u: the first object, either a GCF, Spectrum, or MolecularFamily\n v: the second object, either a GCF, Spectrum, or MolecularFamily\n data: keyword arguments. At least one scoring method and its data must be provided.\n The key must be the name of the scoring method defined in `ScoringMethod`, and the\n value is a `Score` object, e.g. `metcalf=Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})`.\n\n Examples:\n >>> lg.add_link(gcf, spectrum, metcalf=Score(\"metcalf\", 1.0, {\"cutoff\": 0.5}))\n \"\"\"\n # validate the data\n if not data:\n raise ValueError(\"At least one scoring method and its data must be provided.\")\n for key, value in data.items():\n if not ScoringMethod.has_value(key):\n raise ValueError(\n f\"{key} is not a valid name of scoring method. See `ScoringMethod` for valid names.\"\n )\n if not isinstance(value, Score):\n raise TypeError(f\"{value} is not a Score object.\")\n\n self._g.add_edge(u, v, **data)\n
"},{"location":"api/scoring/#nplinker.scoring.LinkGraph.has_link","title":"has_link","text":"has_link(u: Entity, v: Entity) -> bool\n
Check if there is a link between two objects.
Parameters:
u
(Entity
) \u2013 the first object, either a GCF, Spectrum, or MolecularFamily
v
(Entity
) \u2013 the second object, either a GCF, Spectrum, or MolecularFamily
Returns:
bool
\u2013 True if there is a link between the two objects, False otherwise
Examples:
>>> lg.has_link(gcf, spectrum)\nTrue\n
Source code in src/nplinker/scoring/link_graph.py
@validate_uv\ndef has_link(self, u: Entity, v: Entity) -> bool:\n \"\"\"Check if there is a link between two objects.\n\n Args:\n u: the first object, either a GCF, Spectrum, or MolecularFamily\n v: the second object, either a GCF, Spectrum, or MolecularFamily\n\n Returns:\n True if there is a link between the two objects, False otherwise\n\n Examples:\n >>> lg.has_link(gcf, spectrum)\n True\n \"\"\"\n return self._g.has_edge(u, v)\n
"},{"location":"api/scoring/#nplinker.scoring.LinkGraph.get_link_data","title":"get_link_data","text":"get_link_data(u: Entity, v: Entity) -> LINK_DATA | None\n
Get the data for a link between two objects.
Parameters:
u
(Entity
) \u2013 the first object, either a GCF, Spectrum, or MolecularFamily
v
(Entity
) \u2013 the second object, either a GCF, Spectrum, or MolecularFamily
Returns:
LINK_DATA | None
\u2013 A dictionary of scoring methods and their data for the link between the two objects, or
LINK_DATA | None
\u2013 None if there is no link between the two objects.
Examples:
>>> lg.get_link_data(gcf, spectrum)\n{\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})}\n
Source code in src/nplinker/scoring/link_graph.py
@validate_uv\ndef get_link_data(\n self,\n u: Entity,\n v: Entity,\n) -> LINK_DATA | None:\n \"\"\"Get the data for a link between two objects.\n\n Args:\n u: the first object, either a GCF, Spectrum, or MolecularFamily\n v: the second object, either a GCF, Spectrum, or MolecularFamily\n\n Returns:\n A dictionary of scoring methods and their data for the link between the two objects, or\n None if there is no link between the two objects.\n\n Examples:\n >>> lg.get_link_data(gcf, spectrum)\n {\"metcalf\": Score(\"metcalf\", 1.0, {\"cutoff\": 0.5})}\n \"\"\"\n return self._g.get_edge_data(u, v) # type: ignore\n
"},{"location":"api/scoring/#nplinker.scoring.LinkGraph.filter","title":"filter","text":"filter(\n u_nodes: Sequence[Entity],\n v_nodes: Sequence[Entity] = [],\n) -> LinkGraph\n
Return a new LinkGraph object with the filtered links between the given objects.
The new LinkGraph object will only contain the links between u_nodes
and v_nodes
.
If u_nodes
or v_nodes
is empty, the new LinkGraph object will contain the links for the given objects in v_nodes
or u_nodes
, respectively. If both are empty, return an empty LinkGraph object.
Note that not all objects in u_nodes
and v_nodes
need to be present in the original LinkGraph.
Parameters:
u_nodes
(Sequence[Entity]
) \u2013 a sequence of objects used as the first object in the links
v_nodes
(Sequence[Entity]
, default: []
) \u2013 a sequence of objects used as the second object in the links
Returns:
LinkGraph
\u2013 A new LinkGraph object with the filtered links between the given objects.
Examples:
Filter the links for gcf1
and gcf2
:
>>> new_lg = lg.filter([gcf1, gcf2])\nFilter the links for `spectrum1` and `spectrum2`:\n>>> new_lg = lg.filter([spectrum1, spectrum2])\nFilter the links between two lists of objects:\n>>> new_lg = lg.filter([gcf1, gcf2], [spectrum1, spectrum2])\n
Source code in src/nplinker/scoring/link_graph.py
def filter(self, u_nodes: Sequence[Entity], v_nodes: Sequence[Entity] = [], /) -> LinkGraph:\n \"\"\"Return a new LinkGraph object with the filtered links between the given objects.\n\n The new LinkGraph object will only contain the links between `u_nodes` and `v_nodes`.\n\n If `u_nodes` or `v_nodes` is empty, the new LinkGraph object will contain the links for\n the given objects in `v_nodes` or `u_nodes`, respectively. If both are empty, return an\n empty LinkGraph object.\n\n Note that not all objects in `u_nodes` and `v_nodes` need to be present in the original\n LinkGraph.\n\n Args:\n u_nodes: a sequence of objects used as the first object in the links\n v_nodes: a sequence of objects used as the second object in the links\n\n Returns:\n A new LinkGraph object with the filtered links between the given objects.\n\n Examples:\n Filter the links for `gcf1` and `gcf2`:\n >>> new_lg = lg.filter([gcf1, gcf2])\n Filter the links for `spectrum1` and `spectrum2`:\n >>> new_lg = lg.filter([spectrum1, spectrum2])\n Filter the links between two lists of objects:\n >>> new_lg = lg.filter([gcf1, gcf2], [spectrum1, spectrum2])\n \"\"\"\n lg = LinkGraph()\n\n # exchange u_nodes and v_nodes if u_nodes is empty but v_nodes not\n if len(u_nodes) == 0 and len(v_nodes) != 0:\n u_nodes = v_nodes\n v_nodes = []\n\n if len(v_nodes) == 0:\n for u in u_nodes:\n self._filter_one_node(u, lg)\n\n for u in u_nodes:\n for v in v_nodes:\n self._filter_two_nodes(u, v, lg)\n\n return lg\n
"},{"location":"api/scoring/#nplinker.scoring.Score","title":"Score dataclass
","text":"Score(name: str, value: float, parameter: dict)\n
A data class to represent score data.
Attributes:
name
(str
) \u2013 the name of the scoring method. See ScoringMethod
for valid values.
value
(float
) \u2013 the score value.
parameter
(dict
) \u2013 the parameters used for the scoring method.
instance-attribute
","text":"name: str\n
"},{"location":"api/scoring/#nplinker.scoring.Score.value","title":"value instance-attribute
","text":"value: float\n
"},{"location":"api/scoring/#nplinker.scoring.Score.parameter","title":"parameter instance-attribute
","text":"parameter: dict\n
"},{"location":"api/scoring/#nplinker.scoring.Score.__post_init__","title":"__post_init__","text":"__post_init__() -> None\n
Check if the value of name
is valid.
Raises:
ValueError
\u2013 if the value of name
is not valid.
src/nplinker/scoring/score.py
def __post_init__(self) -> None:\n \"\"\"Check if the value of `name` is valid.\n\n Raises:\n ValueError: if the value of `name` is not valid.\n \"\"\"\n if ScoringMethod.has_value(self.name) is False:\n raise ValueError(\n f\"{self.name} is not a valid value. Valid values are: {[e.value for e in ScoringMethod]}\"\n )\n
"},{"location":"api/scoring/#nplinker.scoring.Score.__getitem__","title":"__getitem__","text":"__getitem__(key)\n
Source code in src/nplinker/scoring/score.py
def __getitem__(self, key):\n if key in {field.name for field in fields(self)}:\n return getattr(self, key)\n else:\n raise KeyError(f\"{key} not found in {self.__class__.__name__}\")\n
"},{"location":"api/scoring/#nplinker.scoring.Score.__setitem__","title":"__setitem__","text":"__setitem__(key, value)\n
Source code in src/nplinker/scoring/score.py
def __setitem__(self, key, value):\n # validate the value of `name`\n if key == \"name\" and ScoringMethod.has_value(value) is False:\n raise ValueError(\n f\"{value} is not a valid value. Valid values are: {[e.value for e in ScoringMethod]}\"\n )\n\n if key in {field.name for field in fields(self)}:\n setattr(self, key, value)\n else:\n raise KeyError(f\"{key} not found in {self.__class__.__name__}\")\n
"},{"location":"api/scoring_abc/","title":"Abstract Base Classes","text":""},{"location":"api/scoring_abc/#nplinker.scoring.abc","title":"nplinker.scoring.abc","text":""},{"location":"api/scoring_abc/#nplinker.scoring.abc.ScoringBase","title":"ScoringBase","text":" Bases: ABC
Abstract base class of scoring methods.
Attributes:
name
(str
) \u2013 The name of the scoring method.
npl
(NPLinker | None
) \u2013 The NPLinker object.
class-attribute
instance-attribute
","text":"name: str = 'ScoringBase'\n
"},{"location":"api/scoring_abc/#nplinker.scoring.abc.ScoringBase.npl","title":"npl class-attribute
instance-attribute
","text":"npl: NPLinker | None = None\n
"},{"location":"api/scoring_abc/#nplinker.scoring.abc.ScoringBase.setup","title":"setup abstractmethod
classmethod
","text":"setup(npl: NPLinker)\n
Setup class level attributes.
Source code insrc/nplinker/scoring/abc.py
@classmethod\n@abstractmethod\ndef setup(cls, npl: NPLinker):\n \"\"\"Setup class level attributes.\"\"\"\n
"},{"location":"api/scoring_abc/#nplinker.scoring.abc.ScoringBase.get_links","title":"get_links abstractmethod
","text":"get_links(*objects, **parameters) -> LinkGraph\n
Get links information for the given objects.
Parameters:
objects
\u2013 A list of objects to get links for.
parameters
\u2013 The parameters used for scoring.
Returns:
LinkGraph
\u2013 The LinkGraph object.
src/nplinker/scoring/abc.py
@abstractmethod\ndef get_links(\n self,\n *objects,\n **parameters,\n) -> LinkGraph:\n \"\"\"Get links information for the given objects.\n\n Args:\n objects: A list of objects to get links for.\n parameters: The parameters used for scoring.\n\n Returns:\n The LinkGraph object.\n \"\"\"\n
"},{"location":"api/scoring_abc/#nplinker.scoring.abc.ScoringBase.format_data","title":"format_data abstractmethod
","text":"format_data(data) -> str\n
Format the scoring data to a string.
Source code insrc/nplinker/scoring/abc.py
@abstractmethod\ndef format_data(self, data) -> str:\n \"\"\"Format the scoring data to a string.\"\"\"\n
"},{"location":"api/scoring_abc/#nplinker.scoring.abc.ScoringBase.sort","title":"sort abstractmethod
","text":"sort(objects, reverse=True) -> list\n
Sort the given objects based on the scoring data.
Source code insrc/nplinker/scoring/abc.py
@abstractmethod\ndef sort(self, objects, reverse=True) -> list:\n \"\"\"Sort the given objects based on the scoring data.\"\"\"\n
"},{"location":"api/scoring_methods/","title":"Scoring Methods","text":""},{"location":"api/scoring_methods/#nplinker.scoring","title":"nplinker.scoring","text":""},{"location":"api/scoring_methods/#nplinker.scoring.ScoringMethod","title":"ScoringMethod","text":" Bases: Enum
Enum class for scoring methods.
"},{"location":"api/scoring_methods/#nplinker.scoring.ScoringMethod.METCALF","title":"METCALFclass-attribute
instance-attribute
","text":"METCALF = 'metcalf'\n
"},{"location":"api/scoring_methods/#nplinker.scoring.ScoringMethod.ROSETTA","title":"ROSETTA class-attribute
instance-attribute
","text":"ROSETTA = 'rosetta'\n
"},{"location":"api/scoring_methods/#nplinker.scoring.ScoringMethod.NPLCLASS","title":"NPLCLASS class-attribute
instance-attribute
","text":"NPLCLASS = 'nplclass'\n
"},{"location":"api/scoring_methods/#nplinker.scoring.ScoringMethod.has_value","title":"has_value classmethod
","text":"has_value(value: str) -> bool\n
Check if the enum has a value.
Source code insrc/nplinker/scoring/scoring_method.py
@classmethod\ndef has_value(cls, value: str) -> bool:\n \"\"\"Check if the enum has a value.\"\"\"\n return any(value == item.value for item in cls)\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring","title":"MetcalfScoring","text":" Bases: ScoringBase
Metcalf scoring method.
Attributes:
name
\u2013 The name of this scoring method, set to a fixed value metcalf
.
npl
(NPLinker | None
) \u2013 The NPLinker object.
CACHE
(str
) \u2013 The name of the cache file to use for storing the MetcalfScoring.
presence_gcf_strain
(DataFrame
) \u2013 A DataFrame to store presence of gcfs with respect to strains. The index of the DataFrame are the GCF objects and the columns are Strain objects. The values are 1 where the gcf occurs in the strain, 0 otherwise.
presence_spec_strain
(DataFrame
) \u2013 A DataFrame to store presence of spectra with respect to strains. The index of the DataFrame are the Spectrum objects and the columns are Strain objects. The values are 1 where the spectrum occurs in the strain, 0 otherwise.
presence_mf_strain
(DataFrame
) \u2013 A DataFrame to store presence of molecular families with respect to strains. The index of the DataFrame are the MolecularFamily objects and the columns are Strain objects. The values are 1 where the molecular family occurs in the strain, 0 otherwise.
raw_score_spec_gcf
(DataFrame
) \u2013 A DataFrame to store the raw Metcalf scores for spectrum-gcf links. The columns are \"spec\", \"gcf\" and \"score\":
raw_score_mf_gcf
(DataFrame
) \u2013 A DataFrame to store the raw Metcalf scores for molecular family-gcf links. The columns are \"mf\", \"gcf\" and \"score\":
metcalf_mean
(ndarray | None
) \u2013 A numpy array to store the mean value used for standardising Metcalf scores. The array has shape (n_strains+1, n_strains+1), where n_strains is the number of strains.
metcalf_std
(ndarray | None
) \u2013 A numpy array to store the standard deviation value used for standardising Metcalf scores. The array has shape (n_strains+1, n_strains+1), where n_strains is the number of strains.
class-attribute
instance-attribute
","text":"name = METCALF.value\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.npl","title":"npl class-attribute
instance-attribute
","text":"npl: NPLinker | None = None\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.CACHE","title":"CACHE class-attribute
instance-attribute
","text":"CACHE: str = 'cache_metcalf_scoring.pckl'\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.metcalf_weights","title":"metcalf_weights class-attribute
instance-attribute
","text":"metcalf_weights: tuple[int, int, int, int] = (10, -10, 0, 1)\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.presence_gcf_strain","title":"presence_gcf_strain class-attribute
instance-attribute
","text":"presence_gcf_strain: DataFrame = DataFrame()\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.presence_spec_strain","title":"presence_spec_strain class-attribute
instance-attribute
","text":"presence_spec_strain: DataFrame = DataFrame()\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.presence_mf_strain","title":"presence_mf_strain class-attribute
instance-attribute
","text":"presence_mf_strain: DataFrame = DataFrame()\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.raw_score_spec_gcf","title":"raw_score_spec_gcf class-attribute
instance-attribute
","text":"raw_score_spec_gcf: DataFrame = DataFrame(\n columns=[\"spec\", \"gcf\", \"score\"]\n)\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.raw_score_mf_gcf","title":"raw_score_mf_gcf class-attribute
instance-attribute
","text":"raw_score_mf_gcf: DataFrame = DataFrame(\n columns=[\"mf\", \"gcf\", \"score\"]\n)\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.metcalf_mean","title":"metcalf_mean class-attribute
instance-attribute
","text":"metcalf_mean: ndarray | None = None\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.metcalf_std","title":"metcalf_std class-attribute
instance-attribute
","text":"metcalf_std: ndarray | None = None\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.setup","title":"setup classmethod
","text":"setup(npl: NPLinker) -> None\n
Setup the MetcalfScoring object.
This method is only called once to setup the MetcalfScoring object.
Parameters:
npl
(NPLinker
) \u2013 The NPLinker object.
src/nplinker/scoring/metcalf_scoring.py
@classmethod\ndef setup(cls, npl: NPLinker) -> None:\n \"\"\"Setup the MetcalfScoring object.\n\n This method is only called once to setup the MetcalfScoring object.\n\n Args:\n npl: The NPLinker object.\n \"\"\"\n if cls.npl is not None:\n logger.info(\"MetcalfScoring.setup already called, skipping.\")\n return\n\n logger.info(\n f\"MetcalfScoring.setup starts: #bgcs={len(npl.bgcs)}, #gcfs={len(npl.gcfs)}, \"\n f\"#spectra={len(npl.spectra)}, #mfs={len(npl.mfs)}, #strains={npl.strains}\"\n )\n cls.npl = npl\n\n # calculate presence of gcfs/spectra/mfs with respect to strains\n cls.presence_gcf_strain = get_presence_gcf_strain(npl.gcfs, npl.strains)\n cls.presence_spec_strain = get_presence_spec_strain(npl.spectra, npl.strains)\n cls.presence_mf_strain = get_presence_mf_strain(npl.mfs, npl.strains)\n\n # calculate raw Metcalf scores for spec-gcf links\n raw_score_spec_gcf = cls._calc_raw_score(\n cls.presence_spec_strain, cls.presence_gcf_strain, cls.metcalf_weights\n )\n cls.raw_score_spec_gcf = raw_score_spec_gcf.reset_index().melt(id_vars=\"index\")\n cls.raw_score_spec_gcf.columns = [\"spec\", \"gcf\", \"score\"] # type: ignore\n\n # calculate raw Metcalf scores for spec-gcf links\n raw_score_mf_gcf = cls._calc_raw_score(\n cls.presence_mf_strain, cls.presence_gcf_strain, cls.metcalf_weights\n )\n cls.raw_score_mf_gcf = raw_score_mf_gcf.reset_index().melt(id_vars=\"index\")\n cls.raw_score_mf_gcf.columns = [\"mf\", \"gcf\", \"score\"] # type: ignore\n\n # calculate mean and std for standardising Metcalf scores\n cls.metcalf_mean, cls.metcalf_std = cls._calc_mean_std(\n len(npl.strains), cls.metcalf_weights\n )\n\n logger.info(\"MetcalfScoring.setup completed\")\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.get_links","title":"get_links","text":"get_links(*objects, **parameters)\n
Get links for the given objects.
Parameters:
objects
\u2013 The objects to get links for. All objects must be of the same type, i.e. GCF
, Spectrum
or MolecularFamily
type. If no objects are provided, all detected objects (npl.gcfs
) will be used.
parameters
\u2013 The scoring parameters to use for the links. The parameters are:
cutoff
: The minimum score to consider a link (\u2265cutoff). Default is 0.standardised
: Whether to use standardised scores. Default is False.Returns:
The LinkGraph
object containing the links involving the input objects with the Metcalf scores.
Raises:
TypeError
\u2013 If the input objects are not of the same type or the object type is invalid.
src/nplinker/scoring/metcalf_scoring.py
def get_links(self, *objects, **parameters):\n \"\"\"Get links for the given objects.\n\n Args:\n objects: The objects to get links for. All objects must be of the same type, i.e. `GCF`,\n `Spectrum` or `MolecularFamily` type.\n If no objects are provided, all detected objects (`npl.gcfs`) will be used.\n parameters: The scoring parameters to use for the links.\n The parameters are:\n\n - `cutoff`: The minimum score to consider a link (\u2265cutoff). Default is 0.\n - `standardised`: Whether to use standardised scores. Default is False.\n\n Returns:\n The [`LinkGraph`][nplinker.scoring.LinkGraph] object containing the links involving the\n input objects with the Metcalf scores.\n\n Raises:\n TypeError: If the input objects are not of the same type or the object type is invalid.\n \"\"\"\n # validate input objects\n if len(objects) == 0:\n objects = self.npl.gcfs\n # check if all objects are of the same type\n types = {type(i) for i in objects}\n if len(types) > 1:\n raise TypeError(\"Input objects must be of the same type.\")\n # check if the object type is valid\n obj_type = next(iter(types))\n if obj_type not in (GCF, Spectrum, MolecularFamily):\n raise TypeError(\n f\"Invalid type {obj_type}. Input objects must be GCF, Spectrum or MolecularFamily objects.\"\n )\n\n # validate scoring parameters\n self._cutoff: float = parameters.get(\"cutoff\", 0)\n self._standardised: bool = parameters.get(\"standardised\", False)\n parameters.update({\"cutoff\": self._cutoff, \"standardised\": self._standardised})\n\n logger.info(\n f\"MetcalfScoring: #objects={len(objects)}, type={obj_type}, cutoff={self._cutoff}, \"\n f\"standardised={self._standardised}\"\n )\n if not self._standardised:\n scores_list = self._get_links(*objects, obj_type=obj_type, score_cutoff=self._cutoff)\n else:\n if self.metcalf_mean is None or self.metcalf_std is None:\n raise ValueError(\n \"MetcalfScoring.metcalf_mean and metcalf_std are not set. Run MetcalfScoring.setup first.\"\n )\n # use negative infinity as the score cutoff to ensure we get all links\n scores_list = self._get_links(*objects, obj_type=obj_type, score_cutoff=-np.inf)\n scores_list = self._calc_standardised_score(scores_list)\n\n links = LinkGraph()\n for score_df in scores_list:\n for row in score_df.itertuples(index=False): # row has attributes: spec/mf, gcf, score\n met = row.spec if score_df.name == LinkType.SPEC_GCF else row.mf\n links.add_link(\n row.gcf,\n met,\n metcalf=Score(self.name, row.score, parameters),\n )\n\n logger.info(f\"MetcalfScoring: completed! Found {len(links.links)} links in total.\")\n return links\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.format_data","title":"format_data","text":"format_data(data)\n
Format the data for display.
Source code insrc/nplinker/scoring/metcalf_scoring.py
def format_data(self, data):\n \"\"\"Format the data for display.\"\"\"\n # for metcalf the data will just be a floating point value (i.e. the score)\n return f\"{data:.4f}\"\n
"},{"location":"api/scoring_methods/#nplinker.scoring.MetcalfScoring.sort","title":"sort","text":"sort(objects, reverse=True)\n
Sort the objects based on the score.
Source code insrc/nplinker/scoring/metcalf_scoring.py
def sort(self, objects, reverse=True):\n \"\"\"Sort the objects based on the score.\"\"\"\n # sort based on score\n return sorted(objects, key=lambda objlink: objlink[self], reverse=reverse)\n
"},{"location":"api/scoring_utils/","title":"Utilities","text":""},{"location":"api/scoring_utils/#nplinker.scoring.utils","title":"nplinker.scoring.utils","text":""},{"location":"api/scoring_utils/#nplinker.scoring.utils.get_presence_gcf_strain","title":"get_presence_gcf_strain","text":"get_presence_gcf_strain(\n gcfs: Sequence[GCF], strains: StrainCollection\n) -> DataFrame\n
Get the occurrence of strains in gcfs.
The occurrence is a DataFrame with GCF objects as index and Strain objects as columns, and the values are 1 if the gcf occurs in the strain, 0 otherwise.
Source code insrc/nplinker/scoring/utils.py
def get_presence_gcf_strain(gcfs: Sequence[GCF], strains: StrainCollection) -> pd.DataFrame:\n \"\"\"Get the occurrence of strains in gcfs.\n\n The occurrence is a DataFrame with GCF objects as index and Strain objects as columns, and the\n values are 1 if the gcf occurs in the strain, 0 otherwise.\n \"\"\"\n df_gcf_strain = pd.DataFrame(\n 0,\n index=gcfs,\n columns=list(strains),\n dtype=int,\n ) # type: ignore\n for gcf in gcfs:\n for strain in strains:\n if gcf.has_strain(strain):\n df_gcf_strain.loc[gcf, strain] = 1\n return df_gcf_strain # type: ignore\n
"},{"location":"api/scoring_utils/#nplinker.scoring.utils.get_presence_spec_strain","title":"get_presence_spec_strain","text":"get_presence_spec_strain(\n spectra: Sequence[Spectrum], strains: StrainCollection\n) -> DataFrame\n
Get the occurrence of strains in spectra.
The occurrence is a DataFrame with Spectrum objects as index and Strain objects as columns, and the values are 1 if the spectrum occurs in the strain, 0 otherwise.
Source code insrc/nplinker/scoring/utils.py
def get_presence_spec_strain(\n spectra: Sequence[Spectrum], strains: StrainCollection\n) -> pd.DataFrame:\n \"\"\"Get the occurrence of strains in spectra.\n\n The occurrence is a DataFrame with Spectrum objects as index and Strain objects as columns, and\n the values are 1 if the spectrum occurs in the strain, 0 otherwise.\n \"\"\"\n df_spec_strain = pd.DataFrame(\n 0,\n index=spectra,\n columns=list(strains),\n dtype=int,\n ) # type: ignore\n for spectrum in spectra:\n for strain in strains:\n if spectrum.has_strain(strain):\n df_spec_strain.loc[spectrum, strain] = 1\n return df_spec_strain # type: ignore\n
"},{"location":"api/scoring_utils/#nplinker.scoring.utils.get_presence_mf_strain","title":"get_presence_mf_strain","text":"get_presence_mf_strain(\n mfs: Sequence[MolecularFamily],\n strains: StrainCollection,\n) -> DataFrame\n
Get the occurrence of strains in molecular families.
The occurrence is a DataFrame with MolecularFamily objects as index and Strain objects as columns, and the values are 1 if the molecular family occurs in the strain, 0 otherwise.
Source code insrc/nplinker/scoring/utils.py
def get_presence_mf_strain(\n mfs: Sequence[MolecularFamily], strains: StrainCollection\n) -> pd.DataFrame:\n \"\"\"Get the occurrence of strains in molecular families.\n\n The occurrence is a DataFrame with MolecularFamily objects as index and Strain objects as\n columns, and the values are 1 if the molecular family occurs in the strain, 0 otherwise.\n \"\"\"\n df_mf_strain = pd.DataFrame(\n 0,\n index=mfs,\n columns=list(strains),\n dtype=int,\n ) # type: ignore\n for mf in mfs:\n for strain in strains:\n if mf.has_strain(strain):\n df_mf_strain.loc[mf, strain] = 1\n return df_mf_strain # type: ignore\n
"},{"location":"api/strain/","title":"Data Models","text":""},{"location":"api/strain/#nplinker.strain","title":"nplinker.strain","text":""},{"location":"api/strain/#nplinker.strain.Strain","title":"Strain","text":"Strain(id: str)\n
Class to model the mapping between strain id and its aliases.
It's recommended to use NCBI taxonomy strain id or name as the primary id.
Attributes:
id
(str
) \u2013 The representative id of the strain.
names
(set[str]
) \u2013 A set of names associated with the strain.
aliases
(set[str]
) \u2013 A set of aliases associated with the strain.
Parameters:
id
(str
) \u2013 the representative id of the strain.
src/nplinker/strain/strain.py
def __init__(self, id: str) -> None:\n \"\"\"To model the mapping between strain id and its aliases.\n\n Args:\n id: the representative id of the strain.\n \"\"\"\n self.id: str = id\n self._aliases: set[str] = set()\n
"},{"location":"api/strain/#nplinker.strain.Strain.id","title":"id instance-attribute
","text":"id: str = id\n
"},{"location":"api/strain/#nplinker.strain.Strain.names","title":"names property
","text":"names: set[str]\n
Get the set of strain names including id and aliases.
Returns:
set[str]
\u2013 A set of names associated with the strain.
property
","text":"aliases: set[str]\n
Get the set of known aliases.
Returns:
set[str]
\u2013 A set of aliases associated with the strain.
__repr__() -> str\n
Source code in src/nplinker/strain/strain.py
def __repr__(self) -> str:\n return str(self)\n
"},{"location":"api/strain/#nplinker.strain.Strain.__str__","title":"__str__","text":"__str__() -> str\n
Source code in src/nplinker/strain/strain.py
def __str__(self) -> str:\n return f\"Strain({self.id}) [{len(self._aliases)} aliases]\"\n
"},{"location":"api/strain/#nplinker.strain.Strain.__eq__","title":"__eq__","text":"__eq__(other) -> bool\n
Source code in src/nplinker/strain/strain.py
def __eq__(self, other) -> bool:\n if isinstance(other, Strain):\n return self.id == other.id\n return NotImplemented\n
"},{"location":"api/strain/#nplinker.strain.Strain.__hash__","title":"__hash__","text":"__hash__() -> int\n
Hash function for Strain.
Note that Strain is a mutable container, so here we hash on only the id to avoid the hash value changes when self._aliases
is updated.
src/nplinker/strain/strain.py
def __hash__(self) -> int:\n \"\"\"Hash function for Strain.\n\n Note that Strain is a mutable container, so here we hash on only the id\n to avoid the hash value changes when `self._aliases` is updated.\n \"\"\"\n return hash(self.id)\n
"},{"location":"api/strain/#nplinker.strain.Strain.__contains__","title":"__contains__","text":"__contains__(alias: str) -> bool\n
Source code in src/nplinker/strain/strain.py
def __contains__(self, alias: str) -> bool:\n if not isinstance(alias, str):\n raise TypeError(f\"Expected str, got {type(alias)}\")\n return alias in self._aliases\n
"},{"location":"api/strain/#nplinker.strain.Strain.add_alias","title":"add_alias","text":"add_alias(alias: str) -> None\n
Add an alias for the strain.
Parameters:
alias
(str
) \u2013 The alias to add for the strain.
src/nplinker/strain/strain.py
def add_alias(self, alias: str) -> None:\n \"\"\"Add an alias for the strain.\n\n Args:\n alias: The alias to add for the strain.\n \"\"\"\n if not isinstance(alias, str):\n raise TypeError(f\"Expected str, got {type(alias)}\")\n if len(alias) == 0:\n logger.warning(\"Refusing to add an empty-string alias to strain {%s}\", self)\n else:\n self._aliases.add(alias)\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection","title":"StrainCollection","text":"StrainCollection()\n
A collection of Strain
objects.
src/nplinker/strain/strain_collection.py
def __init__(self) -> None:\n # the order of strains is needed for scoring part, so use a list\n self._strains: list[Strain] = []\n self._strain_dict_name: dict[str, list[Strain]] = {}\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.__repr__","title":"__repr__","text":"__repr__() -> str\n
Source code in src/nplinker/strain/strain_collection.py
def __repr__(self) -> str:\n return str(self)\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.__str__","title":"__str__","text":"__str__() -> str\n
Source code in src/nplinker/strain/strain_collection.py
def __str__(self) -> str:\n if len(self) > 20:\n return f\"StrainCollection(n={len(self)})\"\n\n return f\"StrainCollection(n={len(self)}) [\" + \",\".join(s.id for s in self._strains) + \"]\"\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.__len__","title":"__len__","text":"__len__() -> int\n
Source code in src/nplinker/strain/strain_collection.py
def __len__(self) -> int:\n return len(self._strains)\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.__eq__","title":"__eq__","text":"__eq__(other) -> bool\n
Source code in src/nplinker/strain/strain_collection.py
def __eq__(self, other) -> bool:\n if isinstance(other, StrainCollection):\n return (\n self._strains == other._strains\n and self._strain_dict_name == other._strain_dict_name\n )\n return NotImplemented\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.__add__","title":"__add__","text":"__add__(other) -> StrainCollection\n
Source code in src/nplinker/strain/strain_collection.py
def __add__(self, other) -> StrainCollection:\n if isinstance(other, StrainCollection):\n sc = StrainCollection()\n for strain in self._strains:\n sc.add(strain)\n for strain in other._strains:\n sc.add(strain)\n return sc\n return NotImplemented\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.__contains__","title":"__contains__","text":"__contains__(item: Strain) -> bool\n
Check if the strain collection contains the given Strain object.
Source code insrc/nplinker/strain/strain_collection.py
def __contains__(self, item: Strain) -> bool:\n \"\"\"Check if the strain collection contains the given Strain object.\"\"\"\n if isinstance(item, Strain):\n return item.id in self._strain_dict_name\n raise TypeError(f\"Expected Strain, got {type(item)}\")\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.__iter__","title":"__iter__","text":"__iter__() -> Iterator[Strain]\n
Source code in src/nplinker/strain/strain_collection.py
def __iter__(self) -> Iterator[Strain]:\n return iter(self._strains)\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.add","title":"add","text":"add(strain: Strain) -> None\n
Add strain to the collection.
If the strain already exists, merge the aliases.
Parameters:
strain
(Strain
) \u2013 The strain to add.
src/nplinker/strain/strain_collection.py
def add(self, strain: Strain) -> None:\n \"\"\"Add strain to the collection.\n\n If the strain already exists, merge the aliases.\n\n Args:\n strain: The strain to add.\n \"\"\"\n if strain in self._strains:\n # only one strain object per id\n strain_ref = self._strain_dict_name[strain.id][0]\n new_aliases = [alias for alias in strain.aliases if alias not in strain_ref.aliases]\n for alias in new_aliases:\n strain_ref.add_alias(alias)\n if alias not in self._strain_dict_name:\n self._strain_dict_name[alias] = [strain_ref]\n else:\n self._strain_dict_name[alias].append(strain_ref)\n else:\n self._strains.append(strain)\n for name in strain.names:\n if name not in self._strain_dict_name:\n self._strain_dict_name[name] = [strain]\n else:\n self._strain_dict_name[name].append(strain)\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.remove","title":"remove","text":"remove(strain: Strain) -> None\n
Remove a strain from the collection.
It removes the given strain object from the collection by strain id. If the strain id is not found, raise ValueError
.
Parameters:
strain
(Strain
) \u2013 The strain to remove.
Raises:
ValueError
\u2013 If the strain is not found in the collection.
src/nplinker/strain/strain_collection.py
def remove(self, strain: Strain) -> None:\n \"\"\"Remove a strain from the collection.\n\n It removes the given strain object from the collection by strain id.\n If the strain id is not found, raise `ValueError`.\n\n Args:\n strain: The strain to remove.\n\n Raises:\n ValueError: If the strain is not found in the collection.\n \"\"\"\n if strain in self._strains:\n self._strains.remove(strain)\n # only one strain object per id\n strain_ref = self._strain_dict_name[strain.id][0]\n for name in strain_ref.names:\n if name in self._strain_dict_name:\n new_strain_list = [s for s in self._strain_dict_name[name] if s.id != strain.id]\n if not new_strain_list:\n del self._strain_dict_name[name]\n else:\n self._strain_dict_name[name] = new_strain_list\n else:\n raise ValueError(f\"Strain {strain} not found in the strain collection.\")\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.filter","title":"filter","text":"filter(strain_set: set[Strain])\n
Remove all strains that are not in strain_set
from the strain collection.
Parameters:
strain_set
(set[Strain]
) \u2013 Set of strains to keep.
src/nplinker/strain/strain_collection.py
def filter(self, strain_set: set[Strain]):\n \"\"\"Remove all strains that are not in `strain_set` from the strain collection.\n\n Args:\n strain_set: Set of strains to keep.\n \"\"\"\n # note that we need to copy the list of strains, as we are modifying it\n for strain in self._strains.copy():\n if strain not in strain_set:\n self.remove(strain)\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.intersection","title":"intersection","text":"intersection(other: StrainCollection) -> StrainCollection\n
Get the intersection of two strain collections.
Parameters:
other
(StrainCollection
) \u2013 The other strain collection to compare.
Returns:
StrainCollection
\u2013 StrainCollection object containing the strains that are in both collections.
src/nplinker/strain/strain_collection.py
def intersection(self, other: StrainCollection) -> StrainCollection:\n \"\"\"Get the intersection of two strain collections.\n\n Args:\n other: The other strain collection to compare.\n\n Returns:\n StrainCollection object containing the strains that are in both collections.\n \"\"\"\n intersection = StrainCollection()\n for strain in self:\n if strain in other:\n intersection.add(strain)\n return intersection\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.has_name","title":"has_name","text":"has_name(name: str) -> bool\n
Check if the strain collection contains the given strain name (id or alias).
Parameters:
name
(str
) \u2013 Strain name (id or alias) to check.
Returns:
bool
\u2013 True if the strain name is in the collection, False otherwise.
src/nplinker/strain/strain_collection.py
def has_name(self, name: str) -> bool:\n \"\"\"Check if the strain collection contains the given strain name (id or alias).\n\n Args:\n name: Strain name (id or alias) to check.\n\n Returns:\n True if the strain name is in the collection, False otherwise.\n \"\"\"\n return name in self._strain_dict_name\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.lookup","title":"lookup","text":"lookup(name: str) -> list[Strain]\n
Lookup a strain by name (id or alias).
Parameters:
name
(str
) \u2013 Strain name (id or alias) to lookup.
Returns:
list[Strain]
\u2013 List of Strain objects with the given name.
Raises:
ValueError
\u2013 If the strain name is not found.
src/nplinker/strain/strain_collection.py
def lookup(self, name: str) -> list[Strain]:\n \"\"\"Lookup a strain by name (id or alias).\n\n Args:\n name: Strain name (id or alias) to lookup.\n\n Returns:\n List of Strain objects with the given name.\n\n Raises:\n ValueError: If the strain name is not found.\n \"\"\"\n if name in self._strain_dict_name:\n return self._strain_dict_name[name]\n raise ValueError(f\"Strain {name} not found in the strain collection.\")\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.read_json","title":"read_json staticmethod
","text":"read_json(file: str | PathLike) -> StrainCollection\n
Read a strain mappings JSON file and return a StrainCollection
object.
Parameters:
file
(str | PathLike
) \u2013 Path to the strain mappings JSON file.
Returns:
StrainCollection
\u2013 StrainCollection
object.
src/nplinker/strain/strain_collection.py
@staticmethod\ndef read_json(file: str | PathLike) -> StrainCollection:\n \"\"\"Read a strain mappings JSON file and return a `StrainCollection` object.\n\n Args:\n file: Path to the strain mappings JSON file.\n\n Returns:\n `StrainCollection` object.\n \"\"\"\n with open(file, \"r\") as f:\n json_data = json.load(f)\n\n # validate json data\n validate(instance=json_data, schema=STRAIN_MAPPINGS_SCHEMA)\n\n strain_collection = StrainCollection()\n for data in json_data[\"strain_mappings\"]:\n strain = Strain(data[\"strain_id\"])\n for alias in data[\"strain_alias\"]:\n strain.add_alias(alias)\n strain_collection.add(strain)\n return strain_collection\n
"},{"location":"api/strain/#nplinker.strain.StrainCollection.to_json","title":"to_json","text":"to_json(file: str | PathLike | None = None) -> str | None\n
Convert the StrainCollection
object to a JSON string.
Parameters:
file
(str | PathLike | None
, default: None
) \u2013 Path to output JSON file. If None, return the JSON string instead.
Returns:
str | None
\u2013 If input file
is None, return the JSON string. Otherwise, write the JSON string to the given
str | None
\u2013 file.
src/nplinker/strain/strain_collection.py
def to_json(self, file: str | PathLike | None = None) -> str | None:\n \"\"\"Convert the `StrainCollection` object to a JSON string.\n\n Args:\n file: Path to output JSON file. If None, return the JSON string instead.\n\n Returns:\n If input `file` is None, return the JSON string. Otherwise, write the JSON string to the given\n file.\n \"\"\"\n data_list = [\n {\"strain_id\": strain.id, \"strain_alias\": list(strain.aliases)} for strain in self\n ]\n json_data = {\"strain_mappings\": data_list, \"version\": \"1.0\"}\n\n # validate json data\n validate(instance=json_data, schema=STRAIN_MAPPINGS_SCHEMA)\n\n if file is not None:\n with open(file, \"w\") as f:\n json.dump(json_data, f)\n return None\n return json.dumps(json_data)\n
"},{"location":"api/strain_utils/","title":"Utilities","text":""},{"location":"api/strain_utils/#nplinker.strain.utils","title":"nplinker.strain.utils","text":""},{"location":"api/strain_utils/#nplinker.strain.utils.load_user_strains","title":"load_user_strains","text":"load_user_strains(json_file: str | PathLike) -> set[Strain]\n
Load user specified strains from a JSON file.
The JSON file will be validated against the schema USER_STRAINS_SCHEMA
The content of the JSON file could be, for example:
{\"strain_ids\": [\"strain1\", \"strain2\"]}\n
Parameters:
json_file
(str | PathLike
) \u2013 Path to the JSON file containing user specified strains.
Returns:
set[Strain]
\u2013 A set of user specified strains.
src/nplinker/strain/utils.py
def load_user_strains(json_file: str | PathLike) -> set[Strain]:\n \"\"\"Load user specified strains from a JSON file.\n\n The JSON file will be validated against the schema\n [USER_STRAINS_SCHEMA][nplinker.schemas.USER_STRAINS_SCHEMA]\n\n The content of the JSON file could be, for example:\n ```\n {\"strain_ids\": [\"strain1\", \"strain2\"]}\n ```\n\n Args:\n json_file: Path to the JSON file containing user specified strains.\n\n Returns:\n A set of user specified strains.\n \"\"\"\n with open(json_file, \"r\") as f:\n json_data = json.load(f)\n\n # validate json data\n validate(instance=json_data, schema=USER_STRAINS_SCHEMA)\n\n strains = set()\n for strain_id in json_data[\"strain_ids\"]:\n strains.add(Strain(strain_id))\n\n return strains\n
"},{"location":"api/strain_utils/#nplinker.strain.utils.podp_generate_strain_mappings","title":"podp_generate_strain_mappings","text":"podp_generate_strain_mappings(\n podp_project_json_file: str | PathLike,\n genome_status_json_file: str | PathLike,\n genome_bgc_mappings_file: str | PathLike,\n gnps_file_mappings_file: str | PathLike,\n output_json_file: str | PathLike,\n) -> StrainCollection\n
Generate strain mappings JSON file for PODP pipeline.
To get the strain mappings, we need to combine the following mappings:
These mappings are extracted from the following files:
podp_project_json_file
.genome_status_json_file
.genome_bgc_mappings_file
.podp_project_json_file
.gnps_file_mappings_file
.Parameters:
podp_project_json_file
(str | PathLike
) \u2013 The path to the PODP project JSON file.
genome_status_json_file
(str | PathLike
) \u2013 The path to the genome status JSON file.
genome_bgc_mappings_file
(str | PathLike
) \u2013 The path to the genome BGC mappings JSON file.
gnps_file_mappings_file
(str | PathLike
) \u2013 The path to the GNPS file mappings file (csv or tsv).
output_json_file
(str | PathLike
) \u2013 The path to the output JSON file.
Returns:
StrainCollection
\u2013 The strain mappings stored in a StrainCollection object.
extract_mappings_strain_id_original_genome_id
: Extract mappings \"strain_id <-> original_genome_id\".extract_mappings_original_genome_id_resolved_genome_id
: Extract mappings \"original_genome_id <-> resolved_genome_id\".extract_mappings_resolved_genome_id_bgc_id
: Extract mappings \"resolved_genome_id <-> bgc_id\".get_mappings_strain_id_bgc_id
: Get mappings \"strain_id <-> bgc_id\".extract_mappings_strain_id_ms_filename
: Extract mappings \"strain_id <-> MS_filename\".extract_mappings_ms_filename_spectrum_id
: Extract mappings \"MS_filename <-> spectrum_id\".get_mappings_strain_id_spectrum_id
: Get mappings \"strain_id <-> spectrum_id\".src/nplinker/strain/utils.py
def podp_generate_strain_mappings(\n podp_project_json_file: str | PathLike,\n genome_status_json_file: str | PathLike,\n genome_bgc_mappings_file: str | PathLike,\n gnps_file_mappings_file: str | PathLike,\n output_json_file: str | PathLike,\n) -> StrainCollection:\n \"\"\"Generate strain mappings JSON file for PODP pipeline.\n\n To get the strain mappings, we need to combine the following mappings:\n\n - strain_id <-> original_genome_id <-> resolved_genome_id <-> bgc_id\n - strain_id <-> MS_filename <-> spectrum_id\n\n These mappings are extracted from the following files:\n\n - \"strain_id <-> original_genome_id\" is extracted from `podp_project_json_file`.\n - \"original_genome_id <-> resolved_genome_id\" is extracted from `genome_status_json_file`.\n - \"resolved_genome_id <-> bgc_id\" is extracted from `genome_bgc_mappings_file`.\n - \"strain_id <-> MS_filename\" is extracted from `podp_project_json_file`.\n - \"MS_filename <-> spectrum_id\" is extracted from `gnps_file_mappings_file`.\n\n Args:\n podp_project_json_file: The path to the PODP project\n JSON file.\n genome_status_json_file: The path to the genome status\n JSON file.\n genome_bgc_mappings_file: The path to the genome BGC\n mappings JSON file.\n gnps_file_mappings_file: The path to the GNPS file\n mappings file (csv or tsv).\n output_json_file: The path to the output JSON file.\n\n Returns:\n The strain mappings stored in a StrainCollection object.\n\n See Also:\n - `extract_mappings_strain_id_original_genome_id`: Extract mappings\n \"strain_id <-> original_genome_id\".\n - `extract_mappings_original_genome_id_resolved_genome_id`: Extract mappings\n \"original_genome_id <-> resolved_genome_id\".\n - `extract_mappings_resolved_genome_id_bgc_id`: Extract mappings\n \"resolved_genome_id <-> bgc_id\".\n - `get_mappings_strain_id_bgc_id`: Get mappings \"strain_id <-> bgc_id\".\n - `extract_mappings_strain_id_ms_filename`: Extract mappings\n \"strain_id <-> MS_filename\".\n - `extract_mappings_ms_filename_spectrum_id`: Extract mappings\n \"MS_filename <-> spectrum_id\".\n - `get_mappings_strain_id_spectrum_id`: Get mappings \"strain_id <-> spectrum_id\".\n \"\"\"\n # Get mappings strain_id <-> original_genome_id <-> resolved_genome_id <-> bgc_id\n mappings_strain_id_bgc_id = get_mappings_strain_id_bgc_id(\n extract_mappings_strain_id_original_genome_id(podp_project_json_file),\n extract_mappings_original_genome_id_resolved_genome_id(genome_status_json_file),\n extract_mappings_resolved_genome_id_bgc_id(genome_bgc_mappings_file),\n )\n\n # Get mappings strain_id <-> MS_filename <-> spectrum_id\n mappings_strain_id_spectrum_id = get_mappings_strain_id_spectrum_id(\n extract_mappings_strain_id_ms_filename(podp_project_json_file),\n extract_mappings_ms_filename_spectrum_id(gnps_file_mappings_file),\n )\n\n # Get mappings strain_id <-> bgc_id / spectrum_id\n mappings = mappings_strain_id_bgc_id.copy()\n for strain_id, spectrum_ids in mappings_strain_id_spectrum_id.items():\n if strain_id in mappings:\n mappings[strain_id].update(spectrum_ids)\n else:\n mappings[strain_id] = spectrum_ids.copy()\n\n # Create StrainCollection\n sc = StrainCollection()\n for strain_id, bgc_ids in mappings.items():\n if not sc.has_name(strain_id):\n strain = Strain(strain_id)\n for bgc_id in bgc_ids:\n strain.add_alias(bgc_id)\n sc.add(strain)\n else:\n # strain_list has only one element\n strain_list = sc.lookup(strain_id)\n for bgc_id in bgc_ids:\n strain_list[0].add_alias(bgc_id)\n\n # Write strain mappings JSON file\n sc.to_json(output_json_file)\n logger.info(\"Generated strain mappings JSON file: %s\", output_json_file)\n\n return sc\n
"},{"location":"api/utils/","title":"Utilities","text":""},{"location":"api/utils/#nplinker.utils","title":"nplinker.utils","text":""},{"location":"api/utils/#nplinker.utils.calculate_md5","title":"calculate_md5","text":"calculate_md5(\n fpath: str | PathLike, chunk_size: int = 1024 * 1024\n) -> str\n
Calculate the MD5 checksum of a file.
Parameters:
fpath
(str | PathLike
) \u2013 Path to the file.
chunk_size
(int
, default: 1024 * 1024
) \u2013 Chunk size for reading the file. Defaults to 1024*1024.
Returns:
str
\u2013 MD5 checksum of the file.
src/nplinker/utils.py
def calculate_md5(fpath: str | PathLike, chunk_size: int = 1024 * 1024) -> str:\n \"\"\"Calculate the MD5 checksum of a file.\n\n Args:\n fpath: Path to the file.\n chunk_size: Chunk size for reading the file. Defaults to 1024*1024.\n\n Returns:\n MD5 checksum of the file.\n \"\"\"\n if sys.version_info >= (3, 9):\n md5 = hashlib.md5(usedforsecurity=False)\n else:\n md5 = hashlib.md5()\n with open(fpath, \"rb\") as f:\n for chunk in iter(lambda: f.read(chunk_size), b\"\"):\n md5.update(chunk)\n return md5.hexdigest()\n
"},{"location":"api/utils/#nplinker.utils.check_disk_space","title":"check_disk_space","text":"check_disk_space(func)\n
A decorator to check available disk space.
If the available disk space is less than 500GB, raise and log a warning.
Warns:
UserWarning
\u2013 If the available disk space is less than 500GB.
src/nplinker/utils.py
def check_disk_space(func):\n \"\"\"A decorator to check available disk space.\n\n If the available disk space is less than 500GB, raise and log a warning.\n\n Warnings:\n UserWarning: If the available disk space is less than 500GB.\n \"\"\"\n\n @functools.wraps(func)\n def wrapper_check_disk_space(*args, **kwargs):\n _, _, free = shutil.disk_usage(\"/\")\n free_gb = free // (2**30)\n if free_gb < 50:\n warning_message = f\"Available disk space is {free_gb}GB. Is it enough for your project?\"\n logger.warning(warning_message)\n warnings.warn(warning_message, UserWarning)\n return func(*args, **kwargs)\n\n return wrapper_check_disk_space\n
"},{"location":"api/utils/#nplinker.utils.check_md5","title":"check_md5","text":"check_md5(fpath: str | PathLike, md5: str) -> bool\n
Verify the MD5 checksum of a file.
Parameters:
fpath
(str | PathLike
) \u2013 Path to the file.
md5
(str
) \u2013 MD5 checksum to verify.
Returns:
bool
\u2013 True if the MD5 checksum matches, False otherwise.
src/nplinker/utils.py
def check_md5(fpath: str | PathLike, md5: str) -> bool:\n \"\"\"Verify the MD5 checksum of a file.\n\n Args:\n fpath: Path to the file.\n md5: MD5 checksum to verify.\n\n Returns:\n True if the MD5 checksum matches, False otherwise.\n \"\"\"\n return md5 == calculate_md5(fpath)\n
"},{"location":"api/utils/#nplinker.utils.download_and_extract_archive","title":"download_and_extract_archive","text":"download_and_extract_archive(\n url: str,\n download_root: str | PathLike,\n extract_root: str | Path | None = None,\n filename: str | None = None,\n md5: str | None = None,\n remove_finished: bool = False,\n) -> None\n
Download an archive file and then extract it.
This method is a wrapper of download_url
and extract_archive
functions.
Parameters:
url
(str
) \u2013 URL to download file from
download_root
(str | PathLike
) \u2013 Path to the directory to place downloaded file in. If it doesn't exist, it will be created.
extract_root
(str | Path | None
, default: None
) \u2013 Path to the directory the file will be extracted to. The given directory will be created if not exist. If omitted, the download_root
is used.
filename
(str | None
, default: None
) \u2013 Name to save the downloaded file under. If None, use the basename of the URL
md5
(str | None
, default: None
) \u2013 MD5 checksum of the download. If None, do not check
remove_finished
(bool
, default: False
) \u2013 If True
, remove the downloaded file after the extraction. Defaults to False.
src/nplinker/utils.py
def download_and_extract_archive(\n url: str,\n download_root: str | PathLike,\n extract_root: str | Path | None = None,\n filename: str | None = None,\n md5: str | None = None,\n remove_finished: bool = False,\n) -> None:\n \"\"\"Download an archive file and then extract it.\n\n This method is a wrapper of [`download_url`][nplinker.utils.download_url] and\n [`extract_archive`][nplinker.utils.extract_archive] functions.\n\n Args:\n url: URL to download file from\n download_root: Path to the directory to place downloaded\n file in. If it doesn't exist, it will be created.\n extract_root: Path to the directory the file\n will be extracted to. The given directory will be created if not exist.\n If omitted, the `download_root` is used.\n filename: Name to save the downloaded file under.\n If None, use the basename of the URL\n md5: MD5 checksum of the download. If None, do not check\n remove_finished: If `True`, remove the downloaded file\n after the extraction. Defaults to False.\n \"\"\"\n download_root = Path(download_root)\n if extract_root is None:\n extract_root = download_root\n else:\n extract_root = Path(extract_root)\n if not filename:\n filename = Path(url).name\n\n download_url(url, download_root, filename, md5)\n\n archive = download_root / filename\n extract_archive(archive, extract_root, remove_finished=remove_finished)\n
"},{"location":"api/utils/#nplinker.utils.download_url","title":"download_url","text":"download_url(\n url: str,\n root: str | PathLike,\n filename: str | None = None,\n md5: str | None = None,\n http_method: str = \"GET\",\n allow_http_redirect: bool = True,\n) -> None\n
Download a file from a url and place it in root.
Parameters:
url
(str
) \u2013 URL to download file from
root
(str | PathLike
) \u2013 Directory to place downloaded file in. If it doesn't exist, it will be created.
filename
(str | None
, default: None
) \u2013 Name to save the file under. If None, use the basename of the URL.
md5
(str | None
, default: None
) \u2013 MD5 checksum of the download. If None, do not check.
http_method
(str
, default: 'GET'
) \u2013 HTTP request method, e.g. \"GET\", \"POST\". Defaults to \"GET\".
allow_http_redirect
(bool
, default: True
) \u2013 If true, enable following redirects for all HTTP (\"http:\") methods.
src/nplinker/utils.py
@check_disk_space\ndef download_url(\n url: str,\n root: str | PathLike,\n filename: str | None = None,\n md5: str | None = None,\n http_method: str = \"GET\",\n allow_http_redirect: bool = True,\n) -> None:\n \"\"\"Download a file from a url and place it in root.\n\n Args:\n url: URL to download file from\n root: Directory to place downloaded file in. If it doesn't exist, it will be created.\n filename: Name to save the file under. If None, use the\n basename of the URL.\n md5: MD5 checksum of the download. If None, do not check.\n http_method: HTTP request method, e.g. \"GET\", \"POST\".\n Defaults to \"GET\".\n allow_http_redirect: If true, enable following redirects for all HTTP (\"http:\") methods.\n \"\"\"\n root = transform_to_full_path(root)\n # create the download directory if not exist\n root.mkdir(exist_ok=True)\n if not filename:\n filename = Path(url).name\n fpath = root / filename\n\n # check if file is already present locally\n if fpath.is_file() and md5 is not None and check_md5(fpath, md5):\n logger.info(\"Using downloaded and verified file: \" + str(fpath))\n return\n\n # download the file\n logger.info(f\"Downloading {filename} to {root}\")\n with open(fpath, \"wb\") as fh:\n with httpx.stream(http_method, url, follow_redirects=allow_http_redirect) as response:\n if not response.is_success:\n fpath.unlink(missing_ok=True)\n raise RuntimeError(\n f\"Failed to download url {url} with status code {response.status_code}\"\n )\n total = int(response.headers.get(\"Content-Length\", 0))\n\n with Progress(\n TextColumn(\"[progress.description]{task.description}\"),\n BarColumn(bar_width=None),\n \"[progress.percentage]{task.percentage:>3.1f}%\",\n \"\u2022\",\n DownloadColumn(),\n \"\u2022\",\n TransferSpeedColumn(),\n \"\u2022\",\n TimeRemainingColumn(),\n \"\u2022\",\n TimeElapsedColumn(),\n ) as progress:\n task = progress.add_task(f\"[hot_pink]Downloading {fpath.name}\", total=total)\n for chunk in response.iter_bytes():\n fh.write(chunk)\n progress.update(task, advance=len(chunk))\n\n # check integrity of downloaded file\n if md5 is not None and not check_md5(fpath, md5):\n raise RuntimeError(\"MD5 validation failed.\")\n
"},{"location":"api/utils/#nplinker.utils.extract_archive","title":"extract_archive","text":"extract_archive(\n from_path: str | PathLike,\n extract_root: str | PathLike | None = None,\n members: list | None = None,\n remove_finished: bool = False,\n) -> str\n
Extract an archive.
The archive type and a possible compression is automatically detected from the file name.
If the file is compressed but not an archive, the call is dispatched to _decompress
function.
Parameters:
from_path
(str | PathLike
) \u2013 Path to the file to be extracted.
extract_root
(str | PathLike | None
, default: None
) \u2013 Path to the directory the file will be extracted to. The given directory will be created if not exist. If omitted, the directory of the archive file is used.
members
(list | None
, default: None
) \u2013 Optional selection of members to extract. If not specified, all members are extracted. Members must be a subset of the list returned by - zipfile.ZipFile.namelist()
or a list of strings for zip file - tarfile.TarFile.getmembers()
for tar file
remove_finished
(bool
, default: False
) \u2013 If True
, remove the file after the extraction.
Returns:
str
\u2013 Path to the directory the file was extracted to.
src/nplinker/utils.py
def extract_archive(\n from_path: str | PathLike,\n extract_root: str | PathLike | None = None,\n members: list | None = None,\n remove_finished: bool = False,\n) -> str:\n \"\"\"Extract an archive.\n\n The archive type and a possible compression is automatically detected from\n the file name.\n\n If the file is compressed but not an archive, the call is dispatched to `_decompress` function.\n\n Args:\n from_path: Path to the file to be extracted.\n extract_root: Path to the directory the file will be extracted to.\n The given directory will be created if not exist.\n If omitted, the directory of the archive file is used.\n members: Optional selection of members to extract. If not specified,\n all members are extracted.\n Members must be a subset of the list returned by\n - `zipfile.ZipFile.namelist()` or a list of strings for zip file\n - `tarfile.TarFile.getmembers()` for tar file\n remove_finished: If `True`, remove the file after the extraction.\n\n Returns:\n Path to the directory the file was extracted to.\n \"\"\"\n from_path = Path(from_path)\n\n if extract_root is None:\n extract_root = from_path.parent\n else:\n extract_root = Path(extract_root)\n\n # create the extract directory if not exist\n extract_root.mkdir(exist_ok=True)\n\n logger.info(f\"Extracting {from_path} to {extract_root}\")\n suffix, archive_type, compression = _detect_file_type(from_path)\n if not archive_type:\n return _decompress(\n from_path,\n extract_root / from_path.name.replace(suffix, \"\"),\n remove_finished=remove_finished,\n )\n\n extractor = _ARCHIVE_EXTRACTORS[archive_type]\n\n extractor(str(from_path), str(extract_root), members, compression)\n if remove_finished:\n from_path.unlink()\n\n return str(extract_root)\n
"},{"location":"api/utils/#nplinker.utils.is_file_format","title":"is_file_format","text":"is_file_format(\n file: str | PathLike, format: str = \"tsv\"\n) -> bool\n
Check if the file is in the given format.
Parameters:
file
(str | PathLike
) \u2013 Path to the file to check.
format
(str
, default: 'tsv'
) \u2013 The format to check for, either \"tsv\" or \"csv\".
Returns:
bool
\u2013 True if the file is in the given format, False otherwise.
src/nplinker/utils.py
def is_file_format(file: str | PathLike, format: str = \"tsv\") -> bool:\n \"\"\"Check if the file is in the given format.\n\n Args:\n file: Path to the file to check.\n format: The format to check for, either \"tsv\" or \"csv\".\n\n Returns:\n True if the file is in the given format, False otherwise.\n \"\"\"\n try:\n with open(file, \"rt\") as f:\n if format == \"tsv\":\n reader = csv.reader(f, delimiter=\"\\t\")\n elif format == \"csv\":\n reader = csv.reader(f, delimiter=\",\")\n else:\n raise ValueError(f\"Unknown format '{format}'.\")\n for _ in reader:\n pass\n return True\n except csv.Error:\n return False\n
"},{"location":"api/utils/#nplinker.utils.list_dirs","title":"list_dirs","text":"list_dirs(\n root: str | PathLike, keep_parent: bool = True\n) -> list[str]\n
List all directories at a given root.
Parameters:
root
(str | PathLike
) \u2013 Path to directory whose folders need to be listed
keep_parent
(bool
, default: True
) \u2013 If true, prepends the path to each result, otherwise only returns the name of the directories found
src/nplinker/utils.py
def list_dirs(root: str | PathLike, keep_parent: bool = True) -> list[str]:\n \"\"\"List all directories at a given root.\n\n Args:\n root: Path to directory whose folders need to be listed\n keep_parent: If true, prepends the path to each result, otherwise\n only returns the name of the directories found\n \"\"\"\n root = transform_to_full_path(root)\n directories = [str(p) for p in root.iterdir() if p.is_dir()]\n if not keep_parent:\n directories = [os.path.basename(d) for d in directories]\n return directories\n
"},{"location":"api/utils/#nplinker.utils.list_files","title":"list_files","text":"list_files(\n root: str | PathLike,\n prefix: str | tuple[str, ...] = \"\",\n suffix: str | tuple[str, ...] = \"\",\n keep_parent: bool = True,\n) -> list[str]\n
List all files at a given root.
Parameters:
root
(str | PathLike
) \u2013 Path to directory whose files need to be listed
prefix
(str | tuple[str, ...]
, default: ''
) \u2013 Prefix of the file names to match, Defaults to empty string '\"\"'.
suffix
(str | tuple[str, ...]
, default: ''
) \u2013 Suffix of the files to match, e.g. \".png\" or (\".jpg\", \".png\"). Defaults to empty string '\"\"'.
keep_parent
(bool
, default: True
) \u2013 If true, prepends the parent path to each result, otherwise only returns the name of the files found. Defaults to False.
src/nplinker/utils.py
def list_files(\n root: str | PathLike,\n prefix: str | tuple[str, ...] = \"\",\n suffix: str | tuple[str, ...] = \"\",\n keep_parent: bool = True,\n) -> list[str]:\n \"\"\"List all files at a given root.\n\n Args:\n root: Path to directory whose files need to be listed\n prefix: Prefix of the file names to match,\n Defaults to empty string '\"\"'.\n suffix: Suffix of the files to match, e.g. \".png\" or\n (\".jpg\", \".png\").\n Defaults to empty string '\"\"'.\n keep_parent: If true, prepends the parent path to each\n result, otherwise only returns the name of the files found.\n Defaults to False.\n \"\"\"\n root = Path(root)\n files = [\n str(p)\n for p in root.iterdir()\n if p.is_file() and p.name.startswith(prefix) and p.name.endswith(suffix)\n ]\n\n if not keep_parent:\n files = [os.path.basename(f) for f in files]\n\n return files\n
"},{"location":"api/utils/#nplinker.utils.transform_to_full_path","title":"transform_to_full_path","text":"transform_to_full_path(p: str | PathLike) -> Path\n
Transform a path to a full path.
The path is expanded (i.e. the ~
will be replaced with actual path) and converted to an absolute path (i.e. .
or ..
will be replaced with actual path).
Parameters:
p
(str | PathLike
) \u2013 The path to transform.
Returns:
Path
\u2013 The transformed full path.
src/nplinker/utils.py
def transform_to_full_path(p: str | PathLike) -> Path:\n \"\"\"Transform a path to a full path.\n\n The path is expanded (i.e. the `~` will be replaced with actual path) and converted to an\n absolute path (i.e. `.` or `..` will be replaced with actual path).\n\n Args:\n p: The path to transform.\n\n Returns:\n The transformed full path.\n \"\"\"\n # Multiple calls to `Path` are used to ensure static typing compatibility.\n p = Path(p).expanduser()\n p = Path(p).resolve()\n return Path(p)\n
"},{"location":"concepts/bigscape/","title":"BigScape","text":"NPLinker can run BigScape automatically if the bigscape
directory does not exist in the working directory. Both version 1 and version 2 of BigScape are supported.
See the configuration template for how to set parameters for running BigScape.
See the default configurations for the default parameters used in NPLinker.
"},{"location":"concepts/config_file/","title":"Config File","text":""},{"location":"concepts/config_file/#configuration-template","title":"Configuration Template","text":"#############################\n# NPLinker configuration file\n#############################\n\n# The root directory of the NPLinker project. You need to create it first.\n# The value is required and must be a full path.\nroot_dir = \"<NPLinker root directory>\"\n# The mode for preparing dataset.\n# The available modes are \"podp\" and \"local\".\n# \"podp\" mode is for using the PODP platform (https://pairedomicsdata.bioinformatics.nl/) to prepare the dataset.\n# \"local\" mode is for preparing the dataset locally. So uers do not need to upload their data to the PODP platform.\n# The value is required.\nmode = \"podp\"\n# The PODP project identifier.\n# The value is required if the mode is \"podp\".\npodp_id = \"\"\n\n\n[log]\n# Log level. The available levels are same as the levels in python package `logging`:\n# \"DEBUG\", \"INFO\", \"WARNING\", \"ERROR\", \"CRITICAL\".\n# The default value is \"INFO\".\nlevel = \"INFO\"\n# The log file to append log messages.\n# The value is optional.\n# If not set or use empty string, log messages will not be written to a file.\n# The file will be created if it does not exist. Log messages will be appended to the file if it exists.\nfile = \"path/to/logfile\"\n# Whether to write log meesages to console.\n# The default value is true.\nuse_console = true\n\n\n[mibig]\n# Whether to use mibig metadta (json).\n# The default value is true.\nto_use = true\n# The version of mibig metadata.\n# Make sure using the same version of mibig in bigscape.\n# The default value is \"3.1\"\nversion = \"3.1\"\n\n\n[bigscape]\n# The parameters to use for running BiG-SCAPE.\n# Version of BiG-SCAPE to run. Make sure to change the parameters property below as well\n# when changing versions.\nversion = 1\n# Required BiG-SCAPE parameters.\n# --------------\n# For version 1:\n# -------------\n# Required parameters are: `--mix`, `--include_singletons` and `--cutoffs`. NPLinker needs them to run the analysis properly.\n# Do NOT set these parameters: `--inputdir`, `--outputdir`, `--pfam_dir`. NPLinker will automatically configure them.\n# If parameter `--mibig` is set, make sure to set the config `mibig.to_use` to true and `mibig.version` to the version of mibig in BiG-SCAPE.\n# The default value is \"--mibig --clans-off --mix --include_singletons --cutoffs 0.30\".\n# --------------\n# For version 2:\n# --------------\n# Note that BiG-SCAPE v2 has subcommands. NPLinker requires the `cluster` subcommand and its parameters.\n# Required parameters of `cluster` subcommand are: `--mibig_version`, `--include_singletons` and `--gcf_cutoffs`.\n# DO NOT set these parameters: `--pfam_path`, `--inputdir`, `--outputdir`. NPLinker will automatically configure them.\n# BiG-SCPAPE v2 also runs a `--mix` analysis by default, so you don't need to set this parameter here.\n# Example parameters for BiG-SCAPE v2: \"--mibig_version 3.1 --include_singletons --gcf_cutoffs 0.30\"\nparameters = \"--mibig --clans-off --mix --include_singletons --cutoffs 0.30\"\n# Which bigscape cutoff to use for NPLinker analysis.\n# There might be multiple cutoffs in bigscape output.\n# Note that this value must be a string.\n# The default value is \"0.30\".\ncutoff = \"0.30\"\n\n\n[scoring]\n# Scoring methods.\n# Valid values are \"metcalf\" and \"rosetta\".\n# The default value is \"metcalf\".\nmethods = [\"metcalf\"]\n
"},{"location":"concepts/config_file/#default-configurations","title":"Default Configurations","text":"The default configurations are automatically used by NPLinker if you don't set them in your config file.
# NPLinker default configurations\n\n[log]\nlevel = \"INFO\"\nuse_console = true\n\n[mibig]\nto_use = true\nversion = \"3.1\"\n\n[bigscape]\nversion = 1\nparameters = \"--mibig --clans-off --mix --include_singletons --cutoffs 0.30\"\ncutoff = \"0.30\"\n\n[scoring]\nmethods = [\"metcalf\"]\n
"},{"location":"concepts/config_file/#config-loader","title":"Config loader","text":"You can load the configuration file using the load_config function.
from nplinker.config import load_config\nconfig = load_config('path/to/nplinker.toml')\n
When you use NPLinker as an application, you can get access to the configuration object directly:
from nplinker import NPLinker\nnpl = NPLinker('path/to/nplinker.toml')\nprint(npl.config)\n
"},{"location":"concepts/gnps_data/","title":"GNPS data","text":"NPLinker requires GNPS molecular networking data as input. It currently accepts data from the following GNPS workflows:
METABOLOMICS-SNETS
(data should be downloaded from the option Download Clustered Spectra as MGF
)METABOLOMICS-SNETS-V2
(Download Clustered Spectra as MGF
)FEATURE-BASED-MOLECULAR-NETWORKING
(Download Cytoscape Data
)METABOLOMICS-SNETS
workflowMETABOLOMICS-SNETS-V2
FEATURE-BASED-MOLECULAR-NETWORKING
NPLinker input GNPS file in the archive of Download Clustered Spectra as MGF
spectra.mgf METABOLOMICS-SNETS*.mgf molecular_families.tsv networkedges_selfloop/*.pairsinfo annotations.tsv result_specnets_DB/*.tsv file_mappings.tsv clusterinfosummarygroup_attributes_withIDs_withcomponentID/*.tsv For example, the file METABOLOMICS-SNETS*.mgf
from the downloaded zip archive is used as the spectra.mgf
input file of NPLinker.
When manually preparing GNPS data for NPLinker, the METABOLOMICS-SNETS*.mgf
must be renamed to spectra.mgf
and placed in the gnps
sub-directory of the NPLinker working directory.
Download Clustered Spectra as MGF
spectra.mgf METABOLOMICS-SNETS-V2*.mgf molecular_families.tsv networkedges_selfloop/*.selfloop annotations.tsv result_specnets_DB/*.tsv file_mappings.tsv clusterinfosummarygroup_attributes_withIDs_withcomponentID/*.clustersummary NPLinker input GNPS file in the archive of Download Cytoscape Data
spectra.mgf spectra/*.mgf molecular_families.tsv networkedges_selfloop/*.selfloop annotations.tsv DB_result/*.tsv file_mappings.csv quantification_table/*.csv Note that file_mappings.csv
is a CSV file, not a TSV file, different from the other workflows.
NPLinker requires a fixed structure of working directory with fixed names for the input and output data.
root_dir # (1)!\n \u2502\n \u251c\u2500\u2500 nplinker.toml # (2)!\n \u251c\u2500\u2500 strain_mappings.json [F] # (3)!\n \u251c\u2500\u2500 strains_selected.json [F][O] # (4)!\n \u2502\n \u251c\u2500\u2500 gnps [F] # (5)!\n \u2502 \u251c\u2500\u2500 spectra.mgf [F]\n \u2502 \u251c\u2500\u2500 molecular_families.tsv [F]\n \u2502 \u251c\u2500\u2500 annotations.tsv [F]\n \u2502 \u2514\u2500\u2500 file_mappings.tsv (.csv) [F] # (6)!\n \u2502\n \u251c\u2500\u2500 antismash [F] # (7)!\n \u2502 \u251c\u2500\u2500 GCF_000514975.1\n \u2502 \u2502 \u251c\u2500\u2500 xxx.region001.gbk\n \u2502 \u2502 \u2514\u2500\u2500 ...\n \u2502 \u251c\u2500\u2500 GCF_000016425.1\n \u2502 \u2502 \u251c\u2500\u2500 xxxx.region001.gbk\n \u2502 \u2502 \u2514\u2500\u2500 ...\n \u2502 \u2514\u2500\u2500 ...\n \u2502\n \u251c\u2500\u2500 bigscape [F][O] # (8)!\n \u2502 \u251c\u2500\u2500 mix_clustering_c0.30.tsv [F] # (9)!\n \u2502 \u2514\u2500\u2500 bigscape_running_output\n \u2502 \u2514\u2500\u2500 ...\n \u2502\n \u251c\u2500\u2500 downloads [F][A] # (10)!\n \u2502 \u251c\u2500\u2500 paired_datarecord_4b29ddc3-26d0-40d7-80c5-44fb6631dbf9.4.json # (11)!\n \u2502 \u251c\u2500\u2500 GCF_000016425.1.zip\n \u2502 \u251c\u2500\u2500 GCF_0000514975.1.zip\n \u2502 \u251c\u2500\u2500 c22f44b14a3d450eb836d607cb9521bb.zip\n \u2502 \u251c\u2500\u2500 genome_status.json\n \u2502 \u2514\u2500\u2500 mibig_json_3.1.tar.gz\n \u2502\n \u251c\u2500\u2500 mibig [F][A] # (12)!\n \u2502 \u251c\u2500\u2500 BGC0000001.json\n \u2502 \u251c\u2500\u2500 BGC0000002.json\n \u2502 \u2514\u2500\u2500 ...\n \u2502\n \u251c\u2500\u2500 output [F][A] # (13)!\n \u2502 \u2514\u2500\u2500 ...\n \u2502\n \u2514\u2500\u2500 ... # (14)!\n
root_dir
is the working directory you created, used as the root directory for NPLinker.nplinker.toml
is the configuration file (toml format) provided by the user for running NPLinker. strain_mappings.json
contains the mappings from strain to genomics and metabolomics data. It is generated by NPLinker for podp
mode; for local
mode, users need to create it manually. [F]
means the file name nplinker.toml
is a fixed name (including the extension) and must be named as shown.strains_selected.json
is an optional file containing the list of strains to be used in the analysis. If it is not provided, NPLinker will use all strains detected from the input data. [O]
means the file strains_selected.json
is optional for users to provide.gnps
directory contains the GNPS data. The files in this directory must be named as shown. See XXX for more information about the GNPS data..tsv
or .csv
format.antismash
directory contains a collection of AntiSMASH BGC data. The BGC data (*.region*.gbk
files) must be stored in subdirectories named after NCBI accession number (e.g. GCF_000514975.1
).bigscape
directory is optional and contains the output of BigScape. If the directory is not provided, NPLinker will run BigScape automatically to generate the data using the AntiSMASH BGC data.mix_clustering_c0.30.tsv
is an example output of BigScape. The file name must follow the pattern mix_clustering_c{cutoff}.tsv
, where {cutoff}
is the cutoff value used in the BigScape run.downloads
directory is automatically created and managed by NPLinker. It stores the downloaded data from the internet. Users can also use it to store their own downloaded data. [A]
means the directory is automatically created and/or managed by NPLinker.downloads
directory.mibig
directory contains the MIBiG metadata, which is automatically created and downloaded by NPLinker. Users should not interfere with this directory and its content.output
directory is automatically created by NPLinker. It stores the output data of NPLinker.Tip
[F]
means the file or directory name is fixed and must be named as shown. The names are defined in the defaults module.[O]
means the file or directory is optional for users to provide. It does not mean the file or directory is optional for NPLinker to use. If it's not provided by the user, NPLinker may generate it.[A]
means the directory is automatically created and/or managed by NPLinker.The DatasetArranger is implemented according to the following flowcharts.
"},{"location":"diagrams/arranger/#strain-mappings-file","title":"Strain mappings file","text":"flowchart TD\n StrainMappings[`strain_mappings.json`] --> SM{Is the mode PODP?}\n SM --> |No |SM0[Validate the file]\n SM --> |Yes|SM1[Generate the file] --> SM0
"},{"location":"diagrams/arranger/#strain-selection-file","title":"Strain selection file","text":"flowchart TD\n StrainsSelected[`strains_selected.json`] --> S{Does the file exist?}\n S --> |No | S0[Nothing to do]\n S --> |Yes| S1[Validate the file]
"},{"location":"diagrams/arranger/#podp-project-metadata-json-file","title":"PODP project metadata json file","text":"flowchart TD\n podp[PODP project metadata json file] --> A{Is the mode PODP?}\n A --> |No | A0[Nothing to do]\n A --> |Yes| P{Does the file exist?}\n P --> |No | P0[Download the file] --> P1\n P --> |Yes| P1[Validate the file]
"},{"location":"diagrams/arranger/#gnps-antismash-and-bigscape","title":"GNPS, AntiSMASH and BigScape","text":"flowchart TD\n ConfigError[Dynaconf config validation error]\n DataError[Data validation error]\n UseIt[Use the data]\n Download[First remove existing data if relevent, then download or generate data]\n\n A[GNPS, antiSMASH and BigSCape] --> B{Pass Dynaconf config validation?}\n B -->|No | ConfigError\n B -->|Yes| G{Is the mode PODP?}\n\n G -->|No, local mode| G1{Does data dir exist?}\n G1 -->|No | DataError\n G1 -->|Yes| H{Pass data validation?}\n H --> |No | DataError\n H --> |Yes| UseIt \n\n G -->|Yes, podp mode| G2{Does data dir exist?}\n G2 --> |No | Download\n G2 --> |Yes | J{Pass data validation?}\n J -->|No | Download --> |try max 2 times| J\n J -->|Yes| UseIt
"},{"location":"diagrams/arranger/#mibig-data","title":"MIBiG Data","text":"MIBiG data is always downloaded automatically. Users cannot provide their own MIBiG data.
flowchart TD\n Mibig[MIBiG] --> M0{Pass Dynaconf config validation?}\n M0 -->|No | M01[Dynaconf config validation error]\n M0 -->|Yes | MibigDownload[First remove existing data if relevant and then download data]
"},{"location":"diagrams/loader/","title":"Dataset Loading Pipeline","text":"The DatasetLoader is implemented according to the following pipeline.
"}]} \ No newline at end of file