IsopacketModeler is a command line tool written in Python for identifying peptides that have incorporated stable isotopes and modeling isotope uptake levels. It is recommended that this tool be run using the protein_SIP_pipeline, as this pipeline does a better job of taking advantage of high performance computing resources.
- Make sure you have a working conda installation. If conda is not installed please follow the instructions at this page: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
- Download the repository as a .zip file from the green "<> Code" drop-down menu in the upper right of the GitHub page and extract the contents. Alternatively, if you have git installed run
git clone https://github.com/stavis1/isopacketModeler/. - On the command line navigate to the root of the downloaded directory and run
conda env create -n isopacketModeler -f env/run.yml. This will create a conda environment and install the necessary dependencies for this tool. - Run
conda activate isopacketModelerto activate this environment.
IsopacketModeler requires four text files to run:
- A TOML formatted options file.
- A tab-separated file specifying the elemental composition of each amino acid.
- A tab-separated file describing the isotopic label added to each raw file.
- A tab-separated file describing the peptide-spectrum matches to use when searching for isotopically enriched peptides.
Additionally, the raw data must be provided as centroid mode mzML files.
Default versions of files 1 and 2, and templates for files 3 and 4 can be found in the templates folder.
The parameters are as follows:
- working_directory: This is the base directory from which all paths in the options file must be relative to.
- output_directory: A directory that will hold the results files. This will be created if not extant.
- overwrite: Whether to overwrite previous outputs. Must be either
trueorfalsein lowercase without quotes. - log_level: Controls the verbosity of console logging. 50 silences output while 10 displays all information.
- log_file: The name of the log file. All logs will be written to log_file regardless of log_level.
- design_file: The name of file 3.
- mzml_dir: The directory that holds mzML files. The path names in the design file must be relative to this directory.
- psms: A list of tab separated files containing PSMs to be fit. The list should be formatted as
['psm_file_1.tsv', 'psm_file_2.tsv'] - psm_headers: A list of column headers to be found in the PSM files. These should contain the peptide sequence, mzml file, scan number, charge, and protein names in that order, regardless of the order these columns are found in the PSM files. The scan number and charge columns should be integers only, i.e.
+2doesn't work. IsopacketModeler only works with positive ion mode data. Any columns listed after the requisite first five will be tracked as additional metadata and will be included in the final output. The formatting of this list is the same as that of thepsmsparameter. - aa_formulae: The name of file 2.
- cores: The maximum number of CPU cores the tool is allowed to use.
- classifier_fdr: The target false discovery rate on the PSM level that the classifier will attempt to control. Only used if
do_psm_classificationistrue - data_generating_processes: A list of data generating processes to fit to each peptide. Options are
BetabinomQuiescentMix,Betabinom,BinomQuiescentMix, andBinom. Any combination of these can be fit in a single run. This list should be formatted the same as thepsmsparameter. See footnote 1 for details on each option. - max_peptide_err: Peptidoforms whose best-fit model has a final loss higher than this value will be excluded from the output. In practice even exceptionally poor fits will rarely have a loss above 0.03 and qualitatively well fitting models often have a loss ~0.01. See footnote 2 for a discussion of what the loss means.
- do_psm_classification: Whether to train a neural network classifier model to do false discovery rate control on the PSM level. Must be either
trueorfalsein lowercase without quotes. - checkpoint_files: If you have a previously terminated run it can be resumed from the checkpoint files. All checkpoint files must be on the same step.
- stopping_point: If you wish to stop early put a checkpoint step here, otherwise this should be
false. Step1is the end of file parsing and step2is after PSM classification and peptide construction.
The TSV file which maps single letter amino acid codes to molecular formulae. If you have modified amino acids please represent them as either single characters (any non whitespace character can be used) or as the base amino acid followed by a bracketed modifier, e.g. S[79.9663]. Each modification should get its own row with the full molecular formula (not just the difference from the unmodified amino acid). If you need elements not listed in the default file add a column with their elemental symbol as the header.
This is the experimental design file. It must have at least two columns 'file' and 'label' which list the mzML files and their heavy label isotopes, respectively. The allowable isotopes are 'H[2]', 'C[13]', 'N[15]', 'O[17]', 'O[18]', 'S[33]', 'S[34]', and 'S[36]'. Deuterium labeling is supported but not recommended, see footnote 3 for details. Control samples should leave this field blank. Only one label is allowed per file. Other columns can be included. The information in these columns will be kept associated with the peptide data and will be included in the output but will otherwise be ignored. This can be used to, for example, keep track of time points or conditions in an experiment.
This is the PSM file, must include columns for peptide sequence, mzml file, scan #, charge, and protein names. These columns do not need to have specific names, rather, the names of the columns corresponding to these pieces of information will be specified in file 1, the options file. This means that tab-separated PSM files from multiple tools can be used with minimal processing. Scan # can be either the MS2 or the parent MS1 scan for the PSM. The charge column must have only integer values, e.g. 2 not +2 and cannot be blank or unknown. The mzML files must be either in the same directory that the command is run from or must be specified as paths relative to that directory. Flanking residues in peptide sequences will be ignored if they are separated by '.', e.g. 'P.EPTID.E' will be interpreted as 'EPTID'. Post-translational modifications can be specified as any ASCII character that is neither a whitespace nor an open bracket or they can be specified as bracket bounded strings of characters following a character. For example either 'm' or 'M[15.9949]' could represent a modified residue. Either of these options must be added to File 2, the amino acid formula file. PTMs that use the bracket notation must include the preceding character in File 2, so an example entry line would look like M[15.9949] 5 9 1 2 1 0.
To run the tool ensure that the isopacketModeler conda environment is active then run python -m isopacketModeler --options options.toml.
The primary output of this tool is a tab separated table called peptides.tsv. Each row of this file is a single peptidoform (a unique peptide sequence with a unique combination of modifications) found in a single file.
- peptide: The peptidoform sequence.
- proteins: The proteins a peptidoform matches to.
- npeaks: The number of isotopic peaks used for fitting the models.
- canonical_fit: The final loss from the fitting procedure for the best fit model. See footnote 2.
- file: The mzML file the peptidoform was found in.
- label: The label isotope.
- PSM_count: The number of peptide-spectrum matches.
- canonical_DGP: The name of the best fit data generating process
- canonical_mean_label_probability: The mean atom percent isotope incorporation level. This is calculated from the best fit distribution.
- canonical_variance: The variance of atom percent isotope incorporation levels, also calculated from the best fit distribution.
- canonical_param0: parameter 0 of the best fit distribution, see below for explanations.
- canonical_param1: parameter 1 of the best fit distribution, see below for explanations.
- canonical_param2: parameter 2 of the best fit distribution, see below for explanations.
Each fitted data generating process will also generate a set of columns that match the
canonical_*columns.
| data generating process | param0 | param1 | param2 |
|---|---|---|---|
BetabinomQuiescentMix |
betabinomial alpha | betabinomial beta | active fraction |
Betabinom |
betabinomial alpha | betabinomial beta | |
BinomQuiescentMix |
binomial p | active fraction | |
Binom |
binomial p |
IsopacketModeler can fit up to four different models of isotope incorporation to each peptide. These models differ in whether they include a background distribution and whether they use a binomial or beta-binomial distribution for the enriched component.
In physically structured matrices, e.g. soil, we have found that there is often a substantial metabolically inactive biomass fraction that does not take up isotopic labels. To handle these situations the BetabinomQuiescentMix and BinomQuiescentMix models are mixture models that include an isotopically enriched component and a component that has only the background isotopic distribution. These models fit an extra parameter that controls the fraction of total biomass that is contained within the metabolically active, isotopically enriched fraction. At low atom percent label incorporation levels it is challenging to distinguish between the background distribution and the enriched distribution which results in pathological fitting behavior. If you expect that a metabolically inactive biomass component will be present, then experimentally ensure that the isotope incorporation levels will be above 5%, and preferably above 10%.
If all organisms that produce a specific peptide are incorporating isotopes at the same rate, then the distribution of isotopes within that peptide will be close to an ideal binomial distribution. If, however, there is substantial metabolic heterogeneity between organisms, then the isotope distribution will be overdispersed. We have found that a beta-binomial model works well for handling these data in the experiments we've tested. This model implies that there is a continuum of label uptake rates that can be modeled with a beta distribution. The shape of this beta distribution is then a direct estimate of the cellular heterogeneity in the system. The BetabinomQuiescentMix and Betabinom data generating processes include this feature. We expect to see this behavior in most environmental samples due to the genetic heterogeneity of real communities and in structured substrates due to variable nutrient mass transport rates.
There are other effects that can produce deviations from the ideal binomial distribution. If multiple labeled atoms from the labeled nutrient are expected to be incorporated into an amino acid without first being broken apart during catabolism, then the resulting distribution is likely to be 'lumpy', e.g. if two carbon units are expected to be incorporated directly then even isotopologues will be more abundant than odd ones. The extreme example of this would be SILAC labeling. This is not a situation that IsopacketModeler can currently support, but it is easy to include new data generating processes. So, if you run into a situation like this please reach out to me either on the issues page or by email (stavis@vols.utk.edu).
| data generating process | inactive fraction | heterogeneous uptake | parameter count |
|---|---|---|---|
BetabinomQuiescentMix |
Yes | Yes | 3 |
Betabinom |
No | Yes | 2 |
BinomQuiescentMix |
Yes | No | 2 |
Binom |
No | No | 1 |
Observed intensity values are first scaled to sum to 1, per PSM. The residuals are winsorized at the 90^th^ percentile and the loss function is the mean absolute error of these values. For the second round of optimization the residuals are weighted by 1 minus the residual. It is this final loss that is used for filtering with the max_peptide_err parameter.
The problem is that deuterium induces a retention time shift. This means that heavily labeled isotopologues will not co-elute with their light counterparts. IsopacketModeler assumes that isotopologues co-elute so this effect will distort the observed isotopic packet. The effect will be most severe in the context of the background + enriched models. This may not be a serious concern if you have good reason to believe that the isotope incorporation will be unimodal and reasonably uniform, i.e. the binomial model. This would be the case for labeling of species isolates in liquid culture. I, however, do not have the data necessary to determine the severity of this problem.