Skip to content

A tool for identifiying and modeling stable isotope label incorporation in peptides

License

Notifications You must be signed in to change notification settings

stavis1/isopacketModeler

Repository files navigation

IsopacketModeler

A tool for identifying and modeling stable isotope label incorporation in peptides.

IsopacketModeler is a command line tool written in Python for identifying peptides that have incorporated stable isotopes and modeling isotope uptake levels. It is recommended that this tool be run using the protein_SIP_pipeline, as this pipeline does a better job of taking advantage of high performance computing resources.

Installation

  1. Make sure you have a working conda installation. If conda is not installed please follow the instructions at this page: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html
  2. Download the repository as a .zip file from the green "<> Code" drop-down menu in the upper right of the GitHub page and extract the contents. Alternatively, if you have git installed run git clone https://github.com/stavis1/isopacketModeler/.
  3. On the command line navigate to the root of the downloaded directory and run conda env create -n isopacketModeler -f env/run.yml. This will create a conda environment and install the necessary dependencies for this tool.
  4. Run conda activate isopacketModeler to activate this environment.

Usage

IsopacketModeler requires four text files to run:

  1. A TOML formatted options file.
  2. A tab-separated file specifying the elemental composition of each amino acid.
  3. A tab-separated file describing the isotopic label added to each raw file.
  4. A tab-separated file describing the peptide-spectrum matches to use when searching for isotopically enriched peptides.

Additionally, the raw data must be provided as centroid mode mzML files.

Default versions of files 1 and 2, and templates for files 3 and 4 can be found in the templates folder.

File 1

The parameters are as follows:

  1. working_directory: This is the base directory from which all paths in the options file must be relative to.
  2. output_directory: A directory that will hold the results files. This will be created if not extant.
  3. overwrite: Whether to overwrite previous outputs. Must be either true or false in lowercase without quotes.
  4. log_level: Controls the verbosity of console logging. 50 silences output while 10 displays all information.
  5. log_file: The name of the log file. All logs will be written to log_file regardless of log_level.
  6. design_file: The name of file 3.
  7. mzml_dir: The directory that holds mzML files. The path names in the design file must be relative to this directory.
  8. psms: A list of tab separated files containing PSMs to be fit. The list should be formatted as ['psm_file_1.tsv', 'psm_file_2.tsv']
  9. psm_headers: A list of column headers to be found in the PSM files. These should contain the peptide sequence, mzml file, scan number, charge, and protein names in that order, regardless of the order these columns are found in the PSM files. The scan number and charge columns should be integers only, i.e. +2 doesn't work. IsopacketModeler only works with positive ion mode data. Any columns listed after the requisite first five will be tracked as additional metadata and will be included in the final output. The formatting of this list is the same as that of the psms parameter.
  10. aa_formulae: The name of file 2.
  11. cores: The maximum number of CPU cores the tool is allowed to use.
  12. classifier_fdr: The target false discovery rate on the PSM level that the classifier will attempt to control. Only used if do_psm_classification is true
  13. data_generating_processes: A list of data generating processes to fit to each peptide. Options are BetabinomQuiescentMix, Betabinom, BinomQuiescentMix, and Binom. Any combination of these can be fit in a single run. This list should be formatted the same as the psms parameter. See footnote 1 for details on each option.
  14. max_peptide_err: Peptidoforms whose best-fit model has a final loss higher than this value will be excluded from the output. In practice even exceptionally poor fits will rarely have a loss above 0.03 and qualitatively well fitting models often have a loss ~0.01. See footnote 2 for a discussion of what the loss means.
  15. do_psm_classification: Whether to train a neural network classifier model to do false discovery rate control on the PSM level. Must be either true or false in lowercase without quotes.
  16. checkpoint_files: If you have a previously terminated run it can be resumed from the checkpoint files. All checkpoint files must be on the same step.
  17. stopping_point: If you wish to stop early put a checkpoint step here, otherwise this should be false. Step 1 is the end of file parsing and step 2 is after PSM classification and peptide construction.

File 2

The TSV file which maps single letter amino acid codes to molecular formulae. If you have modified amino acids please represent them as either single characters (any non whitespace character can be used) or as the base amino acid followed by a bracketed modifier, e.g. S[79.9663]. Each modification should get its own row with the full molecular formula (not just the difference from the unmodified amino acid). If you need elements not listed in the default file add a column with their elemental symbol as the header.

File 3

This is the experimental design file. It must have at least two columns 'file' and 'label' which list the mzML files and their heavy label isotopes, respectively. The allowable isotopes are 'H[2]', 'C[13]', 'N[15]', 'O[17]', 'O[18]', 'S[33]', 'S[34]', and 'S[36]'. Deuterium labeling is supported but not recommended, see footnote 3 for details. Control samples should leave this field blank. Only one label is allowed per file. Other columns can be included. The information in these columns will be kept associated with the peptide data and will be included in the output but will otherwise be ignored. This can be used to, for example, keep track of time points or conditions in an experiment.

File 4

This is the PSM file, must include columns for peptide sequence, mzml file, scan #, charge, and protein names. These columns do not need to have specific names, rather, the names of the columns corresponding to these pieces of information will be specified in file 1, the options file. This means that tab-separated PSM files from multiple tools can be used with minimal processing. Scan # can be either the MS2 or the parent MS1 scan for the PSM. The charge column must have only integer values, e.g. 2 not +2 and cannot be blank or unknown. The mzML files must be either in the same directory that the command is run from or must be specified as paths relative to that directory. Flanking residues in peptide sequences will be ignored if they are separated by '.', e.g. 'P.EPTID.E' will be interpreted as 'EPTID'. Post-translational modifications can be specified as any ASCII character that is neither a whitespace nor an open bracket or they can be specified as bracket bounded strings of characters following a character. For example either 'm' or 'M[15.9949]' could represent a modified residue. Either of these options must be added to File 2, the amino acid formula file. PTMs that use the bracket notation must include the preceding character in File 2, so an example entry line would look like M[15.9949] 5 9 1 2 1 0.

Running the tool

To run the tool ensure that the isopacketModeler conda environment is active then run python -m isopacketModeler --options options.toml.

Interpreting results

The primary output of this tool is a tab separated table called peptides.tsv. Each row of this file is a single peptidoform (a unique peptide sequence with a unique combination of modifications) found in a single file.

  1. peptide: The peptidoform sequence.
  2. proteins: The proteins a peptidoform matches to.
  3. npeaks: The number of isotopic peaks used for fitting the models.
  4. canonical_fit: The final loss from the fitting procedure for the best fit model. See footnote 2.
  5. file: The mzML file the peptidoform was found in.
  6. label: The label isotope.
  7. PSM_count: The number of peptide-spectrum matches.
  8. canonical_DGP: The name of the best fit data generating process
  9. canonical_mean_label_probability: The mean atom percent isotope incorporation level. This is calculated from the best fit distribution.
  10. canonical_variance: The variance of atom percent isotope incorporation levels, also calculated from the best fit distribution.
  11. canonical_param0: parameter 0 of the best fit distribution, see below for explanations.
  12. canonical_param1: parameter 1 of the best fit distribution, see below for explanations.
  13. canonical_param2: parameter 2 of the best fit distribution, see below for explanations. Each fitted data generating process will also generate a set of columns that match the canonical_* columns.

The parameters for each data generating process:

data generating process param0 param1 param2
BetabinomQuiescentMix betabinomial alpha betabinomial beta active fraction
Betabinom betabinomial alpha betabinomial beta
BinomQuiescentMix binomial p active fraction
Binom binomial p

Footnotes

Data generating processes

IsopacketModeler can fit up to four different models of isotope incorporation to each peptide. These models differ in whether they include a background distribution and whether they use a binomial or beta-binomial distribution for the enriched component.

In physically structured matrices, e.g. soil, we have found that there is often a substantial metabolically inactive biomass fraction that does not take up isotopic labels. To handle these situations the BetabinomQuiescentMix and BinomQuiescentMix models are mixture models that include an isotopically enriched component and a component that has only the background isotopic distribution. These models fit an extra parameter that controls the fraction of total biomass that is contained within the metabolically active, isotopically enriched fraction. At low atom percent label incorporation levels it is challenging to distinguish between the background distribution and the enriched distribution which results in pathological fitting behavior. If you expect that a metabolically inactive biomass component will be present, then experimentally ensure that the isotope incorporation levels will be above 5%, and preferably above 10%.

If all organisms that produce a specific peptide are incorporating isotopes at the same rate, then the distribution of isotopes within that peptide will be close to an ideal binomial distribution. If, however, there is substantial metabolic heterogeneity between organisms, then the isotope distribution will be overdispersed. We have found that a beta-binomial model works well for handling these data in the experiments we've tested. This model implies that there is a continuum of label uptake rates that can be modeled with a beta distribution. The shape of this beta distribution is then a direct estimate of the cellular heterogeneity in the system. The BetabinomQuiescentMix and Betabinom data generating processes include this feature. We expect to see this behavior in most environmental samples due to the genetic heterogeneity of real communities and in structured substrates due to variable nutrient mass transport rates.

There are other effects that can produce deviations from the ideal binomial distribution. If multiple labeled atoms from the labeled nutrient are expected to be incorporated into an amino acid without first being broken apart during catabolism, then the resulting distribution is likely to be 'lumpy', e.g. if two carbon units are expected to be incorporated directly then even isotopologues will be more abundant than odd ones. The extreme example of this would be SILAC labeling. This is not a situation that IsopacketModeler can currently support, but it is easy to include new data generating processes. So, if you run into a situation like this please reach out to me either on the issues page or by email (stavis@vols.utk.edu).

Summary of the available data generating processes:

data generating process inactive fraction heterogeneous uptake parameter count
BetabinomQuiescentMix Yes Yes 3
Betabinom No Yes 2
BinomQuiescentMix Yes No 2
Binom No No 1

Loss function

Observed intensity values are first scaled to sum to 1, per PSM. The residuals are winsorized at the 90^th^ percentile and the loss function is the mean absolute error of these values. For the second round of optimization the residuals are weighted by 1 minus the residual. It is this final loss that is used for filtering with the max_peptide_err parameter.

Deuterium labeling

The problem is that deuterium induces a retention time shift. This means that heavily labeled isotopologues will not co-elute with their light counterparts. IsopacketModeler assumes that isotopologues co-elute so this effect will distort the observed isotopic packet. The effect will be most severe in the context of the background + enriched models. This may not be a serious concern if you have good reason to believe that the isotope incorporation will be unimodal and reasonably uniform, i.e. the binomial model. This would be the case for labeling of species isolates in liquid culture. I, however, do not have the data necessary to determine the severity of this problem.

About

A tool for identifiying and modeling stable isotope label incorporation in peptides

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages