This repository contains the data and code used in the preparation of the manuscript Combining isobaric tags and peptidomics enables the detection of single amino acids and small peptides in human cerebrospinal fluid.
It contains a small proof-of-principle experiment, where isobaric tags are combined with a sample of cerebrospinal fluid to identify singly charged molecules obtained by tandem mass spectrometry. Identification is done with a simple mass-based strategy with 10 ppm tolerance.
The goal of the project is to show that it is possible to identify a reasonable number of single amino acids, very small peptides, and metabolites using this strategy. While there is still a lot of room for improvement, when doing a peptidomics experiment using isobaric tags, collecting singly charged molecules and analysing these spectra adds a little amount of extra work, but can potentially give a wealth of additional information.
The precursor mass for each MS2 spectrum with at least three out of six TMT tags is used to match against a purpose-built database of masses. Only the precursor mass is used for identification, with a tolerance of 10 ppm (+ and - 5ppm relative to the theoretical mass in the database).
Single amino acids, di- and tripeptides, and a list of metabolites are used for the identifications. The single amino acids, di- and tripeptides can also have up to one post-translational modification.
The mass of amino acids, elements (hydrogen, oxygen, and charge), and modifications are taken from Unimod. From www.unimod.org/downloads.html, we downloaded the XML file reflecting the logical structure of the database. (This file is called unimod.xml.)
Given the ubiquitous nature of oxidised methionine and carbamidomethylated cysteine, these two modified amino acids are treated as standard single amino acids. Secondly, as identification is solely done on the basis of mass, we will not be able to distinguish leucine and isoleucine. Thus, isoleucine is removed from the database.
Dipeptides are generated by making all possible combinations of two amino acids. As only the precursor mass is used for identification, we will not distinguish between, for example, alanine + leucine vs leucine + alanine. Hence, only one variant of each combination of amino acids is included.
Tripeptides are produced in a similar fashion as the dipeptides.
This leads to 2023 entries:
- 21 'single amino acids' (20 amino acids, minus isoleucine, plus oxidated methionine and carbamidomethylated cysteine)
- 231 dipeptides (
(r+n-1)!/(r!(n-1)!)
withr=2
,n=21
) - 1771 tripeptides (as above, but with
r=3
)
We also added common post-translational modifications that are not on the N-term or the protein C-term.
This lead to the following 11 modifications:
Name | Abbreviation | Amino Acid |
---|---|---|
Biotinylation | Biotin | K |
Phosphorylation | Phospho | Y |
Phosphorylation | Phospho | T |
Phosphorylation | Phospho | S |
Methylation | Methyl | E |
Methylation | Methyl | D |
O-Sulfonation | Sulfo | S |
O-Sulfonation | Sulfo | T |
O-Sulfonation | Sulfo | Y |
dihydroxy | Dioxidation | M |
Crotonylation | Crotonyl | K |
Resulting in 2783 extra molecules:
- 11 modified single amino acids
- 231 modified dipeptides
- 2541 modified tripeptides
We do not add more than one post-translational modification (in addition to the TMT-tag) to any molecule.
The metabolites were taken from the Human Metabolite Database HMDB. From the downloads site, we took the Metabolite and Protein Data in XML format for CSF metabolites. (Version 3.6, the most recent version at the time.) We only included the subclasses "Amines" and "Amino acids, peptides, and analogues", and removed different versions of single amino acids.
The masses of the amino acids taken from Unimod are residual masses. These masses are also used when combining single amino acids into di- and tripeptides. To get to the masses we expect to see in the experiment, we add two hydrogen and one oxygen from the elemental masses part of Unimod. The molecules in HMDB already have the expected mass.
As we expect a single charge and a TMT-tag, we also add one hydrogen minus an electron (the mass of a charge) from the elemental masses part of Unimod, and a TMT6 tag from the modifications part of unimod to each of the molecules in our database.
This project uses the following R packages:
- stringr
- XML
- data.table
- Rcpp (you'll probably also need Rtools to compile the cpp code)
For the graphics:
- ggplot2
- gridExtra
- lattice
In the top level of the repository the file 20171024_EndoCSF_TMT_Rest_Charge1.mgf
contains the MS2 spectra for the singly charged features. This file was generated using ProteoWizard MSConvert. To run the R files from the Rscript
folder, you first need to download the xml files mentioned above from HMDB and Unimod.
To run the whole identification pipeline, run the file main.R
. This first runs getMGF.R
to read the mgf file and build a data table of the spectra, then makeAAdb.R
which constructs the database of theoretical masses described above, and finally getIdentifications.R
which maps the theoretical masses to the experimental masses.
The files barplots.R
, heatplots.R
, and scatterplot.R
contain the code to build the graphics used in the publication.
First run main.R
, then run the appropriate file with the code for the graphics to reproduce these.
For questions please use the issue tracker.