The page you requested cannot be found (perhaps it was moved or renamed).
-You may want to try searching to find the page's new location, or use -the table of contents to find the page you are looking for.
--
-Page built: -2023-09-06 - using -R version 4.3.1 Patched (2023-07-10 r84676) -
-Let’s use the identification from from msdata
:
<- msdata::ident(full.names = TRUE)
- idf basename(idf)
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
-The easiest way to read identification data in mzIdentML
(often
-abbreviated with mzid
) into R is to read it with readPSMs()
-function from the PSM
-package. The function will parse the file and return a DataFrame
.
library(PSM)
-<- readPSMs(idf)
- id dim(id)
## [1] 5802 35
-names(id)
## [1] "sequence" "spectrumID"
-## [3] "chargeState" "rank"
-## [5] "passThreshold" "experimentalMassToCharge"
-## [7] "calculatedMassToCharge" "peptideRef"
-## [9] "modNum" "isDecoy"
-## [11] "post" "pre"
-## [13] "start" "end"
-## [15] "DatabaseAccess" "DBseqLength"
-## [17] "DatabaseSeq" "DatabaseDescription"
-## [19] "scan.number.s." "acquisitionNum"
-## [21] "spectrumFile" "idFile"
-## [23] "MS.GF.RawScore" "MS.GF.DeNovoScore"
-## [25] "MS.GF.SpecEValue" "MS.GF.EValue"
-## [27] "MS.GF.QValue" "MS.GF.PepQValue"
-## [29] "modPeptideRef" "modName"
-## [31] "modMass" "modLocation"
-## [33] "subOriginalResidue" "subReplacementResidue"
-## [35] "subLocation"
--► Question -
-Verify that this table contains 5802 matches for 5343 -scans and 4938 peptides sequences.
-- -
--► Solution -
- -The PSM data are read as is, without and filtering. As we can see -below, we still have all the hits from the forward and reverse (decoy) -databases.
-table(id$isDecoy)
##
-## FALSE TRUE
-## 2906 2896
-The data contains also contains multiple matches for several -spectra. The table below shows the number of number of spectra that -have 1, 2, … up to 5 matches.
-table(table(id$spectrumID))
##
-## 1 2 3 4 5
-## 4936 369 26 10 2
-Below, we can see how scan 1774 has 4 matches, all to sequence
-RTRYQAEVR
, which itself matches to 4 different proteins:
<- which(id$spectrumID == "controllerType=0 controllerNumber=1 scan=1774")
- i 1:5] id[i,
## PSM with 4 rows and 5 columns.
-## names(5): sequence spectrumID ... rank passThreshold
-If the goal is to keep all the matches, but arranged by scan/spectrum,
-one can reduce the DataFrame
object by the spectrumID
variable,
-so that each scan correponds to a single row that still stores all
-values5 The rownames aren’t needed here are are removed to reduce
-to output in the the next code chunk display parts of id2
.:
<- QFeatures::reduceDataFrame(id, id$spectrumID) id2
## Warning: The dim() method for DataFrameList objects is deprecated. Please use
-## dims() on these objects instead.
-## Warning: The nrow() method for DataFrameList objects is deprecated. Please use
-## nrows() on these objects instead.
-## Warning: The ncol() method for CompressedSplitDataFrameList objects is
-## deprecated. Please use ncols() on these objects instead.
-rownames(id2) <- NULL ## rownames not needed here
-dim(id2)
## [1] 5343 35
-The resulting object contains a single entrie for scan 1774 with -information for the multiple matches stored as lists within the cells.
-<- which(id2$spectrumID == "controllerType=0 controllerNumber=1 scan=1774")
- j id2[j, ]
## DataFrame with 1 row and 35 columns
-## sequence spectrumID chargeState rank
-## <CharacterList> <character> <integer> <IntegerList>
-## 1 RTRYQAEVR,RTRYQAEVR,RTRYQAEVR,... controller... 2 1,1,1,...
-## passThreshold experimentalMassToCharge calculatedMassToCharge
-## <logical> <numeric> <NumericList>
-## 1 TRUE 589.821 589.823,589.823,589.823,...
-## peptideRef modNum isDecoy
-## <CharacterList> <IntegerList> <LogicalList>
-## 1 Pep1890,Pep1890,Pep1890,... 0,0,0,... FALSE,FALSE,FALSE,...
-## post pre start end
-## <CharacterList> <CharacterList> <IntegerList> <IntegerList>
-## 1 P,P,P,... R,R,R,... 89,99,89,... 97,107,97,...
-## DatabaseAccess DBseqLength DatabaseSeq
-## <CharacterList> <IntegerList> <character>
-## 1 ECA2104,ECA2867,ECA3427,... 675,619,678,...
-## DatabaseDescription scan.number.s. acquisitionNum
-## <CharacterList> <numeric> <numeric>
-## 1 ECA2104 Vg...,ECA2867 pu...,ECA3427 co...,... 1774 1774
-## spectrumFile idFile MS.GF.RawScore MS.GF.DeNovoScore MS.GF.SpecEValue
-## <character> <character> <numeric> <numeric> <numeric>
-## 1 TMT_Erwini... TMT_Erwini... 0 96 3.69254e-06
-## MS.GF.EValue MS.GF.QValue MS.GF.PepQValue
-## <NumericList> <numeric> <NumericList>
-## 1 10.5388,10.5388,10.5388,... 1 0.990816,0.990816,0.990816,...
-## modPeptideRef modName modMass modLocation
-## <CharacterList> <CharacterList> <NumericList> <IntegerList>
-## 1 NA,NA,NA,... NA,NA,NA,... NA,NA,NA,... NA,NA,NA,...
-## subOriginalResidue subReplacementResidue subLocation
-## <character> <character> <integer>
-## 1 NA NA NA
-"DatabaseAccess"] id2[j,
## CharacterList of length 1
-## [["controllerType=0 controllerNumber=1 scan=1774"]] ECA2104 ECA2867 ECA3427 ECA4142
-The is the type of complete identification table that could be used to
-annotate an raw mass spectrometry Spectra
object, as shown below.
Often, the PSM data is filtered to only retain reliable matches. The
-MSnID
package can be used to set thresholds to attain user-defined
-PSM, peptide or protein-level FDRs. Here, we will simply filter out
-wrong identification manually.
Here, the filter()
from the dplyr
package comes very handy. We
-will thus start by convering the DataFrame
to a tibble
.
library("dplyr")
-<- tidyr::as_tibble(id)
- id_tbl id_tbl
## # A tibble: 5,802 × 35
-## sequence spectrumID chargeState rank passThreshold experimentalMass…
-## <chr> <chr> <int> <int> <lgl> <dbl>
-## 1 RQCRTDFLNYLR controllerTyp… 3 1 TRUE 548.
-## 2 ESVALADQVTC… controllerTyp… 2 1 TRUE 1288.
-## 3 KELLCLAMQIIR controllerTyp… 2 1 TRUE 744.
-## 4 QRMARTSDKQQ… controllerTyp… 3 1 TRUE 913.
-## 5 KDEGSTEPLKV… controllerTyp… 3 1 TRUE 927.
-## 6 DGGPAIYGHER… controllerTyp… 3 1 TRUE 969.
-## 7 QRMARTSDKQQ… controllerTyp… 2 1 TRUE 1369.
-## 8 CIDRARHVEVQ… controllerTyp… 3 1 TRUE 1285.
-## 9 CIDRARHVEVQ… controllerTyp… 3 1 TRUE 1285.
-## 10 VGRCRPIINYL… controllerTyp… 2 1 TRUE 1102.
-## # … with 5,792 more rows, and 29 more variables: calculatedMassToCharge <dbl>,
-## # peptideRef <chr>, modNum <int>, isDecoy <lgl>, post <chr>, pre <chr>,
-## # start <int>, end <int>, DatabaseAccess <chr>, DBseqLength <int>,
-## # DatabaseSeq <chr>, DatabaseDescription <chr>, scan.number.s. <dbl>,
-## # acquisitionNum <dbl>, spectrumFile <chr>, idFile <chr>,
-## # MS.GF.RawScore <dbl>, MS.GF.DeNovoScore <dbl>, MS.GF.SpecEValue <dbl>,
-## # MS.GF.EValue <dbl>, MS.GF.QValue <dbl>, MS.GF.PepQValue <dbl>, …
--► Question -
-- -
--► Solution -
- --► Question -
-- -
--► Solution -
- --► Question -
-XXX_ECA3406
and ECA3415
. Scan 4099 match XXX_ECA4416_1
,
-XXX_ECA4416_2
and XXX_ECA4416_3
. Then remove the scans that
-match any of these proteins.- -
--► Solution -
- -Which leaves us with 2666 PSMs.
-This can also be achieved with the filterPSMs()
function:
<- filterPSMs(id) id_filtered
## Starting with 5802 PSMs:
-## removed 2896 decoy hits
-## removed 155 PSMs with rank > 1
-## removed 85 non-proteotypic peptides
-## 2666 PSMs left.
--► Question -
-Compare the distribution of raw idenfication scores of the decoy and -non-decoy hits. Interpret the figure.
-- -
--► Solution -
- --► Question -
-The tidyverse
-tools are fit for data wrangling with identification data. Using the
-above identification dataframe, calculate the length of each peptide
-(you can use nchar
with the peptide sequence sequence
) and the
-number of peptides for each protein (defined as
-DatabaseDescription
). Plot the length of the proteins against their
-respective number of peptides.
- -
--► Solution -
- -There are two packages that can be used to parse mzIdentML
files,
-namely mzR
(that we have already used for raw data) and mzID
. The
-major difference is that the former leverages C++ code from
-proteowizard
and is hence faster than the latter (which uses the
-XML
R package). They both work in similar ways.
|Data type |File format |Data structure |Package |
-|:--------------|:-----------|:--------------|:-------|
-|Identification |mzIdentML |mzRident |mzR |
-|Identification |mzIdentML |mzID |mzID |
-Which of these packages is used by readPSMs()
can be defined by the
-parser
argument.
mzID
-The main functions are mzID
to read the data into a dedicated data
-class and flatten
to transform it into a data.frame
.
idf
## [1] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.1/msdata/ident/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
-library("mzID")
##
-## Attaching package: 'mzID'
-## The following object is masked from 'package:dplyr':
-##
-## id
-<- mzID(idf) id
## reading TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid...
-## Warning in type.convert.default(...): 'as.is' should be specified by the caller;
-## using TRUE
-
-## Warning in type.convert.default(...): 'as.is' should be specified by the caller;
-## using TRUE
-
-## Warning in type.convert.default(...): 'as.is' should be specified by the caller;
-## using TRUE
-
-## Warning in type.convert.default(...): 'as.is' should be specified by the caller;
-## using TRUE
-
-## Warning in type.convert.default(...): 'as.is' should be specified by the caller;
-## using TRUE
-
-## Warning in type.convert.default(...): 'as.is' should be specified by the caller;
-## using TRUE
-
-## Warning in type.convert.default(...): 'as.is' should be specified by the caller;
-## using TRUE
-
-## Warning in type.convert.default(...): 'as.is' should be specified by the caller;
-## using TRUE
-
-## Warning in type.convert.default(...): 'as.is' should be specified by the caller;
-## using TRUE
-
-## Warning in type.convert.default(...): 'as.is' should be specified by the caller;
-## using TRUE
-
-## Warning in type.convert.default(...): 'as.is' should be specified by the caller;
-## using TRUE
-
-## Warning in type.convert.default(...): 'as.is' should be specified by the caller;
-## using TRUE
-
-## Warning in type.convert.default(...): 'as.is' should be specified by the caller;
-## using TRUE
-## DONE!
- id
## An mzID object
-##
-## Software used: MS-GF+ (version: Beta (v10072))
-##
-## Rawfile: /home/lg390/dev/01_svn/workflows/proteomics/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
-##
-## Database: /home/lg390/dev/01_svn/workflows/proteomics/erwinia_carotovora.fasta
-##
-## Number of scans: 5343
-## Number of PSM's: 5656
-Various data can be extracted from the mzID
object, using one the
-accessor functions such as database
, software
, scans
, peptides
,
-… The object can also be converted into a data.frame
using the
-flatten
function.
head(flatten(id))
## spectrumid scan number(s) acquisitionnum
-## 1 controllerType=0 controllerNumber=1 scan=5782 5782 5782
-## 2 controllerType=0 controllerNumber=1 scan=6037 6037 6037
-## 3 controllerType=0 controllerNumber=1 scan=5235 5235 5235
-## passthreshold rank calculatedmasstocharge experimentalmasstocharge
-## 1 TRUE 1 1080.232 1080.233
-## 2 TRUE 1 1002.212 1002.209
-## 3 TRUE 1 1189.280 1189.284
-## chargestate ms-gf:denovoscore ms-gf:evalue ms-gf:pepqvalue ms-gf:qvalue
-## 1 3 174 1.086033e-20 0 0
-## 2 3 245 1.988774e-19 0 0
-## 3 3 264 5.129649e-19 0 0
-## ms-gf:rawscore ms-gf:specevalue assumeddissociationmethod isotopeerror
-## 1 147 3.764831e-27 HCD 0
-## 2 214 6.902626e-26 HCD 0
-## 3 211 1.778789e-25 HCD 0
-## isdecoy post pre end start accession length
-## 1 FALSE S R 84 50 ECA1932 155
-## 2 FALSE R K 315 288 ECA1147 434
-## 3 FALSE A R 224 192 ECA0013 295
-## description pepseq
-## 1 outer membrane lipoprotein PVQIQAGEDSNVIGALGGAVLGGFLGNTIGGGSGR
-## 2 trigger factor TQVLDGLINANDIEVPVALIDGEIDVLR
-## 3 ribose-binding periplasmic protein TKGLNVMQNLLTAHPDVQAVFAQNDEMALGALR
-## modified modification
-## 1 FALSE <NA>
-## 2 FALSE <NA>
-## 3 FALSE <NA>
-## idFile
-## 1 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
-## 2 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
-## 3 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
-## spectrumFile
-## 1 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
-## 2 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
-## 3 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
-## databaseFile
-## 1 erwinia_carotovora.fasta
-## 2 erwinia_carotovora.fasta
-## 3 erwinia_carotovora.fasta
-## [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
-mzR
-The mzR
interface provides a similar interface. It is however much
-faster as it does not read all the data into memory and only extracts
-relevant data on demand. It has also accessor functions such as
-softwareInfo
, mzidInfo
, … (use showMethods(classes = "mzRident", where = "package:mzR")
)
-to see all available methods.
library("mzR")
-<- openIDfile(idf)
- id2 id2
## Identification file handle.
-## Filename: TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
-## Number of psms: 5759
-softwareInfo(id2)
## [1] "MS-GF+ Beta (v10072) "
-## [2] "ProteoWizard MzIdentML 3.0.501 ProteoWizard"
-The identification data can be accessed as a data.frame
with the
-psms
accessor.
head(psms(id2))
## spectrumID chargeState rank passThreshold
-## 1 controllerType=0 controllerNumber=1 scan=5782 3 1 TRUE
-## 2 controllerType=0 controllerNumber=1 scan=6037 3 1 TRUE
-## 3 controllerType=0 controllerNumber=1 scan=5235 3 1 TRUE
-## 4 controllerType=0 controllerNumber=1 scan=5397 3 1 TRUE
-## 5 controllerType=0 controllerNumber=1 scan=6075 3 1 TRUE
-## experimentalMassToCharge calculatedMassToCharge
-## 1 1080.2325 1080.2321
-## 2 1002.2089 1002.2115
-## 3 1189.2836 1189.2800
-## 4 960.5365 960.5365
-## 5 1264.3409 1264.3419
-## sequence peptideRef modNum isDecoy post pre start
-## 1 PVQIQAGEDSNVIGALGGAVLGGFLGNTIGGGSGR Pep1 0 FALSE S R 50
-## 2 TQVLDGLINANDIEVPVALIDGEIDVLR Pep2 0 FALSE R K 288
-## 3 TKGLNVMQNLLTAHPDVQAVFAQNDEMALGALR Pep3 0 FALSE A R 192
-## 4 SQILQQAGTSVLSQANQVPQTVLSLLR Pep4 0 FALSE - R 264
-## 5 PIIGDNPFVVVLPDVVLDESTADQTQENLALLISR Pep5 0 FALSE F R 119
-## end DatabaseAccess DBseqLength DatabaseSeq
-## 1 84 ECA1932 155
-## 2 315 ECA1147 434
-## 3 224 ECA0013 295
-## 4 290 ECA1731 290
-## 5 153 ECA1443 298
-## DatabaseDescription scan.number.s.
-## 1 ECA1932 outer membrane lipoprotein 5782
-## 2 ECA1147 trigger factor 6037
-## 3 ECA0013 ribose-binding periplasmic protein 5235
-## 4 ECA1731 flagellin 5397
-## 5 ECA1443 UTP--glucose-1-phosphate uridylyltransferase 6075
-## acquisitionNum
-## 1 5782
-## 2 6037
-## 3 5235
-## 4 5397
-## 5 6075
-## [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
-Searches are generally performed using third-party software
-independently of R or can be started from R using a system
call.
-The MSGFplus package can be used to perform a search
-using the MSGF+ engine directly from R and MSGFgui can
-be used to explore the identification results.
We are goind to use the sp
object created in the previous chapter
-and the id_filtered
variable generated above.
Identification data (as a DataFrame
) can be merged into raw data (as
-a Spectra
object) by adding new spectra variables to the appropriate
-MS2 spectra. Scans and peptide-spectrum matches can be matched by
-their spectrum identifers.
-► Question -
-Identify the spectum identifier columns in the sp
the id_filtered
-variables.
- -
--► Solution -
- -These two data can thus simply be joined using:
-<- joinSpectraData(sp, id_filtered,
- sp by.x = "spectrumId",
- by.y = "spectrumID")
- spectraVariables(sp)
## [1] "msLevel" "rtime"
-## [3] "acquisitionNum" "scanIndex"
-## [5] "dataStorage" "dataOrigin"
-## [7] "centroided" "smoothed"
-## [9] "polarity" "precScanNum"
-## [11] "precursorMz" "precursorIntensity"
-## [13] "precursorCharge" "collisionEnergy"
-## [15] "isolationWindowLowerMz" "isolationWindowTargetMz"
-## [17] "isolationWindowUpperMz" "peaksCount"
-## [19] "totIonCurrent" "basePeakMZ"
-## [21] "basePeakIntensity" "ionisationEnergy"
-## [23] "lowMZ" "highMZ"
-## [25] "mergedScan" "mergedResultScanNum"
-## [27] "mergedResultStartScanNum" "mergedResultEndScanNum"
-## [29] "injectionTime" "filterString"
-## [31] "spectrumId" "ionMobilityDriftTime"
-## [33] "scanWindowLowerLimit" "scanWindowUpperLimit"
-## [35] "sequence" "chargeState"
-## [37] "rank" "passThreshold"
-## [39] "experimentalMassToCharge" "calculatedMassToCharge"
-## [41] "peptideRef" "modNum"
-## [43] "isDecoy" "post"
-## [45] "pre" "start"
-## [47] "end" "DatabaseAccess"
-## [49] "DBseqLength" "DatabaseSeq"
-## [51] "DatabaseDescription" "scan.number.s."
-## [53] "acquisitionNum.y" "spectrumFile"
-## [55] "idFile" "MS.GF.RawScore"
-## [57] "MS.GF.DeNovoScore" "MS.GF.SpecEValue"
-## [59] "MS.GF.EValue" "MS.GF.QValue"
-## [61] "MS.GF.PepQValue" "modPeptideRef"
-## [63] "modName" "modMass"
-## [65] "modLocation" "subOriginalResidue"
-## [67] "subReplacementResidue" "subLocation"
--► Question -
-Verify that the identification data has been added to the correct -spectra.
-- -
--► Solution -
- -Let’s choose a MS2 spectrum with a high identication score and plot -it.
-<- which(sp$MS.GF.RawScore > 100)[1]
- i plotSpectra(sp[i])
We have seen above that we can add labels to each peak using the
-labels
argument in plotSpectra()
. The addFragments()
function
-takes a spectrum as input (that is a Spectra
object of length 1) and
-annotates its peaks.
addFragments(sp[i])
## [1] NA NA NA "b1" NA NA NA NA NA NA NA NA
-## [13] NA NA NA NA NA NA NA NA NA NA NA NA
-## [25] NA NA NA NA NA NA NA NA NA NA NA NA
-## [37] NA NA NA NA NA NA NA "y1_" NA NA NA NA
-## [49] NA "y1" NA NA NA NA NA NA NA NA NA NA
-## [61] NA NA NA NA NA NA NA NA NA NA NA NA
-## [73] NA NA NA NA NA NA NA NA NA NA NA NA
-## [85] NA NA "b2" NA NA NA NA NA NA NA NA NA
-## [97] NA NA NA NA
-## [ reached getOption("max.print") -- omitted 227 entries ]
-It can be directly used with plotSpectra()
:
plotSpectra(sp[i], labels = addFragments,
-labelPos = 3, labelCol = "steelblue")
When a precursor peptide ion is fragmented in a CID cell, it breaks at -specific bonds, producing sets of peaks (a, b, c and x, y, -z) that can be predicted.
-The annotation of spectra is obtained by simulating fragmentation of a -peptide and matching observed peaks to fragments:
-$sequence sp[i]
## [1] "THSQEEMQHMQR"
-calculateFragments(sp[i]$sequence)
## Modifications used: C=57.02146
-## mz ion type pos z seq
-## 1 102.0550 b1 b 1 1 T
-## 2 239.1139 b2 b 2 1 TH
-## 3 326.1459 b3 b 3 1 THS
-## 4 454.2045 b4 b 4 1 THSQ
-## 5 583.2471 b5 b 5 1 THSQE
-## 6 712.2897 b6 b 6 1 THSQEE
-## 7 843.3301 b7 b 7 1 THSQEEM
-## 8 971.3887 b8 b 8 1 THSQEEMQ
-## 9 1108.4476 b9 b 9 1 THSQEEMQH
-## 10 1239.4881 b10 b 10 1 THSQEEMQHM
-## 11 1367.5467 b11 b 11 1 THSQEEMQHMQ
-## 12 175.1190 y1 y 1 1 R
-## 13 303.1775 y2 y 2 1 QR
-## 14 434.2180 y3 y 3 1 MQR
-## 15 571.2769 y4 y 4 1 HMQR
-## 16 699.3355 y5 y 5 1 QHMQR
-## [ reached 'max' / getOption("max.print") -- omitted 42 rows ]
-The compareSpectra()
can be used to compare spectra (by default,
-computing the normalised dot product).
-► Question -
-Spectra
object containing the MS2 spectra with
-sequences "SQILQQAGTSVLSQANQVPQTVLSLLR"
and
-"TKGLNVMQNLLTAHPDVQAVFAQNDEMALGALR"
.- -
--► Solution -
- --► Question -
-compareSpectra
. See the ?Spectra
man
-page for details. Draw a heatmap of that distance matrix- -
--► Solution -
- --► Question -
-- -
--► Solution -
- --► Question -
-Download the 3 first mzML and mzID files from the -PXD022816 -project (Morgenstern, Barzilay, and Levin 2021).
-- -
--► Solution -
- --► Question -
-Generate a Spectra
object and a table of filtered PSMs. Visualise
-the total ion chromatograms and check the quality of the
-identification data by comparing the density of the decoy and target
-PSMs id scores for each file.
- -
--► Solution -
- --► Question -
-Join the raw and identification data. Beware though that the joining -must now be performed by spectrum ids and by files.
-- -
--► Solution -
- --► Question -
-Extract the PSMs that have been matched to peptides from protein
-O43175
and compare and cluster the scans. Hint: once you have
-created the smaller Spectra
object with the scans of interest,
-switch to an in-memory backend to seed up the calculations.
- -
--► Solution -
- -MSnID
-
-The MSnID
package extracts MS/MS ID data from mzIdentML (leveraging
-the mzID
package) or text files. After collating the search results
-from multiple datasets it assesses their identification quality and
-optimises filtering criteria to achieve the maximum number of
-identifications while not exceeding a specified false discovery
-rate. It also contains a number of utilities to explore the MS/MS
-results and assess missed and irregular enzymatic cleavages, mass
-measurement accuracy, etc.
Let’s reproduce parts of the analysis described the MSnID
-vignette. You can explore more with
vignette("msnid_vignette", package = "MSnID")
The MSnID package can be used for post-search filtering
-of MS/MS identifications. One starts with the construction of an
-MSnID
object that is populated with identification results that can
-be imported from a data.frame
or from mzIdenML
files. Here, we
-will use the example identification data provided with the package.
<- system.file("extdata", "c_elegans.mzid.gz", package="MSnID")
- mzids basename(mzids)
## [1] "c_elegans.mzid.gz"
-We start by loading the package, initialising the MSnID
object, and
-add the identification result from our mzid
file (there could of
-course be more that one).
library("MSnID")
##
-## Attaching package: 'MSnID'
-## The following object is masked from 'package:ProtGenerics':
-##
-## peptides
-<- MSnID(".") msnid
## Note, the anticipated/suggested columns in the
-## peptide-to-spectrum matching results are:
-## -----------------------------------------------
-## accession
-## calculatedMassToCharge
-## chargeState
-## experimentalMassToCharge
-## isDecoy
-## peptide
-## spectrumFile
-## spectrumID
-<- read_mzIDs(msnid, mzids) msnid
## Loaded cached data
-show(msnid)
## MSnID object
-## Working directory: "."
-## #Spectrum Files: 1
-## #PSMs: 12263 at 36 % FDR
-## #peptides: 9489 at 44 % FDR
-## #accessions: 7414 at 76 % FDR
-Printing the MSnID
object returns some basic information such as
The package then enables to define, optimise and apply filtering based -for example on missed cleavages, identification scores, precursor mass -errors, etc. and assess PSM, peptide and protein FDR levels. To -properly function, it expects to have access to the following data
-## [1] "accession" "calculatedMassToCharge"
-## [3] "chargeState" "experimentalMassToCharge"
-## [5] "isDecoy" "peptide"
-## [7] "spectrumFile" "spectrumID"
-which are indeed present in our data:
-names(msnid)
## [1] "spectrumID" "scan number(s)"
-## [3] "acquisitionNum" "passThreshold"
-## [5] "rank" "calculatedMassToCharge"
-## [7] "experimentalMassToCharge" "chargeState"
-## [9] "MS-GF:DeNovoScore" "MS-GF:EValue"
-## [11] "MS-GF:PepQValue" "MS-GF:QValue"
-## [13] "MS-GF:RawScore" "MS-GF:SpecEValue"
-## [15] "AssumedDissociationMethod" "IsotopeError"
-## [17] "isDecoy" "post"
-## [19] "pre" "end"
-## [21] "start" "accession"
-## [23] "length" "description"
-## [25] "pepSeq" "modified"
-## [27] "modification" "idFile"
-## [29] "spectrumFile" "databaseFile"
-## [31] "peptide"
-Here, we summarise a few steps and redirect the reader to the -package’s vignette for more details:
-Cleaning irregular cleavages at the termini of the peptides and
-missing cleavage site within the peptide sequences. The following two
-function call create the new numMisCleavages
and numIrregCleavages
-columns in the MSnID
object
<- assess_termini(msnid, validCleavagePattern="[KR]\\.[^P]")
- msnid <- assess_missed_cleavages(msnid, missedCleavagePattern="[KR](?=[^P$])") msnid
Now, we can use the apply_filter
function to effectively apply
-filters. The strings passed to the function represent expressions that
-will be evaluated, thus keeping only PSMs that have 0 irregular
-cleavages and 2 or less missed cleavages.
<- apply_filter(msnid, "numIrregCleavages == 0")
- msnid <- apply_filter(msnid, "numMissCleavages <= 2")
- msnid show(msnid)
## MSnID object
-## Working directory: "."
-## #Spectrum Files: 1
-## #PSMs: 7838 at 17 % FDR
-## #peptides: 5598 at 23 % FDR
-## #accessions: 3759 at 53 % FDR
-Using "calculatedMassToCharge"
and "experimentalMassToCharge"
, the
-mass_measurement_error
function calculates the parent ion mass
-measurement error in parts per million.
summary(mass_measurement_error(msnid))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
-## -2184.0640 -0.6992 0.0000 17.6146 0.7512 2012.5178
-We then filter any matches that do not fit the +/- 20 ppm tolerance
-<- apply_filter(msnid, "abs(mass_measurement_error(msnid)) < 20")
- msnid summary(mass_measurement_error(msnid))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
-## -19.7797 -0.5866 0.0000 -0.2970 0.5713 19.6758
-Filtering of the identification data will rely on
-$msmsScore <- -log10(msnid$`MS-GF:SpecEValue`) msnid
$absParentMassErrorPPM <- abs(mass_measurement_error(msnid)) msnid
MS2 filters are handled by a special MSnIDFilter
class objects, where
-individual filters are set by name (that is present in names(msnid)
)
-and comparison operator (>, <, = , …) defining if we should retain
-hits with higher or lower given the threshold and finally the
-threshold value itself.
<- MSnIDFilter(msnid)
- filtObj $absParentMassErrorPPM <- list(comparison="<", threshold=10.0)
- filtObj$msmsScore <- list(comparison=">", threshold=10.0)
- filtObjshow(filtObj)
## MSnIDFilter object
-## (absParentMassErrorPPM < 10) & (msmsScore > 10)
-We can then evaluate the filter on the identification data object, -which return the false discovery rate and number of retained -identifications for the filtering criteria at hand.
-evaluate_filter(msnid, filtObj)
## fdr n
-## PSM 0 3807
-## peptide 0 2455
-## accession 0 1009
-Rather than setting filtering values by hand, as shown above, these -can be set automatically to meet a specific false discovery rate.
-<- optimize_filter(filtObj, msnid, fdr.max=0.01,
- filtObj.grid method="Grid", level="peptide",
- n.iter=500)
- show(filtObj.grid)
## MSnIDFilter object
-## (absParentMassErrorPPM < 3) & (msmsScore > 7.4)
-evaluate_filter(msnid, filtObj.grid)
## fdr n
-## PSM 0.004097561 5146
-## peptide 0.006447651 3278
-## accession 0.021996616 1208
-Filters can eventually be applied (rather than just evaluated) using
-the apply_filter
function.
<- apply_filter(msnid, filtObj.grid)
- msnid show(msnid)
## MSnID object
-## Working directory: "."
-## #Spectrum Files: 1
-## #PSMs: 5146 at 0.41 % FDR
-## #peptides: 3278 at 0.64 % FDR
-## #accessions: 1208 at 2.2 % FDR
-And finally, identifications that matched decoy and contaminant -protein sequences are removed
-<- apply_filter(msnid, "isDecoy == FALSE")
- msnid <- apply_filter(msnid, "!grepl('Contaminant',accession)")
- msnid show(msnid)
## MSnID object
-## Working directory: "."
-## #Spectrum Files: 1
-## #PSMs: 5117 at 0 % FDR
-## #peptides: 3251 at 0 % FDR
-## #accessions: 1179 at 0 % FDR
-MSnID
data
-The resulting filtered identification data can be exported to a
-data.frame
(or to a dedicated MSnSet
data structure from the
-MSnbase
package) for quantitative MS data, described below, and
-further processed and analyses using appropriate statistical tests.
head(psms(msnid))
## spectrumID scan number(s) acquisitionNum passThreshold rank
-## 1 index=7151 8819 7151 TRUE 1
-## 2 index=8520 10419 8520 TRUE 1
-## calculatedMassToCharge experimentalMassToCharge chargeState MS-GF:DeNovoScore
-## 1 1270.318 1270.318 3 287
-## 2 1426.737 1426.739 3 270
-## MS-GF:EValue MS-GF:PepQValue MS-GF:QValue MS-GF:RawScore MS-GF:SpecEValue
-## 1 1.709082e-24 0 0 239 1.007452e-31
-## 2 3.780745e-24 0 0 230 2.217275e-31
-## AssumedDissociationMethod IsotopeError isDecoy post pre end start accession
-## 1 CID 0 FALSE A K 283 249 CE02347
-## 2 CID 0 FALSE A K 182 142 CE07055
-## length
-## 1 393
-## 2 206
-## description
-## 1 WBGene00001993; locus:hpd-1; 4-hydroxyphenylpyruvate dioxygenase; status:Confirmed; UniProt:Q22633; protein_id:CAA90315.1; T21C12.2
-## 2 WBGene00001755; locus:gst-7; glutathione S-transferase; status:Confirmed; UniProt:P91253; protein_id:AAB37846.1; F11G11.2
-## pepSeq modified modification
-## 1 AISQIQEYVDYYGGSGVQHIALNTSDIITAIEALR FALSE <NA>
-## 2 SAGSGYLVGDSLTFVDLLVAQHTADLLAANAALLDEFPQFK FALSE <NA>
-## idFile spectrumFile
-## 1 c_elegans.mzid.gz c_elegans_A_3_1_21Apr10_Draco_10-03-04_dta.txt
-## 2 c_elegans.mzid.gz c_elegans_A_3_1_21Apr10_Draco_10-03-04_dta.txt
-## databaseFile peptide
-## 1 ID_004174_E48C5B52.fasta K.AISQIQEYVDYYGGSGVQHIALNTSDIITAIEALR.A
-## 2 ID_004174_E48C5B52.fasta K.SAGSGYLVGDSLTFVDLLVAQHTADLLAANAALLDEFPQFK.A
-## numIrregCleavages numMissCleavages msmsScore absParentMassErrorPPM
-## 1 0 0 30.99678 0.3843772
-## 2 0 0 30.65418 1.3689451
-## [ reached 'max' / getOption("max.print") -- omitted 4 rows ]
-
-Page built: -2021-08-31 - using -R version 4.1.0 (2021-05-18) -
-The aim of the R for Mass -Spectrometry initiative is to -provide efficient, thoroughly documented, tested and flexible R -software for the analysis and interpretation of high throughput mass -spectrometry assays, including proteomics and metabolomics -experiments. The project formalises the longtime collaborative -development efforts of its core members under the RforMassSpectrometry -organisation to facilitate dissemination and accessibility of their -work.
-- - -Figure 1.1: The R for Mass Spectrometry intiative sticker, designed by Johannes Rainer. - -
-This material introduces participants to the analysis and exploration -of mass spectrometry (MS) based proteomics data using R and -Bioconductor. The course will cover all levels of MS data, from raw -data to identification and quantitation data, up to the statistical -interpretation of a typical shotgun MS experiment and will focus on -hands-on tutorials. At the end of this course, the participants will -be able to manipulate MS data in R and use existing packages for their -exploratory and statistical proteomics data analysis.
-The course material is targeted to either proteomics practitioners or -data analysts/bioinformaticians that would like to learn how to use R -and Bioconductor to analyse proteomics data. Familiarity with MS or -proteomics in general is desirable, but not essential as we will walk -through and describe a typical MS data as part of learning about the -tools. For approachable introductions to sample preparation, mass -spectrometry, data interpretation and analysis, readers are redirected -to:
-A working knowledge of R (R syntax, commonly used functions, basic -data structures such as data frames, vectors, matrices, … and their -manipulation) is required. Familiarity with other Bioconductor omics -data classes and the tidyverse syntax is useful, but not necessary.
-This material uses the latest version of the R for Mass Spectrometry -package and their dependencies. It might thus be possible that even -the latest Bioconductor stable version isn’t recent enough.
-To install all the necessary package, please use the latest release of -R and execute:
-if (!requireNamespace("BiocManager", quietly = TRUE))
- install.packages("BiocManager")
-
-BiocManager::install("tidyverse")
-BiocManager::install("factoextra")
-BiocManager::install("msdata")
-BiocManager::install("mzR")
-BiocManager::install("rhdf5")
-BiocManager::install("rpx")
-BiocManager::install("MsCoreUtils")
-BiocManager::install("QFeatures")
-BiocManager::install("Spectra")
-BiocManager::install("ProtGenerics")
-BiocManager::install("PSMatch")
-BiocManager::install("pheatmap")
-BiocManager::install("limma")
-BiocManager::install("MSnID")
-BiocManager::install("RforMassSpectrometry/SpectraVis")
Follow the instructions in this -script -to install the packages and download some of the data used in the -following chapters. All software versions used to generate this -document are recoded at the end of the book in 7.
-To compile and render the teaching material, you will also need -the BiocStyle package and the (slighly -modified) Modern Statistics for Model Biology (msmb) HTML Book -Style by Mike -Smith:
- -Run the installation -script -by executing the line below to install all requirements to compile the -book:
- -Thank you to Charlotte Soneson for -fixing many typos in a previous version of this book.
-
This material is licensed under a Creative Commons
-Attribution-ShareAlike 4.0 International License. You are free to
-share (copy and redistribute the material in any medium or format)
-and adapt (remix, transform, and build upon the material) for any
-purpose, even commercially, as long as you give appropriate credit and
-distribute your contributions under the same license as the original.
Page built: -2023-09-06 - using -R version 4.3.1 Patched (2023-07-10 r84676) -
-In this section, we will learn how to read raw data from in one of the
-commonly used open formats (mzML
, mzXML
and netCDF
) into R.
|Data type |File format |Data structure |Package |
-|:----------|:-------------|:----------------------------|:-----------------|
-|Raw |mzXML or mzML |mzRpwiz or mzRramp |mzR |
-|Raw |mzXML or mzML |list of MassSpectrum objects |MALDIquantForeign |
-|Raw |mzXML or mzML |MSnExp |MSnbase |
-|Peak lists |mgf |MSnExp |MSnbase |
-|Raw |several |Spectra |Spectra |
-When we manipulate complex data, we need a way to abstract it.
-The need for and abstraction saves us from having to know about all -the details of that data and its associated metadata. This allows -to rely on a few easy-to-remember conventions to make mundane and -repetitive tasks trivial and be able to complete more complex things -easily. Abstractions provide a smoother approaches to handle complex -data using common patterns.
-Spectra
class
-We are going to use the
-Spectra
package
-as an abstraction to raw mass spectrometry data.
library(Spectra)
Spectra
is part of the R for Mass Spectrometry
-initiative initiative. It
-defines the Spectra
class that is used as a raw data abstration, to
-maniputate MS data and metadata. The best way to learn about a data
-structure is to create one by hand.
Let’s create a DataFrame
4 As defined in the Bioconductor S4Vectors
-package. containing MS levels, retention time, m/z and intensities
-for 2 spectra:
<- DataFrame(msLevel = c(1L, 2L), rtime = c(1.1, 1.2))
- spd $mz <- list(c(100, 103.2, 104.3, 106.5), c(45.6, 120.4, 190.2))
- spd$intensity <- list(c(200, 400, 34.2, 17), c(12.3, 15.2, 6.8))
- spd spd
## DataFrame with 2 rows and 4 columns
-## msLevel rtime mz intensity
-## <integer> <numeric> <list> <list>
-## 1 1 1.1 100.0,103.2,104.3,... 200.0,400.0, 34.2,...
-## 2 2 1.2 45.6,120.4,190.2 12.3,15.2, 6.8
-And now convert this DataFrame
into a Spectra
object:
<- Spectra(spd)
- sp0 sp0
## MSn data (Spectra) with 2 spectra in a MsBackendDataFrame backend:
-## msLevel rtime scanIndex
-## <integer> <numeric> <integer>
-## 1 1 1.1 NA
-## 2 2 1.2 NA
-## ... 16 more variables/columns.
-Explore the newly created object using
-spectraVariables
to extract all the metadata variables.spectraData
to extract all the metadata.peaksData
to extract a list containing thet raw data.[
to create subsets.Spectra
from mzML files
-Let’s now create a new object using the mzML data previously
-downloaded and available in the mzf
file.
mzf
## [1] "/home/lgatto/.cache/R/rpx/b87c573dec94f_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
-<- Spectra(mzf)
- sp sp
## MSn data (Spectra) with 7534 spectra in a MsBackendMzR backend:
-## msLevel rtime scanIndex
-## <integer> <numeric> <integer>
-## 1 1 0.4584 1
-## 2 1 0.9725 2
-## 3 1 1.8524 3
-## 4 1 2.7424 4
-## 5 1 3.6124 5
-## ... ... ... ...
-## 7530 2 3600.47 7530
-## 7531 2 3600.83 7531
-## 7532 2 3601.18 7532
-## 7533 2 3601.57 7533
-## 7534 2 3601.98 7534
-## ... 33 more variables/columns.
-##
-## file(s):
-## b87c573dec94f_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
-length()
.Mass spectrometry data in Spectra
objects can be thought of as a
-list of individual spectra, with each spectrum having a set of
-variables associated with it. Besides core spectra variables (such
-as MS level or retention time) an arbitrary number of optional
-variables can be assigned to a spectrum. The core spectra variables
-all have their own accessor method and it is guaranteed that a value
-is returned by it (or NA
if the information is not available). The
-core variables and their data type are (alphabetically ordered):
integer(1)
: the index of acquisition of a
-spectrum during a MS run.logical(1)
: whether the spectrum is in profile or
-centroid mode.numeric(1)
: collision energy used to create an
-MSn spectrum.character(1)
: the origin of the spectrum’s data,
-e.g. the mzML file from which it was read.character(1)
: the (current) storage location of the
-spectrum data. This value depends on the backend used to handle and
-provide the data. For an in-memory backend like the
-MsBackendDataFrame
this will be "<memory>"
, for an on-disk
-backend such as the MsBackendHdf5Peaks
it will be the name of the
-HDF5 file where the spectrum’s peak data is stored.numeric
: intensity values for the spectrum’s peaks.numeric(1)
: lower m/z for the isolation
-window in which the (MSn) spectrum was measured.numeric(1)
: the target m/z for the
-isolation window in which the (MSn) spectrum was measured.numeric(1)
: upper m/z for the isolation
-window in which the (MSn) spectrum was measured.integer(1)
: the MS level of the spectrum.numeric
: the m/z values for the spectrum’s peaks.integer(1)
: the polarity of the spectrum (0
and 1
-representing negative and positive polarity, respectively).integer(1)
: the scan (acquisition) number of the
-precursor for an MSn spectrum.integer(1)
: the charge of the precursor of an
-MSn spectrum.numeric(1)
: the intensity of the precursor of
-an MSn spectrum.numeric(1)
: the m/z of the precursor of an MSn
-spectrum.numeric(1)
: the retention time of a spectrum.integer(1)
: the index of a spectrum within a (raw)
-file.logical(1)
: whether the spectrum was smoothed.For details on the individual variables and their getter/setter
-function see the help for Spectra
(?Spectra
). Also note that these
-variables are suggested, but not required to characterize a
-spectrum. Also, some only make sense for MSn, but not for MS1 spectra.
msLevel(.)
) or using the $
notation (for example .$msLevel
).plotSpectra()
-function to visualise peaks and set the m/z range with the xlim
-arguments.Using the first raw data file starting with MS3TMT10
, answer the
-following questions:
These objects and their manipulations are not limited to single files:
-<- dir(system.file("sciex", package = "msdata"), full.names = TRUE)) (fls
## [1] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.1/msdata/sciex/20171016_POOL_POS_1_105-134.mzML"
-## [2] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.1/msdata/sciex/20171016_POOL_POS_3_105-134.mzML"
-<- Spectra(fls)
- sp_sciex table(dataOrigin(sp_sciex))
##
-## /home/lgatto/R/x86_64-pc-linux-gnu-library/4.1/msdata/sciex/20171016_POOL_POS_1_105-134.mzML
-## 931
-## /home/lgatto/R/x86_64-pc-linux-gnu-library/4.1/msdata/sciex/20171016_POOL_POS_3_105-134.mzML
-## 931
-Backends allow to use different backends to store mass spectrometry data while
-providing via the Spectra
class a unified interface to use that data. The
-Spectra
package defines a set of example backends but any object extending the
-base MsBackend
class could be used instead. The default backends are:
MsBackendMzR
: this backend keeps only general spectra variables in memory
-and relies on the mzR package to read mass peaks (m/z and
-intensity values) from the original MS files on-demand. sp_sciex
## MSn data (Spectra) with 1862 spectra in a MsBackendMzR backend:
-## msLevel rtime scanIndex
-## <integer> <numeric> <integer>
-## 1 1 0.280 1
-## 2 1 0.559 2
-## 3 1 0.838 3
-## 4 1 1.117 4
-## 5 1 1.396 5
-## ... ... ... ...
-## 1858 1 258.636 927
-## 1859 1 258.915 928
-## 1860 1 259.194 929
-## 1861 1 259.473 930
-## 1862 1 259.752 931
-## ... 33 more variables/columns.
-##
-## file(s):
-## 20171016_POOL_POS_1_105-134.mzML
-## 20171016_POOL_POS_3_105-134.mzML
-MsBackendDataFrame
: the mass spectrometry data is stored (in-memory) in a
-DataFrame
. Keeping the data in memory guarantees high performance but has
-also, depending on the number of mass peaks in each spectrum, a much higher
-memory footprint.setBackend(sp_sciex, MsBackendDataFrame())
## MSn data (Spectra) with 1862 spectra in a MsBackendDataFrame backend:
-## msLevel rtime scanIndex
-## <integer> <numeric> <integer>
-## 1 1 0.280 1
-## 2 1 0.559 2
-## 3 1 0.838 3
-## 4 1 1.117 4
-## 5 1 1.396 5
-## ... ... ... ...
-## 1858 1 258.636 927
-## 1859 1 258.915 928
-## 1860 1 259.194 929
-## 1861 1 259.473 930
-## 1862 1 259.752 931
-## ... 33 more variables/columns.
-## Processing:
-## Switch backend from MsBackendMzR to MsBackendDataFrame [Tue Aug 31 11:35:38 2021]
-MsBackendHdf5Peaks
: similar to MsBackendMzR
this backend reads peak data
-only on-demand from disk while all other spectra variables are kept in
-memory. The peak data are stored in Hdf5 files which guarantees scalability.<- setBackend(sp_sciex, MsBackendHdf5Peaks(), hdf5path = tempdir())
- sp_hdf5 sp_hdf5
## MSn data (Spectra) with 1862 spectra in a MsBackendHdf5Peaks backend:
-## msLevel rtime scanIndex
-## <integer> <numeric> <integer>
-## 1 1 0.280 1
-## 2 1 0.559 2
-## 3 1 0.838 3
-## 4 1 1.117 4
-## 5 1 1.396 5
-## ... ... ... ...
-## 1858 1 258.636 927
-## 1859 1 258.915 928
-## 1860 1 259.194 929
-## 1861 1 259.473 930
-## 1862 1 259.752 931
-## ... 33 more variables/columns.
-##
-## file(s):
-## 20171016_POOL_POS_1_105-134.h5
-## 20171016_POOL_POS_3_105-134.h5
-## Processing:
-## Switch backend from MsBackendMzR to MsBackendHdf5Peaks [Tue Aug 31 11:35:44 2021]
-table(sp_hdf5$dataOrigin)
##
-## /home/lgatto/R/x86_64-pc-linux-gnu-library/4.1/msdata/sciex/20171016_POOL_POS_1_105-134.mzML
-## 931
-## /home/lgatto/R/x86_64-pc-linux-gnu-library/4.1/msdata/sciex/20171016_POOL_POS_3_105-134.mzML
-## 931
-table(sp_hdf5$dataStorage)
##
-## /tmp/RtmpFxKO1C/20171016_POOL_POS_1_105-134.h5
-## 931
-## /tmp/RtmpFxKO1C/20171016_POOL_POS_3_105-134.h5
-## 931
-All of the above mentioned backends support changing all of their their spectra
-variables, except the MsBackendMzR
that does not support changing m/z or
-intensity values for the mass peaks.
With the example below we load the data from a single mzML file and use a
-MsBackendHdf5Peaks
backend for data storage. The hdf5path
parameter allows
-us to specify the storage location of the HDF5 file.
There is also an (under development) SQLite-based backend called
-MsBackendSqlDb
-that will store all data, i.e. raw and metadata, on disk.
mzR
(optional)
-The mzR
package in a direct interface to the
-proteowizard code base. It
-includes a substantial proportion of pwiz’s C/C++ code for fast and
-efficient parsing of these large raw data files.
Let’s start by using some raw data files from the msdata
-package. After loading it, we use the proteomics()
function to
-return the full file names for two raw data files. We will start by
-focusing on the second one.
<- msdata::proteomics(full.names = TRUE)
- f f
## [1] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.1/msdata/proteomics/MRM-standmix-5.mzML.gz"
-## [2] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.1/msdata/proteomics/MS3TMT10_01022016_32917-33481.mzML.gz"
-## [3] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.1/msdata/proteomics/MS3TMT11.mzML"
-## [4] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.1/msdata/proteomics/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz"
-## [5] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.1/msdata/proteomics/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz"
-<- grep("20141210", f, value = TRUE)) (f2
## [1] "/home/lgatto/R/x86_64-pc-linux-gnu-library/4.1/msdata/proteomics/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz"
-The three main functions of mzR
are
openMSfile
to create a file handle to a raw data fileheader
to extract metadata about the spectra contained in the filepeaks
to extract one or multiple spectra of interest.Other functions such as instrumentInfo
, or runInfo
can be used to
-gather general information about a run.
library("mzR")
-<- openMSfile(f2)
- ms ms
## Mass Spectrometry file handle.
-## Filename: TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz
-## Number of scans: 7534
-<- header(ms)
- hd dim(hd)
## [1] 7534 31
-names(hd)
## [1] "seqNum" "acquisitionNum"
-## [3] "msLevel" "polarity"
-## [5] "peaksCount" "totIonCurrent"
-## [7] "retentionTime" "basePeakMZ"
-## [9] "basePeakIntensity" "collisionEnergy"
-## [11] "ionisationEnergy" "lowMZ"
-## [13] "highMZ" "precursorScanNum"
-## [15] "precursorMZ" "precursorCharge"
-## [17] "precursorIntensity" "mergedScan"
-## [19] "mergedResultScanNum" "mergedResultStartScanNum"
-## [21] "mergedResultEndScanNum" "injectionTime"
-## [23] "filterString" "spectrumId"
-## [25] "centroided" "ionMobilityDriftTime"
-## [27] "isolationWindowTargetMZ" "isolationWindowLowerOffset"
-## [29] "isolationWindowUpperOffset" "scanWindowLowerLimit"
-## [31] "scanWindowUpperLimit"
-head(peaks(ms, 117))
## [,1] [,2]
-## [1,] 399.9976 0
-## [2,] 399.9991 0
-## [3,] 400.0006 0
-## [4,] 400.0021 0
-## [5,] 400.2955 0
-## [6,] 400.2970 0
-str(peaks(ms, 1:5))
## List of 5
-## $ : num [1:25800, 1:2] 400 400 400 400 400 ...
-## $ : num [1:25934, 1:2] 400 400 400 400 400 ...
-## $ : num [1:26148, 1:2] 400 400 400 400 400 ...
-## $ : num [1:26330, 1:2] 400 400 400 400 400 ...
-## $ : num [1:26463, 1:2] 400 400 400 400 400 ...
--► Question -
-Let’s extract the index of the MS2 spectrum with the highest base peak -intensity and plot its spectrum. Is the data centroided or in profile -mode?
-- -
--► Solution -
- --► Question -
-Pick an MS1 spectrum and visually check whether it is centroided or in -profile mode.
-- -
--► Solution -
- -The importance of flexible access to specialised data becomes visible
-in the figure below (taken from the RforProteomics
visualisation
-vignette).
-Not only can we access specific data and understand/visualise them,
-but we can transverse all the data and extracted/visualise/understand
-structured slices of data.
The figure below show is an illustration of how mass spectrometry -works:
-The chromatogram at the top display to total ion current along the -retention time. The vertical line identifies one scan in particular -at retention time 1800.68 seconds (the 2807th scan).
The spectra on the second line represent the full MS1 spectrum -marked by the red line. The vertical lines identify the 10 -precursor ions that where selected for MS2 analysis. The zoomed in -on the right shows one specific precursor peak.
The MS2 spectra displayed along the two rows at the bottom are -those resulting from the fragmentation of the 10 precursor peaks -identified by the vertical bars above.
We are going to reproduce the figure above trought a set of exercices.
--► Question -
-totIonCurrent
-and rtime
variables for all MS1 spectra. Annotate the spectrum of
-interest.- -
--► Solution -
- --► Question -
-filterPrecursorScan()
function can be used to retains parent
-(MS1) and children scans (MS2) of a scan, as defined by its
-acquisition number. Use it to extract the MS1 scan of interest and
-all its MS2 children.- -
--► Solution -
- --► Question -
-- -
--► Solution -
- --► Question -
-- -
--► Solution -
- --► Question -
-plotSpectra()
function is used to plot all 10 MS2 spectra in
-one call.- -
--► Solution -
- -It is possible to label the peaks with the plotSpectra()
-function. The labels
argument is either a character
of appropriate
-length (i.e. with a label for each peak) or, as illustrated below, a
-function that computes the labels.
<- function(z) {
- mzLabel <- peaksData(z)[[1L]]
- z <- format(z[, "mz"], digits = 4)
- lbls "intensity"] < 1e5] <- ""
- lbls[z[,
- lbls
- }
-plotSpectra(ms_2[7],
-xlim = c(126, 132),
- labels = mzLabel,
- labelSrt = -30, labelPos = 2,
- labelOffset = 0.1)
Spectra can also be compared either by overlay or mirror plotting
-using the plotSpectraOverlay()
and plotSpectraMirror()
functions.
-► Question -
-Filter MS2 level spectra and find any 2 MS2 spectra that have matching -precursor peaks based on the precursor m/z values.
-- -
--► Solution -
- --► Question -
-Visualise the matching pair using the plotSpectraOverlay()
and
-plotSpectraMirror()
functions.
- -
--► Solution -
- -Apart from classical subsetting operations such as [
and split
,
-a set of filter functions are defined for Spectra
objects (for
-detailed help please see the ?Spectra
help):
filterAcquisitionNum
: retain spectra with certain acquisition numbers.filterDataOrigin
: subset to spectra from specific origins.filterDataStorage
: subset to spectra from certain data storage files.filterEmptySpectra
: remove spectra without mass peaks.filterMzRange
: subset spectra keeping only peaks with an m/z within the
-provided m/z range.filterMzValues
: subset spectra keeping only peaks matching provided m/z
-value(s).filterIsolationWindow
: keep spectra with the provided mz
in their
-isolation window (m/z range).filterMsLevel
: filter by MS level.filterPolarity
: filter by polarity.filterPrecursorMz
: retain (MSn) spectra with a precursor m/z within the
-provided m/z range.filterPrecursorScan
: retain (parent and children) scans of an acquisition
-number.filterRt
: filter based on retention time ranges.-► Question -
-Using the sp_sciex
data, select all spectra measured in the second
-mzML file and subsequently filter them to retain spectra measured
-between 175 and 189 seconds in the measurement run.
- -
--► Solution -
- -As an example of data processing, below we use the pickPeaks()
-function to pick peaks:
plotSpectra(sp[2807], xlim = c(521.2, 522.5))
library("magrittr")
-pickPeaks(sp[2807]) %>%
-filterIntensity(1e7) %>%
- plotSpectra(xlim = c(521.2, 522.5))
Most functions on Spectra
support (and use) parallel processing out
-of the box. Peak data access and manipulation methods perform by
-default parallel processing on a per-file basis (i.e. using the
-dataStorage variable as splitting factor). Spectra uses
-BiocParallel
for
-parallel processing and all functions use the default registered
-parallel processing setup of that package.
Data manipulations on Spectra objects are not immediately applied to
-the peak data. They are added to a so called processing queue which is
-applied each time peak data is accessed (with the peaksData
, mz
or
-intensity
functions). Thanks to this processing queue data
-manipulation operations are also possible for read-only backends
-(e.g. mzML-file based backends or database-based backends). The
-information about the number of such processing steps can be seen
-below (next to Lazy evaluation queue).
min(intensity(sp_sciex[1]))
## [1] 0
-<- filterIntensity(sp_sciex, intensity = c(10, Inf))
- sp_sciex ## Note the lazy evaluation queue sp_sciex
## MSn data (Spectra) with 1862 spectra in a MsBackendMzR backend:
-## msLevel rtime scanIndex
-## <integer> <numeric> <integer>
-## 1 1 0.280 1
-## 2 1 0.559 2
-## 3 1 0.838 3
-## 4 1 1.117 4
-## 5 1 1.396 5
-## ... ... ... ...
-## 1858 1 258.636 927
-## 1859 1 258.915 928
-## 1860 1 259.194 929
-## 1861 1 259.473 930
-## 1862 1 259.752 931
-## ... 33 more variables/columns.
-##
-## file(s):
-## 20171016_POOL_POS_1_105-134.mzML
-## 20171016_POOL_POS_3_105-134.mzML
-## Lazy evaluation queue: 1 processing step(s)
-## Processing:
-## Remove peaks with intensities outside [10, Inf] in spectra of MS level(s) 1. [Tue Aug 31 11:35:51 2021]
-min(intensity(sp_sciex[1]))
## [1] 412
-@processingQueue sp_sciex
## [[1]]
-## Object of class "ProcessingStep"
-## Function: user-provided function
-## Arguments:
-## o intensity = 10Inf
-## o msLevel = 1
-<- reset(sp_sciex)
- sp_sciex @processingQueue sp_sciex
## list()
-min(intensity(sp_sciex[1]))
## [1] 0
-
-Page built: -2021-08-31 - using -R version 4.1.0 (2021-05-18) -
-mzR
package
-The mzR
package is a direct interface to the
-proteowizard code base. It
-includes a substantial proportion of pwiz’s C/C++ code for fast and
-efficient parsing of these large raw data files.
Let’s start by using some raw data files from the msdata
-package. After loading it, we use the proteomics()
function to
-return the full file names for two raw data files. We will start by
-focusing on the second one.
## [1] "/home/lgatto/disk/R/x86_64-pc-linux-gnu-library/4.3/msdata/proteomics/MRM-standmix-5.mzML.gz"
-## [2] "/home/lgatto/disk/R/x86_64-pc-linux-gnu-library/4.3/msdata/proteomics/MS3TMT10_01022016_32917-33481.mzML.gz"
-## [3] "/home/lgatto/disk/R/x86_64-pc-linux-gnu-library/4.3/msdata/proteomics/MS3TMT11.mzML"
-## [4] "/home/lgatto/disk/R/x86_64-pc-linux-gnu-library/4.3/msdata/proteomics/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz"
-## [5] "/home/lgatto/disk/R/x86_64-pc-linux-gnu-library/4.3/msdata/proteomics/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz"
-
-## [1] "/home/lgatto/disk/R/x86_64-pc-linux-gnu-library/4.3/msdata/proteomics/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz"
-The three main functions of mzR
are
openMSfile
to create a file handle to a raw data fileheader
to extract metadata about the spectra contained in the filepeaks
to extract one or multiple spectra of interest.Other functions such as instrumentInfo
, or runInfo
can be used to
-gather general information about a run.
## Mass Spectrometry file handle.
-## Filename: TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz
-## Number of scans: 7534
-
-## [1] 7534 31
-
-## [1] "seqNum" "acquisitionNum"
-## [3] "msLevel" "polarity"
-## [5] "peaksCount" "totIonCurrent"
-## [7] "retentionTime" "basePeakMZ"
-## [9] "basePeakIntensity" "collisionEnergy"
-## [11] "ionisationEnergy" "lowMZ"
-## [13] "highMZ" "precursorScanNum"
-## [15] "precursorMZ" "precursorCharge"
-## [17] "precursorIntensity" "mergedScan"
-## [19] "mergedResultScanNum" "mergedResultStartScanNum"
-## [21] "mergedResultEndScanNum" "injectionTime"
-## [23] "filterString" "spectrumId"
-## [25] "centroided" "ionMobilityDriftTime"
-## [27] "isolationWindowTargetMZ" "isolationWindowLowerOffset"
-## [29] "isolationWindowUpperOffset" "scanWindowLowerLimit"
-## [31] "scanWindowUpperLimit"
-
-## mz intensity
-## [1,] 399.9976 0
-## [2,] 399.9991 0
-## [3,] 400.0006 0
-## [4,] 400.0021 0
-## [5,] 400.2955 0
-## [6,] 400.2970 0
-
-## List of 5
-## $ : num [1:25800, 1:2] 400 400 400 400 400 ...
-## ..- attr(*, "dimnames")=List of 2
-## .. ..$ : NULL
-## .. ..$ : chr [1:2] "mz" "intensity"
-## $ : num [1:25934, 1:2] 400 400 400 400 400 ...
-## ..- attr(*, "dimnames")=List of 2
-## .. ..$ : NULL
-## .. ..$ : chr [1:2] "mz" "intensity"
-## $ : num [1:26148, 1:2] 400 400 400 400 400 ...
-## ..- attr(*, "dimnames")=List of 2
-## .. ..$ : NULL
-## .. ..$ : chr [1:2] "mz" "intensity"
-## $ : num [1:26330, 1:2] 400 400 400 400 400 ...
-## ..- attr(*, "dimnames")=List of 2
-## .. ..$ : NULL
-## .. ..$ : chr [1:2] "mz" "intensity"
-## $ : num [1:26463, 1:2] 400 400 400 400 400 ...
-## ..- attr(*, "dimnames")=List of 2
-## .. ..$ : NULL
-## .. ..$ : chr [1:2] "mz" "intensity"
--► Question -
-Let’s extract the index of the MS2 spectrum with the highest base peak -intensity and plot its spectrum. Is the data centroided or in profile -mode?
-- -
--► Solution -
- --► Question -
-Pick an MS1 spectrum and visually check whether it is centroided or in -profile mode.
-- -
--► Solution -
- -There are two packages that can be used to parse mzIdentML
files,
-namely mzR
(that we have already used for raw data) and mzID
. The
-major difference is that the former leverages C++ code from
-proteowizard
and is hence faster than the latter (which uses the
-XML
R package). They both work in similar ways.
|Data type |File format |Data structure |Package |
-|:--------------|:-----------|:--------------|:-------|
-|Identification |mzIdentML |mzRident |mzR |
-|Identification |mzIdentML |mzID |mzID |
-Which of these packages is used by PSM()
can be defined by the
-parser
argument, as documented in ?PSM
.
mzID
-The main functions are mzID
to read the data into a dedicated data
-class and flatten
to transform it into a data.frame
.
## [1] "/home/lgatto/disk/R/x86_64-pc-linux-gnu-library/4.3/msdata/ident/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
-
-##
-## Attaching package: 'mzID'
-## The following object is masked from 'package:purrr':
-##
-## flatten
-## The following object is masked from 'package:dplyr':
-##
-## id
-
-## reading TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid... DONE!
-
-## An mzID object
-##
-## Software used: MS-GF+ (version: Beta (v10072))
-##
-## Rawfile: /home/lg390/dev/01_svn/workflows/proteomics/TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
-##
-## Database: /home/lg390/dev/01_svn/workflows/proteomics/erwinia_carotovora.fasta
-##
-## Number of scans: 5343
-## Number of PSM's: 5656
-Various data can be extracted from the mzID
object, using one of the
-accessor functions such as database
, software
, scans
, peptides
,
-… The object can also be converted into a data.frame
using the
-flatten
function.
## spectrumid scan number(s) acquisitionnum
-## 1 controllerType=0 controllerNumber=1 scan=5782 5782 5782
-## 2 controllerType=0 controllerNumber=1 scan=6037 6037 6037
-## 3 controllerType=0 controllerNumber=1 scan=5235 5235 5235
-## passthreshold rank calculatedmasstocharge experimentalmasstocharge
-## 1 TRUE 1 1080.232 1080.233
-## 2 TRUE 1 1002.212 1002.209
-## 3 TRUE 1 1189.280 1189.284
-## chargestate ms-gf:denovoscore ms-gf:evalue ms-gf:pepqvalue ms-gf:qvalue
-## 1 3 174 1.086033e-20 0 0
-## 2 3 245 1.988774e-19 0 0
-## 3 3 264 5.129649e-19 0 0
-## ms-gf:rawscore ms-gf:specevalue assumeddissociationmethod isotopeerror
-## 1 147 3.764831e-27 HCD 0
-## 2 214 6.902626e-26 HCD 0
-## 3 211 1.778789e-25 HCD 0
-## isdecoy post pre end start accession length
-## 1 FALSE S R 84 50 ECA1932 155
-## 2 FALSE R K 315 288 ECA1147 434
-## 3 FALSE A R 224 192 ECA0013 295
-## description pepseq
-## 1 outer membrane lipoprotein PVQIQAGEDSNVIGALGGAVLGGFLGNTIGGGSGR
-## 2 trigger factor TQVLDGLINANDIEVPVALIDGEIDVLR
-## 3 ribose-binding periplasmic protein TKGLNVMQNLLTAHPDVQAVFAQNDEMALGALR
-## modified modification
-## 1 FALSE <NA>
-## 2 FALSE <NA>
-## 3 FALSE <NA>
-## idFile
-## 1 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
-## 2 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
-## 3 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
-## spectrumFile
-## 1 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
-## 2 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
-## 3 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
-## databaseFile
-## 1 erwinia_carotovora.fasta
-## 2 erwinia_carotovora.fasta
-## 3 erwinia_carotovora.fasta
-## [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
-mzR
-The mzR
interface provides a similar interface. It is however much
-faster as it does not read all the data into memory and only extracts
-relevant data on demand. It has also accessor functions such as
-softwareInfo
, mzidInfo
, … (use showMethods(classes = "mzRident", where = "package:mzR")
)
-to see all available methods.
## Identification file handle.
-## Filename: TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
-## Number of psms: 5759
-
-## [1] "MS-GF+ Beta (v10072) "
-## [2] "ProteoWizard MzIdentML 3.0.21263 ProteoWizard"
-The identification data can be accessed as a data.frame
with the
-psms
accessor.
## spectrumID chargeState rank passThreshold
-## 1 controllerType=0 controllerNumber=1 scan=5782 3 1 TRUE
-## 2 controllerType=0 controllerNumber=1 scan=6037 3 1 TRUE
-## 3 controllerType=0 controllerNumber=1 scan=5235 3 1 TRUE
-## 4 controllerType=0 controllerNumber=1 scan=5397 3 1 TRUE
-## 5 controllerType=0 controllerNumber=1 scan=6075 3 1 TRUE
-## experimentalMassToCharge calculatedMassToCharge
-## 1 1080.2325 1080.2321
-## 2 1002.2089 1002.2115
-## 3 1189.2836 1189.2800
-## 4 960.5365 960.5365
-## 5 1264.3409 1264.3419
-## sequence peptideRef modNum isDecoy post pre start
-## 1 PVQIQAGEDSNVIGALGGAVLGGFLGNTIGGGSGR Pep1 0 FALSE S R 50
-## 2 TQVLDGLINANDIEVPVALIDGEIDVLR Pep2 0 FALSE R K 288
-## 3 TKGLNVMQNLLTAHPDVQAVFAQNDEMALGALR Pep3 0 FALSE A R 192
-## 4 SQILQQAGTSVLSQANQVPQTVLSLLR Pep4 0 FALSE - R 264
-## 5 PIIGDNPFVVVLPDVVLDESTADQTQENLALLISR Pep5 0 FALSE F R 119
-## end DatabaseAccess DBseqLength DatabaseSeq
-## 1 84 ECA1932 155
-## 2 315 ECA1147 434
-## 3 224 ECA0013 295
-## 4 290 ECA1731 290
-## 5 153 ECA1443 298
-## DatabaseDescription scan.number.s.
-## 1 ECA1932 outer membrane lipoprotein 5782
-## 2 ECA1147 trigger factor 6037
-## 3 ECA0013 ribose-binding periplasmic protein 5235
-## 4 ECA1731 flagellin 5397
-## 5 ECA1443 UTP--glucose-1-phosphate uridylyltransferase 6075
-## acquisitionNum
-## 1 5782
-## 2 6037
-## 3 5235
-## 4 5397
-## 5 6075
-## [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
-
-Page built: -2023-09-06 - using -R version 4.3.1 Patched (2023-07-10 r84676) -
-Peptide identification is performed using third-party software - there
-is no package to run these searches directly in R. When using line
-search engines it possible to hard-code or automatically generate the
-search command lines and run them from R using a system()
call. This
-allows to generate these reproducibly (especially useful if many
-command lines need to be run) and to keep a record in the R script of
-the exact command.
The example below illustrates this for 3 mzML files to be searched
-using MSGFplus
:
## [1] "file_1.mzML" "file_2.mzML" "file_3.mzML"
-
-## [1] "file_1.mzid" "file_2.mzid" "file_3.mzid"
-paste0("java -jar /path/to/MSGFPlus.jar",
- " -s ", mzmls,
- " -o ", mzids,
- " -d uniprot.fas",
- " -t 20ppm",
- " -m 0",
- " int 1")
## [1] "java -jar /path/to/MSGFPlus.jar -s file_1.mzML -o file_1.mzid -d uniprot.fas -t 20ppm -m 0 int 1"
-## [2] "java -jar /path/to/MSGFPlus.jar -s file_2.mzML -o file_2.mzid -d uniprot.fas -t 20ppm -m 0 int 1"
-## [3] "java -jar /path/to/MSGFPlus.jar -s file_3.mzML -o file_3.mzid -d uniprot.fas -t 20ppm -m 0 int 1"
-Let’s use the identification from msdata
:
## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
-The easiest way to read identification data in mzIdentML
(often
-abbreviated with mzid
) into R is to read it with the readPSMs()
-function from the
-PSMatch
-package5. The function will parse the file and return a
-DataFrame
.
## [1] 5802 35
-
-## [1] "sequence" "spectrumID"
-## [3] "chargeState" "rank"
-## [5] "passThreshold" "experimentalMassToCharge"
-## [7] "calculatedMassToCharge" "peptideRef"
-## [9] "modNum" "isDecoy"
-## [11] "post" "pre"
-## [13] "start" "end"
-## [15] "DatabaseAccess" "DBseqLength"
-## [17] "DatabaseSeq" "DatabaseDescription"
-## [19] "scan.number.s." "acquisitionNum"
-## [21] "spectrumFile" "idFile"
-## [23] "MS.GF.RawScore" "MS.GF.DeNovoScore"
-## [25] "MS.GF.SpecEValue" "MS.GF.EValue"
-## [27] "MS.GF.QValue" "MS.GF.PepQValue"
-## [29] "modPeptideRef" "modName"
-## [31] "modMass" "modLocation"
-## [33] "subOriginalResidue" "subReplacementResidue"
-## [35] "subLocation"
--► Question -
-Verify that this table contains 5802 matches for 5343 -scans and 4938 peptides sequences.
-- -
--► Solution -
- -The PSM data are read as is, without any filtering. As we can see -below, we still have all the hits from the forward and reverse (decoy) -databases.
- -##
-## FALSE TRUE
-## 2906 2896
-The data contains also contains multiple matches for several -spectra. The table below shows the number of number of spectra that -have 1, 2, … up to 5 matches.
- -##
-## 1 2 3 4 5
-## 4936 369 26 10 2
-Below, we can see how scan 1774 has 4 matches, all to sequence
-RTRYQAEVR
, which itself matches to 4 different proteins:
i <- which(id$spectrumID == "controllerType=0 controllerNumber=1 scan=1774")
-data.frame(id[i, ])[1:5]
## sequence spectrumID chargeState rank
-## 1 RTRYQAEVR controllerType=0 controllerNumber=1 scan=1774 2 1
-## 2 RTRYQAEVR controllerType=0 controllerNumber=1 scan=1774 2 1
-## 3 RTRYQAEVR controllerType=0 controllerNumber=1 scan=1774 2 1
-## 4 RTRYQAEVR controllerType=0 controllerNumber=1 scan=1774 2 1
-## passThreshold
-## 1 TRUE
-## 2 TRUE
-## 3 TRUE
-## 4 TRUE
-If the goal is to keep all the matches, but arranged by scan/spectrum,
-one can reduce the PSM
object by the spectrumID
variable, so
-that each scan correponds to a single row that still stores all
-values6:
## Reduced PSM with 5343 rows and 35 columns.
-## names(35): sequence spectrumID ... subReplacementResidue subLocation
-The resulting object contains a single entry for scan 1774 with -information for the multiple matches stored as lists within the cells.
- -## Reduced PSM with 1 rows and 35 columns.
-## names(35): sequence spectrumID ... subReplacementResidue subLocation
-
-## CharacterList of length 1
-## [["controllerType=0 controllerNumber=1 scan=1774"]] ECA2104 ECA2867 ECA3427 ECA4142
-The is the type of complete identification table that could be used to
-annotate an raw mass spectrometry Spectra
object, as shown below.
Often, the PSM data is filtered to only retain reliable matches. The
-MSnID
package can be used to set thresholds to attain user-defined
-PSM, peptide or protein-level FDRs. Here, we will simply filter out
-wrong identification manually.
Here, the filter()
from the dplyr
package comes very handy. We
-will thus start by converting the DataFrame
to a tibble
.
## # A tibble: 5,802 × 35
-## sequence spectrumID chargeState rank passThreshold experimentalMassToCh…¹
-## <chr> <chr> <int> <int> <lgl> <dbl>
-## 1 RQCRTDFLNY… controlle… 3 1 TRUE 548.
-## 2 ESVALADQVT… controlle… 2 1 TRUE 1288.
-## 3 KELLCLAMQI… controlle… 2 1 TRUE 744.
-## 4 QRMARTSDKQ… controlle… 3 1 TRUE 913.
-## 5 KDEGSTEPLK… controlle… 3 1 TRUE 927.
-## 6 DGGPAIYGHE… controlle… 3 1 TRUE 969.
-## 7 QRMARTSDKQ… controlle… 2 1 TRUE 1369.
-## 8 CIDRARHVEV… controlle… 3 1 TRUE 1285.
-## 9 CIDRARHVEV… controlle… 3 1 TRUE 1285.
-## 10 VGRCRPIINY… controlle… 2 1 TRUE 1102.
-## # ℹ 5,792 more rows
-## # ℹ abbreviated name: ¹experimentalMassToCharge
-## # ℹ 29 more variables: calculatedMassToCharge <dbl>, peptideRef <chr>,
-## # modNum <int>, isDecoy <lgl>, post <chr>, pre <chr>, start <int>, end <int>,
-## # DatabaseAccess <chr>, DBseqLength <int>, DatabaseSeq <chr>,
-## # DatabaseDescription <chr>, scan.number.s. <dbl>, acquisitionNum <dbl>,
-## # spectrumFile <chr>, idFile <chr>, MS.GF.RawScore <dbl>, …
--► Question -
-- -
--► Solution -
- --► Question -
-- -
--► Solution -
- --► Question -
-XXX_ECA3406
and ECA3415
. Scan 4099 match XXX_ECA4416_1
,
-XXX_ECA4416_2
and XXX_ECA4416_3
. Then remove the scans that
-match any of these proteins.- -
--► Solution -
- -Which leaves us with 2666 PSMs.
-This can also be achieved with the filterPSMs()
function, or the
-individual filterPsmRank()
, filterPsmDecoy
and filterPsmShared()
-functions:
## Starting with 5802 PSMs:
-## Removed 2896 decoy hits.
-## Removed 155 PSMs with rank > 1.
-## Removed 85 shared peptides.
-## 2666 PSMs left.
-The describePeptides()
and describeProteins()
functions from the
-PSMatch
package provide useful summaries of preptides and proteins
-in a PSM search result.
describePeptides()
gives the number of unique and shared peptides
-and for the latter, the size of their protein groups:## 2324 peptides composed of
-## unique peptides: 2324
-## shared peptides (among protein):
-## ()
-describeProteins()
gives the number of proteins defined by only
-unique, only shared, or a mixture of unique/shared peptides:## 1466 proteins composed of
-## only unique peptides: 1466
-## only shared peptides: 0
-## unique and shared peptides: 0
-The Understanding protein groups with adjacency
-matrices
-PSMatch
vignette provides additional tools to explore how proteins
-were inferred from peptides.
-► Question -
-Compare the distribution of raw identification scores of the decoy and -non-decoy hits. Interpret the figure.
-- -
--► Solution -
- --► Question -
-The tidyverse
-tools are fit for data wrangling with identification data. Using the
-above identification dataframe, calculate the length of each peptide
-(you can use nchar
with the peptide sequence sequence
) and the
-number of peptides for each protein (defined as
-DatabaseDescription
). Plot the length of the proteins against their
-respective number of peptides.
- -
--► Solution -
- -If you would like to learn more about how the mzid data are handled by
-PSMatch
via the mzR and mzID
-packages, check out the 6.2 section in the annex.
We are goind to use the sp
object created in the previous chapter
-and the id_filtered
variable generated above.
Identification data (as a DataFrame
) can be merged into raw data (as
-a Spectra
object) by adding new spectra variables to the appropriate
-MS2 spectra. Scans and peptide-spectrum matches can be matched by
-their spectrum identifers.
-► Question -
-Identify the spectum identifier columns in the sp
the id_filtered
-variables.
- -
--► Solution -
- -We still have several PTMs that are matched to a single spectrum -identifier:
- -##
-## 1 2 3 4
-## 2630 13 2 1
-Let’s look at "controllerType=0 controllerNumber=1 scan=5490"
, the
-has 4 matching PSMs in detail.
## controllerType=0 controllerNumber=1 scan=5490
-## 1903
-id_4 <- id_filtered[id_filtered$spectrumID == "controllerType=0 controllerNumber=1 scan=5490", ] %>%
- as.data.frame()
-id_4
## sequence spectrumID chargeState
-## 1 KCNQCLKVACTLFYCK controllerType=0 controllerNumber=1 scan=5490 3
-## 2 KCNQCLKVACTLFYCK controllerType=0 controllerNumber=1 scan=5490 3
-## rank passThreshold experimentalMassToCharge calculatedMassToCharge peptideRef
-## 1 1 TRUE 698.6633 698.3315 Pep453
-## 2 1 TRUE 698.6633 698.3315 Pep453
-## modNum isDecoy post pre start end DatabaseAccess DBseqLength DatabaseSeq
-## 1 4 FALSE C K 127 142 ECA0668 302
-## 2 4 FALSE C K 127 142 ECA0668 302
-## DatabaseDescription scan.number.s. acquisitionNum
-## 1 ECA0668 hypothetical protein 5490 5490
-## 2 ECA0668 hypothetical protein 5490 5490
-## spectrumFile
-## 1 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
-## 2 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
-## idFile
-## 1 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
-## 2 TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid
-## MS.GF.RawScore MS.GF.DeNovoScore MS.GF.SpecEValue MS.GF.EValue MS.GF.QValue
-## 1 -22 79 4.555588e-07 1.307689 0.9006211
-## 2 -22 79 4.555588e-07 1.307689 0.9006211
-## MS.GF.PepQValue modPeptideRef modName modMass modLocation
-## 1 0.8901099 Pep453 Carbamidomethyl 57.02146 2
-## 2 0.8901099 Pep453 Carbamidomethyl 57.02146 5
-## subOriginalResidue subReplacementResidue subLocation
-## 1 <NA> <NA> NA
-## 2 <NA> <NA> NA
-## [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
-We can see that these 4 PSMs differ by the location of the -Carbamidomethyl modification.
- -## modName modLocation
-## 1 Carbamidomethyl 2
-## 2 Carbamidomethyl 5
-## 3 Carbamidomethyl 10
-## 4 Carbamidomethyl 15
-Let’s reduce that PSM table before joining it to the Spectra
object,
-to make sure we have unique one-to-one matches between the raw spectra
-and the PSMs.
## Reduced PSM with 2646 rows and 35 columns.
-## names(35): sequence spectrumID ... subReplacementResidue subLocation
-These two data can thus simply be joined using:
-sp <- joinSpectraData(sp, id_filtered,
- by.x = "spectrumId",
- by.y = "spectrumID")
-spectraVariables(sp)
## [1] "msLevel" "rtime"
-## [3] "acquisitionNum" "scanIndex"
-## [5] "dataStorage" "dataOrigin"
-## [7] "centroided" "smoothed"
-## [9] "polarity" "precScanNum"
-## [11] "precursorMz" "precursorIntensity"
-## [13] "precursorCharge" "collisionEnergy"
-## [15] "isolationWindowLowerMz" "isolationWindowTargetMz"
-## [17] "isolationWindowUpperMz" "peaksCount"
-## [19] "totIonCurrent" "basePeakMZ"
-## [21] "basePeakIntensity" "ionisationEnergy"
-## [23] "lowMZ" "highMZ"
-## [25] "mergedScan" "mergedResultScanNum"
-## [27] "mergedResultStartScanNum" "mergedResultEndScanNum"
-## [29] "injectionTime" "filterString"
-## [31] "spectrumId" "ionMobilityDriftTime"
-## [33] "scanWindowLowerLimit" "scanWindowUpperLimit"
-## [35] "rtime_minute" "sequence"
-## [37] "chargeState" "rank"
-## [39] "passThreshold" "experimentalMassToCharge"
-## [41] "calculatedMassToCharge" "peptideRef"
-## [43] "modNum" "isDecoy"
-## [45] "post" "pre"
-## [47] "start" "end"
-## [49] "DatabaseAccess" "DBseqLength"
-## [51] "DatabaseSeq" "DatabaseDescription"
-## [53] "scan.number.s." "acquisitionNum.y"
-## [55] "spectrumFile" "idFile"
-## [57] "MS.GF.RawScore" "MS.GF.DeNovoScore"
-## [59] "MS.GF.SpecEValue" "MS.GF.EValue"
-## [61] "MS.GF.QValue" "MS.GF.PepQValue"
-## [63] "modPeptideRef" "modName"
-## [65] "modMass" "modLocation"
-## [67] "subOriginalResidue" "subReplacementResidue"
-## [69] "subLocation"
--► Question -
-Verify that the identification data has been added to the correct -spectra.
-- -
--► Solution -
- -Now that we have combined raw data and their associated -peptide-spectrum matches, we can produce an improved total ion -chromatogram, identifying MS1 scans that lead to successful -identifications.
-The countIdentifications()
function is going to tally the number of
-identifications (i.e non-missing characters in the sequence
spectra
-variable) for each scan. In the case of MS2 scans, these will be
-either 1 or 0, depending on the presence of a sequence. For MS1 scans,
-the function will count the number of sequences for the descendant MS2
-scans, i.e. those produced from precursor ions from each MS1 scan.
Below, we see on the second line that 3457 MS2 scans lead to no PSM, -while 2546 lead to an identification. Among all MS1 scans, 833 lead to -no MS2 scans with PSMs. 30 MS1 scans generated one MS2 scan that lead -to a PSM, 45 lead to two PSMs, …
- -##
-## 0 1 2 3 4 5 6 7 8 9 10
-## 1 833 30 45 97 139 132 92 42 17 3 1
-## 2 3457 2646 0 0 0 0 0 0 0 0 0
-These data can also be visualised on the total ion chromatogram:
-sp |>
-filterMsLevel(1) |>
-spectraData() |>
-as_tibble() |>
-ggplot(aes(x = rtime,
- y = totIonCurrent)) +
- geom_line(alpha = 0.25) +
- geom_point(aes(colour = ifelse(countIdentifications == 0,
- NA, countIdentifications)),
- size = 0.75,
- alpha = 0.5) +
- labs(colour = "Number of ids")
Let’s choose a MS2 spectrum with a high identification score and plot -it.
- - -We have seen above that we can add labels to each peak using the
-labels
argument in plotSpectra()
. The addFragments()
function
-takes a spectrum as input (that is a Spectra
object of length 1) and
-annotates its peaks.
## [1] NA NA NA "b1" NA NA NA NA NA NA NA NA
-## [13] NA NA NA NA NA NA NA NA NA NA NA NA
-## [25] NA NA NA NA NA NA NA NA NA NA NA NA
-## [37] NA NA NA NA NA NA NA "y1_" NA NA NA NA
-## [49] NA "y1" NA NA NA NA NA NA NA NA NA NA
-## [61] NA NA NA NA NA NA NA NA NA NA NA NA
-## [73] NA NA NA NA NA NA NA NA NA NA NA NA
-## [85] NA NA "b2" NA NA NA NA NA NA NA NA NA
-## [97] NA NA NA NA
-## [ reached getOption("max.print") -- omitted 227 entries ]
-It can be directly used with plotSpectra()
:
When a precursor peptide ion is fragmented in a CID cell, it breaks at -specific bonds, producing sets of peaks (a, b, c and x, y, -z) that can be predicted.
-The annotation of spectra is obtained by simulating fragmentation of a -peptide and matching observed peaks to fragments:
- -## [1] "THSQEEMQHMQR"
-
-## Modifications used: C=57.02146
-## mz ion type pos z seq
-## 1 102.0550 b1 b 1 1 T
-## 2 239.1139 b2 b 2 1 TH
-## 3 326.1459 b3 b 3 1 THS
-## 4 454.2045 b4 b 4 1 THSQ
-## 5 583.2471 b5 b 5 1 THSQE
-## 6 712.2897 b6 b 6 1 THSQEE
-## 7 843.3301 b7 b 7 1 THSQEEM
-## 8 971.3887 b8 b 8 1 THSQEEMQ
-## 9 1108.4476 b9 b 9 1 THSQEEMQH
-## 10 1239.4881 b10 b 10 1 THSQEEMQHM
-## 11 1367.5467 b11 b 11 1 THSQEEMQHMQ
-## 12 175.1190 y1 y 1 1 R
-## 13 303.1775 y2 y 2 1 QR
-## 14 434.2180 y3 y 3 1 MQR
-## 15 571.2769 y4 y 4 1 HMQR
-## 16 699.3355 y5 y 5 1 QHMQR
-## [ reached 'max' / getOption("max.print") -- omitted 42 rows ]
-The compareSpectra()
function can be used to compare spectra (by default,
-computing the normalised dot product).
-► Question -
-Spectra
object containing the MS2 spectra with
-sequences "SQILQQAGTSVLSQANQVPQTVLSLLR"
and
-"TKGLNVMQNLLTAHPDVQAVFAQNDEMALGALR"
.- -
--► Solution -
- --► Question -
-compareSpectra
. See the
-?Spectra
man page for details. Draw a heatmap of that matrix.- -
--► Solution -
- --► Question -
-- -
--► Solution -
- --► Question -
-Download the 3 first mzML and mzID files from the -PXD022816 -project (Morgenstern, Barzilay, and Levin 2021Morgenstern, David, Rotem Barzilay, and Yishai Levin. 2021. “RawBeans: A Simple, Vendor-Independent, Raw-Data Quality-Control Tool.” Journal of Proteome Research. https://doi.org/10.1021/acs.jproteome.0c00956.).
-- -
--► Solution -
- --► Question -
-Generate a Spectra
object and a table of filtered PSMs. Visualise
-the total ion chromatograms and check the quality of the
-identification data by comparing the density of the decoy and target
-PSMs id scores for each file.
- -
--► Solution -
- --► Question -
-Join the raw and identification data. Beware though that the joining -must now be performed by spectrum ids and by files.
-- -
--► Solution -
- --► Question -
-Extract the PSMs that have been matched to peptides from protein
-O43175
and compare and cluster the scans. Hint: once you have
-created the smaller Spectra
object with the scans of interest,
-switch to an in-memory backend to seed up the calculations.
- -
--► Solution -
- --► Question -
-Generate total ion chromatograms for each acquisition and annotate the
-MS1 scans with the number of PSMs using the countIdentifications()
-function, as shown above. The function will automatically perform the
-counts in parallel for each acquisition.
- -
--► Solution -
- -MSnID
-
-The MSnID
package extracts MS/MS ID data from mzIdentML (leveraging
-the mzID
package) or text files. After collating the search results
-from multiple datasets it assesses their identification quality and
-optimises filtering criteria to achieve the maximum number of
-identifications while not exceeding a specified false discovery
-rate. It also contains a number of utilities to explore the MS/MS
-results and assess missed and irregular enzymatic cleavages, mass
-measurement accuracy, etc.
Let’s reproduce parts of the analysis described the MSnID
-vignette. You can explore more with
The MSnID package can be used for post-search filtering
-of MS/MS identifications. One starts with the construction of an
-MSnID
object that is populated with identification results that can
-be imported from a data.frame
or from mzIdenML
files. Here, we
-will use the example identification data provided with the package.
## [1] "c_elegans.mzid.gz"
-We start by loading the package, initialising the MSnID
object, and
-add the identification result from our mzid
file (there could of
-course be more than one).
##
-## Attaching package: 'MSnID'
-## The following object is masked from 'package:ProtGenerics':
-##
-## peptides
-
-## Note, the anticipated/suggested columns in the
-## peptide-to-spectrum matching results are:
-## -----------------------------------------------
-## accession
-## calculatedMassToCharge
-## chargeState
-## experimentalMassToCharge
-## isDecoy
-## peptide
-## spectrumFile
-## spectrumID
-
-## Loaded cached data
-
-## MSnID object
-## Working directory: "."
-## #Spectrum Files: 1
-## #PSMs: 12263 at 36 % FDR
-## #peptides: 9489 at 44 % FDR
-## #accessions: 7414 at 76 % FDR
-Printing the MSnID
object returns some basic information such as
The package then enables to define, optimise and apply filtering based -for example on missed cleavages, identification scores, precursor mass -errors, etc. and assess PSM, peptide and protein FDR levels. To -properly function, it expects to have access to the following data
-## [1] "accession" "calculatedMassToCharge"
-## [3] "chargeState" "experimentalMassToCharge"
-## [5] "isDecoy" "peptide"
-## [7] "spectrumFile" "spectrumID"
-which are indeed present in our data:
- -## [1] "spectrumID" "scan number(s)"
-## [3] "acquisitionNum" "passThreshold"
-## [5] "rank" "calculatedMassToCharge"
-## [7] "experimentalMassToCharge" "chargeState"
-## [9] "MS-GF:DeNovoScore" "MS-GF:EValue"
-## [11] "MS-GF:PepQValue" "MS-GF:QValue"
-## [13] "MS-GF:RawScore" "MS-GF:SpecEValue"
-## [15] "AssumedDissociationMethod" "IsotopeError"
-## [17] "isDecoy" "post"
-## [19] "pre" "end"
-## [21] "start" "accession"
-## [23] "length" "description"
-## [25] "pepSeq" "modified"
-## [27] "modification" "idFile"
-## [29] "spectrumFile" "databaseFile"
-## [31] "peptide"
-Here, we summarise a few steps and redirect the reader to the -package’s vignette for more details:
-Cleaning irregular cleavages at the termini of the peptides and
-missing cleavage site within the peptide sequences. The following two
-function calls create the new numMisCleavages
and numIrregCleavages
-columns in the MSnID
object
Now, we can use the apply_filter
function to effectively apply
-filters. The strings passed to the function represent expressions that
-will be evaluated, thus keeping only PSMs that have 0 irregular
-cleavages and 2 or less missed cleavages.
msnid <- apply_filter(msnid, "numIrregCleavages == 0")
-msnid <- apply_filter(msnid, "numMissCleavages <= 2")
-show(msnid)
## MSnID object
-## Working directory: "."
-## #Spectrum Files: 1
-## #PSMs: 7838 at 17 % FDR
-## #peptides: 5598 at 23 % FDR
-## #accessions: 3759 at 53 % FDR
-Using "calculatedMassToCharge"
and "experimentalMassToCharge"
, the
-mass_measurement_error
function calculates the parent ion mass
-measurement error in parts per million.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
-## -2184.0640 -0.6992 0.0000 17.6146 0.7512 2012.5178
-We then filter any matches that do not fit the +/- 20 ppm tolerance
-msnid <- apply_filter(msnid, "abs(mass_measurement_error(msnid)) < 20")
-summary(mass_measurement_error(msnid))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
-## -19.7797 -0.5866 0.0000 -0.2970 0.5713 19.6758
-Filtering of the identification data will rely on
-MS2 filters are handled by a special MSnIDFilter
class objects, where
-individual filters are set by name (that is present in names(msnid)
)
-and comparison operator (>, <, = , …) defining if we should retain
-hits with higher or lower given the threshold and finally the
-threshold value itself.
filtObj <- MSnIDFilter(msnid)
-filtObj$absParentMassErrorPPM <- list(comparison="<", threshold=10.0)
-filtObj$msmsScore <- list(comparison=">", threshold=10.0)
-show(filtObj)
## MSnIDFilter object
-## (absParentMassErrorPPM < 10) & (msmsScore > 10)
-We can then evaluate the filter on the identification data object, -which returns the false discovery rate and number of retained -identifications for the filtering criteria at hand.
- -## fdr n
-## PSM 0 3807
-## peptide 0 2455
-## accession 0 1009
-Rather than setting filtering values by hand, as shown above, these -can be set automatically to meet a specific false discovery rate.
-filtObj.grid <- optimize_filter(filtObj, msnid, fdr.max=0.01,
- method="Grid", level="peptide",
- n.iter=500)
-show(filtObj.grid)
## MSnIDFilter object
-## (absParentMassErrorPPM < 3) & (msmsScore > 7.4)
-
-## fdr n
-## PSM 0.004097561 5146
-## peptide 0.006447651 3278
-## accession 0.021996616 1208
-Filters can eventually be applied (rather than just evaluated) using
-the apply_filter
function.
## MSnID object
-## Working directory: "."
-## #Spectrum Files: 1
-## #PSMs: 5146 at 0.41 % FDR
-## #peptides: 3278 at 0.64 % FDR
-## #accessions: 1208 at 2.2 % FDR
-And finally, identifications that matched decoy and contaminant -protein sequences are removed
-msnid <- apply_filter(msnid, "isDecoy == FALSE")
-msnid <- apply_filter(msnid, "!grepl('Contaminant',accession)")
-show(msnid)
## MSnID object
-## Working directory: "."
-## #Spectrum Files: 1
-## #PSMs: 5117 at 0 % FDR
-## #peptides: 3251 at 0 % FDR
-## #accessions: 1179 at 0 % FDR
-MSnID
data
-The resulting filtered identification data can be exported to a
-data.frame
(or to a dedicated MSnSet
data structure from the
-MSnbase
package) for quantitative MS data, described below, and
-further processed and analysed using appropriate statistical tests.
## spectrumID scan number(s) acquisitionNum passThreshold rank
-## 1 index=7151 8819 7151 TRUE 1
-## 2 index=8520 10419 8520 TRUE 1
-## calculatedMassToCharge experimentalMassToCharge chargeState MS-GF:DeNovoScore
-## 1 1270.318 1270.318 3 287
-## 2 1426.737 1426.739 3 270
-## MS-GF:EValue MS-GF:PepQValue MS-GF:QValue MS-GF:RawScore MS-GF:SpecEValue
-## 1 1.709082e-24 0 0 239 1.007452e-31
-## 2 3.780745e-24 0 0 230 2.217275e-31
-## AssumedDissociationMethod IsotopeError isDecoy post pre end start accession
-## 1 CID 0 FALSE A K 283 249 CE02347
-## 2 CID 0 FALSE A K 182 142 CE07055
-## length
-## 1 393
-## 2 206
-## description
-## 1 WBGene00001993; locus:hpd-1; 4-hydroxyphenylpyruvate dioxygenase; status:Confirmed; UniProt:Q22633; protein_id:CAA90315.1; T21C12.2
-## 2 WBGene00001755; locus:gst-7; glutathione S-transferase; status:Confirmed; UniProt:P91253; protein_id:AAB37846.1; F11G11.2
-## pepSeq modified modification
-## 1 AISQIQEYVDYYGGSGVQHIALNTSDIITAIEALR FALSE <NA>
-## 2 SAGSGYLVGDSLTFVDLLVAQHTADLLAANAALLDEFPQFK FALSE <NA>
-## idFile spectrumFile
-## 1 c_elegans.mzid.gz c_elegans_A_3_1_21Apr10_Draco_10-03-04_dta.txt
-## 2 c_elegans.mzid.gz c_elegans_A_3_1_21Apr10_Draco_10-03-04_dta.txt
-## databaseFile peptide
-## 1 ID_004174_E48C5B52.fasta K.AISQIQEYVDYYGGSGVQHIALNTSDIITAIEALR.A
-## 2 ID_004174_E48C5B52.fasta K.SAGSGYLVGDSLTFVDLLVAQHTADLLAANAALLDEFPQFK.A
-## numIrregCleavages numMissCleavages msmsScore absParentMassErrorPPM
-## 1 0 0 30.99678 0.3843772
-## 2 0 0 30.65418 1.3689451
-## [ reached 'max' / getOption("max.print") -- omitted 4 rows ]
-
-Page built: -2023-09-06 - using -R version 4.3.1 Patched (2023-07-10 r84676) -
-Mass spectrometry (MS) is a technology that separates charged -molecules (ions) based on their mass to charge ratio (M/Z). It is -often coupled to chromatography (liquid LC, but can also be gas-based -GC). The time an analyte takes to elute from the chromatography -column is the retention time.
-An mass spectrometer is composed of three components:
-When using mass spectrometry for proteomics, the proteins are first -digested with a protease such as trypsin. In mass shotgun proteomics, -the analytes assayed in the mass spectrometer are peptides.
-Often, ions are subjected to more than a single MS round. After a -first round of separation, the peaks in the spectra, called MS1 -spectra, represent peptides. At this stage, the only information we -possess about these peptides are their retention time and their -mass-to-charge (we can also infer their charge by inspecting their -isotopic envelope, i.e the peaks of the individual isotopes, see -below), which is not enough to infer their identify (i.e. their -sequence).
-In MSMS (or MS2), the settings of the mass spectrometer are set -automatically to select a certain number of MS1 peaks (for example -20)1. Once a narrow M/Z range has been -selected (corresponding to one high-intensity peak, a peptide, and -some background noise), it is fragmented (using for example -collision-induced dissociation (CID), higher energy collisional -dissociation (HCD) or electron-transfer dissociation (ETD)). The -fragment ions are then themselves separated in the analyser to produce -a MS2 spectrum. The unique fragment ion pattern can then be used to -infer the peptide sequence using de novo sequencing (when the spectrum -is of high enough quality) or using a search engine such as, for -example Mascot, MSGF+, …, that will match the observed, experimental -spectrum to theoretical spectra (see details below).
-The animation below show how 25 ions different ions (i.e. having -different M/Z values) are separated throughout the MS analysis and are -eventually detected (i.e. quantified). The final frame shows the -hypothetical spectrum.
-The figures below illustrate the two rounds of MS. The spectrum on the -left is an MS1 spectrum acquired after 21 minutes and 3 seconds of -elution. 10 peaks, highlited by dotted vertical lines, were selected -for MS2 analysis. The peak at M/Z 460.79 (488.8) is highlighted by a -red (orange) vertical line on the MS1 spectrum and the fragment -spectra are shown on the MS2 spectrum on the top (bottom) right -figure.
-The figures below represent the 3 dimensions of MS data: a set of -spectra (M/Z and intensity) of retention time, as well as the -interleaved nature of MS1 and MS2 (and there could be more levels) -data.
-MS-based proteomics data is disseminated through the -ProteomeXchange infrastructure, -which centrally coordinates submission, storage and dissemination -through multiple data repositories, such as the -PRoteomics IDEntifications (PRIDE) -database at the EBI for mass spectrometry-based experiments (including -quantitative data, as opposed as the name suggests), -PASSEL at the ISB for Selected -Reaction Monitoring (SRM, i.e. targeted) data and the -MassIVE -resource. These data can be downloaded within R using the -rpx package.
- -Using the unique PXD000001
identifier, we can retrieve the relevant
-metadata that will be stored in a PXDataset
object. The names of the
-files available in this data can be retrieved with the pxfiles
-accessor function.
## Loading PXD000001 from cache.
-
-## Project PXD000001 with 11 files
-##
-## Resource ID BFC225 in cache in /home/lgatto/.cache/R/rpx.
-## [1] 'F063721.dat' ... [11] 'erwinia_carotovora.fasta'
-## Use 'pxfiles(.)' to see all files.
-
-## Project PXD000001 files (11):
-## [remote] F063721.dat
-## [local] F063721.dat-mztab.txt
-## [remote] PRIDE_Exp_Complete_Ac_22134.xml.gz
-## [remote] PRIDE_Exp_mzData_Ac_22134.xml.gz
-## [remote] PXD000001_mztab.txt
-## [remote] README.txt
-## [local] TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
-## [remote] TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzXML
-## [local] TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML
-## [remote] TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.raw
-## ...
-Other metadata for the px
data set:
## [1] "Erwinia carotovora"
-
-## [1] "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2012/03/PXD000001"
-
-## [1] "Gatto L, Christoforou A; Using R and Bioconductor for proteomics data analysis., Biochim Biophys Acta, 2013 May 18, doi:10.1016/j.bbapap.2013.04.032 PMID:23692960"
-Data files can then be downloaded with the pxget
function. Below, we
-retrieve the raw data file. The file is
-downloaded2
-in the working directory and the name of the file is return by the
-function and stored in the mzf
variable for later use 3.
## Loading TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML from cache.
-
-## [1] "/home/lgatto/.cache/R/rpx/8ee512042c5ff_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
-Some data are also distributed through dedicated packages. The -msdata, for example, provides some -general raw data files relevant for both proteomics and -metabolomics.
- -## [1] "MRM-standmix-5.mzML.gz"
-## [2] "MS3TMT10_01022016_32917-33481.mzML.gz"
-## [3] "MS3TMT11.mzML"
-## [4] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML.gz"
-## [5] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML.gz"
-
-## [1] "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzid"
-
-## [1] "cptac_a_b_peptides.txt"
-More often, such experiment packages distribute processed data; an -example of such is the pRolocdata -package, that offers quantitative proteomics data.
- -Here, we will focus on data dependent acquisition (DDA), where -MS1 peaks are selected. In data independent acquisition (DIA), all peaks -in the MS1 spectrum are fragmented.↩︎
If the file is already available, it is not downloaded a second time.↩︎
This and other files are also availabel in the msdata
package, described below↩︎
Page built: -2023-09-06 - using -R version 4.3.1 Patched (2023-07-10 r84676) -
-There are a wide range of proteomics quantitation techniques that can -broadly be classified as labelled vs. label-free, depending on whether -the features are labelled prior the MS acquisition and the MS level at -which quantitation is inferred, namely MS1 or MS2.
-- | Label-free | -Labelled | -
---|---|---|
MS1 | -XIC | -SILAC, 15N | -
MS2 | -Counting | -iTRAQ, TMT | -
In spectral counting, one simply counts the number of quantified -peptides that are assigned to a protein.
-Isobaric tagging refers to the labelling using isobaric tags, -i.e. chemical tags that have the same mass and hence can’t be -distinguished by the spectrometer. The peptides of different samples (4, -6, 10, 11 or 16) are labelled with different tags and combined prior -to mass spectrometry acquisition. Given that they are isobaric, all -identical peptides, irrespective of the tag and this the sample of -origin, are co-analysed, up to fragmentation prior to MS2 -analysis. During fragmentation, the isobaric tags fall of, fragment -themselves, and result in a set of sample specific peaks. These -specific peaks can be used to infer sample-specific quantitation, -while the rest of the MS2 spectrum is used for identification.
-In label-free quantitation, the precursor peaks that match an -identified peptide are integrated over retention time and the area under -that extracted ion chromatogram is used to quantify that peptide in -that sample.
-In SILAC quantitation, sample are grown in a medium that contains -heavy amino acids (typically arginine and lysine). All proteins grown -in this heavy growth medium contain the heavy form of these amino -acids. Two samples, one grown in heavy medium, and one grown in normal -(light) medium are then combined and analysed together. The heavy -peptides precursor peaks are systematically shifted compared to the -light ones, and the ratio between the height of a heavy and light -peaks can be used to calculate peptide and protein fold-changes.
-These different quantitation techniques come with their respective -benefits and distinct challenges, such as large quantities of raw data -processing, data transformation and normalisation, missing values, and -different underlying statistical models for the quantitative data -(count data for spectral counting, continuous data for the others).
-In terms of raw data quantitation in R/Bioconductor, most efforts have -been devoted to MS2-level quantitation. Label-free XIC quantitation -has been addressed in the frame of metabolomics data processing by the -xcms infrastructure.
- - - - - - - -Mass spectrometry-based quantitative proteomics data can be
-represented as a matrix of quantitative values for features (PSMs,
-peptides, proteins) arranged along the rows, measured for a set of
-samples, arranged along the columns. There is a common representation
-for such quantitative data set, namely the SummarizedExperiment
-(Morgan et al. 2020Morgan, Martin, Valerie Obenchain, Jim Hester, and Hervé Pagès. 2020. SummarizedExperiment: SummarizedExperiment Container. https://bioconductor.org/packages/SummarizedExperiment.) class:
colData()
-function.rowData()
-column.rowRanges()
-metadata()
.assay()
.assays()
returns a list of matrix-like assays.While mass spectrometers acquire data for spectra/peptides, the -biological entity of interest are the protein. As part of the data -processing, we are thus required to aggregate low-level -quantitative features into higher level data.
-We are going to start to familiarise ourselves with the QFeatures
-class implemented in the
-QFeatures
-package. The class is derived from the Bioconductor
-MultiAssayExperiment
(Ramos et al. 2017Ramos, Marcel, Lucas Schiffer, Angela Re, Rimsha Azhar, Azfar Basunia, Carmen Rodriguez Cabrera, Tiffany Chan, et al. 2017. “Software for the Integration of Multi-Omics Experiments in Bioconductor.” Cancer Research 77(21); e39-42.) (MAE) class. Let’s start by loading the
-QFeatures
package.
Next, we load the feat1
test data, which is composed of single
-assay of class SummarizedExperiment
composed of 10 rows and 2
-columns.
## An instance of class QFeatures containing 1 assays:
-## [1] psms: SummarizedExperiment with 10 rows and 2 columns
-Let’s perform some simple operations to familiarise ourselves with the
-QFeatures
class:
colData()
accessor (like you
-have previously done with SummarizedExperiment
objects).## DataFrame with 2 rows and 1 column
-## Group
-## <integer>
-## S1 1
-## S2 2
-We can also further annotate the experiment by adding columns to the colData
slot:
## DataFrame with 2 rows and 3 columns
-## Group X Y
-## <integer> <character> <character>
-## S1 1 X1 Y1
-## S2 2 X2 Y2
-QFeatures
data
-using the [[
operator (as you have done to extract elements of a
-list) by using the assay’s index or name.## class: SummarizedExperiment
-## dim: 10 2
-## metadata(0):
-## assays(1): ''
-## rownames(10): PSM1 PSM2 ... PSM9 PSM10
-## rowData names(5): Sequence Protein Var location pval
-## colnames(2): S1 S2
-## colData names(0):
-
-## class: SummarizedExperiment
-## dim: 10 2
-## metadata(0):
-## assays(1): ''
-## rownames(10): PSM1 PSM2 ... PSM9 PSM10
-## rowData names(5): Sequence Protein Var location pval
-## colnames(2): S1 S2
-## colData names(0):
-psms
assay’s row data and quantitative values.## S1 S2
-## PSM1 1 11
-## PSM2 2 12
-## PSM3 3 13
-## PSM4 4 14
-## PSM5 5 15
-## PSM6 6 16
-## PSM7 7 17
-## PSM8 8 18
-## PSM9 9 19
-## PSM10 10 20
-
-## DataFrame with 10 rows and 5 columns
-## Sequence Protein Var location pval
-## <character> <character> <integer> <character> <numeric>
-## PSM1 SYGFNAAR ProtA 1 Mitochondr... 0.084
-## PSM2 SYGFNAAR ProtA 2 Mitochondr... 0.077
-## PSM3 SYGFNAAR ProtA 3 Mitochondr... 0.063
-## PSM4 ELGNDAYK ProtA 4 Mitochondr... 0.073
-## PSM5 ELGNDAYK ProtA 5 Mitochondr... 0.012
-## PSM6 ELGNDAYK ProtA 6 Mitochondr... 0.011
-## PSM7 IAEESNFPFI... ProtB 7 unknown 0.075
-## PSM8 IAEESNFPFI... ProtB 8 unknown 0.038
-## PSM9 IAEESNFPFI... ProtB 9 unknown 0.028
-## PSM10 IAEESNFPFI... ProtB 10 unknown 0.097
-The central functionality of the QFeatures
infrastructure is the
-aggregation of features into higher-level features while retaining the
-link between the different levels. This can be done with the
-aggregateFeatures()
function.
The call below will
-psms
assay of the feat1
objects;peptides
row data variables;colMeans()
function;peptides
and add it to the feat1
-object.feat1 <- aggregateFeatures(feat1, i = "psms",
- fcol = "Sequence",
- name = "peptides",
- fun = colMeans)
-feat1
## An instance of class QFeatures containing 2 assays:
-## [1] psms: SummarizedExperiment with 10 rows and 2 columns
-## [2] peptides: SummarizedExperiment with 3 rows and 2 columns
-## S1 S2
-## 2 12
-
-## S1 S2
-## 2 12
-
-## S1 S2
-## 5 15
-
-## S1 S2
-## 5 15
-
-## S1 S2
-## 8.5 18.5
-
-## S1 S2
-## 8.5 18.5
-
-## DataFrame with 3 rows and 4 columns
-## Sequence Protein location .n
-## <character> <character> <character> <integer>
-## ELGNDAYK ELGNDAYK ProtA Mitochondr... 3
-## IAEESNFPFIK IAEESNFPFI... ProtB unknown 4
-## SYGFNAAR SYGFNAAR ProtA Mitochondr... 3
-We can now aggregate the peptide-level data into a new protein-level
-assay using the colMedians()
aggregation function.
feat1 <- aggregateFeatures(feat1, i = "peptides",
- fcol = "Protein",
- name = "proteins",
- fun = colMedians)
-feat1
## An instance of class QFeatures containing 3 assays:
-## [1] psms: SummarizedExperiment with 10 rows and 2 columns
-## [2] peptides: SummarizedExperiment with 3 rows and 2 columns
-## [3] proteins: SummarizedExperiment with 2 rows and 2 columns
-
-## S1 S2
-## ProtA 3.5 13.5
-## ProtB 8.5 18.5
-The link between the assays becomes apparent when we now subset the
-assays for protein A as shown below or using the subsetByFeature()
-function. This creates a new instance of class QFeatures
containing
-assays with the expression data for protein, its peptides and their
-PSMs.
## An instance of class QFeatures containing 3 assays:
-## [1] psms: SummarizedExperiment with 6 rows and 2 columns
-## [2] peptides: SummarizedExperiment with 2 rows and 2 columns
-## [3] proteins: SummarizedExperiment with 1 rows and 2 columns
-The filterFeatures()
function can be used to filter rows the assays
-composing a QFeatures
object using the row data variables. We can
-for example retain rows that have a pval
< 0.05, which would only
-keep rows in the psms
assay because the pval
is only relevant for
-that assay.
## 'pval' found in 1 out of 3 assay(s)
-## No filter applied to the following assay(s) because one or more filtering variables are missing in the rowData: peptides, proteins.
-## You can control whether to remove or keep the features using the 'keep' argument (see '?filterFeature').
-## An instance of class QFeatures containing 3 assays:
-## [1] psms: SummarizedExperiment with 4 rows and 2 columns
-## [2] peptides: SummarizedExperiment with 0 rows and 2 columns
-## [3] proteins: SummarizedExperiment with 0 rows and 2 columns
--► Question -
-As the message above implies, it is also possible to apply a filter to
-only the assays that have a filtering variables by setting the keep
-variables.
- -
--► Solution -
- -On the other hand, if we filter assay rows for those that localise to -the mitochondrion, we retain the relevant protein, peptides and PSMs.
- -## 'location' found in 3 out of 3 assay(s)
-## An instance of class QFeatures containing 3 assays:
-## [1] psms: SummarizedExperiment with 6 rows and 2 columns
-## [2] peptides: SummarizedExperiment with 2 rows and 2 columns
-## [3] proteins: SummarizedExperiment with 1 rows and 2 columns
--► Question -
-As an exercise, let’s filter rows that do not localise to the -mitochondrion.
-- -
--► Solution -
- -You can refer to the Quantitative features for mass spectrometry
-data
-vignette and the QFeatures
manual
-page
-for more details about the class.
QFeatures
object
-While QFeatures
objects can be created manually (see ?QFeatures
-for details), most users have a quantitative data in a spreadsheet or
-a data.frame. In such cases, the easiest is to use the readQFeatures
-function to extract the quantitative data and metadata columns. Below,
-we load the hlpsms
dataframe that contains data for 28
-PSMs from the TMT-10plex hyperLOPIT spatial proteomics experiment
-from (Christoforou et al. 2016Christoforou, Andy, Claire M Mulvey, Lisa M Breckels, Aikaterini Geladaki, Tracey Hurrell, Penelope C Hayward, Thomas Naake, et al. 2016. “A Draft Map of the Mouse Pluripotent Stem Cell Spatial Proteome.” Nat Commun 7: 8992. https://doi.org/10.1038/ncomms9992.). The ecol
argument specifies that columns
-1 to 10 contain quantitation data, and that the assay should be named
-psms
in the returned QFeatures
object, to reflect the nature of
-the data.
## An instance of class QFeatures containing 1 assays:
-## [1] psms: SummarizedExperiment with 3010 rows and 10 columns
-Below, we see that we can extract an assay using its index or its
-name. The individual assays are stored as SummarizedExperiment
-object and further access its quantitative data and metadata using
-the assay
and rowData
functions.
## class: SummarizedExperiment
-## dim: 3010 10
-## metadata(0):
-## assays(1): ''
-## rownames(3010): 1 2 ... 3009 3010
-## rowData names(18): Sequence ProteinDescriptions ... RTmin markers
-## colnames(10): X126 X127C ... X130N X131
-## colData names(0):
-
-## class: SummarizedExperiment
-## dim: 3010 10
-## metadata(0):
-## assays(1): ''
-## rownames(3010): 1 2 ... 3009 3010
-## rowData names(18): Sequence ProteinDescriptions ... RTmin markers
-## colnames(10): X126 X127C ... X130N X131
-## colData names(0):
-
-## X126 X127C X127N X128C X128N X129C
-## 1 0.12283431 0.08045915 0.070804055 0.09386901 0.051815695 0.13034383
-## 2 0.35268185 0.14162381 0.167523880 0.07843497 0.071087436 0.03214548
-## 3 0.01546089 0.16142297 0.086938133 0.23120844 0.114664348 0.09610188
-## 4 0.04702854 0.09288723 0.102012167 0.11125409 0.067969116 0.14155358
-## 5 0.01044693 0.15866147 0.167315736 0.21017494 0.147946673 0.07088253
-## 6 0.04955362 0.01215244 0.002477681 0.01297833 0.002988949 0.06253195
-## X129N X130C X130N X131
-## 1 0.17540095 0.040068658 0.11478839 0.11961594
-## 2 0.06686260 0.031961793 0.02810434 0.02957384
-## 3 0.15977819 0.010127118 0.08059400 0.04370403
-## 4 0.18015910 0.035329902 0.12166589 0.10014038
-## 5 0.17555789 0.007088253 0.02884754 0.02307803
-## 6 0.01726511 0.172651119 0.37007905 0.29732174
-
-## DataFrame with 6 rows and 18 columns
-## Sequence ProteinDescriptions NbProteins ProteinGroupAccessions
-## <character> <character> <integer> <character>
-## 1 SQGEIDk Tetratrico... 1 Q8BYY4
-## 2 YEAQGDk Vacuolar p... 1 P46467
-## 3 TTScDTk C-type man... 1 Q64449
-## 4 aEELESR Liprin-alp... 1 P60469
-## Modifications qValue PEP IonScore NbMissedCleavages
-## <character> <numeric> <numeric> <integer> <integer>
-## 1 K7(TMT6ple... 0.008 0.11800 27 0
-## 2 K7(TMT6ple... 0.001 0.01070 27 0
-## 3 C4(Carbami... 0.008 0.11800 11 0
-## 4 N-Term(TMT... 0.002 0.04450 24 0
-## IsolationInterference IonInjectTimems Intensity Charge mzDa MHDa
-## <integer> <integer> <numeric> <integer> <numeric> <numeric>
-## 1 0 70 335000 2 503.274 1005.54
-## 2 0 70 926000 2 520.267 1039.53
-## 3 0 70 159000 2 521.258 1041.51
-## 4 0 70 232000 2 531.785 1062.56
-## DeltaMassPPM RTmin markers
-## <numeric> <numeric> <character>
-## 1 -0.38 24.02 unknown
-## 2 0.61 18.85 unknown
-## 3 1.11 10.17 unknown
-## 4 0.35 29.18 unknown
-## [ reached getOption("max.print") -- omitted 2 rows ]
-For further details on how to manipulate such objects, refer to the -MultiAssayExperiment (Ramos et al. 2017Ramos, Marcel, Lucas Schiffer, Angela Re, Rimsha Azhar, Azfar Basunia, Carmen Rodriguez Cabrera, Tiffany Chan, et al. 2017. “Software for the Integration of Multi-Omics Experiments in Bioconductor.” Cancer Research 77(21); e39-42.) and -SummarizedExperiment (Morgan et al. 2020Morgan, Martin, Valerie Obenchain, Jim Hester, and Hervé Pagès. 2020. SummarizedExperiment: SummarizedExperiment Container. https://bioconductor.org/packages/SummarizedExperiment.) packages.
-It is also possible to first create a SummarizedExperiment
, and then
-only include it into a QFeatures
object.
## class: SummarizedExperiment
-## dim: 3010 10
-## metadata(0):
-## assays(1): ''
-## rownames(3010): 1 2 ... 3009 3010
-## rowData names(18): Sequence ProteinDescriptions ... RTmin markers
-## colnames(10): X126 X127C ... X130N X131
-## colData names(0):
-
-## An instance of class QFeatures containing 1 assays:
-## [1] psm: SummarizedExperiment with 3010 rows and 10 columns
-At this stage, i.e. at the beginning of the analysis, whether you have
-a SummarizedExperiment
or a QFeatures
object, it is a good time to
-define the experimental design in the colData
slot.
The CPTAC spike-in study 6 (Paulovich et al. 2010Paulovich, Amanda G, Dean Billheimer, Amy-Joan L Ham, Lorenzo Vega-Montoto, Paul A Rudnick, David L Tabb, Pei Wang, et al. 2010. “Interlaboratory Study Characterizing a Yeast Performance Standard for Benchmarking LC-MS Platform Performance.” Mol. Cell. Proteomics 9 (2): 242–54.) combines the Sigma UPS1 -standard containing 48 different human proteins that are spiked in at -5 different concentrations (conditions A to E) into a constant yeast -protein background. The sample were acquired in triplicate on -different instruments in different labs. We are going to start with a -subset of the CPTAC study 6 containing conditions A and B for a single -lab.
-The peptide-level data, as processed by MaxQuant (Cox and Mann 2008Cox, J, and M Mann. 2008. “MaxQuant Enables High Peptide Identification Rates, Individualized p.p.b.-Range Mass Accuracies and Proteome-Wide Protein Quantification.” Nat Biotechnol 26 (12): 1367–72. https://doi.org/10.1038/nbt.1511.) is
-available in the msdata
package:
## [1] "cptac_a_b_peptides.txt"
--► Question -
-Read these data in as either a SummarizedExperiment
or a QFeatures
-object.
- -
--► Solution -
- --► Question -
-Before proceeding, we are going to clean up the sample names by
-removing the unnecessary Intensity prefix and annotate the
-experiment in the object’s colData
.
- -
--► Solution -
- --► Question -
-There are many row variables that aren’t useful here. Get rid or all
-of them but Sequence
, Proteins
, Leading.razor.protein
, PEP
,
-Score
, Reverse
, and Potential.contaminant
.
- -
--► Solution -
- -A typical quantitative proteomics data processing is composed of the -following steps, which we are going to apply to the cptac data created -above.
-Missing values can be highly frequent in proteomics. There are two -reasons supporting the existence of missing values, namely biological -or technical.
-Values that are missing due to the absence (or extremely low -concentration) of a protein are observed for biological reasons, -and their pattern aren’t random (MNAR). A protein missing -due to the suppression of its expression will not be missing at -random: it will be missing in the condition in which it was -suppressed, and be present in the condition where it is expressed.
Due to its data-dependent acquisition, mass spectrometry isn’t -capable of assaying all peptides in a sample. Peptides that are -less abundant than some of their co-eluting ions, peptides that do -not ionise well or peptides that do not get identified might be -sporadically missing in the final quantitation table, despite their -presence in the biological samples. Their absence patterns are -(completely) random (MAR or MCAR) in such cases.
Often, third party software that produce quantitative data use zeros
-instead of properly reporting missing values. We can use the
-zeroIsNA()
function to replace the 0
by NA
values in our
-cptac_se
object and then explore the missing data patterns across
-columns and rows.
## $nNA
-## DataFrame with 1 row and 2 columns
-## nNA pNA
-## <integer> <numeric>
-## 1 31130 0.452497
-##
-## $nNArows
-## DataFrame with 11466 rows and 3 columns
-## name nNA pNA
-## <character> <integer> <numeric>
-## 1 AAAAGAGGAG... 4 0.666667
-## 2 AAAALAGGK 0 0.000000
-## 3 AAAALAGGKK 0 0.000000
-## 4 AAADALSDLE... 0 0.000000
-## 5 AAADALSDLE... 0 0.000000
-## ... ... ... ...
-## 11462 YYSIYDLGNN... 6 1.000000
-## 11463 YYTFNGPNYN... 3 0.500000
-## 11464 YYTITEVATR 4 0.666667
-## 11465 YYTVFDRDNN... 6 1.000000
-## 11466 YYTVFDRDNN... 6 1.000000
-##
-## $nNAcols
-## DataFrame with 6 rows and 3 columns
-## name nNA pNA
-## <character> <integer> <numeric>
-## 1 6A_7 4743 0.413658
-## 2 6A_8 5483 0.478196
-## 3 6A_9 5320 0.463980
-## 4 6B_7 4721 0.411739
-## 5 6B_8 5563 0.485174
-## 6 6B_9 5300 0.462236
-Let’s now explore these missing values:
-cptac_se
data.##
-## 0 1 2 3 4 5 6
-## 4059 990 884 717 934 807 3075
-filterNA()
function.Imputation is the technique of replacing missing data with probable
-values. This can be done with impute()
method. As we have discussed
-above, there are however two types of missing values in mass
-spectrometry-based proteomics, namely data missing at random (MAR),
-and data missing not at random (MNAR). These two types of missing
-data, those missing at random, and those missing not at random, need
-to be imputed with different types of imputation
-methods
-(Lazar et al. 2016Lazar, C, L Gatto, M Ferro, C Bruley, and T Burger. 2016. “Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies.” J Proteome Res 15 (4): 1116–25. https://doi.org/10.1021/acs.jproteome.5b00981.).
Generally, it is recommended to use hot deck methods (nearest -neighbour (left), maximum likelihood, …) when data are missing -at random.Conversely, MNAR features should ideally be imputed with a -left-censor (minimum value (right), but not zero, …) method.
-There are various methods to perform data imputation, as described in
-?impute
. The imp4p package contains additional
-functionality, including some to estimate the randomness of missing
-data.
The general syntax for imputation is shown below, using the se_na2
-object as an example:
## Imputing along margin 1 (features/rows).
-## Warning in knnimp(x, k, maxmiss = rowmax, maxp = maxp): 12 rows with more than 50 % entries missing;
-## mean imputation used for these rows
-## class: SummarizedExperiment
-## dim: 689 16
-## metadata(3): MSnbaseFiles MSnbaseProcessing MSnbaseVersion
-## assays(1): ''
-## rownames(689): AT1G09210 AT1G21750 ... AT4G11150 AT4G39080
-## rowData names(2): nNA randna
-## colnames(16): M1F1A M1F4A ... M2F8B M2F11B
-## colData names(1): nNA
--► Question -
-Following the example above, apply a mixed imputation, using knn for
-data missing at random and the zero imputation for data missing not at
-random. Hint: the randna
variable defines which features are assumed
-to be missing at random.
- -
--► Solution -
- --► Question -
-When assessing missing data imputation methods, such as in Lazar et
-al. (2016),
-one often replaces values with missing data, imputes these with a
-method of choice, then quantifies the difference between original
-(expected) and observed (imputed) values. Here, using the se_na2
-data, use this strategy to assess the difference between knn and
-Bayesian PCA imputation.
- -
--► Solution -
- --► Question -
-When assessing the impact of missing value imputation on real data,
-one can’t use the strategy above. Another useful approach is to assess
-the impact of the imputation method on the distribution of the
-quantitative data. For instance, here is the intensity distribution of
-the se_na2
data. Verify the effect of applying knn
, zero
,
-MinDet
and bpca
on this distribution.
- -
--► Solution -
- -Tip: When downstream analyses permit, it might be safer not to -impute data and deal explicitly with missing values. Indeed missing -data imputation is not straightforward, and is likely to dramatically -fail when a high proportion of data is missing (10s of %). It is -possible to keep NAs when performing hypothesis tests7, but (generally) not to perform a principal component -analysis.
-As discussed in the previous chapter, PSMs are deemed relevant after
-comparison against hits from a decoy database. The origin of these
-hits is recorded with +
in the Reverse
variable:
##
-## +
-## 7572 12
-Similarly, a proteomics experiment is also searched against a database -of contaminants:
- -##
-## +
-## 7558 26
-Let’s visualise some of the cptac’s metadata using standard ggplot2
-code:
-► Question -
-Visualise the identification score and the posterior probability -probability (PEP) distributions from forward and reverse hits and -interpret the figure.
-- -
--► Solution -
- -Note: it is also possible to compute and visualise protein groups
-as connected components starting from a quantitative dataset such as a
-SummarizedExperiment
. See the Using quantitative
-data
-section in the Understanding protein groups with adjacency matrices
-vignette.
We can now create our QFeatures
object using the
-SummarizedExperiment
as shown below.
## An instance of class QFeatures containing 1 assays:
-## [1] peptides: SummarizedExperiment with 7584 rows and 6 columns
-We should also assign the QFeatures
column data with the
-SummarizedExperiment
slot.
Note that it is also possible to directly create a QFeatures
object
-with the readQFeatures()
function and the same arguments as the
-readSummarizedExperiment()
used above. In addition, most functions
-used above and below work on single SummarizedExperiment
objects or
-assays within a QFeatures
object.
-► Question -
-Using the filterFeatures()
function, filter out the reverse and
-contaminant hits, and also retain those that have a posterior error
-probability smaller than 0.05.
- -
--► Solution -
- -The two code chunks below log-transform and normalise using the assay
-i
as input and adding a new one names as defined by name
.
-► Question -
-Use the normalize()
method to normalise the data. The syntax is the
-same as logTransform()
. Use the center.median
method.
- -
--► Solution -
- --► Question -
-Visualise the result of the transformations above. The
-plotDensities()
function from the limma
package is very
-convenient, but feel free to use boxplots, violin plots, or any other
-visualisation that you deem useful to assess the tranformations.
- -
--► Solution -
- --► Question -
-Use median aggregation to aggregation peptides into protein -values. This is not necessarily the best choice, as we will see -later, but a good start.
-- -
--► Solution -
- -Looking at the .n
row variable computed during the aggregation, we
-see that most proteins result from the aggregation of 5 peptides or
-less, while very few proteins are accounted for by tens of peptides.
##
-## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
-## 327 234 167 132 84 73 62 49 49 29 29 24 20 13 15 12 4 6 11 5
-## 21 22 23 24 25 26 28 29 30 31 32 34 37 38 39 42 51 52 62
-## 7 4 7 2 2 3 1 3 1 2 2 1 1 1 1 2 1 1 1
-library("factoextra")
-
-pca_pep <-
- cptac[["lognorm_peptides"]] %>%
- filterNA() %>%
- assay() %>%
- t() %>%
- prcomp(scale = TRUE, center = TRUE) %>%
- fviz_pca_ind(habillage = cptac$condition, title = "Peptides")
-
-pca_prot <-
- cptac[["proteins_med"]] %>%
- filterNA() %>%
- assay() %>%
- t() %>%
- prcomp() %>%
- fviz_pca_ind(habillage = cptac$condition,
- title = "Proteins (median aggregation)")
Below, we use the longFormat()
function to extract the quantitative
-and row data in a long format, that can be directly reused by the
-tidyverse tools.
longFormat(cptac["P02787ups|TRFE_HUMAN_UPS", ,
- c("lognorm_peptides", "proteins_med")]) %>%
- as_tibble() %>%
- mutate(condition = ifelse(grepl("A", colname), "A", "B")) %>%
- ggplot(aes(x = colname, y = value, colour = rowname, shape = condition)) +
- geom_point(size = 3) +
- geom_line(aes(group = rowname)) +
- facet_grid(~ assay) +
- ggtitle("P02787ups|TRFE_HUMAN_UPS")
We can also visualise the assays withing a QFeatures
object and
-their relation.
-► Question -
-The example above shows a simple linear relationship between
-assays. Create a more interesting one by applying a different
-normalisation method on the log_peptides assay and aggreate that new
-normalised peptide assay. Visualise the relationship with plot()
, as
-above.
- -
--► Solution -
- -R in general and Bioconductor in particular are well suited for the -statistical analysis of quantitative proteomics data. Several -packages provide dedicated resources for proteomics data:
-MSstats and MSstatsTMT: A set of tools -for statistical relative protein significance analysis in Data -dependent (DDA), SRM, Data independent acquisition (DIA) and TMT -experiments.
msmsTests: Statistical tests for label-free LC-MS/MS
-data by spectral counts, to discover differentially expressed
-proteins between two biological conditions. Three tests are
-available: Poisson GLM regression, quasi-likelihood GLM regression,
-and the negative binomial of the edgeR
-package. All can be readily applied on MSnSet
instances produced,
-for example by MSnID
.
DEP provides an integrated analysis workflow for the -analysis of mass spectrometry proteomics data for differential -protein expression or differential enrichment.
MSqRob: The MSqRob
package
-allows a user to do quantitative protein-level statistical inference
-on LC-MS proteomics data. More specifically, our package makes use
-of peptide-level input data, thus correcting for unbalancedness and
-peptide-specific biases. As previously shown (Goeminne et
-al. (2015)), this
-approach is both more sensitive and specific than summarizing
-peptide-level input to protein-level values. Model estimates are
-stabilized by ridge regression, empirical Bayes variance estimation
-and downweighing of outliers. Currently, only label-free proteomics
-data types are
-supported. msqrob2
is now
-available and makes use of the QFeatures
infrastructure.
proDA accounts for missing -values in label-free mass spectrometry data without imputation. The -package implements a probabilistic dropout model that ensures that -the information from observed and missing values are properly -combined. It adds empirical Bayesian priors to increase power to -detect differentially abundant proteins.
Others, while not specfic to proteomics, are also recommended, such as
-the limma package. When analysing spectral counting
-data, methods for high throughput sequencing data are
-applicable. Below, we illustrate how to apply a typical edgeR
test
-to count data using the msms.edgeR
function from the msmsTests
-package.
Below, we are going to perform our statistical analysis on the protein
-data using limma
.
## Warning: 'experiments' dropped; see 'metadata'
-## Warning: Ignoring redundant column names in 'colData(x)':
-The limma package is the precursor package -that enables the consistent application of linear models to normalliy -distributed omics data in general, and microarrays in -particular.
-The limma
package implements an empirical Bayes method that
-borrows information across features to estimate the standard error and
-calculate (so called moderated) t statistics. This approach is
-demonstrably more powerful that a standard t-tests when the number of
-samples is low.
The code chunk below illustrates how to set up the model, fit it, and -apply the empirical Bayes moderation.
- -## Warning: Partial NA coefficients for 25 probe(s)
-
-Finally, the topTable()
function is used the extract the results for
-the coefficient of interest.
res <-
- topTable(fit, coef = "prots$condition6B", number = Inf) %>%
- rownames_to_column("protein") %>%
- as_tibble() %>%
- mutate(TP = grepl("ups", protein))
Note the warning about partial NA
coefficients for 23 probes:
## 6A_7 6A_8 6A_9 6B_7 6B_8
-## P00167ups|CYB5_HUMAN_UPS NaN NaN NaN -0.7840558 -2.0282987
-## P01112ups|RASH_HUMAN_UPS NaN NaN NaN -1.5564896 NaN
-## P05413ups|FABPH_HUMAN_UPS NaN NaN NaN -3.3419480 NaN
-## P08758ups|ANXA5_HUMAN_UPS NaN NaN NaN -2.7973872 -2.0137585
-## sp|P06704|CDC31_YEAST NaN NaN NaN -1.2032046 -2.1252371
-## sp|P25574|EMC1_YEAST -1.506177 -1.983737 -0.7795009 NaN NaN
-## sp|P32608|RTG2_YEAST NaN NaN NaN NaN -4.4424189
-## sp|P32769|HBS1_YEAST NaN -1.384031 -0.7285780 NaN NaN
-## sp|P34217|PIN4_YEAST NaN NaN NaN -0.8378614 -0.1316397
-## sp|P34237|CASP_YEAST NaN NaN NaN -1.5645172 -1.6600291
-## sp|P38166|SFT2_YEAST -1.585685 -1.076707 NaN NaN NaN
-## sp|P40056|GET2_YEAST NaN -1.091696 -1.4014211 NaN NaN
-## sp|P40533|TED1_YEAST NaN NaN NaN -2.0491876 NaN
-## sp|P43582|WWM1_YEAST NaN NaN NaN -0.5538711 -0.7360990
-## sp|P46965|SPC1_YEAST NaN -3.428771 -3.6321984 NaN NaN
-## sp|P48363|PFD3_YEAST NaN NaN NaN -0.1904905 NaN
-## 6B_9
-## P00167ups|CYB5_HUMAN_UPS -1.1230809
-## P01112ups|RASH_HUMAN_UPS -1.5618192
-## P05413ups|FABPH_HUMAN_UPS -3.8907081
-## P08758ups|ANXA5_HUMAN_UPS -2.0894752
-## sp|P06704|CDC31_YEAST -1.5844104
-## sp|P25574|EMC1_YEAST NaN
-## sp|P32608|RTG2_YEAST -2.7873186
-## sp|P32769|HBS1_YEAST NaN
-## sp|P34217|PIN4_YEAST -0.1989392
-## sp|P34237|CASP_YEAST -1.6877463
-## sp|P38166|SFT2_YEAST NaN
-## sp|P40056|GET2_YEAST NaN
-## sp|P40533|TED1_YEAST -1.7474812
-## sp|P43582|WWM1_YEAST -0.7207043
-## sp|P46965|SPC1_YEAST NaN
-## sp|P48363|PFD3_YEAST -0.5087747
-## [ reached getOption("max.print") -- omitted 9 rows ]
-We can now visualise the results using a volcano plot:
-res %>%
- ggplot(aes(x = logFC, y = -log10(adj.P.Val))) +
- geom_point(aes(colour = TP)) +
- geom_vline(xintercept = c(-1, 1)) +
- geom_hline(yintercept = -log10(0.05)) +
- scale_color_manual(values = c("black","red"))
## Warning: Removed 25 rows containing missing values (`geom_point()`).
-Using the pipeline described above, we would would identify a single -differentially expressed protein at an 5 percent FDR but miss out the -other 36 expected spike-in proteins.
-We can assess our results in terms of true/false postitves/negatives:
-As shown below, it is possible to substantially improve these results
-by aggregating features using a robust summarisation (available as
-MsCoreUtils::robustSummary()
), i.e robust regression with
-M-estimation using Huber weights, as described in section 2.7 in
-(Sticker et al. 2019Sticker, Adriaan, Ludger Goeminne, Lennart Martens, and Lieven Clement. 2019. “Robust Summarization and Inference in Proteome-Wide Label-Free Quantification.” bioRxiv. https://doi.org/10.1101/668863.).
Repeat and adapt what we have seen here using, for example, the
-robustSummary()
function.
Still, it is -recommended to explore missingness as part of the exploratory data -analysis.↩︎
Page built: -2023-09-06 - using -R version 4.3.1 Patched (2023-07-10 r84676) -
-In this section, we will learn how to read raw data in one of the
-commonly used open formats (mzML
, mzXML
, netCDF
or mgf
) into
-R.
When we manipulate complex data, we need a way to abstract it.
-The abstraction saves us from having to know about all -the details of that data and its associated metadata. In R, we -think of MS data as illustrated on the figure below (taken from -(Gatto, Gibb, and Rainer 2020Gatto, Laurent, Sebastian Gibb, and Johannes Rainer. 2020. “MSnbase, Efficient and Elegant r-Based Processing and Visualisation of Raw Mass Spectrometry Data.” J. Proteome Res., September.)): a metadata table and a set of raw spectra. This allows -to rely on a few easy-to-remember conventions to make mundane and -repetitive tasks trivial and be able to complete more complex things -easily. Abstractions provide a smoother approach to handle complex -data using common patterns.
-Spectra
class
-We are going to use the
-Spectra
package
-as an abstraction to raw mass spectrometry data.
Spectra
is part of the R for Mass Spectrometry
-initiative. It
-defines the Spectra
class that is used as a raw data abstraction, to
-manipulate MS data and metadata. The best way to learn about a data
-structure is to create one by hand.
Let’s create a DataFrame
4 containing MS levels, retention time, m/z and intensities
-for 2 spectra:
spd <- DataFrame(msLevel = c(1L, 2L), rtime = c(1.1, 1.2))
-spd$mz <- list(c(100, 103.2, 104.3, 106.5), c(45.6, 120.4, 190.2))
-spd$intensity <- list(c(200, 400, 34.2, 17), c(12.3, 15.2, 6.8))
-spd
## DataFrame with 2 rows and 4 columns
-## msLevel rtime mz intensity
-## <integer> <numeric> <list> <list>
-## 1 1 1.1 100.0,103.2,104.3,... 200.0,400.0, 34.2,...
-## 2 2 1.2 45.6,120.4,190.2 12.3,15.2, 6.8
-And now convert this DataFrame
into a Spectra
object:
## MSn data (Spectra) with 2 spectra in a MsBackendMemory backend:
-## msLevel rtime scanIndex
-## <integer> <numeric> <integer>
-## 1 1 1.1 NA
-## 2 2 1.2 NA
-## ... 16 more variables/columns.
-Explore the newly created object using
-spectraVariables
to extract all the metadata variables. Compare these to the
-spectra variables available from the previous example.spectraData
to extract all the metadata.peaksData
to extract a list containing the raw data.[
to create subsets.Spectra
from mzML files
-Let’s now create a new object using the mzML data previously
-downloaded and available in the mzf
file.
## [1] "/home/lgatto/.cache/R/rpx/8ee512042c5ff_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML"
-
-## MSn data (Spectra) with 7534 spectra in a MsBackendMzR backend:
-## msLevel rtime scanIndex
-## <integer> <numeric> <integer>
-## 1 1 0.4584 1
-## 2 1 0.9725 2
-## 3 1 1.8524 3
-## 4 1 2.7424 4
-## 5 1 3.6124 5
-## ... ... ... ...
-## 7530 2 3600.47 7530
-## 7531 2 3600.83 7531
-## 7532 2 3601.18 7532
-## 7533 2 3601.57 7533
-## 7534 2 3601.98 7534
-## ... 33 more variables/columns.
-##
-## file(s):
-## 8ee512042c5ff_TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01-20141210.mzML
-length()
.Mass spectrometry data in Spectra
objects can be thought of as a
-list of individual spectra, with each spectrum having a set of
-variables associated with it. Besides core spectra variables (such
-as MS level or retention time) an arbitrary number of optional
-variables can be assigned to a spectrum. The core spectra variables
-all have their own accessor method and it is guaranteed that a value
-is returned by it (or NA
if the information is not available). The
-core variables and their data type are (alphabetically ordered):
integer(1)
: the index of acquisition of a
-spectrum during a MS run.logical(1)
: whether the spectrum is in profile or
-centroid mode.numeric(1)
: collision energy used to create an
-MSn spectrum.character(1)
: the origin of the spectrum’s data,
-e.g. the mzML file from which it was read.character(1)
: the (current) storage location of the
-spectrum data. This value depends on the backend used to handle and
-provide the data. For an in-memory backend like the
-MsBackendMemory
this will be "<memory>"
, for an on-disk
-backend such as the MsBackendHdf5Peaks
it will be the name of the
-HDF5 file where the spectrum’s peak data is stored.numeric
: intensity values for the spectrum’s peaks.numeric(1)
: lower m/z for the isolation
-window in which the (MSn) spectrum was measured.numeric(1)
: the target m/z for the
-isolation window in which the (MSn) spectrum was measured.numeric(1)
: upper m/z for the isolation
-window in which the (MSn) spectrum was measured.integer(1)
: the MS level of the spectrum.numeric
: the m/z values for the spectrum’s peaks.integer(1)
: the polarity of the spectrum (0
and 1
-representing negative and positive polarity, respectively).integer(1)
: the scan (acquisition) number of the
-precursor for an MSn spectrum.integer(1)
: the charge of the precursor of an
-MSn spectrum.numeric(1)
: the intensity of the precursor of
-an MSn spectrum.numeric(1)
: the m/z of the precursor of an MSn
-spectrum.numeric(1)
: the retention time of a spectrum.integer(1)
: the index of a spectrum within a (raw)
-file.logical(1)
: whether the spectrum was smoothed.For details on the individual variables and their getter/setter
-function see the help for Spectra
(?Spectra
). Also note that these
-variables are suggested, but not required to characterize a
-spectrum. Also, some only make sense for MSn, but not for MS1 spectra.
In addition to the core spectra variables it is also possible to add additional
-spectra variables to a Spectra
object. As an example we add below a spectra
-variable representing the retention times in minutes to the object. This
-information can then be extracted again using the $
notation (similar to
-accessing a column in a data.frame
, i.e., $
and the name of the spectra
-variable).
## [1] 0.00764000 0.01620833 0.03087333 0.04570667 0.06020667 0.07487500
-msLevel(.)
) or using the $
notation (for example .$msLevel
).plotSpectra()
-function to visualise peaks and set the m/z range with the xlim
-arguments.Using the first raw data file starting with MS3TMT10
, answer the
-following questions:
These objects and their manipulations are not limited to single files or
-samples. Below we load data from two mzML files. The MS data from both files in
-the Spectra
is organized linearly (first all spectra from the first file
-and then from the second). The dataOrigin
function can be used to identify
-spectra from the different data files.
## [1] "/home/lgatto/disk/R/x86_64-pc-linux-gnu-library/4.3/msdata/sciex/20171016_POOL_POS_1_105-134.mzML"
-## [2] "/home/lgatto/disk/R/x86_64-pc-linux-gnu-library/4.3/msdata/sciex/20171016_POOL_POS_3_105-134.mzML"
-
-##
-## /home/lgatto/disk/R/x86_64-pc-linux-gnu-library/4.3/msdata/sciex/20171016_POOL_POS_1_105-134.mzML
-## 931
-## /home/lgatto/disk/R/x86_64-pc-linux-gnu-library/4.3/msdata/sciex/20171016_POOL_POS_3_105-134.mzML
-## 931
-Backends allow to use different backends to store mass spectrometry data while
-providing via the Spectra
class a unified interface to use that data. With
-the setBackend
function it is possible to change between different backends
-and hence different data representations. The Spectra
package defines a set of
-example backends but any object extending the base MsBackend
class could be
-used instead. The default backends are:
MsBackendMzR
: this backend keeps only general spectra variables in memory
-and relies on the mzR package to read mass peaks (m/z and
-intensity values) from the original MS files on-demand.## MSn data (Spectra) with 1862 spectra in a MsBackendMzR backend:
-## msLevel rtime scanIndex
-## <integer> <numeric> <integer>
-## 1 1 0.280 1
-## 2 1 0.559 2
-## 3 1 0.838 3
-## 4 1 1.117 4
-## 5 1 1.396 5
-## ... ... ... ...
-## 1858 1 258.636 927
-## 1859 1 258.915 928
-## 1860 1 259.194 929
-## 1861 1 259.473 930
-## 1862 1 259.752 931
-## ... 33 more variables/columns.
-##
-## file(s):
-## 20171016_POOL_POS_1_105-134.mzML
-## 20171016_POOL_POS_3_105-134.mzML
-MsBackendMemory
and MsBackendDataFrame
: the full mass spectrometry data is
-stored (in-memory) within the object. Keeping the data in memory guarantees
-high performance but has also, depending on the number of mass peaks in each
-spectrum, a much higher memory footprint.## MSn data (Spectra) with 1862 spectra in a MsBackendMemory backend:
-## msLevel rtime scanIndex
-## <integer> <numeric> <integer>
-## 1 1 0.280 1
-## 2 1 0.559 2
-## 3 1 0.838 3
-## 4 1 1.117 4
-## 5 1 1.396 5
-## ... ... ... ...
-## 1858 1 258.636 927
-## 1859 1 258.915 928
-## 1860 1 259.194 929
-## 1861 1 259.473 930
-## 1862 1 259.752 931
-## ... 33 more variables/columns.
-## Processing:
-## Switch backend from MsBackendMzR to MsBackendMemory [Wed Sep 6 11:50:11 2023]
-MsBackendHdf5Peaks
: similar to MsBackendMzR
this backend reads peak data
-only on-demand from disk while all other spectra variables are kept in
-memory. The peak data are stored in Hdf5 files which guarantees scalability.With the example below we load the data from a single mzML file and use a
-MsBackendHdf5Peaks
backend for data storage. The hdf5path
parameter allows
-us to specify the storage location of the HDF5 file.
## MSn data (Spectra) with 1862 spectra in a MsBackendHdf5Peaks backend:
-## msLevel rtime scanIndex
-## <integer> <numeric> <integer>
-## 1 1 0.280 1
-## 2 1 0.559 2
-## 3 1 0.838 3
-## 4 1 1.117 4
-## 5 1 1.396 5
-## ... ... ... ...
-## 1858 1 258.636 927
-## 1859 1 258.915 928
-## 1860 1 259.194 929
-## 1861 1 259.473 930
-## 1862 1 259.752 931
-## ... 33 more variables/columns.
-##
-## file(s):
-## 20171016_POOL_POS_1_105-134.h5
-## 20171016_POOL_POS_3_105-134.h5
-## Processing:
-## Switch backend from MsBackendMzR to MsBackendHdf5Peaks [Wed Sep 6 11:50:18 2023]
-
-##
-## /home/lgatto/disk/R/x86_64-pc-linux-gnu-library/4.3/msdata/sciex/20171016_POOL_POS_1_105-134.mzML
-## 931
-## /home/lgatto/disk/R/x86_64-pc-linux-gnu-library/4.3/msdata/sciex/20171016_POOL_POS_3_105-134.mzML
-## 931
-
-##
-## /tmp/RtmphN3t3B/20171016_POOL_POS_1_105-134.h5
-## 931
-## /tmp/RtmphN3t3B/20171016_POOL_POS_3_105-134.h5
-## 931
-All of the above mentioned backends support changing all of their their spectra
-variables, except the MsBackendMzR
that does not support changing m/z or
-intensity values for the mass peaks.
Next to these default backends there are a set of other backend implementations
-provided by additional R packages. The
-MsBackendSql
for
-example allows to store (and retrieve) all MS data in (from) an SQL database
-guaranteeing thus a minimal memory footprint.
Other backends focus on specific file formats such as
-MsBackendMgf
for files
-in mgf
file format or on specific acquisitions such as
-MsBackendTimsTof
-or provide access to certain MS data resources such as the
-MsBackendMassbank
.
-Additional backends are being developed to address specific needs or
-technologies, while remaining compliant with the Spectra
interface.
If you would like to learn more about how the raw MS formats are
-handled by Spectra
via the mzR package,
-check out the 6.1 section in the annex.
See also Spectra -backends -for more information on different backends, their properties and -advantages/disadvantages.
-The importance of flexible access to specialised data becomes visible
-in the figure below (taken from the RforProteomics
visualisation
-vignette).
-Not only can we access specific data and understand/visualise them,
-but we can transverse all the data and extract/visualise/understand
-structured slices of data.
The figure below shows an illustration of how mass spectrometry -works:
-The chromatogram at the top displays the total ion current along the -retention time. The vertical line identifies one scan in particular -at retention time 1800.68 seconds (the 2807th scan).
The spectra on the second line represent the full MS1 spectrum -marked by the red line. The vertical lines identify the 10 -precursor ions that where selected for MS2 analysis. The zoomed in -on the right shows one specific precursor peak.
The MS2 spectra displayed along the two rows at the bottom are -those resulting from the fragmentation of the 10 precursor peaks -identified by the vertical bars above.
We are going to reproduce the figure above through a set of exercices.
--► Question -
-totIonCurrent
-and rtime
variables for all MS1 spectra. Annotate the spectrum of
-interest.- -
--► Solution -
- --► Question -
-filterPrecursorScan()
function can be used to retain a set
-parent (MS1) and children scans (MS2), as defined by an acquisition
-number. Use it to extract the MS1 scan of interest and all its MS2
-children.- -
--► Solution -
- --► Question -
-- -
--► Solution -
- --► Question -
-- -
--► Solution -
- --► Question -
-plotSpectra()
function is used to plot all 10 MS2 spectra in
-one call.- -
--► Solution -
- -It is possible to label the peaks with the plotSpectra()
-function. The labels
argument is either a character
of appropriate
-length (i.e. with a label for each peak) or, as illustrated below, a
-function that computes the labels.
mzLabel <- function(z) {
- z <- peaksData(z)[[1L]]
- lbls <- format(z[, "mz"], digits = 4)
- lbls[z[, "intensity"] < 1e5] <- ""
- lbls
-}
-
-plotSpectra(ms_2[7],
- xlim = c(126, 132),
- labels = mzLabel,
- labelSrt = -30, labelPos = 2,
- labelOffset = 0.1)
Spectra can also be compared either by overlay or mirror plotting
-using the plotSpectraOverlay()
and plotSpectraMirror()
functions.
-► Question -
-Filter MS2 level spectra and find any 2 MS2 spectra that have matching -precursor peaks based on the precursor m/z values.
-- -
--► Solution -
- --► Question -
-Visualise the matching pair using the plotSpectraOverlay()
and
-plotSpectraMirror()
functions.
- -
--► Solution -
- -It is also possible to explore raw data interactively with the
-SpectraVis
-package:
The
-browseSpectra()
-function opens a simple shiny application that allows to browse
-through the individual scans of a Spectra object.
The
-plotlySpectra()
-function displays a single spectrum using
-plotly
allowing to explore (zooming,
-panning) the spectrum interactively.
-► Question -
-Test the SpectraVis
function on some the Spectra
objects produce
-above.
- -
-Apart from classical subsetting operations such as [
and split
, a set of
-filter functions are defined for Spectra
objects that filter/reduce the number
-of spectra within the object (for detailed help please see the ?Spectra
help):
filterAcquisitionNum
: retains spectra with certain acquisition numbers.filterDataOrigin
: subsets to spectra from specific origins.filterDataStorage
: subsets to spectra from certain data storage files.filterEmptySpectra
: removes spectra without mass peaks.filterMzRange
: subsets spectra keeping only peaks with an m/z within the
-provided m/z range.filterIsolationWindow
: keeps spectra with the provided mz
in their
-isolation window (m/z range).filterMsLevel
: filters by MS level.filterPolarity
: filters by polarity.filterPrecursorIsotopes
: identifies precursor ions (from fragment spectra)
-that could represent isotopes of the same molecule. For each of these spectra
-groups only the spectrum of the monoisotopic precursor ion is returned. MS1
-spectra are returned without filtering.filterPrecursorMaxIntensity
: filters spectra keeping, for groups of spectra
-with similar precursor m/z, the one spectrum with the highest precursor
-intensity. All MS1 spectra are returned without filtering.filterPrecursorMzRange
: retains (MSn) spectra with a precursor m/z within
-the provided m/z range.filterPrecursorMzValues
: retains (MSn) spectra with precursor m/z value
-matching the provided value(s) considering also a tolerance
and ppm
.filterPrecursorCharge
: retains (MSn) spectra with speified
-precursor charge(s).filterPrecursorScan
: retains (parent and children) scans of an acquisition
-number.filterRt
: filters based on retention time range.In addition to these, there is also a set of filter functions that operate on
-the peak data, filtering and modifying the number of peaks of each spectrum
-within a Spectra
:
combinePeaks
: groups peaks within each spectrum based on similarity of their
-m/z values and combines these into a single peak per peak group.deisotopeSpectra
: deisotopes each individual spectrum keeping only the
-monoisotopic peak for peaks groups of potential isotopologues.filterIntensity
: filter each spectrum keeping only peaks with intensities
-meeting certain criteria.filterMzRange
: subsets peaks data within each spectrum keeping only peaks
-with their m/z values within the specified m/z range.filterPrecursorPeaks
: removes peaks with either an m/z value matching the
-precursor m/z of the respective spectrum (with parameter mz = "=="
) or peaks
-with an m/z value larger or equal to the precursor m/z (with parameter
-mz = ">="
).filterMzValues
: subsets peaks within each spectrum keeping or removing (all)
-peaks matching provided m/z value(s) (given parameters ppm
and tolerance
).reduceSpectra
: filters individual spectra keeping only the largest peak for
-groups of peaks with similar m/z values.-► Question -
-Using the sp_sciex
data, select all spectra measured in the second
-mzML file and subsequently filter them to retain spectra measured
-between 175 and 189 seconds in the measurement run.
- -
--► Solution -
- -As an example of data processing, we use below the pickPeaks()
-function. This function allows to convert profile mode MS data to centroid
-mode data (a process also referred to as centroiding).
Centroiding reduces the profile mode MS data to a representative single mass -peak per ion.
- - -The figure below (taken from (Gatto, Gibb, and Rainer 2020Gatto, Laurent, Sebastian Gibb, and Johannes Rainer. 2020. “MSnbase, Efficient and Elegant r-Based Processing and Visualisation of Raw Mass Spectrometry Data.” J. Proteome Res., September.)) illustrates the respective
-advantages of storing data in memory or on disk. The benchmarking was
-done for the MSnbase
package but also applies to the Spectra
backends.
Most functions on Spectra
support (and use) parallel processing out
-of the box. Peak data access and manipulation methods perform by
-default parallel processing on a per-file basis (i.e. using the
-dataStorage variable as splitting factor). Spectra uses
-BiocParallel
for
-parallel processing and all functions use the default registered
-parallel processing setup of that package.
Data manipulations on Spectra objects are not immediately applied to
-the peak data. They are added to a so called processing queue which is
-applied each time peak data is accessed (with the peaksData
, mz
or
-intensity
functions). Thanks to this processing queue data
-manipulation operations are also possible for read-only backends
-(e.g. mzML-file based backends or database-based backends). The
-information about the number of such processing steps can be seen
-below (next to Lazy evaluation queue).
## [1] 0
-sp_sciex <- filterIntensity(sp_sciex, intensity = c(10, Inf))
-sp_sciex ## Note the lazy evaluation queue
## MSn data (Spectra) with 1862 spectra in a MsBackendMzR backend:
-## msLevel rtime scanIndex
-## <integer> <numeric> <integer>
-## 1 1 0.280 1
-## 2 1 0.559 2
-## 3 1 0.838 3
-## 4 1 1.117 4
-## 5 1 1.396 5
-## ... ... ... ...
-## 1858 1 258.636 927
-## 1859 1 258.915 928
-## 1860 1 259.194 929
-## 1861 1 259.473 930
-## 1862 1 259.752 931
-## ... 33 more variables/columns.
-##
-## file(s):
-## 20171016_POOL_POS_1_105-134.mzML
-## 20171016_POOL_POS_3_105-134.mzML
-## Lazy evaluation queue: 1 processing step(s)
-## Processing:
-## Remove peaks with intensities outside [10, Inf] in spectra of MS level(s) 1. [Wed Sep 6 11:50:20 2023]
-
-## [1] 412
-
-## [[1]]
-## Object of class "ProcessingStep"
-## Function: user-provided function
-## Arguments:
-## o intensity = 10Inf
-## o msLevel = 1
-Through this lazy evaluation system it is also possible to undo data -manipulations:
- -## list()
-
-## [1] 0
-More information on this lazy evaluation concept implemented in Spectra
is
-provided in the Spectra
-backends
-vignette.
As defined in the Bioconductor S4Vectors
-package.↩︎
Page built: -2023-09-06 - using -R version 4.3.1 Patched (2023-07-10 r84676) -
-The Single-cell proteomics data analysis using QFeatures
and
-scp
workshop
-is provided as two vignettes. The first one provides a general
-introduction to the QFeatures
class in the general context of mass
-spectrometry-based proteomics data manipulation. The second vignette
-focuses on single-cell application and introduces the scp
package
-(Vanderaa and Gatto 2021Vanderaa, Christophe, and Laurent Gatto. 2021. “Replication of Single-Cell Proteomics Data Reveals Important Computational Challenges.” Expert Rev. Proteomics, October.) as an extension of QFeatures
. This second
-vignette also provides exercises that give the attendee the
-opportunity to apply the learned concepts to reproduce a published
-analysis on a subset of a real data set.
The SpectraTutorials package -provides three different vignettes:
-r BiocStyle::Biocpkg("MetaboAnnotation")
package in LC-MS/MS annotation
-workflows for untargeted metabolomics data.A tutorial presenting Use Cases and Examples for Annotation of
-Untargeted Metabolomics
-Data using
-the MetaboAnnotation
and MetaboCoreUtils
packages
-(Rainer et al. 2022Rainer, Johannes, Andrea Vicini, Liesa Salzer, Jan Stanstrup, Josep M Badia, Steffen Neumann, Michael A Stravs, et al. 2022. “A Modular and Expandable Ecosystem for Metabolomics Data Annotation in R.” Metabolites 12 (2): 173.).
Exploring and analyzing LC-MS data with Spectra and -xcms provides an -overview of recent developments in Bioconductor to work with mass -spectrometry -(MsExperiment, -Spectra) and -specifically LC-MS data (xcms) -and walks through the preprocessing of a small data set emphasizing -on selection of data-dependent settings for the individual -pre-processing steps.
For questions about specific software or their usage, please refer to -the software’s github issue page, or use the Bioconductor support -site.
-The following packages have been used to generate this document.
- -## R version 4.3.1 Patched (2023-07-10 r84676)
-## Platform: x86_64-pc-linux-gnu (64-bit)
-## Running under: Manjaro Linux
-##
-## Matrix products: default
-## BLAS: /usr/lib/libblas.so.3.11.0
-## LAPACK: /usr/lib/liblapack.so.3.11.0
-##
-## locale:
-## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
-## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
-## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
-## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
-## [9] LC_ADDRESS=C LC_TELEPHONE=C
-## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
-##
-## time zone: Europe/Brussels
-## tzcode source: system (glibc)
-##
-## attached base packages:
-## [1] stats4 stats graphics grDevices utils datasets methods
-## [8] base
-##
-## other attached packages:
-## [1] mzID_1.38.0 patchwork_1.1.3
-## [3] factoextra_1.0.7 gplots_3.1.3
-## [5] limma_3.56.2 lubridate_1.9.2
-## [7] forcats_1.0.0 stringr_1.5.0
-## [9] purrr_1.0.2 readr_2.1.4
-## [11] tibble_3.2.1 tidyverse_2.0.0
-## [13] MSnID_1.34.0 magrittr_2.0.3
-## [15] tidyr_1.3.0 ggplot2_3.4.3
-## [17] dplyr_1.1.2 msdata_0.40.0
-## [19] rpx_2.8.0 MsCoreUtils_1.12.0
-## [21] QFeatures_1.11.1 MultiAssayExperiment_1.26.0
-## [23] SummarizedExperiment_1.30.2 Biobase_2.60.0
-## [25] GenomicRanges_1.52.0 GenomeInfoDb_1.36.1
-## [27] IRanges_2.34.1 MatrixGenerics_1.12.3
-## [29] matrixStats_1.0.0 Spectra_1.10.2
-## [31] ProtGenerics_1.32.0 BiocParallel_1.34.2
-## [33] S4Vectors_0.38.1 BiocGenerics_0.46.0
-## [35] mzR_2.34.1 Rcpp_1.0.11
-## [37] BiocStyle_2.28.0
-##
-## loaded via a namespace (and not attached):
-## [1] later_1.3.1 bitops_1.0-7 filelock_1.0.2
-## [4] R.oo_1.25.0 preprocessCore_1.62.1 XML_3.99-0.14
-## [7] lifecycle_1.0.3 rstatix_0.7.2 doParallel_1.0.17
-## [10] lattice_0.21-8 MASS_7.3-60 backports_1.4.1
-## [13] sass_0.4.7 rmarkdown_2.24 jquerylib_0.1.4
-## [16] yaml_2.3.7 httpuv_1.6.11 DBI_1.1.3
-## [19] RColorBrewer_1.1-3 abind_1.4-5 zlibbioc_1.46.0
-## [22] R.cache_0.16.0 R.utils_2.12.2 AnnotationFilter_1.24.0
-## [25] RCurl_1.98-1.12 rappdirs_0.3.3 GenomeInfoDbData_1.2.10
-## [28] ggrepel_0.9.3 pheatmap_1.0.12 MSnbase_2.27.1
-## [31] ncdf4_1.21 codetools_0.2-19 DelayedArray_0.26.7
-## [34] xml2_1.3.5 tidyselect_1.2.0 farver_2.1.1
-## [37] BiocFileCache_2.8.0 jsonlite_1.8.7 ellipsis_0.3.2
-## [40] iterators_1.0.14 foreach_1.5.2 tools_4.3.1
-## [43] glue_1.6.2 BiocBaseUtils_1.2.0 xfun_0.40
-## [46] withr_2.5.0 BiocManager_1.30.22 fastmap_1.1.1
-## [49] rhdf5filters_1.12.1 fansi_1.0.4 caTools_1.18.2
-## [52] digest_0.6.33 timechange_0.2.0 R6_2.5.1
-## [55] mime_0.12 colorspace_2.1-0 gtools_3.9.4
-## [58] RSQLite_2.3.1 R.methodsS3_1.8.2 utf8_1.2.3
-## [61] generics_0.1.3 data.table_1.14.8 httr_1.4.7
-## [64] S4Arrays_1.0.5 pkgconfig_2.0.3 gtable_0.3.4
-## [67] blob_1.2.4 impute_1.74.1 XVector_0.40.0
-## [70] htmltools_0.5.6 carData_3.0-5 bookdown_0.34.2
-## [73] MALDIquant_1.22.1 clue_0.3-64 scales_1.2.1
-## [76] png_0.1-8 knitr_1.43 rstudioapi_0.15.0
-## [79] tzdb_0.4.0 reshape2_1.4.4 curl_5.0.2
-## [82] cachem_1.0.8 rhdf5_2.44.0 BiocVersion_3.17.1
-## [85] KernSmooth_2.23-22 parallel_4.3.1 AnnotationDbi_1.62.2
-## [88] vsn_3.68.0 msmbstyle_0.0.19 pillar_1.9.0
-## [91] grid_4.3.1 vctrs_0.6.3 pcaMethods_1.92.0
-## [94] promises_1.2.1 ggpubr_0.6.0 car_3.1-2
-## [97] dbplyr_2.3.3 xtable_1.8-4 cluster_2.1.4
-## [100] evaluate_0.21
-## [ reached getOption("max.print") -- omitted 28 entries ]
-
-Page built: -2023-09-06 - using -R version 4.3.1 Patched (2023-07-10 r84676) -
-