-
Notifications
You must be signed in to change notification settings - Fork 0
/
tut_maldiquant_msi.Rmd
289 lines (190 loc) · 7.92 KB
/
tut_maldiquant_msi.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
---
title: "Preprocessing MSI data using MALDIquant"
author: "Paolo Inglese"
date: "15/10/2021"
output:
html_document:
theme: readable
bibliography: ['refs.bib']
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## About
In this short tutorial, I will show how to preprocess mass spectrometry imaging (MSI) data
using [```MALDIquant```](https://cran.r-project.org/package=MALDIquant) [@MALDIquant], [```MALDIquantForeign```](https://cran.r-project.org/package=MALDIquantForeign)
First load the required packages:
```{r packages}
library(MALDIquant)
library(MALDIquantForeign)
library(irlba)
library(viridis)
```
## Importing ImzML
For this tutorial, I will use a MALDI-MSI dataset `EY_210930pm_EDI-27jul21_p56-Male-642-12-6-T_50um_2SE` collected from a mouse brain in positive ion mode.
The dataset is publicly available on [METASPACE2020](https://metaspace2020.eu/) (*thanks to Eylan Yutuc for sharing it online*).
We call the function ```importImzML``` from ```MALDIquant``` to load the MSI dataset ```imzML``` file.
The dataset is in centroided mode, so we will set the option (```centroided = TRUE```). This operation usually takes few minutes,
depending on the dataset size.
```{r load}
peaks <- importImzMl('EY_210930pm_EDI-27jul21_p56-Male-642-12-6-T_50um_2SE.imzML', centroided = TRUE, verbose = FALSE)
```
Some details about the dataset
```{r details}
# Total number of pixels
print(length(peaks))
# Spatial dimensions (same metadata for all pixels)
print(peaks[[1]]@metaData$imaging$size)
```
***NOTE***: The following code expects the spectra to be acquired ***row-wise*** ***left-to-right***.
This small function can reorder the `peaks` list elements in case they were acquired with a
different pattern:
```{r px order}
orderPixels <- function(peaks) {
require(MALDIquant)
coords <- coordinates(peaks)
ord <- c()
for (y in sort(unique(coords[, 2]))) {
curr.y <- which(coords[, 2] == y)
ord <- c(ord, curr.y[order(coords[curr.y, 1])])
}
return(ord)
}
px.ord <- orderPixels(peaks)
```
In this case, the order remains unchanged, since the spectra were already acquired in the expected order:
```{r check order}
px.ord[1:10]
max(diff(px.ord)) # The max difference between two consecutive pixels order is 1, because it's identical to the original order
# To reorder the peaks
peaks <- peaks[px.ord]
```
Let's have a look at the distribution of the number of detected peaks and pixels mean intensities (in the log-space):
```{r}
n.peaks <- unlist(lapply(peaks, function(x) length(mass(x))))
hist(n.peaks)
mu.peaks <- unlist(lapply(peaks, function(x) mean(intensity(x))))
hist(log1p(mu.peaks))
```
We can check that we have a set of MS peaks (class ```MassPeaks```):
```{r class}
print(class(peaks[[1]]))
```
and plot the peaks from one pixel:
```{r plot pixel}
plot(peaks[[3000]])
```
# Total-ion-count (TIC) image
A quick way to check the spatial properties of an MSI dataset is to plot the TIC image. In this image,
each pixel represents its total peaks intensity:
```{r tic}
tic <- unlist(lapply(peaks, function(x) sum(intensity(x))))
tic <- matrix(tic, peaks[[1]]@metaData$imaging$size)
image(tic, col=viridis(64))
title('TIC')
```
As you may notice, it is not very informative, although the section boundaries are barely visible.
Things are a bit clearer if we calculate the TIC of the log-transformed intensities:
```{r tic.log}
tic.log <- unlist(lapply(peaks, function(x) sum(log1p(intensity(x)))))
tic.log <- matrix(tic.log, peaks[[1]]@metaData$imaging$size)
image(tic.log, col=viridis(64))
title('TIC (log)')
```
## Peak binning (or peak matching)
After quickly checking some details of the dataset, we can start processing it. The first step is
matching the peaks across all pixels. In this way, we assign the peaks intensities to a set of
common masses, based on the similarity of their original values.
This can be done by calling the function ```binPeaks``` from ```MALDIquant```. We save the new
masses in the same list of ```MassPeaks``` (NOTE: this procedure will overwrite the original masses,
so if you want to keep them, you have to export the results in a different list).
We will use the 'strict' method, and a tolerance of 20 ppm:
```{r bin}
# binPeaks tolerance is expressed in delta mass / mass, we want it in ppm:
tol.ppm <- 20
tol.maldiquant <- tol.ppm / 1e6
# The new common masses will overwrite the original ones. The intensities remain unmodified
peaks <- binPeaks(peaks, method = 'strict', tolerance = tol.maldiquant)
```
### Intensity matrix (feature)
To perform downstream analysis we need the so-called intensity matrix (or feature matrix),
representing the intensities of all peaks assigned to the common masses. This matrix has a shape
_npixels_ x _nfeatures_ and it's generated using the function ```intensityMatrix``` from ```MALDIquant```.
*NOTE*: depending on the dataset, this matrix can be very big, so large amount of RAM may be necessary
```{r features}
# This can occupy a lot of RAM
X <- intensityMatrix(peaks)
```
*NOTE*: ```MALDIquant``` sets the unassigned values to ```NA```. You may need to convert them to 0
for downstream analysis.
In this case, we have a whooping number of common masses equal to:
```{r}
# most of the common masses represent noise!
ncol(X)
```
Most of these features are noise, as we can see from their assignment frequency:
```{r}
# Unfortunately my computer doesn't have enough RAM to run apply
nz <- array(0, ncol(X))
for (i in 1:ncol(X)) nz[i] <- sum(!is.na(X[, i]))
```
```{r}
summary(nz / nrow(X))
plot(seq(0, 1, 0.01), sapply(seq(0, 1, 0.01), function(x) sum(nz / nrow(X) > x) / length(nz)),
xlab = 'Freq. threshold',
ylab = 'Fraction of kept masses', type = 'b')
```
As we can see, less than 10% of the common masses are detected in more than 1% of the pixels.
We can set 5% as a threshold to remove the _rare_ features (this is quite arbitrary and now based on memory constraints)
```{r}
keep.mass <- which(nz > nrow(X) * 0.05)
X <- X[, keep.mass]
dim(X)
```
We now have 4,871 features which looks more reasonable than > 100,000.
_Where are my masses?_
The common masses are saved as column names of the intensity matrix
```{r}
common.masses <- as.numeric(colnames(X))
print(common.masses[1:20])
```
*NOTE*: let's covert the ```NA``` to 0.
```{r}
X[is.na(X)] <- 0
```
## Normalization and log-transformation
We can now normalize the intensities to take into account of the pixel-to-pixel variations.
To do so, we use TIC scaling. Also we apply a log-transformation to reduce the skeweness of the intensity distributions.
```{r}
scaling.factor <- apply(X, 1, sum)
X <- X / scaling.factor
# Be sure that the empty pixels are set to 0
X[scaling.factor == 0, ] <- 0
X <- log1p(X)
```
## How the data looks like?
A quick way to check the global molecular heterogeneity of the sample is based on plotting
the first 3 principal components (PC) as RGB channels. We use [```irlba```](https://cran.r-project.org/package=irlba) to quickly extract the
first 3 PC:
```{r}
results <- irlba::prcomp_irlba(X, center = TRUE, scale. = TRUE, n = 3)
```
```{r}
# First rescale the scores in [0, 1]
pc <- apply(results$x, 2, function(x) (x - min(x)) / (max(x) - min(x)))
# Create the RGB
im <- rgb(pc[, 1], pc[, 2], pc[, 3])
im <- matrix(im, peaks[[1]]@metaData$imaging$size)
# Plot the image as a raster (transpose before)
plot(as.raster(t(im)), interpolate = FALSE)
```
The image shows some degree of heterogeneity of the tissue, and its difference from the off-tissue region.
## Ready for downstream analysis
Now that we have the preprocessed intensity matrix, we can perform downstream analysis,
such as supervised classification, clustering, etc.
In this tutorial, we have used a basic filtering approach to reduce the number of common
masses. More sophisticated methods based on the spatial information are available, such as
[```SPUTNIK```](https://cran.r-project.org/package=SPUTNIK) [@10.1093/bioinformatics/bty622]
```{r session info}
sessionInfo()
```