-
Notifications
You must be signed in to change notification settings - Fork 0
/
03_AlevinSeuratSingleSample.Rmd
executable file
·242 lines (163 loc) · 10.9 KB
/
03_AlevinSeuratSingleSample.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
---
title: "Practical: Alevin single sample"
subtitle: "Transcriptome Analysis Workshop"
author: "Adam Cribbs"
date: "24/01/2022"
output:
html_document:
theme: cosmo
toc: yes
---
```{r, out.width = "40%", echo=FALSE}
htmltools::img(src = knitr::image_uri(file.path("logo.png")),
alt = 'logo',
style = 'position:absolute; top:0; right:0; padding:10px;',
width='300')
```
# Overview
This is an Rmarkdown document showing an implimentation of Seurat which was adapted form the vignette found here: [Seurat tutorial](https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html).
# Experiment plan
This experiment was to evaluate the effect of control and drug treated cells.
Cells were harvested and a single-cell experiment was then performed using the drop-seq method.
```{r, message=FALSE}
knitr::opts_chunk$set(cache=FALSE, warning = FALSE, message = FALSE)
Sys.setenv(KMP_DUPLICATE_LIB_OK=TRUE)
library(Seurat)
library(tximport)
library(cowplot)
library(org.Hs.eg.db)
library(dplyr)
library("AnnotationDbi")
library(DT)
library(biomaRt)
library(EnsDb.Hsapiens.v75)
```
# Import the data
The experiment was analysed using [cgat-developer](https://github.com/cgat-developers/cgat-core) pipeline framework. The pipeline used is [scflow](https://github.com/Acribbs/scflow).
scflow can be installed using pip or through conda:
`pip install scflow`
or
`conda install -c cgat scflow`
Read quality looked good and then expression was then quanitified using the [Slamon Alevin method](https://salmon.readthedocs.io/en/latest/alevin.html).
```{r}
txi.control <- tximport("salmon.dir/pbmc_1k_v3_S1/alevin/quants_mat.gz", type="alevin")
d <- txi.control$counts
# Import the filtering QC metrics clean object and filter the names from that object
metrics_clean <- readRDS( "metrics_clean.rds")
d <- d[,metrics_clean$cells]
SYMBOL <- mapIds(EnsDb.Hsapiens.v75,keys=rownames(d),column="SYMBOL",keytype="GENEID",multiVals="first")
rownames(d) <- make.unique(ifelse(is.na(SYMBOL), rownames(d), SYMBOL))
control <- CreateSeuratObject(counts = d, min.cells = 3, min.features = 200, project = "CTRL")
control$stim <- "CTRL"
```
# Assess QC metrics {.tabset .tabset-fade}
## Control features
We have previously shown that the filtering is a very important step in single-cell analysis data. We previously implimented our own filtering startegy and also used scater. However, Seurat also has basic quality control steps. However, I would not recomend using Scater to filter the data because I dont think the filtering is as robust as we have just performed in the previous Rmarkdown scripts.
In the example below, we visualize QC metrics, and use these to filter cells.
We filter cells that have unique feature counts over 2,500 or less than 200
We filter cells that have >5% mitochondrial counts
```{r}
control[["percent.mt"]] <- PercentageFeatureSet(control, pattern = "^MT-")
VlnPlot(control, c("nFeature_RNA", "nCount_RNA", "percent.mt"))
```
```{r}
plot1 <- FeatureScatter(control, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(control, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
CombinePlots(plots = list(plot1, plot2))
```
# Filter the cells
Filtering is performed using the Seurat::subset function.
```{r}
control <- subset(control, subset = nFeature_RNA > 1000 & nFeature_RNA < 5000 & percent.mt < 20)
```
# Normalise the data
The single-cell counts data should be normalised before performing any further downstream analyses. This is so that cell to cell comparrisons can be performed. There are a number of normalisation methods available and you can see those that are available using ?NormalizeData.
```{r}
control <- NormalizeData(control)
```
# Identify highly variable genes
We next calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). We and others have found that focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.
The procedure in Seurat3 is described in detail here, and improves on previous versions by directly modeling the mean-variance relationship inherent in single-cell data, and is implemented in the FindVariableFeatures function. By default, we return 2,000 features per dataset. These will be used in downstream analysis, like PCA.
```{r}
control <- FindVariableFeatures(control, selection.method = "disp", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(control), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(control)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1
plot2
```
# Evaluate the clustering
Next we will use UMAP to visualise the clustering
First, we need to apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData function:
Shifts the expression of each gene, so that the mean expression across cells is 0
Scales the expression of each gene, so that the variance across cells is 1
This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
The results of this are stored in integration[["RNA"]]@scale.data
```{r}
control <- ScaleData(control)
```
# Determining the dimensionality of the data
To overcome the extensive technical noise in any single feature for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a ‘metafeature’ that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset. However, how many componenets should we choose to include? 10? 20? 100?
In Macosko et al, we implemented a resampling test inspired by the JackStraw procedure. We randomly permute a subset of the data (1% by default) and rerun PCA, constructing a ‘null distribution’ of feature scores, and repeat this procedure. We identify ‘significant’ PCs as those who have a strong enrichment of low p-value features.
The JackStrawPlot function provides a visualization tool for comparing the distribution of p-values for each PC with a uniform distribution (dashed line). ‘Significant’ PCs will show a strong enrichment of features with low p-values (solid curve above the dashed line). In this case it appears that there is a sharp drop-off in significance after the first 12-13 PCs.
```{r, message=FALSE}
control <- RunPCA(control, npcs = 20)
control <- JackStraw(control, num.replicate = 100)
control <- ScoreJackStraw(control, dims = 1:20)
JackStrawPlot(control, dims = 1:20)
```
An alternative heuristic method generates an ‘Elbow plot’: a ranking of principle components based on the percentage of variance explained by each one (ElbowPlot function). In this example, we can observe an ‘elbow’ around PC5-6, suggesting that the majority of true signal is captured in the first 8 PCs.
```{r}
ElbowPlot(control)
```
Identifying the true dimensionality of a dataset – can be challenging/uncertain for the user. We therefore suggest these three approaches to consider. The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example. The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff. The third is a heuristic that is commonly used, and can be calculated instantly. In this example, all three approaches yielded similar results, but we might have been justified in choosing anything between PC 7-12 as a cutoff.
We chose 10 here, but encourage users to consider the following:
Dendritic cell and NK aficionados may recognize that genes strongly associated with PCs 12 and 13 define rare immune subsets (i.e. MZB1 is a marker for plasmacytoid DCs). However, these groups are so rare, they are difficult to distinguish from background noise for a dataset of this size without prior knowledge.
We encourage users to repeat downstream analyses with a different number of PCs (10, 15, or even 50!). As you will observe, the results often do not differ dramatically.
We advise users to err on the higher side when choosing this parameter. For example, performing downstream analyses with only 5 PCs does signifcanltly and adversely affect results.
# PCA
PCA plots can be important for spotting batch effects of irregularities within the data.
```{r}
control <- RunPCA(control, npcs = 10)
DimPlot(control, reduction = "pca", split.by = "stim")
```
# UMAP
UMAP visualisation is the recomeded high-dimensional visualisation methods for visualisaing clusters wihtin the data.
```{r}
control <- RunUMAP(control, reduction = "pca", dims = 1:10)
control <- FindNeighbors(control, reduction = "pca", dims = 1:10)
# Finding clusters can
control <- FindClusters(control, resolution = 0.4)
p1 <- DimPlot(control, reduction = "umap", group.by = "stim")
p2 <- DimPlot(control, reduction = "umap", label = TRUE)
plot_grid(p1,p2)
```
# Clustering facet
```{r}
DimPlot(control, reduction = "umap", split.by = "stim")
```
# Feature plot
```{r}
FeaturePlot(control, features = c("CD4", "CD8A"), split.by = "stim",
order = TRUE)
```
# Finding differentially expressed features
Seurat can help you find markers that define clusters via differential expression. By default, it identifes positive and negative markers of a single cluster (specified in ident.1), compared to all other cells. FindAllMarkers automates this process for all clusters, but you can also test groups of clusters vs. each other, or against all cells.
The min.pct argument requires a feature to be detected at a minimum percentage in either of the two groups of cells, and the thresh.test argument requires a feature to be differentially expressed (on average) by some amount between the two groups. You can set both of these to 0, but with a dramatic increase in time - since this will test a large number of features that are unlikely to be highly discriminatory. As another option to speed up these computations, max.cells.per.ident can be set. This will downsample each identity class to have no more cells than whatever this is set to. While there is generally going to be a loss in power, the speed increases can be significiant and the most highly differentially expressed features will likely still rise to the top.
```{r}
# find all markers of cluster 1
cluster1.markers <- FindMarkers(control, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)
```
# Find all markers
```{r}
# find markers for every cluster compared to all remaining cells, report only the positive ones
control.markers <- FindAllMarkers(control, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
control.markers %>% group_by(cluster) %>% top_n(n = 2)
```
# Save the Seurat object
```{r}
saveRDS(control, file="seurat_object.rds")
```