-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
181 lines (137 loc) · 5.03 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
title: "plomics"
output:
github_document:
toc: true
toc_depth: 3
editor_options:
chunk_output_type: console
---
A collection of functions for placental DNA methylation analysis.
## Install
```{r message = F, eval = F}
remotes::install_github('wvictor14/plomics')
```
```{r, message = F, warning = F, include = F}
library(plyr)
library(tidyr)
library(dplyr)
library(ggplot2)
```
## Functions
### lmmatrix
Computes pairwise linear models between several variables.
```{r message = F}
library(minfiData)
library(plomics)
# load example data
data(RGsetEx)
# calculate pcs on the data
betas <- getBeta(RGsetEx)
pc_obj <- prcomp(t(na.omit(betas)), center = T, scale = T)
# get pc scores for each sample
rotated <- pc_obj$x
#rsquared
rsq <- lmmatrix(dep = rotated,
ind = as.data.frame(pData(RGsetEx)[,c('Sample_Group', 'age', 'sex', 'status')]))
#pvalue
pva <- lmmatrix(dep = rotated,
ind = as.data.frame(pData(RGsetEx)[,c('Sample_Group', 'age', 'sex', 'status')]),
metric = 'Pvalue')
##### plot
# reshape first
rsq_plot <- rsq %>% as.data.frame() %>%
# add dep variables
mutate(dep = rownames(rsq)) %>%
# reshape
gather(PC, rsquared, -dep)
pva_plot <- pva %>% as.data.frame() %>%
# add dep variables
mutate(dep = rownames(rsq)) %>%
# reshape
gather(PC, pval, -dep) %>%
# pvalue categories
mutate(pval_cat = case_when(
pval > 0.05 ~ '> 0.05',
pval < 0.05 & pval > 0.01 ~ '< 0.05',
pval < 0.01 & pval > 0.001 ~ '< 0.01',
pval < 0.001 ~ '< 0.001'
))
ggplot(rsq_plot, aes(x = PC, y = dep, fill = rsquared)) +
geom_tile() + theme_bw() +
scale_x_discrete(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
scale_fill_gradientn(colours=c("white", "#ffffcc", "#41b6c4", "#2c7fb8", "#253494"),
breaks = c(0,0.5,1), limits = c(0,1),
guide = guide_colorbar(frame.colour = "black", ticks.colour = "black"))
ggplot(pva_plot, aes(x = PC, y = dep, fill = pval_cat)) +
geom_tile() + theme_bw() +
scale_x_discrete(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
scale_fill_manual(values = c('> 0.05' = 'white', '< 0.05' = '#fee8c8',
'< 0.01' = '#fdbb84', '< 0.001' = '#e34a33'))
```
### pairTest
To test if covariates are confounding each other, we need to pairwise tests of independence between
covariates. If at least one covariate is numeric, we can use linear regression. Otherwise if both
covariates are categorical, then a chi squared test must be used.
```{r, warning = F}
variables <- data.frame(
row = c(1, 1, 2, 2, 3, 3),
column = c(1, 1, 1, 2, 2, 2),
Sex = c('m', 'm', 'm', 'f', 'f', 'f'),
age = c(18, 19, 18, 27, 30, 16),
ethnicity = c('AF', 'AF', 'AF', 'EU', 'EU', 'AS')
)
tests <- pairtest(variables)
# make categories
tests <- tests %>%
mutate(pval_cat = if_else(p.value < 0.001, '< 0.001',
if_else(p.value < 0.01, '< 0.01',
if_else(p.value < 0.05, '< 0.05', '<1'))))
tests
# plot heatmap of associations
ggplot(tests, aes(x=Row, y = Column, fill = pval_cat)) +
geom_tile(col = 'grey') + theme(panel.background = element_blank()) +
scale_x_discrete(expand = c(0, 0)) +
scale_y_discrete(expand = c(0, 0)) +
scale_fill_manual(values = c('> 0.05' = 'white', '< 0.05' = '#fee8c8',
'< 0.01' = '#fdbb84', '< 0.001' = '#e34a33'))
```
### findsentrix
`findsentrix` takes a vector of sentrix IDs (chip identifiers) and searches a directory for IDAT
files that match. The returned data frame contains two columns: (1) the sentrix ID (2) unique file
paths for each idat that matches the sentrix ID.
Here is an example using the robinson lab master sample sheet:
```{r}
# read in master sample sheet
ss <- readxl::read_xlsx('Z:/ROBLAB6 Infinium450k John/Master_Sample_Sheet.xlsx')
## specify idat directory
idat_dir <- 'Z:/ROBLAB6 Infinium450k John/EPIC Raw data/'
ss <- ss %>%
# Take the first 6 EPIC samples
dplyr::arrange(desc(Platform)) %>%
dplyr::slice(1:6) %>%
dplyr::select(Sample_Name, Sentrix_ID, Sentrix_Position) %>%
# create sentrix column
dplyr::mutate(Sentrix = paste0(Sentrix_ID, '_', Sentrix_Position))
```
We created a sentrix ID by taking the chip serial number (confusingly named as "sentrix_ID") and
pasting this to the position identifier ("Sentrix_Position"):
**Sentrix:** `r ss$Sentrix[1]`
**Sentrix Chip Number:** `r as.character(ss$Sentrix_ID[1])`
**Sentrix Position on chip:** `r ss$Sentrix_Position[1]`
Now we can use `findsentrix` to find the filepaths for idats that match each sentrix identifier:
```{r}
idatfiles <- findsentrix(sentrix = ss$Sentrix, directory = idat_dir)
idatfiles
# join all matches, retaining unmatched and multiple matched IDs
ss <- ss %>%
dplyr::full_join(idatfiles, by = 'Sentrix')
```
Finally, we can load idats using this dataframe:
```{r}
## Now you can load in these samples with minfi::read.metharray.exp
rgset <- minfi::read.metharray.exp(targets = as.data.frame(ss), verbose = T)
rgset
```