-
Notifications
You must be signed in to change notification settings - Fork 0
/
HumGen_Lab4_23andMe.qmd
335 lines (232 loc) · 12 KB
/
HumGen_Lab4_23andMe.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
---
title: 'Lab 4 : Data Transformation and graphing 23andMe data'
---
## Learning objectives
- Data Transformation using dplyr
## Load libaries
```{r}
library(tidyverse)
library(knitr)
library(DT)
library(plotly)
```
## Learning objectives
- Using R Markdown to
- Reading and manipulating tables with tidyverse
- Making nice tables for your lab reports
- Resize graphs in Quarto
- Print graphics to a file (e.g. jpeg, pdf)
- Loading images into a Quarto file
- Making interactive graphs and in Quarto
## Working with large files the Terminal Window (instead of Console)
Often in bioinformatics we are working on files that are to large to be opened in R or on your computer. In R the file size is capped at 5 MB. If we try to open a larger file the following message appears
![](images/file_too_large.png)
We will use the `Terminal` window in the bottom left corner to explore large files to determine their structure. In this window we can use Unix commands if there are available on your computer (they are available on Posit Cloud and Unity). **Note these are Unix and not R commands.** In terminal we can use the Unix command `head` to look at the first 50 lines in the file `23andMe_complete.txt` in the data directory.
```
head -25 data/23andMe_complete.tsv
```
Notice that the first 14 lines contain information about the file and have hash tags #. `rsid` numbers can be linked to data in NCBI's [SNP database](https://www.ncbi.nlm.nih.gov/snp/rs4477212) and other resources. To look at the last 25 lines use `tail`
```
tail -25 data/23andMe_complete.tsv
```
We can see that the mitochondrial DNA only has 1 allele since it is a haploid rather the diploid genome
## Reading files with tidyverse
It is important to understand data and objects types, particularly when we are importing data. R will try to try to make a best guess, but it is up to you to make sure the data read into R as you would like.
All the data files for a lab will be in your RCloud project `data` directory. To import a file you must include the correct location of the file relative to where your .Rmd file is located. Let's try to read in a SNP genotype file from 23andMe. To read this into R using `read_tsv` (tab separated values) and ignore the comment lines containing #.
```{r}
#| eval: false
SNPs_23andMe <- read_tsv("data/23andMe_complete.tsv", comment = '#')
```
Look in the `Environment` window at the SNPs_23andMe object. Notice that the columns names are `rs548049170`, `1`, `69869` and `TT`. This is because there is a hash tag on the line with the column names (as seen above in terminal). Bummer but we can work with that while importing the file
```{r}
SNPs_23andMe <- read_tsv("data/23andMe_complete.tsv", comment = '#', col_names = FALSE) |>
setNames(c("rsid","chromosome","position","genotype"))
```
Look again in the `Environment` window at the SNPs_23andMe object. Notice the column names are now correct, but we want `chromosome` and \`genotype to be treated as factors. Let's make the change now.
```{r}
SNPs_23andMe <- SNPs_23andMe |>
mutate(chromosome = as.factor(chromosome)) |>
mutate(genotype = as.factor(genotype))
```
## Data transformation
Data transformation for making graphs and statistical analyses is a big part of this course. Today we will touch upon a few of the basics using the above data. Check out [dplyr vignettes](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html) for more examples. Manipulating data in the `tidyverse` uses pipes `|>` or %\>%`to chain together steps. So if we want to filter the`UMass_COVID_data`to cases greater than 50. In this case we are not changing what is in the`UMass_COVID_data\` data object
```{r}
SNPs_23andMe |>
filter(position < 10)
```
We can string them together
```{r}
SNPs_23andMe |>
filter(genotype == 'T') |>
filter(chromosome == 'MT') |>
filter(position < 100)
```
To the number of rsids for each chromosome (notice we are not counting rsids but the number of times each factor in chromosome appears)
```{r}
SNPs_23andMe |>
count(chromosome)
```
```{r}
SNPs_23andMe |>
filter(chromosome == 'MT') |>
count(chromosome)
```
We can arrange the order of the chromosomes by the number of rsids
```{r}
SNPs_23andMe |>
count(chromosome) |>
arrange(n)
```
To make a table of genotype counts for each chromosome use `group_by`
```{r}
SNPs_23andMe |>
group_by(chromosome, genotype) |>
count(genotype)
```
To select a subset of columns
```{r}
SNPs_23andMe |>
select(rsid, chromosome)
```
Note that none of the above creates a new data object. To do this and display
```{r}
SNPs_23andMe_counts <-SNPs_23andMe |>
count(chromosome)
# to display data frame / tibble
SNPs_23andMe_counts
```
Use `mutate` to add a new column
```{r}
SNPs_23andMe_summarize <-SNPs_23andMe |>
group_by(chromosome) |>
summarize(rsid_count = n()) |>
mutate(rsid_percent = rsid_count/sum(rsid_count))
# to display data frame / tibble
SNPs_23andMe_summarize
```
There are other functions you can use in summarize (<https://dplyr.tidyverse.org/reference/summarise.html>). For example, getting the minimum or maximum value in a range.
```{r}
SNPs_23andMe |>
group_by(chromosome) |>
summarize(min_position = min(position), max_position = max(position))
```
## Making tables in R Markdown reports
You can make tables using `knitr`
```{r}
library(knitr)
kable(SNPs_23andMe_counts, caption = "Number of rsids on each chromosome")
```
This works well for small tables, but for even moderate tables like above it takes up a lot of space. In the above example
A better option is using the DT package, but \*\*\* Don't do this with tables of hundreds of thousands of rows (as in your complete SNP table).
```{r}
library(DT)
datatable(SNPs_23andMe_counts)
```
We can click on the table to reorganize the data.
## Fine tuning ggplots
Another good reference for ggplots is <a href="http://www.cookbook-r.com/Graphs/">Cookbook for R by Winston Chang</a> Please take your time and go through the following web pages.
- <a href="http://www.cookbook-r.com/Graphs/Titles_(ggplot2)/">Cookbook for R - Title</a>
- <a href="http://www.cookbook-r.com/Graphs/Axes_(ggplot2)/">Cookbook for R - Axes</a>
- <a href="http://www.cookbook-r.com/Graphs/Legends_(ggplot2)/">Cookbook for R - Legends</a>
- <a href="http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/">Cookbook for R - Colors</a>
Here are a couple of cheatsheets that can be useful
- <a href="http://www.rstudio.com/wp-content/uploads/2015/12/ggplot2-cheatsheet-2.0.pdf">R Studio ggplot2 cheatsheet</a>
- <a href="https://rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf">RMarkdown cheatsheet</a>
### Controlling graph size in Quarto and R Markdown
As of writing this lab I have not been able to control computed figure dimensions in a R chunk with [Quarto executions](https://quarto.org/docs/computations/execution-options.html), but the R Markdown format is working. The dimensions of an individual graph in a R Markdown document be adjusted by specifying the graph dimensions as below
```{r, fig.width = 8, fig.height = 2}
# to adjust figure size {r, fig.width = 8, fig.height = 2}
ggplot(data = SNPs_23andMe, mapping = aes(x = genotype)) +
geom_bar() +
ggtitle("Total SNPs for each genotype") +
ylab("Total number of SNPs") +
xlab("Genotype")
```
```{r, fig.width = 4, fig.height = 4}
# to adjust figure size {r, fig.width = 3, fig.height = 3}
ggplot(data = SNPs_23andMe, mapping = aes(x = genotype)) +
geom_bar() +
ggtitle("Total SNPs for each genotype") +
ylab("Total number of SNPs") +
xlab("Genotype")
```
### Graphic Output
You may have realized that you can export plots in R Studio by clicking on Export in the Plots window that appears after you make a graph. You can save as a pdf, svg, tiff, png, bmp, jpeg and eps. You can also write the output directly to a file. This is particularly useful for controling the final dimensions in a reproducible way and for manuscripts.
- [Cookbook for R - Output to a file - PDF, PNG, TIFF, SVG](http://www.cookbook-r.com/Graphs/Output_to_a_file/)
- [Figures in Quarto](https://quarto.org/docs/authoring/figures.html)
```{r}
# Plot graph to a pdf outputfile
pdf("images/SNP_example_plot.pdf", width=6, height=3)
ggplot(data = SNPs_23andMe) +
geom_bar(mapping = aes(x = chromosome, fill = genotype))
dev.off()
```
```{r}
# Plot graph to a png outputfile
ppi <- 300
png("images/SNP_example_plot.png", width=6*ppi, height=6*ppi, res=ppi)
ggplot(data = SNPs_23andMe) +
geom_bar(mapping = aes(x = chromosome, fill = genotype))
dev.off()
```
For more details on sizing output <a href="http://www.cookbook-r.com/Graphs/Output_to_a_file/">Cookbook for R - Output to a file - PDF, PNG, TIFF, SVG </a>
### Loading images into Quarto documents
Sometimes it is useful in controling the image layout for a report to file with the graph and then subsequently load it into the .Rmd file. This works with png files, but not pdfs. You can also upload images made with other bioinformatic tools into your RMarkdown report.
```{r}
#| eval: false
# This is the RMarkdown style for inserting images
# Your image must be in your working directory
# This command is put OUTSIDE the r code chunk
![Genotype counts per chromosome](images/SNP_example_plot.png)
```
![Genotype counts per chromosome](images/SNP_example_plot.png)
Another way to present a graph without the code is adding echo = FALSE within the r{} chunk - {r echo = FALSE}. This prevents code, but not the results from appearing in the knitr file.
### Interactive graphs in RMarkdown reports
With plotly/ggplotly (<https://plot.ly/ggplot2/>) you can make interactive graphs in your lab report.
```{r}
library(plotly)
```
```{r}
#| eval: false
#| message: false
# Version 1 1
p <- ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point()
ggplotly(p)
```
```{r}
#| message: false
# Version 2
ggplotly(
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point()
)
```
## Exercises
Use the 23andMe complete data set for the exercises. Pay attention to how your graphs look in today's final knitted lab report. You will be docked points if the graphs do not look nice (e.g. overlapping column names, truncated legends, ets.)
### Exercise 1
Subset (select) the 23andMe complete data frame to just position and genotype
### Exercise 2
Filter to just the MT chromosome and remove genotypes A and T. (you can use != to remove).
### Exercise 3
Use `group_by()` and `summarize()` to find the first position, last position and number of positions (SNPs) measured for each chromosome. Note that each position represents a SNP, therefore the number of SNPs for a chromosome = the number of positions measured on each chromosome.
### Exercise 4
Building on ex3 create use `mutate` to create new column with the density of SNPs for each chromosome based the total number of SNPs divided by the last position - first position
### Exercise 5
Building on ex3 sort chromosomes based on SNP density.
### Exercise 6
Make a table for your report using DT to show SNP density
### Exercise 7
Using ggplot make a make a bar graph of the total SNP counts for each chromosome. Add title and labels for the x and y axis. Set the fill of the bars to yellow and the outline (color) to black.
### Exercise 8
Modify ex 7 to make a stacked bar graph with the contributions of each genotype to the total SNP count (Hint: use fill).
### Exercise 9
Turn ex 8 into an interactive graph using plotly. Note: Titles in plotly often need to be wrapped. I do this using `labs(title = str_wrap("Total number of SNPs on each chromosome", 40)`.
### Exercise 10
Use `facet_wrap` to show each genotype as a separate graph (with a bar plot of counts per chromosomes).
### Exercise 11
Revise ex10 using `scales="free_y"` in `facet_wrap` to allow better show the variation in each graph.
### Exercise 12
Using the graph you made in exercise 8 output a PNG file with the ppi = 300 and the appropriate width and height so the image looks nice.
### Exercise 13
Load the png file of the graph from your images folder into your Quarto document.