YOUR NAME
-
Reminder: When answering the questions, it is not sufficient to provide only code. Add explanatory text to interpret your results throughout.
-
Total: 30 POINTS
In this assignment, we will explore the complexity of Acute Myeloid Leukemia (AML), an aggressive type of blood cancer, in multiomics data integration. One of the biggest outstanding challenges in AML study is that a wide variety of cancer-causing mutations has been frequently spotted in genes involved in a hematopoietic stem cell (HSC) differentiation process. Such mutations can dramatically alter developmental/differentiation trajectories, increase heterogeneity among cancer cells, and obfuscate downstream analysis.
We will use a comprehensive data set, “BeatAML2,” which contains the raw and processed gene expression data used in Bottomly et al. as well as other clinical, genetic, and drug response variables. We will investigate how multiple types of high-dimensional sequencing results can be combined in real-world settings. Here, we will focus on data wrangling, exploratory data analysis, and differential gene expression analysis, but many other scientific questions can be explored, especially linking genetic and drug response variables.
Note: only a few libraries are provided for the initial steps; add additional libraries as needed for the rest of the analyses.
library(tidyverse)
library(readxl)
library(edgeR)
library(limma)
We may use some handy functions over and over again.
## Avoid overwriting
if.needed <- function(.file, .code) {
if(!all(file.exists(unlist(.file)))){ .code }
stopifnot(all(file.exists(unlist(.file))))
}
## A quick function to concatenate two strings
`%&%` <- function(a,b) paste0(a,b)
## Feel free to define your own housekeeping functions
Some helper code to read the data in is provided.
url <- "https://github.com/biodev/beataml2.0_data/raw/main/beataml_waves1to4_counts_dbgap.txt"
data.file <- "data/beataml_raw_counts.txt"
dir.create("data", recursive=T, showWarnings=F)
## Download data if needed; avoid overwriting
if.needed(data.file, download.file(url, destfile = data.file))
## Download clinical information if needed
url <- "https://github.com/biodev/beataml2.0_data/raw/main/beataml_wv1to4_clinical.xlsx"
clin.file <- "data/beataml_clinical.xlsx"
if.needed(clin.file, download.file(url, destfile = clin.file))
You might build your own SummarizedExperiment
object, but we do not
need to make one for this project. We will preprocess the raw data files
and feed them into a DGElist
object, which will be repeatedly used. We
can import the count matrix stored in “data/beataml_raw_counts.txt”
file, using a general purpose utility function, such as read.table
or
fread
in data.tabble
package.
count.df <-
data.table::fread(data.file, header = T, sep = "\t") %>%
as.data.frame()
Note: read.table
needed additional tuning since some fields contain
rather unconventional characters. Instead of using read.table
directly, here we read the data using fread
and transformed the
data.table
type into data.frame
.
This study also comes with clinical information in Microsoft Excel
format. A function, read_xlsx
, gives us a convenient and programmatic
way to read the necessary information. If you want to live up to
reproducible data science, it is highly encouraged to avoid
arbitrary/manual operations in your analysis pipeline.
.cols <- c(
"dbgap_rnaseq_sample", # unique sample ID
"cohort", #
"currentStage", #
"consensus_sex", # demographic
"vitalStatus", # survival information
"overallSurvival", #
"plateletCount", # some lab tests
"wbcCount", #
"%.Basophils.in.PB", # cell type fractions
"%.Blasts.in.BM", #
"%.Blasts.in.PB", #
"%.Eosinophils.in.PB", #
"%.Immature.Granulocytes.in.PB", #
"%.Lymphocytes.in.PB", #
"%.Monocytes.in.PB", #
"%.Neutrophils.in.PB", #
"%.Nucleated.RBCs.in.PB")
## Read data from .xlsx file;
## this will be a Tibble object
meta.df <-
read_xlsx(clin.file, sheet="summary") %>%
dplyr::select(all_of(.cols))
colnames(meta.df) <- gsub("[%][.]", "", colnames(meta.df))
In practice, many columns, which are supposed to be quantitative, may
come with corrupted, non-numeric values. Looking through every bit of
such columns will be laborious and may cause another type of unwanted
errors/formats. Let’s deal with such painful steps programmatically. We
will deal with it in a tidyverse
(or dplyr
) way or data.table
way.
First, we can clean up the annotation data by applying the
take.numeric
function “dplyr::across
” all the columns except for the
categorical ones.
take.numeric <- function(x) {
str_extract(x, pattern="\\d+") %>% as.numeric()
}
meta.df <- meta.df %>%
mutate(across(
-c(dbgap_rnaseq_sample, cohort, currentStage, consensus_sex, vitalStatus), # excluding these columns
~. %>% take.numeric # apply str_extract and as.numeric to all the other columns
))
Side note: You may lapply
the take.numeric
function within the
following data.table
list.
.dt <- as.data.table(meta.df)
non.num.cols <- c("dbgap_rnaseq_sample",
"cohort",
"currentStage",
"consensus_sex",
"vitalStatus")
meta.clean.dt <-
cbind(.dt[, .SD, .SDcols = non.num.cols],
.dt[,
lapply(.SD, take.numeric),
.SDcols = ! non.num.cols])
How many rows and columns do you see in the data (count.df
), and what
do they correspond to (genes/features or samples/individuals)?
- Number of rows:
# YOUR CODE HERE
- Number of columns:
# YOUR CODE HERE
One of the columns, biotype
, indicates the biological type of each row
(feature). Can you count how many types exist and how many of each type
we have? If possible, sort the rows in the descending order of
frequency.
- HINT: You can summarize these as a table.
# YOUR CODE HERE
Subset the rows in the metadata meta.df
and the columns in the count
data count.df
to include only those samples for which we have gene
counts for downstream analysis. HINT: Keep in mind that not every sample
in the data is present in the metadata table, and vice versa. This is
one of the most frequently error-prone steps in data analysis.
- Create a new data matrix
matched.count.df
by only retaining samples with matched clinical information (0.5 pt). Also, report the size of the resulting data matrix.
# YOUR CODE HERE
- Also create annotation information
matched.meta.dt
restricting on the matched samples (0.5 pt).
# YOUR CODE HERE
Tumour microenvironments shaped by cellular environments help understand the disease development and progression mechanisms. However, cell type fraction information is frequently missed in many samples. This study provides another data file of inferred cell type scores to compensate for the missing cell type fractions.
## Download inferred cell type scores
url <- "https://github.com/biodev/beataml2.0_data/raw/main/beataml2_manuscript_vg_cts.xlsx"
cts.file <- "data/beataml_celltypes.xlsx"
if.needed(cts.file, download.file(url, destfile = cts.file))
cts.df <- read_xlsx(cts.file)
- Update the meta data (
matched.meta.df
) by adding cell type scores (variables ending with “-like”), matching bydbgap_rnaseq_sample
column. Show the top five rows after the update (0.5 pt).
# YOUR CODE HERE
- Show that there is a strong correspondence between the observe and
estimated cell type fractions by displaying a correlation heatmap.
Provide your own interpretations (0.5 pt). HINT: Deal with missing
values with
use="pairwise.complete.obs"
option when computingcor
matrix; usepheatmap
for visualization.
# YOUR CODE HERE
ADD YOUR INTERPRETATION.
How can you check whether we have successfully matched
dbgap_rnaseq_sample
in meta.matched.dt
with the colnames
of
matched.count.dt
?
# YOUR CODE HERE
Add your comment.
Finally, build a DGEList
that will be used across this project. Hint:
use as.matrix
and/or as.data.frame
if needed. The first four columns
in the count data matrix contain gene information. Think about why we
matched the rows of matched.meta.df
with the columns of
matched.count.df
data matrix.
# YOUR CODE HERE
Should we keep all the genes for downstream analysis? It is often recommended to remove genes that contain very little information. But how do we know which genes can be safely removed or not?
A. Compute gene-level coefficient of variation (CV) values in CPM (1 pt).
-
Assign “0” CV values to zero mean/variance genes.
-
Hint:
cpm(dge)
is a matrix (gene by sample). -
Compute the mean and standard deviation for each row (0.5 pt):
# YOUR CODE HERE
- Compute the coefficient of variation of genes (0.5 pt).
# YOUR CODE HERE
Considering that CPM values below one
() are technically missing observations, compute the frequency of
missing observations for each gene and investigate its relationship with
the log1p
-transformed coefficient of variation (CV) and the
log1p
-transformed mean values graphically.
# YOUR CODE HERE
- Why is it desired to compare the missing frequency scores against the gene-level CV values rather than variance (0.5 pt)?
ADD YOUR INTERPRETATION.
On top of each plot, draw vertical lines on the same plot at 75%; and discuss about why retaining genes with 25% observation rate could make sense to you.
# YOUR CODE HERE
Can we come up with a similar decision rule in a data-driven way?
Construct a simple feature matrix only including the missing fraction
values (gene x 1). Perform
kmeans
clustering
with three clusters and 100 random restarts, and show how 0.75
cutoff
aligns with the clustering results.
- Hint: Colour the points based on the cluster membership.
## Fix the random seed to make sure that we get the same results
set.seed(1)
- Perform
kmeans
clustering (0.5 pt):
# YOUR CODE HERE
- Demonstrate the clustering results in the same type of plots (0.5 pt).
# YOUR CODE HERE
ADD YOUR THOUGHTS.
Remove lowly expressed genes by retaining genes that have CPM 1 in at least 25% of samples and report how many genes are left after filter.
- Apply the filtering step (0.5 pt):
# YOUR CODE HERE
- Report the remaining number of genes (0.5 pt):
# YOUR CODE HERE
The expression values are raw counts.
- Calculate TMM normalization factors (and add them to your DGEList object) (0.5 pt).
# YOUR CODE HERE
- Show the distribution (density, histogram, or both) of library sizes (0.5 pt).
# YOUR CODE HERE
Examine the distribution of gene expression on the scale of log2 CPM on the first 20 samples using box plots (with samples on the x-axis and expression on the y-axis).
-
Hint 1: Add a small pseudo count of 1 before taking the log2 transformation to avoid taking log of zero - you can do this by setting log = TRUE and prior.count = 1 in the cpm function.
-
Hint 2: To get the data in a format amenable to plotting with ggplot, some data manipulation steps are required. Take a look at the pivot_longer function in the tidyr package to first get the data in tidy format (with one row per gene and sample combination).
# YOUR CODE HERE
Examine the distribution of gene expression in units of log2 CPM across all samples using overlapping density plots (with expression on the x-axis and density on the y-axis; with one line per sample and lines coloured by sample).
# YOUR CODE HERE
Let’s compare the distributions visually while restricting the analysis to specific gene types. We investigate the first 20 samples and partition genes into two groups, protein-coding and non-protein-coding, for simplicity. What is your interpretation?
# YOUR CODE HERE
ADD YOUR THOUGHTS.
How many protein-coding genes are left after filtering? Can you create a table that counts each type of genes after the filtering (0.5 pt)?
# YOUR CODE HERE
Since protein-coding genes constitute a majority of features and are highly expressed, let’s take yet another subset of data by only retaining protein-coding genes (0.5 pt).
# YOUR CODE HERE
Examine the correlation between samples using a heatmap with samples on the x-axis and the y-axis and colour indicating a correlation between each pair of samples. Again, use the log2 transformed CPM values. Display other variables, such as “cohort,” “consensus_sex,” “vital status,” “overall survival,” “platelet count,” and “wbcCount” along with the correlation matrix.
- HINT: You may suppress sample names in
pheatmap
(show_rownames=F
,show_colnames=F
).
# YOUR CODE HERE
Show a different set of annotation variables in the same correlation
heatmap. Display the observed cell type fractions (variables ending with
“in.PB”). HINT: use dplyr::ends_with
for your convenience.
# YOUR CODE HERE
Add the estimated cell type scores (variables ending with “like”) in the same heatmap. For better visualization, apply sigmoid transformation on your annotation matrix. A sigmoid function converts logit into probabilistic scores.
# YOUR CODE HERE
Provide your interpretation of these results.
ADD YOUR THOUGHTS.
First, set up a model matrix with cell type estimation scores (variables
ending with “like”) as covariates. Then calculate variance weights with
voom
, and generate the mean-variance trend plot (1 pt). HINT: Filter
out samples with “NA” values in your design matrix. Include all the cell
type scores in your design matrix.
# YOUR CODE HERE
Use limma
(lmFit
and eBayes
) to fit the linear model with the
model matrix you just created.
# YOUR CODE HERE
Print the 10 top-ranked genes by adjusted p-value for the “GMP.like” coefficient using topTable.
# YOUR CODE HERE
Using the linear model defined above, determine the number of genes
differentially expressed in different cell types at a FWER (use
adjust.method = "holm"
in topTable) less than 0.01. Count the number
of DEGs for each cell type and print a table.
- HINT: Count how many significantly associated genes for each
coefficient. Use
dplyr::count
withdplyr::group_by
.
# YOUR CODE HERE
As explored in the previous question (Q.4), many genes fluctuate with cell type scores. It is not surprising to see many genes are strongly associated with more than one cell types.
- Count how many cell types each gene is significantly associated with
(under the same FWER cutoff). Show your results by drawing histogram
(0.5 pt). HINT: use
group_by(stable_id)
.
# YOUR CODE HERE
- Can you further break down the number of genes while distinguishing
the sign of
logFC
(0.5 pt)?
# YOUR CODE HERE
Of the deceased patients (vitalStatus == "Dead"
), we may define two
patient groups according to their survival time, using the median value
as a threshold value. More precisely, we have
-
good prognosis: a patient deceased with the survival time greater than or equal to the median;
-
poor prognosis: a patient deceased with the survival time less than the median.
First, create a new dge
by subsetting on the deceased individuals (0.5
pt).
# YOUR CODE HERE
How many samples do have in each prognosis group (0.5 pt)?
# YOUR CODE HERE
Set up the model matrix with cell type estimation scores and the new prognosis variable. Then calculate variance weights with voom, and generate the mean-variance trend plot (1 pt). HINT: Filter out samples with “NA” values in your design matrix. Make sure that the prognosis variable is of factors.
# YOUR CODE HERE
Use limma
(lmFit
and eBayes
) to fit the linear model with the
model matrix you just created. Print the 10 top-ranked genes by adjusted
p-value for the “progpos” coefficient using topTable (0.5 pt).
# YOUR CODE HERE
How many genes are differentially expressed between the poor and good prognosis groups at FDR cutoff 10% (0.5 pt)?
# YOUR CODE HERE
Build a new model matrix that includes all the marginal
cell-type-scores, and interaction terms between the prognosis and the
cell type scores; as usual, do weight adjustment via voom
(1 pt).
# YOUR CODE HERE
Use limma
(lmFit
and eBayes
) to fit the linear model with the
model matrix you just created. Print the 10 top-ranked genes by adjusted
p-value for the Progenitor.like:progpoor
using topTable (0.5 pt).
# YOUR CODE HERE
Report how many interaction terms are significantly associated at FDR 10%. Break down these numbers into cell types (0.5 pt).
# YOUR CODE HERE
Show condition-specific scatter plots for these top genes (x-axis: the cell type scores; y-axis: gene expression). Show all the significant genes and the relevant cell types. HINT: Create a long-form data frame/tibble after subsetting top genes.
# YOUR CODE HERE