Skip to content

Commit

Permalink
modify md files
Browse files Browse the repository at this point in the history
  • Loading branch information
LorenzoS96 committed Oct 31, 2024
1 parent 6d0ebe5 commit 73fb0e7
Show file tree
Hide file tree
Showing 5 changed files with 126 additions and 121 deletions.
114 changes: 55 additions & 59 deletions docs/usage/DEanalysis/de_rstudio.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,8 @@
---
order: 4
---

# Differential Analysis with DESeq2

In this section of the tutorial, we will guide you through the practical steps necessary to set up the RStudio environment, load the required libraries and data, and execute the DESeq2 analysis. By the end of this section, you will have a fully functional DESeq2 analysis pipeline set up in RStudio, ready to uncover the differentially expressed genes in your dataset.


## Launching the RStudio environment

Once the nf-core/rnaseq pipeline is terminated, the resulting data are stored in the folder `results_star_salmon`. Now, we can analyse the results by running DESeq2 on RStudio. First of all, we need to launch it:
Expand All @@ -31,17 +28,18 @@ This command will keep the gitpod session active for exactly 2 hours, providing

Now come back to our **RStudio session**.


## Differential Expression Analysis

As in all analysis, firstly we need to create a new project:

1. Go to the **File** menu and select **New Project**;
1) Go to the **File** menu and select **New Project**;

2. Select **New Directory**, **New Project**, name the project as shown below and click on **Create Project**;
2) Select **New Directory**, **New Project**, name the project as shown below and click on **Create Project**;

![RNAseq](./img/project_R.png)
![R_project](./img/project_R.png)

3. The new project will be automatically opened in RStudio
3) The new project will be automatically opened in RStudio

We can check whether we are in the correct working directory with `getwd()` The path `/workspace/gitpod/training/DE_analysis/` should appear on your console.
To store our results in an organized way, we will create a folder named **de_results** using the **New Folder** button in the bottom right panel. We will save all our resulting tables and plots in this folder. Next, go to the **File menu**, select **New File** and then **R Script** to create a script editor in which we will save all commands required for the analysis. In the editor type:
Expand All @@ -52,7 +50,7 @@ To store our results in an organized way, we will create a folder named **de_res

and save the file as **de_script.R**. From now on, each command described in the tutorial can be added to your script. The resulting working directory should look like this:

![RNAseq](./img/workdir_RStudio.png)
![Workdir](./img/workdir_RStudio.png)

The analysis requires several R packages. To utilise them, we need to load the following libraries:

Expand Down Expand Up @@ -92,8 +90,7 @@ load("/workspace/gitpod/training/results_star_salmon/star_salmon/deseq2_qc/deseq

In this tutorial we will analyse the `dds` object generated by running the alignment with **STAR** and the quantification with **Salmon**. Alternatively, a user could choose to analyse the the `dds` object generated by running only **Salmon** for both lightweight alignment and quantification. The file is stored in `/workspace/gitpod/training/results_star_salmon/salmon/deseq2_qc/deseq2.dds.RData`.

In DESEq2, the `dds` object is a central data structure that contains the following components:

In DESEq2, the `dds` object is a central data structure that contains the following components:
- `countData`: a matrix of raw count data, where each row represents a gene and each column represents a sample;
- `colData`: a data frame containing information about the samples, such as the experimental design, treatment and other relevant metadata;
- `design`: a formula specifying the experimental design utilised to estimate the dispersion and the log2foldchange.
Expand All @@ -110,7 +107,7 @@ colData(dds) # to check the sample info
design(dds) # to check the design formula
```

The `colData` and the `design` are the ones created within the nfcore/rnaseq pipeline and must be reorganised prior to the analysis. With the following commands we will create our metadata starting from the info stored in the `dds`. We will rename the column of the `colData`, we will ensure that the rownames of the metadata are present in the same order as the column names and finally we will update the `colData` of the `dds` object with our newly created metadata.
The `colData` and the `design` are the ones created within the nfcore/rnaseq pipeline and must be reorganised prior to the analysis. With the following commands we will create our metadata starting from the info stored in the `dds`. We will rename the column of the `colData`, we will ensure that the rownames of the metadata are present in the same order as the column names and finally we will update the `colData` of the `dds` object with our newly created metadata.

```r
#### Creation of metadata starting from the dds colData ####
Expand Down Expand Up @@ -151,7 +148,7 @@ Now that everything is setted up, we can proceed to generate a new DESeq2 object

dds_new <- DESeqDataSet(dds, design = ~ condition)

# dds inspection
# dds inspection

head(counts(dds_new)) # to check the raw counts

Expand All @@ -162,7 +159,7 @@ design(dds_new) # to check the design formula

Comparing the structure of the newly created dds (`dds_new`) with the one automatically produced by the pipeline (`dds`), we can observe the differences:

![RNAseq](./img/dds_comparison.png)
![Comparison_dds](./img/dds_comparison.png)

Before running the different steps of the analysis, a good practice consists in pre-filtering the genes to remove those with very low counts. This is useful to improve computional efficiency and enhance interpretability. In general, it is reasonable to keep only genes with a sum counts of at least 10 for a minimal number of 3 samples:

Expand All @@ -171,15 +168,15 @@ Before running the different steps of the analysis, a good practice consists in

# Select a minimal number of samples = 3

smallestGroupSize <- 3
smallestGroupSize <- 3

# Select genes with a sum counts of at least 10 in 3 samples

keep <- rowSums(counts(dds_new) >= 10) >= smallestGroupSize
keep <- rowSums(counts(dds_new) >= 10) >= smallestGroupSize

# Keep only the genes that pass the threshold

dds_filtered <- dds_new[keep,]
dds_filtered <- dds_new[keep,]
```

Now, it is time to run the differential expression analysis with the `DESeq()` function:
Expand All @@ -192,8 +189,7 @@ dds_final <- DESeq(dds_filtered)

The `DESeq()` function is a high-level wrapper that simplifies the process of differential expression analysis by combining multiple steps into a single function call.

This makes the workflow more user-friendly and ensures that all necessary preprocessing and statistical steps are executed in the correct order. The key functions that **DESeq2** calls include:

This makes the workflow more user-friendly and ensures that all necessary preprocessing and statistical steps are executed in the correct order. The key functions that **DESeq2** calls include:
- **estimateSizeFactors**: to normalise the count data;
- **estimateDispersion**: to estimate the dispersion;
- **nbinomWaldTest**: to perform differential expression test.
Expand Down Expand Up @@ -221,12 +217,11 @@ rld <- rlog(dds_final, blind = TRUE)
```

The `rlog` and the `vst` transformations have an argument, **blind** that can be set to:

- **TRUE** (default): useful for QC analysis because it re-estimates the dispersion, allowing for comparison of samples in an unbiased manner with respect to experimental conditions;
- **FALSE**: the function utilizes the already estimated dispersion, generally applied when differences in counts are expected to be due to the experimental design.

Next, we perform Principal Component Analysis (PCA) to explore the data. DESeq2 provides a built-in function, `plotPCA()`, which uses [ggplot2](https://ggplot2.tidyverse.org) for visualisation, taking the `rld` (or the `vst`) object as input.
Since the **treatment** is the principal condition of interest in our metadata, we will use the `condition` information from our metadata to plot the PCA:
Since the **treatment** is the principal condition of interest in our metadata, we will use the `condition` information from our metadata to plot the PCA:

```r
#### Plot PCA ####
Expand All @@ -246,7 +241,7 @@ sampleDists <- dist(t(assay(rld))) # Calculate pairwise distances between sampl

# Convert distances to a matrix

sampleDistMatrix <- as.matrix(sampleDists)
sampleDistMatrix <- as.matrix(sampleDists)

# Set the row and column names of the distance matrix

Expand All @@ -260,11 +255,11 @@ colors <- colorRampPalette(rev(brewer.pal(9, "Greens")))(255) # function from RC

# Create the heatmap

pheatmap(sampleDistMatrix,
clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists,
col = colors,
fontsize_col = 8,
pheatmap(sampleDistMatrix,
clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists,
col = colors,
fontsize_col = 8,
fontsize_row = 8)
```

Expand All @@ -291,10 +286,10 @@ normalised_counts$gene <- rownames(counts(dds_final))

# Relocate the gene column to the first position

normalised_counts <- normalised_counts %>%
normalised_counts <- normalised_counts %>%
relocate(gene, .before = control_rep1)

# Save normalised counts
# Save normalised counts

write.csv(normalised_counts, file = "de_results/normalised_counts.csv")
```
Expand All @@ -308,7 +303,7 @@ The `results()` function in DESeq2 is used to extract the results of the DE anal
- **pvalue**: the p-value from the Wald test indicates the probability of observing the measured difference in gene expression (log2 fold change) by chance, assuming no true difference exists (null hypothesis). A low p-value suggests that the observed expression change between samples is unlikely due to random chance, so we can reject the null hypothesis;
- **padj**: the adjusted p-value, which takes into account multiple testing corrections, using the Benjamini-Hochberg method to control the false discovery rate;

By default, the `results()` function returns the results for all genes in the analysis with an adjusted p-value below a specific FDR cutoff, set by default to 0.1. This threshold can be modified with the parameter `alpha`. The `results()` function can also be customised to filter the results based on certain criteria (log2 fold change or padj) or to set a specific contrast. A contrast is a specific comparison between two or more levels of a factor, such as the comparison between the treatment and control groups. The order of the contrast names determines the direction of the fold change that is reported in the results. Specifically, the first level of the contrast is the condition of interest and the second level is the reference level.
By default, the `results()` function returns the results for all genes in the analysis with an adjusted p-value below a specific FDR cutoff, set by default to 0.1. This threshold can be modified with the parameter `alpha`. The `results()` function can also be customised to filter the results based on certain criteria (log2 fold change or padj) or to set a specific contrast. A contrast is a specific comparison between two or more levels of a factor, such as the comparison between the treatment and control groups. The order of the contrast names determines the direction of the fold change that is reported in the results. Specifically, the first level of the contrast is the condition of interest and the second level is the reference level.
Notice that in this tutorial the contrast is already correctly specified.

```r
Expand All @@ -318,15 +313,15 @@ res <- results(dds_final)

# Visualise the results

head(res)
head(res)

# Summarise the results showing the number of tested genes (genes with non-zero total read count), the genes up- and down-regulated at the selected threshold (alpha) and the number of genes excluded by the multiple testing due to a low mean count
# Summarise the results showing the number of tested genes (genes with non-zero total read count), the genes up- and down-regulated at the selected threshold (alpha) and the number of genes excluded by the multiple testing due to a low mean count

summary(res)

# DESeq2 function to extract the name of the contrast

resultsNames(dds_final)
resultsNames(dds_final)

# res <- results(dds, contrast = c("design_formula", "condition_of_interest", "reference_condition"))
# Command to set the contrast, if necessary
Expand All @@ -341,37 +336,37 @@ res_viz$gene <- rownames(res)

# Convert the results to a tibble for easier manipulation and relocate the gene column to the first position

res_viz <- as_tibble(res_viz) %>%
res_viz <- as_tibble(res_viz) %>%
relocate(gene, .before = baseMean)

# Save the results table
# Save the results table

write.csv(res_viz, file = "de_results/de_result_table.csv")
```

In the _Experimental Design_ section, we emphasised the importance of estimating the log2 fold change threshold using a statistical power calculation, rather than selecting it arbitrarily. This approach ensures that the chosen threshold is statistically appropriate and tailored to the specifics of the experiment. However, since we are working with simulated data for demonstration purposes, we will use a padj threshold of 0.05 and consider genes with a log2 fold change greater than 1 or lower than -1 as differentially expressed.
In the *Experimental Design* section, we emphasised the importance of estimating the log2 fold change threshold using a statistical power calculation, rather than selecting it arbitrarily. This approach ensures that the chosen threshold is statistically appropriate and tailored to the specifics of the experiment. However, since we are working with simulated data for demonstration purposes, we will use a padj threshold of 0.05 and consider genes with a log2 fold change greater than 1 or lower than -1 as differentially expressed.

```r
#### Extract significant DE genes from the results ####

# Filter the results to include only significantly differentially expressed genes with an adjusted p-value (padj) less than 0.05 and a log2foldchange greater than 1 or less than -1

resSig <- subset(res_viz, padj < 0.05 & abs(log2FoldChange) > 1)
resSig <- subset(res_viz, padj < 0.05 & abs(log2FoldChange) > 1)

# Convert the results to a tibble for easier manipulation and relocate the 'Gene' column to be the first column

resSig <- as_tibble(resSig) %>%
relocate(gene, .before = baseMean)
resSig <- as_tibble(resSig) %>%
relocate(gene, .before = baseMean)

# Order the significant genes by their adjusted p-value (padj) in ascending order

resSig <- resSig[order(resSig$padj),]
resSig <- resSig[order(resSig$padj),]

# Display the final results table of significant genes

resSig
resSig

# Save the significant DE genes
# Save the significant DE genes

write.csv(resSig, file = "de_results/sig_de_genes.csv")
```
Expand All @@ -385,7 +380,7 @@ Now that we have obtained the results of the differential expression analysis, i

plotMA(res, ylim = c(-2,2))
```

- **counts plot**: it plots the normalised counts for a single gene across the different conditions in your experiment. It’s particularly useful for visualising the expression levels of specific genes of interest and comparing them across sample groups.

```r
Expand All @@ -404,14 +399,14 @@ Notiche that this plot is based on the normalised counts.

significant_genes <- resSig[, 1]

# Extract normalised counts for significant genes from the normalised counts matrix and convert the "gene" column to row names
# Extract normalised counts for significant genes from the normalised counts matrix and convert the "gene" column to row names

significant_counts <- inner_join(normalised_counts, significant_genes, by = "gene") %>%
significant_counts <- inner_join(normalised_counts, significant_genes, by = "gene") %>%
column_to_rownames("gene")

# Create the heatmap using pheatmap

pheatmap(significant_counts,
pheatmap(significant_counts,
cluster_rows = TRUE,
fontsize = 8,
scale = "row",
Expand All @@ -427,48 +422,49 @@ pheatmap(significant_counts,

# Convert the results to a tibble and add a column indicating differential expression status

res_tb <- as_tibble(res) %>%
res_tb <- as_tibble(res) %>%
mutate(diffexpressed = case_when(
log2FoldChange > 1 & padj < 0.05 ~ 'upregulated',
log2FoldChange > 1 & padj < 0.05 ~ 'upregulated',
log2FoldChange < -1 & padj < 0.05 ~ 'downregulated',
TRUE ~ 'not_de'))

# Add a new column with gene names

res_tb$gene <- rownames(res)
res_tb$gene <- rownames(res)

# Relocate the 'gene' column to be the first column

res_tb <- res_tb %>%
res_tb <- res_tb %>%
relocate(gene, .before = baseMean)

# Order the table by adjusted p-value (padj) and add a new column for gene labels

res_tb <- res_tb %>% arrange(padj) %>%
res_tb <- res_tb %>% arrange(padj) %>%
mutate(genelabels = "")

# Label the top 5 most significant genes

res_tb$genelabels[1:5] <- res_tb$gene[1:5]

# Create a volcano plot using ggplot2

ggplot(data = res_tb, aes(x = log2FoldChange, y = -log10(padj), col = diffexpressed)) +
geom_point(size = 0.6) +
ggplot(data = res_tb, aes(x = log2FoldChange, y = -log10(padj), col = diffexpressed)) +
geom_point(size = 0.6) +
geom_text_repel(aes(label = genelabels), size = 2.5, max.overlaps = Inf) +
ggtitle("DE genes treatment versus control") +
ggtitle("DE genes treatment versus control") +
geom_vline(xintercept = c(-1, 1), col = "black", linetype = 'dashed', linewidth = 0.2) +
geom_hline(yintercept = -log10(0.05), col = "black", linetype = 'dashed', linewidth = 0.2) +
theme(plot.title = element_text(size = rel(1.25), hjust = 0.5),
axis.title = element_text(size = rel(1))) +
scale_color_manual(values = c("upregulated" = "red",
"downregulated" = "blue",
scale_color_manual(values = c("upregulated" = "red",
"downregulated" = "blue",
"not_de" = "grey")) +
labs(color = 'DE genes') +
xlim(-3,5)

```


## Functional analysis

The output of the differential expression analysis is a list of significant DE genes. To uncover the underlying biological mechanisms, various downstream analyses can be performed, such as functional enrichment analysis (identify overrepresented biological processes, molecular functions, cellular components or pathways) and network analysis (group genes based on similar expression patterns to identify novel interactions). To facilitate the interpretation of the resulting list of DE genes, a range of freely available web- and R-based tools can be employed.
Expand Down Expand Up @@ -534,7 +530,7 @@ go_enrich <- enrichGO(
# Create a bar plot of the top enriched GO terms

barplot <- barplot(
go_enrich,
go_enrich,
title = "Enrichment analysis barplot",
font.size = 8
)
Expand All @@ -550,4 +546,4 @@ dotplot <- dotplot(
# Combine the bar plot and dot plot into a single plot grid

plot_grid(barplot, dotplot, nrow = 2)
```
```
Loading

0 comments on commit 73fb0e7

Please sign in to comment.