modify md files

nf-core · Oct 31, 2024 · 73fb0e7 · 73fb0e7
1 parent 6d0ebe5
commit 73fb0e7
Show file tree

Hide file tree

Showing 5 changed files with 126 additions and 121 deletions.
diff --git a/docs/usage/DEanalysis/de_rstudio.md b/docs/usage/DEanalysis/de_rstudio.md
@@ -1,11 +1,8 @@
----
-order: 4
----
-
 # Differential Analysis with DESeq2
 
 In this section of the tutorial, we will guide you through the practical steps necessary to set up the RStudio environment, load the required libraries and data, and execute the DESeq2 analysis. By the end of this section, you will have a fully functional DESeq2 analysis pipeline set up in RStudio, ready to uncover the differentially expressed genes in your dataset.
 
+
 ## Launching the RStudio environment
 
 Once the nf-core/rnaseq pipeline is terminated, the resulting data are stored in the folder `results_star_salmon`. Now, we can analyse the results by running DESeq2 on RStudio. First of all, we need to launch it:
@@ -31,17 +28,18 @@ This command will keep the gitpod session active for exactly 2 hours, providing
 
 Now come back to our **RStudio session**.
 
+
 ## Differential Expression Analysis
 
 As in all analysis, firstly we need to create a new project:
 
-1. Go to the **File** menu and select **New Project**;
+1) Go to the **File** menu and select **New Project**;
 
-2. Select **New Directory**, **New Project**, name the project as shown below and click on **Create Project**;
+2) Select **New Directory**, **New Project**, name the project as shown below and click on **Create Project**;
 
-![RNAseq](./img/project_R.png)
+![R_project](./img/project_R.png)
 
-3. The new project will be automatically opened in RStudio
+3) The new project will be automatically opened in RStudio
 
 We can check whether we are in the correct working directory with `getwd()` The path `/workspace/gitpod/training/DE_analysis/` should appear on your console.
 To store our results in an organized way, we will create a folder named **de_results** using the **New Folder** button in the bottom right panel. We will save all our resulting tables and plots in this folder. Next, go to the **File menu**, select **New File** and then **R Script** to create a script editor in which we will save all commands required for the analysis. In the editor type:
@@ -52,7 +50,7 @@ To store our results in an organized way, we will create a folder named **de_res
 
 and save the file as **de_script.R**. From now on, each command described in the tutorial can be added to your script. The resulting working directory should look like this:
 
-![RNAseq](./img/workdir_RStudio.png)
+![Workdir](./img/workdir_RStudio.png)
 
 The analysis requires several R packages. To utilise them, we need to load the following libraries:
 
@@ -92,8 +90,7 @@ load("/workspace/gitpod/training/results_star_salmon/star_salmon/deseq2_qc/deseq
 
 In this tutorial we will analyse the `dds` object generated by running the alignment with **STAR** and the quantification with **Salmon**. Alternatively, a user could choose to analyse the the `dds` object generated by running only **Salmon** for both lightweight alignment and quantification. The file is stored in `/workspace/gitpod/training/results_star_salmon/salmon/deseq2_qc/deseq2.dds.RData`.
 
-In DESEq2, the `dds` object is a central data structure that contains the following components:
-
+In DESEq2, the `dds` object is a central data structure that contains the following components: 
 - `countData`: a matrix of raw count data, where each row represents a gene and each column represents a sample;
 - `colData`: a data frame containing information about the samples, such as the experimental design, treatment and other relevant metadata;
 - `design`: a formula specifying the experimental design utilised to estimate the dispersion and the log2foldchange.
@@ -110,7 +107,7 @@ colData(dds) # to check the sample info
 design(dds) # to check the design formula
 ```
 
-The `colData` and the `design` are the ones created within the nfcore/rnaseq pipeline and must be reorganised prior to the analysis. With the following commands we will create our metadata starting from the info stored in the `dds`. We will rename the column of the `colData`, we will ensure that the rownames of the metadata are present in the same order as the column names and finally we will update the `colData` of the `dds` object with our newly created metadata.
+The `colData` and the `design` are the ones created within the nfcore/rnaseq pipeline and must be reorganised prior to the analysis. With the following commands we will create our metadata starting from the info stored in the `dds`. We will rename the column of the `colData`, we will ensure that the rownames of the metadata are present in the same order as the column names and finally we will update the `colData` of the `dds` object with our newly created metadata. 
 
 ```r
 #### Creation of metadata starting from the dds colData ####
@@ -151,7 +148,7 @@ Now that everything is setted up, we can proceed to generate a new DESeq2 object
 
 dds_new  <- DESeqDataSet(dds, design = ~ condition)
 
-# dds inspection
+# dds inspection 
 
 head(counts(dds_new)) # to check the raw counts
 
@@ -162,7 +159,7 @@ design(dds_new) # to check the design formula
 
 Comparing the structure of the newly created dds (`dds_new`) with the one automatically produced by the pipeline (`dds`), we can observe the differences:
 
-![RNAseq](./img/dds_comparison.png)
+![Comparison_dds](./img/dds_comparison.png)
 
 Before running the different steps of the analysis, a good practice consists in pre-filtering the genes to remove those with very low counts. This is useful to improve computional efficiency and enhance interpretability. In general, it is reasonable to keep only genes with a sum counts of at least 10 for a minimal number of 3 samples:
 
@@ -171,15 +168,15 @@ Before running the different steps of the analysis, a good practice consists in
 
 # Select a minimal number of samples = 3
 
-smallestGroupSize <- 3
+smallestGroupSize <- 3 
 
 # Select genes with a sum counts of at least 10 in 3 samples
 
-keep <- rowSums(counts(dds_new) >= 10) >= smallestGroupSize
+keep <- rowSums(counts(dds_new) >= 10) >= smallestGroupSize 
 
 # Keep only the genes that pass the threshold
 
-dds_filtered <- dds_new[keep,]
+dds_filtered <- dds_new[keep,] 
 ```
 
 Now, it is time to run the differential expression analysis with the `DESeq()` function:
@@ -192,8 +189,7 @@ dds_final <- DESeq(dds_filtered)
 
 The `DESeq()` function is a high-level wrapper that simplifies the process of differential expression analysis by combining multiple steps into a single function call.
 
-This makes the workflow more user-friendly and ensures that all necessary preprocessing and statistical steps are executed in the correct order. The key functions that **DESeq2** calls include:
-
+This makes the workflow more user-friendly and ensures that all necessary preprocessing and statistical steps are executed in the correct order. The key functions that **DESeq2** calls include: 
 - **estimateSizeFactors**: to normalise the count data;
 - **estimateDispersion**: to estimate the dispersion;
 - **nbinomWaldTest**: to perform differential expression test.
@@ -221,12 +217,11 @@ rld <- rlog(dds_final, blind = TRUE)
 ```
 
 The `rlog` and the `vst` transformations have an argument, **blind** that can be set to:
-
 - **TRUE** (default): useful for QC analysis because it re-estimates the dispersion, allowing for comparison of samples in an unbiased manner with respect to experimental conditions;
 - **FALSE**: the function utilizes the already estimated dispersion, generally applied when differences in counts are expected to be due to the experimental design.
 
 Next, we perform Principal Component Analysis (PCA) to explore the data. DESeq2 provides a built-in function, `plotPCA()`, which uses [ggplot2](https://ggplot2.tidyverse.org) for visualisation, taking the `rld` (or the `vst`) object as input.
-Since the **treatment** is the principal condition of interest in our metadata, we will use the `condition` information from our metadata to plot the PCA:
+Since the **treatment** is the principal condition of interest in our metadata, we will use the `condition` information from our metadata to plot the PCA: 
 
 ```r
 #### Plot PCA ####
@@ -246,7 +241,7 @@ sampleDists <- dist(t(assay(rld)))  # Calculate pairwise distances between sampl
 
 # Convert distances to a matrix
 
-sampleDistMatrix <- as.matrix(sampleDists)
+sampleDistMatrix <- as.matrix(sampleDists)  
 
 # Set the row and column names of the distance matrix
 
@@ -260,11 +255,11 @@ colors <- colorRampPalette(rev(brewer.pal(9, "Greens")))(255) # function from RC
 
 # Create the heatmap
 
-pheatmap(sampleDistMatrix,
-        clustering_distance_rows = sampleDists,
-        clustering_distance_cols = sampleDists,
-        col = colors,
-        fontsize_col = 8,
+pheatmap(sampleDistMatrix, 
+        clustering_distance_rows = sampleDists, 
+        clustering_distance_cols = sampleDists, 
+        col = colors, 
+        fontsize_col = 8, 
         fontsize_row = 8)
 ```
 
@@ -291,10 +286,10 @@ normalised_counts$gene <- rownames(counts(dds_final))
 
 # Relocate the gene column to the first position
 
-normalised_counts <- normalised_counts %>%
+normalised_counts <- normalised_counts %>% 
   relocate(gene, .before = control_rep1)
 
-# Save normalised counts
+# Save normalised counts 
 
 write.csv(normalised_counts, file = "de_results/normalised_counts.csv")
 ```
@@ -308,7 +303,7 @@ The `results()` function in DESeq2 is used to extract the results of the DE anal
 - **pvalue**: the p-value from the Wald test indicates the probability of observing the measured difference in gene expression (log2 fold change) by chance, assuming no true difference exists (null hypothesis). A low p-value suggests that the observed expression change between samples is unlikely due to random chance, so we can reject the null hypothesis;
 - **padj**: the adjusted p-value, which takes into account multiple testing corrections, using the Benjamini-Hochberg method to control the false discovery rate;
 
-By default, the `results()` function returns the results for all genes in the analysis with an adjusted p-value below a specific FDR cutoff, set by default to 0.1. This threshold can be modified with the parameter `alpha`. The `results()` function can also be customised to filter the results based on certain criteria (log2 fold change or padj) or to set a specific contrast. A contrast is a specific comparison between two or more levels of a factor, such as the comparison between the treatment and control groups. The order of the contrast names determines the direction of the fold change that is reported in the results. Specifically, the first level of the contrast is the condition of interest and the second level is the reference level.
+By default, the `results()` function returns the results for all genes in the analysis with an adjusted p-value below a specific FDR cutoff, set by default to 0.1. This threshold can be modified with the parameter `alpha`. The `results()` function can also be customised to filter the results based on certain criteria (log2 fold change or padj) or to set a specific contrast. A contrast is a specific comparison between two or more levels of a factor, such as the comparison between the treatment and control groups. The order of the contrast names determines the direction of the fold change that is reported in the results. Specifically, the first level of the contrast is the condition of interest and the second level is the reference level. 
 Notice that in this tutorial the contrast is already correctly specified.
 
 ```r
@@ -318,15 +313,15 @@ res <- results(dds_final)
 
 # Visualise the results
 
-head(res)
+head(res) 
 
-# Summarise the results showing the number of tested genes (genes with non-zero total read count), the genes up- and down-regulated at the selected threshold (alpha) and the number of genes excluded by the multiple testing due to a low mean count
+# Summarise the results showing the number of tested genes (genes with non-zero total read count), the genes up- and down-regulated at the selected threshold (alpha) and the number of genes excluded by the multiple testing due to a low mean count 
 
 summary(res)
 
 # DESeq2 function to extract the name of the contrast
 
-resultsNames(dds_final)
+resultsNames(dds_final) 
 
 # res <- results(dds, contrast = c("design_formula", "condition_of_interest", "reference_condition"))
 # Command to set the contrast, if necessary
@@ -341,37 +336,37 @@ res_viz$gene <- rownames(res)
 
 # Convert the results to a tibble for easier manipulation and relocate the gene column to the first position
 
-res_viz <- as_tibble(res_viz) %>%
+res_viz <- as_tibble(res_viz) %>% 
   relocate(gene, .before = baseMean)
 
-# Save the results table
+# Save the results table 
 
 write.csv(res_viz, file = "de_results/de_result_table.csv")
 ```
 
-In the _Experimental Design_ section, we emphasised the importance of estimating the log2 fold change threshold using a statistical power calculation, rather than selecting it arbitrarily. This approach ensures that the chosen threshold is statistically appropriate and tailored to the specifics of the experiment. However, since we are working with simulated data for demonstration purposes, we will use a padj threshold of 0.05 and consider genes with a log2 fold change greater than 1 or lower than -1 as differentially expressed.
+In the *Experimental Design* section, we emphasised the importance of estimating the log2 fold change threshold using a statistical power calculation, rather than selecting it arbitrarily. This approach ensures that the chosen threshold is statistically appropriate and tailored to the specifics of the experiment. However, since we are working with simulated data for demonstration purposes, we will use a padj threshold of 0.05 and consider genes with a log2 fold change greater than 1 or lower than -1 as differentially expressed.
 
 ```r
 #### Extract significant DE genes from the results ####
 
 # Filter the results to include only significantly differentially expressed genes with an adjusted p-value (padj) less than 0.05 and a log2foldchange greater than 1 or less than -1
 
-resSig <- subset(res_viz, padj < 0.05 & abs(log2FoldChange) > 1)
+resSig <- subset(res_viz, padj < 0.05 & abs(log2FoldChange) > 1) 
 
 # Convert the results to a tibble for easier manipulation and relocate the 'Gene' column to be the first column
 
-resSig <- as_tibble(resSig) %>%
-  relocate(gene, .before = baseMean)
+resSig <- as_tibble(resSig) %>% 
+  relocate(gene, .before = baseMean) 
 
 # Order the significant genes by their adjusted p-value (padj) in ascending order
 
-resSig <- resSig[order(resSig$padj),]
+resSig <- resSig[order(resSig$padj),] 
 
 # Display the final results table of significant genes
 
-resSig
+resSig 
 
-# Save the significant DE genes
+# Save the significant DE genes 
 
 write.csv(resSig, file = "de_results/sig_de_genes.csv")
 ```
@@ -385,7 +380,7 @@ Now that we have obtained the results of the differential expression analysis, i
 
 plotMA(res, ylim = c(-2,2))
 ```
-
+ 
 - **counts plot**: it plots the normalised counts for a single gene across the different conditions in your experiment. It’s particularly useful for visualising the expression levels of specific genes of interest and comparing them across sample groups.
 
 ```r
@@ -404,14 +399,14 @@ Notiche that this plot is based on the normalised counts.
 
 significant_genes <- resSig[, 1]
 
-# Extract normalised counts for significant genes from the normalised counts matrix and convert the "gene" column to row names
+# Extract normalised counts for significant genes from the normalised counts matrix and convert the "gene" column to row names 
 
-significant_counts <- inner_join(normalised_counts, significant_genes, by = "gene") %>%
+significant_counts <- inner_join(normalised_counts, significant_genes, by = "gene") %>% 
   column_to_rownames("gene")
 
 # Create the heatmap using pheatmap
 
-pheatmap(significant_counts,
+pheatmap(significant_counts, 
          cluster_rows = TRUE,
          fontsize = 8,
          scale = "row",
@@ -427,48 +422,49 @@ pheatmap(significant_counts,
 
 # Convert the results to a tibble and add a column indicating differential expression status
 
-res_tb <- as_tibble(res) %>%
+res_tb <- as_tibble(res) %>% 
   mutate(diffexpressed = case_when(
-    log2FoldChange > 1 & padj < 0.05 ~ 'upregulated',
+    log2FoldChange > 1 & padj < 0.05 ~ 'upregulated', 
     log2FoldChange < -1 & padj < 0.05 ~ 'downregulated',
     TRUE ~ 'not_de'))
 
 # Add a new column with gene names
 
-res_tb$gene <- rownames(res)
+res_tb$gene <- rownames(res) 
 
 # Relocate the 'gene' column to be the first column
 
-res_tb <-  res_tb %>%
+res_tb <-  res_tb %>% 
   relocate(gene, .before = baseMean)
 
 # Order the table by adjusted p-value (padj) and add a new column for gene labels
 
-res_tb <- res_tb %>% arrange(padj) %>%
+res_tb <- res_tb %>% arrange(padj) %>% 
   mutate(genelabels = "")
-
+  
 # Label the top 5 most significant genes
 
 res_tb$genelabels[1:5] <- res_tb$gene[1:5]
 
 # Create a volcano plot using ggplot2
 
-ggplot(data = res_tb, aes(x = log2FoldChange, y = -log10(padj), col = diffexpressed)) +
-  geom_point(size = 0.6) +
+ggplot(data = res_tb, aes(x = log2FoldChange, y = -log10(padj), col = diffexpressed)) + 
+  geom_point(size = 0.6) + 
   geom_text_repel(aes(label = genelabels), size = 2.5, max.overlaps = Inf) +
-  ggtitle("DE genes treatment versus control") +
+  ggtitle("DE genes treatment versus control") + 
   geom_vline(xintercept = c(-1, 1), col = "black", linetype = 'dashed', linewidth = 0.2) +
   geom_hline(yintercept = -log10(0.05), col = "black", linetype = 'dashed', linewidth = 0.2) +
   theme(plot.title = element_text(size = rel(1.25), hjust = 0.5),
         axis.title = element_text(size = rel(1))) +
-  scale_color_manual(values = c("upregulated" = "red",
-                                "downregulated" = "blue",
+  scale_color_manual(values = c("upregulated" = "red", 
+                                "downregulated" = "blue", 
                                 "not_de" = "grey")) +
   labs(color = 'DE genes') +
   xlim(-3,5)
 
 ```
 
+
 ## Functional analysis
 
 The output of the differential expression analysis is a list of significant DE genes. To uncover the underlying biological mechanisms, various downstream analyses can be performed, such as functional enrichment analysis (identify overrepresented biological processes, molecular functions, cellular components or pathways) and network analysis (group genes based on similar expression patterns to identify novel interactions). To facilitate the interpretation of the resulting list of DE genes, a range of freely available web- and R-based tools can be employed.
@@ -534,7 +530,7 @@ go_enrich <- enrichGO(
 # Create a bar plot of the top enriched GO terms
 
 barplot <- barplot(
-  go_enrich,
+  go_enrich, 
   title = "Enrichment analysis barplot",
   font.size = 8
 )
@@ -550,4 +546,4 @@ dotplot <- dotplot(
 # Combine the bar plot and dot plot into a single plot grid
 
 plot_grid(barplot, dotplot, nrow = 2)
-```
+```