From 82573a1fc1e209ff571e2dfb076a1601c5a7d940 Mon Sep 17 00:00:00 2001 From: jianhong Date: Tue, 11 Jul 2023 17:26:06 +0000 Subject: [PATCH] =?UTF-8?q?Deploying=20to=20gh-pages=20from=20@=20jianhong?= =?UTF-8?q?/genomictools@9f29649471b9df83b293b7736d82429c233e9e0d=20?= =?UTF-8?q?=F0=9F=9A=80?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- articles/scripts.html | 249 ++++++++++++++++++++++-------------------- index.html | 4 +- pkgdown.yml | 2 +- 3 files changed, 132 insertions(+), 123 deletions(-) diff --git a/articles/scripts.html b/articles/scripts.html index 811796d..268a881 100644 --- a/articles/scripts.html +++ b/articles/scripts.html @@ -111,7 +111,8 @@

Install docker
which docker
-docker pull jianhong/genomictools:latest
+#docker pull jianhong/genomictools:latest +docker pull ghcr.io/jianhong/genomictools:latest

Change the docker Memory Resources to >= 5G in the Preferences setting page of Docker. And then run the following code in a terminal.

@@ -127,34 +128,39 @@

Install dockerhttp://localhost:8787 with username “rstudio” and password “123456”. All the following steps are running in Rstudio “Terminal” or “Console”.

+

You may want to open the source of “scripts.Rmd” at the “Source +Panes” by

+
+rstudioapi::documentOpen("/usr/local/lib/R/site-library/basicBioinformaticsDRC2023/doc/scripts.Rmd")

Run kallisto and Salmon for RNA-seq

-

The sample files are packaged in basicBioinformaticsRNI2022 package +

The sample files are packaged in basicBioinformaticsDRC2023 package and Docker container.

Now we will download the zebrafish cDNA files from ENSEMBL in order to build the Kallisto and Salmon transcript index. If you are doing rRNA depletion library, please download and merge the cDNA and ncDNA files from ENSEMBL to make the full transcriptome.

-
cd RNAseq
-wget ftp://ftp.ensembl.org/pub/release-105/fasta/danio_rerio/cdna/Danio_rerio.GRCz11.cdna.all.fa.gz
+
cd RNAseq
+wget ftp://ftp.ensembl.org/pub/release-105/fasta/danio_rerio/cdna/Danio_rerio.GRCz11.cdna.all.fa.gz

Now we can build the transcriptome index. It will take some time and memory for full dataset. This is the reason why we need to set docker memory to > 5G.

-
kallisto index -i danRer.GRCz11_transcrits.idx Danio_rerio.GRCz11.cdna.all.fa.gz
-
-salmon index -i danRer.GRCz11_transcrits.salmon.idx -t Danio_rerio.GRCz11.cdna.all.fa.gz
+
kallisto index -i danRer.GRCz11_transcrits.idx Danio_rerio.GRCz11.cdna.all.fa.gz
+
+salmon index -i danRer.GRCz11_transcrits.salmon.idx -t Danio_rerio.GRCz11.cdna.all.fa.gz

The data are only a subset of the whole genome. We will only use genes in chromosome 4, 13, 16 and 21 to speed up the test run.

-
mkdir data/RNAseq
-cd data/RNAseq
-wget https://raw.githubusercontent.com/jianhong/genomictools/master/inst/extdata/Danio_rerio.GRCz11.cdna.toy.fa.gz
+
mkdir data/RNAseq
+cd data/RNAseq
+# wget https://raw.githubusercontent.com/jianhong/genomictools/master/inst/extdata/Danio_rerio.GRCz11.cdna.toy.fa.gz
+ln -s /usr/local/lib/R/site-library/basicBioinformaticsDRC2023/extdata/Danio_rerio.GRCz11.cdna.toy.fa.gz ./

It will take several seconds to build the toy index.

-
kallisto index -i danRer.GRCz11_transcrits.toy.idx Danio_rerio.GRCz11.cdna.toy.fa.gz
-
-salmon index -i danRer.GRCz11_transcrits.toy.salmon.idx -t Danio_rerio.GRCz11.cdna.toy.fa.gz
+
kallisto index -i danRer.GRCz11_transcrits.toy.idx Danio_rerio.GRCz11.cdna.toy.fa.gz
+
+salmon index -i danRer.GRCz11_transcrits.toy.salmon.idx -t Danio_rerio.GRCz11.cdna.toy.fa.gz

It’s time for quantifying the FASTQ files against our Kallisto index and Salmon index. We run both of them for comparison. For real data, you can select one of them.

@@ -166,28 +172,28 @@

Run kallisto and Salmon for RNA-seq processors in the bootstrapping. We set the bootstrapping by “-b 30” and “–numBootstraps 30”.

It will take several minutes for the sample data.

-
# the toy RNAseq data is saved in /home/rstudio/RNAseq folder
-# create a link to current folder
-ln -s /home/rstudio/RNAseq/fastq ./fastq
-# mapping
-# Ablated and Uninjured are related with fastq file name
-mkdir -p kallisto_quant
-mkdir -p salmon_quant
-for rep in 1 2
-do
-for cond in Ablated Uninjured
-do
-kallisto quant -i danRer.GRCz11_transcrits.toy.idx \
-               -o kallisto_quant/$cond.rep$rep \
-               -b 30 -t 2 fastq/$cond.rep$rep.fastq.gz \
-               --single -l 200 -s 50
-salmon quant -i danRer.GRCz11_transcrits.toy.salmon.idx -l A \
-             -r fastq/$cond.rep$rep.fastq.gz \
-             --validateMappings -p 2 \
-             -o salmon_quant/$cond.rep$rep \
-             --numBootstraps 30 --seqBias --gcBias
-done
-done
+
# the toy RNAseq data is saved in /home/rstudio/RNAseq folder
+# create a link to current folder
+ln -s /home/rstudio/RNAseq/fastq ./
+# mapping
+# Ablated and Uninjured are related with fastq file name
+mkdir -p kallisto_quant
+mkdir -p salmon_quant
+for rep in 1 2
+do
+for cond in Ablated Uninjured
+do
+kallisto quant -i danRer.GRCz11_transcrits.toy.idx \
+               -o kallisto_quant/$cond.rep$rep \
+               -b 30 -t 2 fastq/$cond.rep$rep.fastq.gz \
+               --single -l 200 -s 50
+salmon quant -i danRer.GRCz11_transcrits.toy.salmon.idx -l A \
+             -r fastq/$cond.rep$rep.fastq.gz \
+             --validateMappings -p 2 \
+             -o salmon_quant/$cond.rep$rep \
+             --numBootstraps 30 --seqBias --gcBias
+done
+done

R scripts for RNAseq @@ -200,7 +206,7 @@

prepare the transcripts to g

We are trying to do gene level DE analysis. We need to aggregate the transcript level counts to gene level. We borrowed from the Sleuth documentation by retrieve ENSEMBL transcript id to gene id.

-
+
 setwd("data/RNAseq")
 library(biomaRt)
 library(dplyr)
@@ -219,8 +225,9 @@ 

prepare the transcripts to g head(t2g, n=3)

If you have trouble in downloading the data from ensembl, try to load the pre-saved object for the toy data.

-
-t2g <- readRDS(url("https://raw.githubusercontent.com/jianhong/genomictools/master/inst/extdata/t2g.rds"))
+
+# t2g <- readRDS(url("https://raw.githubusercontent.com/jianhong/genomictools/master/inst/extdata/t2g.rds"))
+t2g <- readRDS(system.file("extdata", "t2g.rds", package = "basicBioinformaticsDRC2023"))
 head(t2g, n=3)

The following codes are using tximport to import the transcripts counts and aggregate to gene level. Downstream are using DESeq2 to do @@ -229,11 +236,11 @@

prepare the transcripts to g

run tximport + DESeq2 for Salmon results

-
+
 library(tximport)
 library(DESeq2)
 
-setwd("RNAseq/") ## RNAseq is the folder where saved the salmon_quant and kallisto_quant
+setwd("~/data/RNAseq") ## RNAseq is the folder where saved the salmon_quant and kallisto_quant
 (salmon_files <- dir("salmon_quant", "sf$", 
                      recursive = TRUE, 
                      full.names = TRUE))
@@ -252,8 +259,8 @@ 

run tximport + DESeq2 for Salmon

run tximport + DESeq2 for kallisto results

-
-(kallisto_files <- dir("kallisto_quant", "abundance.h5", 
+
+(kallisto_files <- dir("kallisto_quant", "abundance.tsv", 
                        recursive = TRUE, full.names = TRUE))
 
 txi.kallisto <- tximport(kallisto_files, type = "kallisto", 
@@ -273,7 +280,7 @@ 

run Sleuth
+
 #BiocManager::install("pachterlab/sleuth")
 library(sleuth)
 samples <- dir("kallisto_quant")
@@ -305,20 +312,21 @@ 

prepare index file for bwa -
## change you working directory
-mkdir -p /home/rstudio/data/ChIPseq
-ln -s /home/rstudio/ChIPseq/fastq /home/rstudio/data/ChIPseq/
-cd /home/rstudio/data/ChIPseq
-## download the zebrafish GRCz11 genome from ENSEMBLE
-wget ftp://ftp.ensembl.org/pub/release-105/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz
-## build the index
-## -p: prefix for all index files
-bwa index -p GRCz11 Danio_rerio.GRCz11.dna.primary_assembly.fa.gz
+
## change you working directory
+mkdir -p $HOME/data/ChIPseq
+ln -s $HOME/ChIPseq/fastq $HOME/data/ChIPseq/
+cd $HOME/data/ChIPseq
+## download the zebrafish GRCz11 genome from ENSEMBLE
+wget ftp://ftp.ensembl.org/pub/release-105/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz
+## build the index
+## -p: prefix for all index files
+bwa index -p GRCz11 Danio_rerio.GRCz11.dna.primary_assembly.fa.gz

The data are only a subset of the whole genome. We will only use genes in chromosome 4, 13, 16 and 21 to speed up the test run.

-
wget https://raw.githubusercontent.com/jianhong/genomictools/master/inst/extdata/Danio_rerio.GRCz11.dna.toy.fa.gz
-## the following step will take about 3min
-bwa index -p GRCz11.toy Danio_rerio.GRCz11.dna.toy.fa.gz
+
# wget https://raw.githubusercontent.com/jianhong/genomictools/master/inst/extdata/Danio_rerio.GRCz11.dna.toy.fa.gz
+ln -s /usr/local/lib/R/site-library/basicBioinformaticsDRC2023/extdata/Danio_rerio.GRCz11.dna.toy.fa.gz ./
+## the following step will take about 3min
+bwa index -p GRCz11.toy Danio_rerio.GRCz11.dna.toy.fa.gz

Run fastQC, mapping and MACS2 @@ -330,77 +338,78 @@

Run fastQC, mapping and MACS2user manual. MACS2 will be used to call the peaks.

-
group=(Ablated Ablated Uninjured Uninjured)
-tag=(Ablated.rep1 Ablated.rep2 Uninjured.rep1 Uninjured.rep2)
-species=danRer11
-prefix=bwa
-
-for i in {0..3}
-do
-## fastQC
-mkdir -p fastqc/${group[$i]}
-fastqc -o fastqc/${group[$i]} -t 2 \
-       fastq/${tag[$i]}.fastq.gz
-## trim adapter, need trim_galore be installed, here we do not do this step
-# mkdir -p fastq.trimmed
-# trim_galore -q 15 --fastqc -o fastq.trimmed/${group[$i]} fastq/${tag[$i]}.fastq.gz
-
-## mapping by bwa
-mkdir -p sam
-## -t: number of threads
-## -M: mark shorter split hits as secondary, this is optional for Picard compatibility.
-## >: save alignment to a SAM file
-## 2>: save standard error to log file
-bwa mem -M -t 2 GRCz11.toy \
-           fastq/${tag[$i]}.fastq.gz \
-           > sam/$prefix.$species.${tag[$i]}.sam \
-           2> bwa.$prefix.$species.${tag[$i]}.log.txt
-
-## convert sam file to bam and clean-up
-mkdir -p bam
-## -q: skip alignments with MAPQ samller than 30.
-samtools view -bhS -q 30 sam/$prefix.$species.${tag[$i]}.sam > bam/$prefix.$species.${tag[$i]}.bam
-## sort and index the bam file for quick access.
-samtools sort bam/$prefix.$species.${tag[$i]}.bam -o bam/$prefix.$species.${tag[$i]}.srt.bam
-samtools index bam/$prefix.$species.${tag[$i]}.srt.bam
-## remove un-sorted bam file.
-rm bam/$prefix.$species.${tag[$i]}.bam
-
-## we remove the duplicated by picard::MarkDuplicates. 
-mkdir -p bam/picard
-picard MarkDuplicates \
-       INPUT=bam/$prefix.$species.${tag[$i]}.srt.bam \
-       OUTPUT=bam/$prefix.$species.${tag[$i]}.srt.markDup.bam \
-       METRICS_FILE=bam/picard/$prefix.$species.${tag[$i]}.srt.fil.picard_info.txt \
-       REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT
-samtools index bam/$prefix.$species.${tag[$i]}.srt.markDup.bam
-
-## use deeptools::bamCoverage to generate bigwig files
-## the bw file can be viewed in IGV
-mkdir -p bw
-bamCoverage -b bam/$prefix.$species.${tag[$i]}.srt.markDup.bam -o bw/$prefix.$species.${tag[$i]}.bw --normalizeUsing CPM
-
-## call peaks by macs2
-mkdir -p macs3/${tag[$i]}
-## -g: mappable genome size
-## -q: use minimum FDR 0.05 cutoff to call significant regions.
-## -B: ask MACS3 to output bedGraph files for experiment.
-## --nomodel --extsize 150: the subset data is not big enough (<1000 peak) for
-## macs3 to generate a model. We manually feed one.
-## because we used toy genome, the genome size we set as 10M
-macs3 callpeak -t bam/${prefix}.$species.${tag[$i]}.srt.markDup.bam \
-               -f BAM -g 10e6 -n ${prefix}.$species.${tag[$i]} \
-               --outdir macs3/${tag[$i]} -q 0.05 \
-               -B --nomodel --extsize 150
-
-done
+
group=(Ablated Ablated Uninjured Uninjured)
+tag=(Ablated.rep1 Ablated.rep2 Uninjured.rep1 Uninjured.rep2)
+species=danRer11
+prefix=bwa
+
+for i in {0..3}
+do
+## fastQC
+mkdir -p fastqc/${group[$i]}
+fastqc -o fastqc/${group[$i]} -t 2 \
+       fastq/${tag[$i]}.fastq.gz
+## trim adapter, need trim_galore be installed, here we do not do this step
+# mkdir -p fastq.trimmed
+# trim_galore -q 15 --fastqc -o fastq.trimmed/${group[$i]} fastq/${tag[$i]}.fastq.gz
+
+## mapping by bwa
+mkdir -p sam
+## -t: number of threads
+## -M: mark shorter split hits as secondary, this is optional for Picard compatibility.
+## >: save alignment to a SAM file
+## 2>: save standard error to log file
+bwa mem -M -t 2 GRCz11.toy \
+           fastq/${tag[$i]}.fastq.gz \
+           > sam/$prefix.$species.${tag[$i]}.sam \
+           2> bwa.$prefix.$species.${tag[$i]}.log.txt
+
+## convert sam file to bam and clean-up
+mkdir -p bam
+## -q: skip alignments with MAPQ samller than 30.
+samtools view -bhS -q 30 sam/$prefix.$species.${tag[$i]}.sam > bam/$prefix.$species.${tag[$i]}.bam
+rm sam/$prefix.$species.${tag[$i]}.sam
+## sort and index the bam file for quick access.
+samtools sort bam/$prefix.$species.${tag[$i]}.bam -o bam/$prefix.$species.${tag[$i]}.srt.bam
+samtools index bam/$prefix.$species.${tag[$i]}.srt.bam
+## remove un-sorted bam file.
+rm bam/$prefix.$species.${tag[$i]}.bam
+
+## we remove the duplicated by picard::MarkDuplicates. 
+mkdir -p bam/picard
+picard MarkDuplicates \
+       INPUT=bam/$prefix.$species.${tag[$i]}.srt.bam \
+       OUTPUT=bam/$prefix.$species.${tag[$i]}.srt.markDup.bam \
+       METRICS_FILE=bam/picard/$prefix.$species.${tag[$i]}.srt.fil.picard_info.txt \
+       REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT
+samtools index bam/$prefix.$species.${tag[$i]}.srt.markDup.bam
+
+## use deeptools::bamCoverage to generate bigwig files
+## the bw file can be viewed in IGV
+mkdir -p bw
+bamCoverage -b bam/$prefix.$species.${tag[$i]}.srt.markDup.bam -o bw/$prefix.$species.${tag[$i]}.bw --normalizeUsing CPM
+
+## call peaks by macs2
+mkdir -p macs3/${tag[$i]}
+## -g: mappable genome size
+## -q: use minimum FDR 0.05 cutoff to call significant regions.
+## -B: ask MACS3 to output bedGraph files for experiment.
+## --nomodel --extsize 150: the subset data is not big enough (<1000 peak) for
+## macs3 to generate a model. We manually feed one.
+## because we used toy genome, the genome size we set as 10M
+macs3 callpeak -t bam/${prefix}.$species.${tag[$i]}.srt.markDup.bam \
+               -f BAM -g 10e6 -n ${prefix}.$species.${tag[$i]} \
+               --outdir macs3/${tag[$i]} -q 0.05 \
+               -B --nomodel --extsize 150
+
+done

Differential analysis

We will use DiffBind package to do the differential analysis.

-
+
 setwd("/home/rstudio/data/ChIPseq")
 library("DiffBind")
 (bams <- dir("bam", "markDup.bam$", full.names = TRUE))
diff --git a/index.html b/index.html
index 1ae12e1..e58fa61 100644
--- a/index.html
+++ b/index.html
@@ -99,10 +99,10 @@ 

To use the resulting image:

docker file for genomic tools

Dockerfile to build bwa, kallisto, MACS2, samtools, picard-tools, fastQC, bedtools, cutadapt, deeptools, R, ucsc genome tools images Based on Ubuntu

-
docker run -e PASSWORD=<choose_a_password_for_rstudio> -p 8787:8787 YOURDOCKERIMAGENAME
+
docker run -e PASSWORD=<choose_a_password_for_rstudio> -p 8787:8787 ghcr.io/jianhong/genomictools:latest

Once running, navigate to http://localhost:8787/ and then login with rstudio:yourchosenpassword.

To try with this repository docker image:

-
docker run -e PASSWORD=abc -p 8787:8787 ghcr.io/bioconductor/buildabiocworkshop
+
docker run -e PASSWORD=123456 -p 8787:8787 ghcr.io/jianhong/genomictools:latest

NOTE: Running docker that uses the password in plain text like above exposes the password to others in a multi-user system (like a shared workstation or compute node). In practice, consider using an environment variable instead of plain text to pass along passwords and other secrets in docker command lines.

diff --git a/pkgdown.yml b/pkgdown.yml index ad3a83b..7602f67 100644 --- a/pkgdown.yml +++ b/pkgdown.yml @@ -3,7 +3,7 @@ pkgdown: 2.0.7 pkgdown_sha: ~ articles: scripts: scripts.html -last_built: 2023-07-11T14:43Z +last_built: 2023-07-11T17:25Z urls: reference: https://jianhong.github.io/genoimctools/reference article: https://jianhong.github.io/genoimctools/articles