Quality control pipeline and pre-processing of data from RNA-Seq
RNA-Seq has stood out among sequencing technologies. Since then, the subsequent analysis of the raw data obtained from this technology has gained focus in bioinformatics. This Pipeline aims to present the main steps for the construction of the gene expression matrix, from raw RNA-Seq data.
Among the steps presented in this pipeline, the topics are addressed:
-
quality control
-
trimming
-
transcript quantification
-
annotation of transcripts
-
normalization
-
batch effect removal
Make sure you have installed all the tools the pipeline needs to run:
Tools: FastQC, MultiQC, Trimmomatic, Salmon, Kallisto, R
R packages: tximport, tximeta, GenomicFeatures, ensembldb, SummarizedExperiment, readxl, AnnotationHub, stringr, edgeR, sva, magrittr
In order to simplify the installation process, we provide the installTools.sh
script, which contains the commands for installing each tool.
Below is a quick start of the pipeline, click here to access the complete pipeline manual.
cd ~
wget https://github.com/resendejss/PreProcSEQ/archive/refs/heads/main.zip
unzip main.zip
./installTools.sh
Let's check the quality of each FASTQ file. The 0-samples
directory contains the files.
./qualityControl_beforeTrimming.sh
FastQC results were saved to 1-qualityControl_beforeTrimming/outputFastQC
and MultiQC results were saved to 1-qualityControl_beforeTrimming/outputMultiQC
./trimming_trimmomatic.sh
The resulting files from the Trimmomatic process are in 2-trimming/trimmomatic/paired
and 2-trimming/trimmomatic/unpaired
. In paired
are the files that were removed from the low quality bases. Under unpaired
are the readings that have been removed.
./qualityControl_afterTrimming.sh
FastQC results are in PreProcSEQ-main/3-qualityControl_afterTrimming/outputFastqc
and MultiQC results are in PreProcSEQ-main/3-qualityControl_afterTrimming/outputMultiqc
.
There are two quantification tool options: Salmon and Kallisto.
# index construction
./salmon_index.sh
# quantification
./salmon_ quant.sh
# index construction
./kallisto_index.sh
# quantification
./kallisto_quant.sh
Salmon results will be in 4-quantification/salmon/quant_salmon
. Kallisto results will be in 4-quantification/kallisto/quant_kallisto
.
Running the R script via terminal:
Rscript matrixConstruction_tximeta_salmon.R
Running the R script via terminal:
# salmon output
Rscript matrixConstruction_tximport_salmon.R
# kallisto output
Rscript matrixConstruction_tximport_kallisto.R
The matrices will be in 5-expressionMatrix
Running the R script via terminal:
# matrix_kallisto_tximport
Rscript annotaionTranscripts_kallisto_matrixTximport.R
# matrix_salmon_tximport
Rscript annotationTranscript_salmon_matrixTximport.R
# matrix_kallisto_tximeta
Rscript annotationTranscripts_salmon_tximeta.R
Rscript normalizationTMM.R
The results will be in 7-normalizationCounts/tmm
# counts
Rscript batchEffectRemoval_counts.R
# normalized data
Rscript batchEffectRemoval_TMM.R
The results will be in 8-batchEffect_removal