A tool to identify exonization of retrotransposable elements using RNA-seq data.
Report Bug
Fredy is a user-friendly pipeline designed to identify, quantify, and analyze chimeric transcripts from RNA-Seq data. The pipeline utilizes well-established tools such as StringTie2 for transcriptome assembly and quantification. In addition, machine learning algorithms provided by RNASamba are used to predict whether a transcript is coding. To further enhance the analysis, Fredy also incorporates HMMER and Python3 scripts to compare protein domains and identify potential alterations. With these tools, Fredy provides a comprehensive approach to chimeric transcript analysis that is both efficient and effective.
The source code for FREDY can be obtained in our github page using the following command:
git clone https://github.com/galantelab/fredy.git
Inside FREDY’s directory, build a docker image:
cd fredy
sudo docker build -f Dockerfile -t fredy .
You can acquire a built docker image from Docker Hub registry:
sudo docker pull galantelab/fredy
We provide all the necessary databases to run FREDY, catering to human functionality. In our comprehensive documentation available in the supplementary material, we offer a step-by-step guide to generating these exact files for other species.
File | Description |
---|---|
model.hdf5 | RNASamba model (Works for mammals in general) |
Pfam-A.hmm | HMMER model (Works to mammals in general) you need to download all files with .hmm |
Pfam-A.hmm.h3f | HMMER model |
Pfam-A.hmm.h3i | HMMER model |
Pfam-A.hmm.h3m | HMMER model |
Pfam-A.hmm.h3p | HMMER model |
File | Description |
---|---|
human_star_index.tar.gz | Folder with STAR Index built with hg38.fa and gencode v36 |
human_gv36.gtf | GTF file (Gencode V36 Used in TCGA) |
human.fa | Reference Genome (hg38) |
human.fa.fai | Index of reference genome |
human.pep.fa | Aminoacid sequences of proteins |
File | Description |
---|---|
chimp_star_index.tar.gz | Folder with STAR Index |
chimp_pantro6.gtf | GTF file |
chimp.fa | Reference Genome |
chimp.fa.fai | Index of reference genome |
chimp.pep.fa | Aminoacid sequences of proteins |
File | Description |
---|---|
cow_star_index.tar.gz | Folder with STAR Index |
cow_bostau9.gtf | GTF file |
cow.fa | Reference Genome |
cow.fa.fai | Index of reference genome |
cow.pep.fa | Aminoacid sequences of proteins |
File | Description |
---|---|
dog_star_index.tar.gz | Folder with STAR Index |
dog_cfam1.gtf | GTF file |
dog.fa | Reference Genome |
dog.fa.fai | Index of reference genome |
dog.pep.fa | Aminoacid sequences of proteins |
File | Description |
---|---|
marmoset_star_index.tar.gz | Folder with STAR Index |
marmoset_cj1700.gtf | GTF file |
marmoset.fa | Reference Genome |
marmoset.fa.fai | Index of reference genome |
marmoset.pep.fa | Aminoacid sequences of proteins |
File | Description |
---|---|
mouse_star_index.tar.gz | Folder with STAR Index |
mouse_GRCm39.gtf | GTF file |
mouse.fa | Reference Genome |
mouse.fa.fai | Index of reference genome |
mouse.pep.fa | Aminoacid sequences of proteins |
File | Description |
---|---|
opossum_star_index.tar.gz | Folder with STAR Index |
opossum_mondom5.gtf | GTF file |
opossum.fa | Reference Genome |
opossum.fa.fai | Index of reference genome |
opossum.pep.fa | Aminoacid sequences of proteins |
File | Description |
---|---|
platypus_star_index.tar.gz | Folder with STAR Index |
platypus_mornana1.gtf | GTF file |
platypus.fa | Reference Genome |
platypus.fa.fai | Index of reference genome |
platypus.pep.fa | Aminoacid sequences of proteins |
File | Description |
---|---|
rat_star_index.tar.gz | Folder with STAR Index |
rat_rn6.gtff | GTF file |
rat.fa | Reference Genome |
rat.fa.fai | Index of reference genome |
rat.pep.fa | Aminoacid sequences of proteins |
File | Description |
---|---|
rhesus_star_index.tar.gz | Folder with STAR Index |
rhesus_rhemac10.gtf | GTF file |
rhesus.fa | Reference Genome |
rhesus.fa.fai | Index of reference genome |
rhesus.pep.fa | Aminoacid sequences of proteins |
To use ${specie}_star_index.tar.gz
you should uncompress the folder:
tar -xvf ${specie}_star_index.tar.gz
FREDY has seven subcommands: “star”, “string”, “chimeric”, “coding”, “pfam”, “expression” and “results”.
fredy [subcommand] <options>
Subcommands may be invoked by the help menu:
fredy help
Subcommand | Description |
---|---|
star | Aligns RNA-seq data against the genome using STAR (DOI: 10.1093/bioinformatics/bts635) |
string | Assembles sequenced reads (compatible with both short and long reads) using StringTie2 (DOI: 10.1186/s13059-019-1910-1) |
chimeric | Identifies potential chimeric transcripts |
coding | Computes the coding potential of (chimeric) transcripts using RNASamba (DOI: https://doi.org/10.1093/nargab/lqz024) |
pfam | Searches for protein domains using HMMer (DOI: 10.1093/nar/gkr367) and PFAM protein families and domains (https://doi.org/10.1093/nar/gkaa913) |
expression | Estimates transcript expression using StringTie2 (DOI: 10.1186/s13059-019-1910-1) |
results | Compiles the final results of chimeric transcripts incorporating inputs from previous steps |
The first step in the FREDY’s pipeline is the “star”. The inputs to this command are FASTQ files and a STAR index (pre-made available here). The output is a sorted and filtered BAM aligned file, which will become the input to the next step. This command supports all types of RNA-Seq data (paired-end, single-end and long-reads), either compressed (as .gz) or not.
OPTIONS:
Short | Long | Description |
---|---|---|
-o | --output-dir | Output directory. Creates the directory if it does not exist [MANDATORY] |
-i | --index-dir | STAR index directory [MANDATORY] |
-f | --file | File containing a newline separated list of sequencing files in FASTQ format. This option is not mandatory if one or more FASTQ files are passed as argument [MANDATORY] |
-h | --help | Prints help message |
-t | --threads | Number of threads [default: 8] |
-S | --short-reads | Set the sequencing to short reads [default] |
-L | --long-reads | Set the sequencing to long reads |
-s | --single-end | For short reads '-S', set the type of sequencing to single-end |
-p | --paired-end | For short reads '-S', set the type of sequencing to paired-end. In this case, the FASTQ files will be processed, being considered forward (R1) and reverse complement (R2) according to the order in which they are passed [default] |
Example
docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v <star_index-path>:/home/fredy/star_index/ -v <fastq-path>:/home/fredy/input/ -v <output-path>:/home/fredy/output/ fredy star -o test -i /home/fredy/star_index/ -f /home/fredy/input/<fastq-path>
Where:
<star_index-path>
is the directory where star_index was downloaded. Ex.: $PWD/star_index/
<fastq-file-path>
is the directory where all FASTQ files are. Ex.: if $PWD/*.fastq.gz
type $PWD/
<output-path>
is the output directory. Ex.: $PWD/output/
<fastq-path>
is a .txt file inside /home/fredy/input/
with the docker path (for instance /home/fredy/input/test.fastq.gz
) to the FASTQ files.
The next step in the pipeline is “string”. This subcommand performs a transcriptome assembly with the BAMs generated in the previous step (or custom BAMs provided by the user). The output of this analysis is a GTF file representing the transcriptome from all samples.
OPTIONS:
Short | Long | Description |
---|---|---|
-o | --output-dir | Output directory. Creates the directory if it does not exist [MANDATORY] |
-a | --annotation | Gene annotation of the reference transcriptome in GTF format [MANDATORY] |
-f | --file | File containing a newline separated list of sequencing files in FASTQ format. This option is not mandatory if one or more FASTQ files are passed as argument [MANDATORY] |
-h | --help | Prints help message |
-t | --threads | Number of threads [default: 8] |
-S | --short-reads | Set the sequencing to short reads [default] |
-L | --long-reads | Set the sequencing to long reads |
Example
docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v <gtf-file-path>:/home/fredy/gtf/ -v <output-path>:/home/fredy/output/ fredy string -o test -a /home/fredy/gtf/<gtf-file>`
Where:
<gtf-file-path>
is the directory where gtf was downloaded. Ex.: if $PWD/human_gv36.gtf
type $PWD/
<output-path>
is the output directory. Ex.: $PWD/output/
<gtf-file>
is a GTF inside /home/fredy/gtf/. Ex.: /home/fredy/gtf/human_gv36.gtf
In the “chimeric” step, the pipeline identifies novel transcripts based on the GTF file generated from the “string” subcommand. Here, FREDY uses a list of events provided by the user to find transcripts containing overlaps between exons and the given events. Again, a GTF file and also a FASTA file with all transcripts found are the outputs provided.
OPTIONS:
Short | Long | Description |
---|---|---|
-o | --output-dir | Output directory. Creates the directory if it does not exist [MANDATORY] |
-a | --annotation | Gene annotation of the reference transcriptome in GTF format [MANDATORY] |
-g | --genome | FASTA file of the reference genome, which is the same one file used for reads alignment using STAR [MANDATORY] |
-e | --stringtie-out | ME fixed events in BED4 format [MANDATORY] |
-h | --help | Prints help message |
-T | --tmp-dir | Custom directory for temporary files [default: /tmp] |
-r | --reciprocal | Criteria for identification of chimeric events is at least 50% overlap of the event with the exon and at least 50% overlap of the exon with the event |
-R | --irreciprocal | Criteria for identification of chimeric events is at least 50% overlap of the event with the exon [default] |
Example
docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v <gtf-file-path>:/home/fredy/gtf/ -v <genome-file-path>:/home/fredy/ref_fa/ -v <events-file-path>:/home/fredy/events/ -v <output-path>:/home/fredy/output/ fredy chimeric -o test -g /home/fredy/gtf/<gtf-file> -G /home/fredy/ref_fa/<genome-file> -i /home/fredy/events/<event-file>
Where:
<gtf-file-path>
is the directory where GTF file was downloaded. Ex.: if $PWD/human_gv36.gtf
type $PWD/
<genome-file-path>
is the directory where the reference genome and reference genome index were downloaded. Ex.: if $PWD/human.fa
type $PWD/
<events-file-path>
is the directory where the events are. Ex.: if $PWD/events.bed
type $PWD/
<output-path>
is the output directory. Ex.: $PWD/output/
<gtf-file>
is a GTF file inside /home/fredy/gtf/
. Ex.: /home/fredy/gtf/human_gv36.gtf
<genome-file>
is a .fa inside /home/fredy/ref_fa/
. Ex.: /home/fredy/ref_fa/human.fa
<events-file>
is a .bed inside /home/fredy/events/
. Ex.: /home/fredy/events/events.bed
The “coding” subcommand classifies the novel transcripts identified in the “chimeric” step as coding or non-coding. Here, FREDY uses a model trained by RNASamba (available at here) to calculate the probability of a transcript being coding. In the end, a FASTA file with the protein sequences of all coding transcripts considered by our criteria is created.
OPTIONS:
Short | Long | Description |
---|---|---|
-o | --output-dir | Output directory. Creates the directory if it does not exist [MANDATORY] |
-m | --protein-model | File with the RNASamba model [MANDATORY] |
-d | --protein-db | File with the protein sequences [MANDATORY] |
-h | --help | Prints help message |
-P | --probability | Cutoff to consider transcripts protein-coding, based on the probability provided by RNASamba [default: 0.9] |
Example
docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v <rnasambamodel-file-path>:/home/fredy/rnasamba/ -v <proteinseq-file-path>:/home/fredy/proteinseq/ -v <output-path>:/home/fredy/output/ fredy coding -o test -m /home/fredy/rnasamba/<rnasambamodel-file> -d /home/fredy/proteinseq/<proteinseq-file>
Where:
<rnasambamodel-file-path>
is the directory where RNASamba model was downloaded. Ex.: if $PWD/model.hdf5
type $PWD/
<proteinseq-file-path>
is the directory where the protein sequences file was downloaded. Ex.: if $PWD/human.pep.fa
type $PWD/
<output-path>
is the output directory. Ex.: $PWD/output/
<rnasambamodel-file>
is a .hdf5 inside /home/fredy/rnasamba/
. Ex.: /home/fredy/rnasamba/model.hdf5
<proteinseq-file>
is a .fa inside /home/fredy/proteinseq/
. Ex.: /home/fredy/proteinseq/human.pep.fa
The “pfam” step searches for protein domains in the novel transcripts that passed the user’s predefined coding probability and subsequently compares them with the host’s protein domains. In order to identify them, we use HMMER trained with the PFAM database. The output of this subcommand is a TSV file comparing the protein domains of the novel transcripts identified with those of the host genes.
OPTIONS:
Short | Long | Description |
---|---|---|
-o | --output-dir | Output directory. Creates the directory if it does not exist [MANDATORY] |
-M | --pfam-model | A database of protein domain families to be used as an index for HMMER [MANDATORY] |
-h | --help | Prints help message |
-T | --tmp-dir | Custom directory for temporary files [default: /tmp] |
-t | --threads | Number of threads [default: 4] |
-E | --e-value | In the HMMER per-target output, reports target sequences with an e-value lesser than NUM [default: 1e-6] |
Example
docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v <pfammodel-file-path>:/home/fredy/pfammodel/ -v <output-path>:/home/fredy/output/ fredy pfam -o test -M <pfammodel-file>
Where:
<pfammodel-file-path>
is the directory where PFAM model was downloaded. Ex.: if $PWD/Pfam-A.hmm
type $PWD/
<output-path>
is the output directory. Ex.: $PWD/output/
<pfammodel-file>
is a .hmm inside /home/fredy/pfammodel/
. Ex.: /home/fredy/pfammodel/Pfam-A.hmm
The FREDY's “expression” subcommand quantifies all the transcriptomes assembled by the StringTie2's “expression” function. The expression results, in TPM (transcript per million) per transcript per sample, are made available as a TSV file.
OPTIONS:
Short | Long | Description |
---|---|---|
-o | --output-dir | Output directory. Creates the directory if it does not exist [MANDATORY] |
-f | --file | File containing a newline separated list of sequencing files in FASTQ format. This option is not mandatory if one or more FASTQ files are passed as argument |
-h | --help | Prints help message |
-T | --tmp-dir | Custom directory for temporary files [default: /tmp] |
-t | --threads | Number of threads [default: 8] |
-S | --short-reads | Set the sequencing to short reads [default] |
-L | --long-reads | Set the sequencing to long reads |
Example
docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v <output-path>:/home/fredy/output/ fredy expression -o test
Where:
<output-path>
is the output directory. Ex.: $PWD/output/
Finally, the “results” subcommand compiles all relevant information from the previous steps. Moreover, if the novel transcripts contribute to the expression of their respective host genes, this step further generates boxplots to show the relative contribution of such expression patterns.
OPTIONS:
Short | Long | Description |
---|---|---|
-o | --output-dir | Output directory. Creates the directory if it does not exist [MANDATORY] |
-h | --help | Prints help message |
-T | --tmp-dir | Custom directory for temporary files [default: /tmp] |
Example
docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v <output-path>:/home/fredy/output/ fredy results -o test
Where:
<output-path>
is the output directory. Ex.: $PWD/output/
In order to execute FREDY, we selected RNA-seq paired-end data of 2 samples related to the cell line K562 from the ENCODE Project.
First, you should download the data:
mkdir fastq
cd fastq/
##ENCLB063ZZZ R1
wget https://www.encodeproject.org/files/ENCFF001RWF/@@download/ENCFF001RWF.fastq.gz
mv ENCFF001RWF.fastq.gz ENCLB063ZZZ_R1.fastq.gz
##ENCLB063ZZZ R2
wget https://www.encodeproject.org/files/ENCFF001RWC/@@download/ENCFF001RWC.fastq.gz
mv ENCFF001RWC.fastq.gz ENCLB063ZZZ_R2.fastq.gz
##ENCLB059ZZZ R1
wget https://www.encodeproject.org/files/ENCFF001RDE/@@download/ENCFF001RDE.fastq.gz
mv ENCFF001RDE.fastq.gz ENCLB059ZZZ_R1.fastq.gz
##ENCLB059ZZZ R2
wget https://www.encodeproject.org/files/ENCFF001RCW/@@download/ENCFF001RCW.fastq.gz
mv ENCFF001RCW.fastq.gz ENCLB059ZZZ_R2.fastq.gz
cd ..
And download the databases of FREDY:
mkdir db
cd db/
## STAR Index (Based on Human hg38 - Gencode v36)
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/index/human_star_index.tar.gz
tar -xvf human_star_index.tar.gz
## Gencode v36 as the human annotation
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/reference_transcript/human_gv36.gtf
## hg38 as the human reference genome
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/reference_genomes/human.fa
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/reference_genomes/human.fa.fai
## Aminoacid sequence
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/reference_protein/human.pep.fa
## RNASamba model
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/rnasamba_model/model.hdf5
## HMMer model
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/hmm_model/Pfam-A.hmm
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/hmm_model/Pfam-A.hmm.h3f
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/hmm_model/Pfam-A.hmm.h3i
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/hmm_model/Pfam-A.hmm.h3m
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/hmm_model/Pfam-A.hmm.h3p
## Retrocopies events
wget https://bioinfohsl-tools.s3.amazonaws.com/rcpedia/downloads/beds/RCP_9606.bed
cut -f 1,2,3,5 RCP_9606.bed | sort -k1,1 -k2,2n > RCP_9606.bed4
cd ..
Then, prepare a file with the FASTQ PATHs:
ls fastq/*fastq.gz | awk '{print "/home/fredy/"$1}' > files.txt
After that, install and build a docker image:
git clone https://github.com/galantelab/fredy.git
cd fredy
sudo docker build -f Dockerfile -t fredy .
cd ..
Finally, you will be able to execute fredy as follows:
- “star” step:
time docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v $PWD:/home/fredy fredy star -o /home/fredy/K562 -i /home/fredy/db/star_index -f /home/fredy/files.txt
- “string” step:
time docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v $PWD:/home/fredy fredy string -o /home/fredy/K562 -a /home/fredy/db/human_gv36.gtf
- “chimeric” step:
time docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v $PWD:/home/fredy fredy chimeric -o /home/fredy/K562 -a /home/fredy/db/human_gv36.gtf -g /home/fredy/db/human.fa -e /home/fredy/db/RCP_9606.bed4
- “coding” step:
time docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v $PWD:/home/fredy fredy coding -o /home/fredy/K562 -m /home/fredy/db/model.hdf5 -d /home/fredy/db/human.pep.fa
- “pfam” step:
time docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v $PWD:/home/fredy fredy pfam -o /home/fredy/K562 -M /home/fredy/db/Pfam-A.hmm
- “expression” step:
time docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v $PWD:/home/fredy fredy expression -o /home/fredy/K562
- “results” step:
time docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v $PWD:/home/fredy fredy results -o /home/fredy/K562
All information related to the chimeric transcripts identified by FREDY are available in the final output named “results.tsv”. If you wish to inspect these transcripts in a Genome Browser such as at UCSC you can easily upload the “K562/chimeric/chimeric.gtf” file, also provided by the FREDY’s pipeline, to the “custom tracks”.
Rafael Luiz Vieira Mercuri - (rmercuri@mochsl.org.br)
Thiago Luiz Araújo Miller - (tmiller@mochsl.org.br)
Pedro Alexandre Favoretto Galante - (pgalante@mochsl.org.br)
Project Link: https://github.com/galantelab/fredy
Rafael Luiz Vieira Mercuri
Thiago Luiz Araújo Miller
Filipe Ferreira dos Santos
Matheus de Lima
Aline Rangel-Pozzo
Pedro Alexandre Favoretto Galante