Skip to content

galantelab/fredy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Logo

FREDY

(Finding Retrotransposon Exonization Dynamics)

A tool to identify exonization of retrotransposable elements using RNA-seq data.
Report Bug

Table of Contents

  1. Overview
  2. Installation
  3. Usage
  4. Commands and options
  5. Practical Workflow
  6. Contact
  7. Authors

Overview

Fredy is a user-friendly pipeline designed to identify, quantify, and analyze chimeric transcripts from RNA-Seq data. The pipeline utilizes well-established tools such as StringTie2 for transcriptome assembly and quantification. In addition, machine learning algorithms provided by RNASamba are used to predict whether a transcript is coding. To further enhance the analysis, Fredy also incorporates HMMER and Python3 scripts to compare protein domains and identify potential alterations. With these tools, Fredy provides a comprehensive approach to chimeric transcript analysis that is both efficient and effective.

Workflow

Installation

Manual installation

The source code for FREDY can be obtained in our github page using the following command:

git clone https://github.com/galantelab/fredy.git

Inside FREDY’s directory, build a docker image:

cd fredy
sudo docker build -f Dockerfile -t fredy .

Pulling image

You can acquire a built docker image from Docker Hub registry:

sudo docker pull galantelab/fredy

Databases

We provide all the necessary databases to run FREDY, catering to human functionality. In our comprehensive documentation available in the supplementary material, we offer a step-by-step guide to generating these exact files for other species.

General

File Description
model.hdf5 RNASamba model (Works for mammals in general)
Pfam-A.hmm HMMER model (Works to mammals in general) you need to download all files with .hmm
Pfam-A.hmm.h3f HMMER model
Pfam-A.hmm.h3i HMMER model
Pfam-A.hmm.h3m HMMER model
Pfam-A.hmm.h3p HMMER model

By Specie

Human
File Description
human_star_index.tar.gz Folder with STAR Index built with hg38.fa and gencode v36
human_gv36.gtf GTF file (Gencode V36 Used in TCGA)
human.fa Reference Genome (hg38)
human.fa.fai Index of reference genome
human.pep.fa Aminoacid sequences of proteins
Chimp
File Description
chimp_star_index.tar.gz Folder with STAR Index
chimp_pantro6.gtf GTF file
chimp.fa Reference Genome
chimp.fa.fai Index of reference genome
chimp.pep.fa Aminoacid sequences of proteins
Cow
File Description
cow_star_index.tar.gz Folder with STAR Index
cow_bostau9.gtf GTF file
cow.fa Reference Genome
cow.fa.fai Index of reference genome
cow.pep.fa Aminoacid sequences of proteins
Dog
File Description
dog_star_index.tar.gz Folder with STAR Index
dog_cfam1.gtf GTF file
dog.fa Reference Genome
dog.fa.fai Index of reference genome
dog.pep.fa Aminoacid sequences of proteins
Marmoset
File Description
marmoset_star_index.tar.gz Folder with STAR Index
marmoset_cj1700.gtf GTF file
marmoset.fa Reference Genome
marmoset.fa.fai Index of reference genome
marmoset.pep.fa Aminoacid sequences of proteins
Mouse
File Description
mouse_star_index.tar.gz Folder with STAR Index
mouse_GRCm39.gtf GTF file
mouse.fa Reference Genome
mouse.fa.fai Index of reference genome
mouse.pep.fa Aminoacid sequences of proteins
Opossum
File Description
opossum_star_index.tar.gz Folder with STAR Index
opossum_mondom5.gtf GTF file
opossum.fa Reference Genome
opossum.fa.fai Index of reference genome
opossum.pep.fa Aminoacid sequences of proteins
Platypus
File Description
platypus_star_index.tar.gz Folder with STAR Index
platypus_mornana1.gtf GTF file
platypus.fa Reference Genome
platypus.fa.fai Index of reference genome
platypus.pep.fa Aminoacid sequences of proteins
Rat
File Description
rat_star_index.tar.gz Folder with STAR Index
rat_rn6.gtff GTF file
rat.fa Reference Genome
rat.fa.fai Index of reference genome
rat.pep.fa Aminoacid sequences of proteins
Rhesus
File Description
rhesus_star_index.tar.gz Folder with STAR Index
rhesus_rhemac10.gtf GTF file
rhesus.fa Reference Genome
rhesus.fa.fai Index of reference genome
rhesus.pep.fa Aminoacid sequences of proteins

To use ${specie}_star_index.tar.gz you should uncompress the folder:

tar -xvf ${specie}_star_index.tar.gz

Usage

FREDY has seven subcommands: “star”, “string”, “chimeric”, “coding”, “pfam”, “expression” and “results”.

fredy [subcommand] <options>

Subcommands may be invoked by the help menu:

fredy help
Subcommand Description
star Aligns RNA-seq data against the genome using STAR (DOI: 10.1093/bioinformatics/bts635)
string Assembles sequenced reads (compatible with both short and long reads) using StringTie2 (DOI: 10.1186/s13059-019-1910-1)
chimeric Identifies potential chimeric transcripts
coding Computes the coding potential of (chimeric) transcripts using RNASamba (DOI: https://doi.org/10.1093/nargab/lqz024)
pfam Searches for protein domains using HMMer (DOI: 10.1093/nar/gkr367) and PFAM protein families and domains (https://doi.org/10.1093/nar/gkaa913)
expression Estimates transcript expression using StringTie2 (DOI: 10.1186/s13059-019-1910-1)
results Compiles the final results of chimeric transcripts incorporating inputs from previous steps

Commands and options

Star

The first step in the FREDY’s pipeline is the “star”. The inputs to this command are FASTQ files and a STAR index (pre-made available here). The output is a sorted and filtered BAM aligned file, which will become the input to the next step. This command supports all types of RNA-Seq data (paired-end, single-end and long-reads), either compressed (as .gz) or not.

OPTIONS:

Short Long Description
-o --output-dir Output directory. Creates the directory if it does not exist [MANDATORY]
-i --index-dir STAR index directory [MANDATORY]
-f --file File containing a newline separated list of sequencing files in FASTQ format. This option is not mandatory if one or more FASTQ files are passed as argument [MANDATORY]
-h --help Prints help message
-t --threads Number of threads [default: 8]
-S --short-reads Set the sequencing to short reads [default]
-L --long-reads Set the sequencing to long reads
-s --single-end For short reads '-S', set the type of sequencing to single-end
-p --paired-end For short reads '-S', set the type of sequencing to paired-end. In this case, the FASTQ files will be processed, being considered forward (R1) and reverse complement (R2) according to the order in which they are passed [default]

Example

docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v <star_index-path>:/home/fredy/star_index/ -v <fastq-path>:/home/fredy/input/ -v <output-path>:/home/fredy/output/ fredy star -o test -i /home/fredy/star_index/ -f /home/fredy/input/<fastq-path>

Where:

<star_index-path> is the directory where star_index was downloaded. Ex.: $PWD/star_index/

<fastq-file-path> is the directory where all FASTQ files are. Ex.: if $PWD/*.fastq.gz type $PWD/

<output-path> is the output directory. Ex.: $PWD/output/

<fastq-path> is a .txt file inside /home/fredy/input/ with the docker path (for instance /home/fredy/input/test.fastq.gz) to the FASTQ files.

String

The next step in the pipeline is “string”. This subcommand performs a transcriptome assembly with the BAMs generated in the previous step (or custom BAMs provided by the user). The output of this analysis is a GTF file representing the transcriptome from all samples.

OPTIONS:

Short Long Description
-o --output-dir Output directory. Creates the directory if it does not exist [MANDATORY]
-a --annotation Gene annotation of the reference transcriptome in GTF format [MANDATORY]
-f --file File containing a newline separated list of sequencing files in FASTQ format. This option is not mandatory if one or more FASTQ files are passed as argument [MANDATORY]
-h --help Prints help message
-t --threads Number of threads [default: 8]
-S --short-reads Set the sequencing to short reads [default]
-L --long-reads Set the sequencing to long reads

Example

docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v <gtf-file-path>:/home/fredy/gtf/ -v <output-path>:/home/fredy/output/ fredy string -o test -a /home/fredy/gtf/<gtf-file>`

Where:

<gtf-file-path> is the directory where gtf was downloaded. Ex.: if $PWD/human_gv36.gtf type $PWD/

<output-path> is the output directory. Ex.: $PWD/output/

<gtf-file> is a GTF inside /home/fredy/gtf/. Ex.: /home/fredy/gtf/human_gv36.gtf

Chimeric

In the “chimeric” step, the pipeline identifies novel transcripts based on the GTF file generated from the “string” subcommand. Here, FREDY uses a list of events provided by the user to find transcripts containing overlaps between exons and the given events. Again, a GTF file and also a FASTA file with all transcripts found are the outputs provided.

Chimeric transcript

OPTIONS:

Short Long Description
-o --output-dir Output directory. Creates the directory if it does not exist [MANDATORY]
-a --annotation Gene annotation of the reference transcriptome in GTF format [MANDATORY]
-g --genome FASTA file of the reference genome, which is the same one file used for reads alignment using STAR [MANDATORY]
-e --stringtie-out ME fixed events in BED4 format [MANDATORY]
-h --help Prints help message
-T --tmp-dir Custom directory for temporary files [default: /tmp]
-r --reciprocal Criteria for identification of chimeric events is at least 50% overlap of the event with the exon and at least 50% overlap of the exon with the event
-R --irreciprocal Criteria for identification of chimeric events is at least 50% overlap of the event with the exon [default]

Example

docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v <gtf-file-path>:/home/fredy/gtf/ -v <genome-file-path>:/home/fredy/ref_fa/ -v <events-file-path>:/home/fredy/events/ -v <output-path>:/home/fredy/output/ fredy chimeric -o test -g /home/fredy/gtf/<gtf-file> -G /home/fredy/ref_fa/<genome-file> -i /home/fredy/events/<event-file>

Where:

<gtf-file-path> is the directory where GTF file was downloaded. Ex.: if $PWD/human_gv36.gtf type $PWD/

<genome-file-path> is the directory where the reference genome and reference genome index were downloaded. Ex.: if $PWD/human.fa type $PWD/

<events-file-path> is the directory where the events are. Ex.: if $PWD/events.bed type $PWD/

<output-path> is the output directory. Ex.: $PWD/output/

<gtf-file> is a GTF file inside /home/fredy/gtf/. Ex.: /home/fredy/gtf/human_gv36.gtf

<genome-file> is a .fa inside /home/fredy/ref_fa/. Ex.: /home/fredy/ref_fa/human.fa

<events-file> is a .bed inside /home/fredy/events/. Ex.: /home/fredy/events/events.bed

Coding

The “coding” subcommand classifies the novel transcripts identified in the “chimeric” step as coding or non-coding. Here, FREDY uses a model trained by RNASamba (available at here) to calculate the probability of a transcript being coding. In the end, a FASTA file with the protein sequences of all coding transcripts considered by our criteria is created.

OPTIONS:

Short Long Description
-o --output-dir Output directory. Creates the directory if it does not exist [MANDATORY]
-m --protein-model File with the RNASamba model [MANDATORY]
-d --protein-db File with the protein sequences [MANDATORY]
-h --help Prints help message
-P --probability Cutoff to consider transcripts protein-coding, based on the probability provided by RNASamba [default: 0.9]

Example

docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v <rnasambamodel-file-path>:/home/fredy/rnasamba/ -v <proteinseq-file-path>:/home/fredy/proteinseq/ -v <output-path>:/home/fredy/output/ fredy coding -o test -m /home/fredy/rnasamba/<rnasambamodel-file> -d /home/fredy/proteinseq/<proteinseq-file>

Where:

<rnasambamodel-file-path> is the directory where RNASamba model was downloaded. Ex.: if $PWD/model.hdf5 type $PWD/

<proteinseq-file-path> is the directory where the protein sequences file was downloaded. Ex.: if $PWD/human.pep.fa type $PWD/

<output-path> is the output directory. Ex.: $PWD/output/

<rnasambamodel-file> is a .hdf5 inside /home/fredy/rnasamba/. Ex.: /home/fredy/rnasamba/model.hdf5

<proteinseq-file> is a .fa inside /home/fredy/proteinseq/. Ex.: /home/fredy/proteinseq/human.pep.fa

Pfam

The “pfam” step searches for protein domains in the novel transcripts that passed the user’s predefined coding probability and subsequently compares them with the host’s protein domains. In order to identify them, we use HMMER trained with the PFAM database. The output of this subcommand is a TSV file comparing the protein domains of the novel transcripts identified with those of the host genes.

OPTIONS:

Short Long Description
-o --output-dir Output directory. Creates the directory if it does not exist [MANDATORY]
-M --pfam-model A database of protein domain families to be used as an index for HMMER [MANDATORY]
-h --help Prints help message
-T --tmp-dir Custom directory for temporary files [default: /tmp]
-t --threads Number of threads [default: 4]
-E --e-value In the HMMER per-target output, reports target sequences with an e-value lesser than NUM [default: 1e-6]

Example

docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v <pfammodel-file-path>:/home/fredy/pfammodel/ -v <output-path>:/home/fredy/output/ fredy pfam -o test -M <pfammodel-file>

Where:

<pfammodel-file-path> is the directory where PFAM model was downloaded. Ex.: if $PWD/Pfam-A.hmm type $PWD/

<output-path> is the output directory. Ex.: $PWD/output/

<pfammodel-file> is a .hmm inside /home/fredy/pfammodel/. Ex.: /home/fredy/pfammodel/Pfam-A.hmm

Expression

The FREDY's “expression” subcommand quantifies all the transcriptomes assembled by the StringTie2's “expression” function. The expression results, in TPM (transcript per million) per transcript per sample, are made available as a TSV file.

OPTIONS:

Short Long Description
-o --output-dir Output directory. Creates the directory if it does not exist [MANDATORY]
-f --file File containing a newline separated list of sequencing files in FASTQ format. This option is not mandatory if one or more FASTQ files are passed as argument
-h --help Prints help message
-T --tmp-dir Custom directory for temporary files [default: /tmp]
-t --threads Number of threads [default: 8]
-S --short-reads Set the sequencing to short reads [default]
-L --long-reads Set the sequencing to long reads

Example

docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v <output-path>:/home/fredy/output/ fredy expression -o test

Where:

<output-path> is the output directory. Ex.: $PWD/output/

Results

Finally, the “results” subcommand compiles all relevant information from the previous steps. Moreover, if the novel transcripts contribute to the expression of their respective host genes, this step further generates boxplots to show the relative contribution of such expression patterns.

OPTIONS:

Short Long Description
-o --output-dir Output directory. Creates the directory if it does not exist [MANDATORY]
-h --help Prints help message
-T --tmp-dir Custom directory for temporary files [default: /tmp]

Example

docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v <output-path>:/home/fredy/output/ fredy results -o test

Where:

<output-path> is the output directory. Ex.: $PWD/output/

Practical workflow

In order to execute FREDY, we selected RNA-seq paired-end data of 2 samples related to the cell line K562 from the ENCODE Project.

First, you should download the data:

mkdir fastq

cd fastq/

##ENCLB063ZZZ R1
wget https://www.encodeproject.org/files/ENCFF001RWF/@@download/ENCFF001RWF.fastq.gz
mv ENCFF001RWF.fastq.gz ENCLB063ZZZ_R1.fastq.gz

##ENCLB063ZZZ R2
wget https://www.encodeproject.org/files/ENCFF001RWC/@@download/ENCFF001RWC.fastq.gz
mv ENCFF001RWC.fastq.gz ENCLB063ZZZ_R2.fastq.gz

##ENCLB059ZZZ R1
wget https://www.encodeproject.org/files/ENCFF001RDE/@@download/ENCFF001RDE.fastq.gz
mv ENCFF001RDE.fastq.gz ENCLB059ZZZ_R1.fastq.gz

##ENCLB059ZZZ R2
wget https://www.encodeproject.org/files/ENCFF001RCW/@@download/ENCFF001RCW.fastq.gz
mv ENCFF001RCW.fastq.gz ENCLB059ZZZ_R2.fastq.gz

cd ..

And download the databases of FREDY:

mkdir db
cd db/

## STAR Index (Based on Human hg38 - Gencode v36)
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/index/human_star_index.tar.gz
tar -xvf human_star_index.tar.gz

## Gencode v36 as the human annotation
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/reference_transcript/human_gv36.gtf

## hg38 as the human reference genome
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/reference_genomes/human.fa
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/reference_genomes/human.fa.fai

## Aminoacid sequence
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/reference_protein/human.pep.fa

## RNASamba model
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/rnasamba_model/model.hdf5

## HMMer model
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/hmm_model/Pfam-A.hmm
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/hmm_model/Pfam-A.hmm.h3f
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/hmm_model/Pfam-A.hmm.h3i
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/hmm_model/Pfam-A.hmm.h3m
wget https://bioinfohsl-tools.s3.amazonaws.com/fredy/databases/hmm_model/Pfam-A.hmm.h3p

## Retrocopies events
wget https://bioinfohsl-tools.s3.amazonaws.com/rcpedia/downloads/beds/RCP_9606.bed
cut -f 1,2,3,5 RCP_9606.bed | sort -k1,1 -k2,2n > RCP_9606.bed4

cd ..

Then, prepare a file with the FASTQ PATHs:

ls fastq/*fastq.gz | awk '{print "/home/fredy/"$1}' > files.txt

After that, install and build a docker image:

git clone https://github.com/galantelab/fredy.git
cd fredy
sudo docker build -f Dockerfile -t fredy .
cd ..

Finally, you will be able to execute fredy as follows:

  • “star” step:
time docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v $PWD:/home/fredy fredy star -o /home/fredy/K562 -i /home/fredy/db/star_index -f /home/fredy/files.txt
  • “string” step:
time docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v $PWD:/home/fredy fredy string -o /home/fredy/K562 -a /home/fredy/db/human_gv36.gtf
  • “chimeric” step:
time docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v $PWD:/home/fredy fredy chimeric -o /home/fredy/K562 -a /home/fredy/db/human_gv36.gtf -g /home/fredy/db/human.fa -e /home/fredy/db/RCP_9606.bed4
  • “coding” step:
time docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v $PWD:/home/fredy fredy coding -o /home/fredy/K562 -m /home/fredy/db/model.hdf5 -d /home/fredy/db/human.pep.fa
  • “pfam” step:
time docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v $PWD:/home/fredy fredy pfam -o /home/fredy/K562 -M /home/fredy/db/Pfam-A.hmm
  • “expression” step:
time docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v $PWD:/home/fredy fredy expression -o /home/fredy/K562
  • “results” step:
time docker run --rm -u $(id -u):$(id -g) -w $(pwd) -v $PWD:/home/fredy fredy results -o /home/fredy/K562

All information related to the chimeric transcripts identified by FREDY are available in the final output named “results.tsv”. If you wish to inspect these transcripts in a Genome Browser such as at UCSC you can easily upload the “K562/chimeric/chimeric.gtf” file, also provided by the FREDY’s pipeline, to the “custom tracks”.

Contact

Rafael Luiz Vieira Mercuri - (rmercuri@mochsl.org.br)

Thiago Luiz Araújo Miller - (tmiller@mochsl.org.br)

Pedro Alexandre Favoretto Galante - (pgalante@mochsl.org.br)

Project Link: https://github.com/galantelab/fredy

Authors

Rafael Luiz Vieira Mercuri

Thiago Luiz Araújo Miller

Filipe Ferreira dos Santos

Matheus de Lima

Aline Rangel-Pozzo

Pedro Alexandre Favoretto Galante