-
Notifications
You must be signed in to change notification settings - Fork 25
Build Reference
Before running IR analysis, a IRFinder reference is required. The details of the process is described as following:
- All potential introns are extracted from the GTF, being the region between two exons in any transcript. Regions covered by a gtf feature within each intron are then excluded as they are likely to confound accurate measurement of true intron level.
- For directional sequencing, only features on the same strand as the intron are excluded. For non-directional sequencing, exclusions are omnidirectional. Regions of poor unique mappability are determined by mapping synthetic reads to the genome. Synthetic reads are 70bp single end, stepped at 10bp across the entire reference genome.
- Reads uniquely mapping to the correct location are tallied. Any 70bp stretch without at least 5 unique reads is considered poorly mappable and excluded from the measurable intron area regardless of strand/direction.
Building IRFinder reference can be done in any of the following mode:
Under this mode, IRFinder requires an Ensembl FTP address to the GTF file of the genome. For example, the following command will build a reference for IRFinder in REF/Human-GRCh38-release100
, based on the Ensembl human genome GRCh38 and the Release 100 of gene annotations.
$ bin/IRFinder -m BuildRef -r REF/Human-GRCh38-release100 \
-e REF/extra-input-files/RNA.SpikeIn.ERCC.fasta.gz \
-b REF/extra-input-files/Human_hg38_nonPolyA_ROI.bed \
ftp://ftp.ensembl.org/pub/release-100/gtf/homo_sapiens/Homo_sapiens.GRCh38.100.gtf.gz
Note, the -e
, -b
and -R
arguments are optional. IRFinder provides command ERCC sequences at REF/extra-input-files/RNA.SpikeIn.ERCC.fasta.gz
, which can be applied on any genome via -e
option. IRFinder also provides BED files for -b
and -R
in REF/extra-input-files
folder, but they are genome-specific.
Otherwise, one can build reference from local files if the genome FASTA file and transcriptome GTF file have already been downloaded and unzipped:
e.g.: for downloads/Homo_sapiens.GRCh37.75.fasta
and downloads/Homo_sapiens.GRCh37.75.gtf
$ mkdir REF/Human-GRCh37-release75 #This must be a new/clean folder
$ ln -s /FULL_PATH/downloads/Homo_sapiens.GRCh37.75.fasta REF/Human-GRCh37-release75/genome.fa #File name MUST be "genome.fa". Case-sensitive. FULL PATH of the original file must be given to make link work.
$ ln -s /FULL_PATH/downloads/Homo_sapiens.GRCh37.75.gtf REF/Human-GRCh37-release75/transcripts.gtf #File name MUST be "transcripts.gtf". Case-sensitive. FULL PATH of the original file must be given to make link work.
# Alternatively, copy the fasta and gtf file into the folder also works. Again, file names must be "genome.fa" and
"transcripts.gtf".
$ bin/IRFinder -m BuildRefProcess -r REF/Human-GRCh37-release75 \
-e REF/extra-input-files/RNA.SpikeIn.ERCC.fasta.gz \
-b REF/extra-input-files/Human_hg19_wgEncodeDacMapabilityConsensusExcludable.bed.gz \
-R REF/extra-input-files/Human_hg19_nonPolyA_ROI.bed
In the above two modes, IRFinder builds reference starting by building a STAR transcriptome reference. If the STAR reference for the transcriptome of interest has already existed, IRFinder can take advantage of it by BuildRefFromSTARRef
mode, which avoids rebuilding the STAR reference and saves a significant amount of total preparation time.
$ bin/IRFinder -m BuildRefFromSTARRef -r REF/Human-GRCh38-release100 -x EXISTING_STAR_REFERENCE \
-e REF/extra-input-files/RNA.SpikeIn.ERCC.fasta.gz \
-b REF/extra-input-files/Human_hg38_nonPolyA_ROI.bed
Please note: by default, BuildRefFromSTARRef
mode automatically looks for the original FASTA and GTF files used to generate the EXISTING_STAR_REFERENCE
. Specifically, IRFinder investigates 'genomeParameters.txt' in EXISTING_STAR_REFERENCE
folder. If both files can be located, IRFinder will continue to generate reference, ignoring '-f' and '-g' options. If either file is missing, IRFinder will quit and you have to re-run it by giving both '-f' and '-g' options:
# If the original FASTA and GTF files used to generate EXISTING_STAR_REFERENCE can not be found,
# users can feed both files manually
# WARNING: users MUST make sure that the manually parsed FASTA and GTF are compatible with EXISTING_STAR_REFERENCE
$ bin/IRFinder -m BuildRefFromSTARRef -r REF/Human-GRCh38-release100 -x EXISTING_STAR_REFERENCE \
-f FULL_PATH_OF_FASTA \
-g FULL_PATH_OF_GTF \
-e REF/extra-input-files/RNA.SpikeIn.ERCC.fasta.gz \
-b REF/extra-input-files/Human_hg38_nonPolyA_ROI.bed
IRFinder supports using a customized GTF/GFF/GFF3 file to prepare reference, if it follows the following rules:
For a GTF file, it must have the following attributions: gene_name
, gene_id
, transcript_id
, transcript_biotype
and gene_biotype
.
For a GFF (version 2) file, it must have the following attributions: ID
, gene_id
, transcript_id
, transcript_biotype
and gene_biotype
.
For a GFF3 file, it must have the following attributions: Name
, gene_id
, transcript_id
, transcript_biotype
and gene_biotype
.
Please also note, only genes with transcript_biotype
attributes assigned to protein_coding
or processed_transcript
are considered for intron retention measurements.
After building the reference, if you found any of the following file (inside IRFinder
folder in the reference directory) was empty, please refer to Issue 1 in the Troubleshoot session for details.
exclude.directional.bed
exclude.omnidirectional.bed
introns.unique.bed
ref-cover.bed
ref-read-continues.ref
ref-sj.ref