-
Notifications
You must be signed in to change notification settings - Fork 26
Tama Merge
TAMA Merge
TAMA Merge is a tool that allows you to merge multiple transcriptomes while maintaining source information.
Detailed explanation of TAMA Merge:
Are you interested in:
1. Combining your Iso-Seq data from different tissue types/library preps into a single transcriptome. 2. Comparing your Iso-Seq data to the reference annotation (or short read RNAseq annotation). 3. Combining your Iso-Seq data with a short read RNAseq annotation and with the reference annotation. 4. Doing any of the above while still maintaining source information. 5. Doing any of the above with the power to define merging parameters. 6. Comparing pipelines (use TAMA Merge on annotations made from the same dataset but using different pipelines).
If so, TAMA Merge is probably what you are looking for.
TAMA Merge takes as input multiple transcriptomes in bed12 format. It then compares the transcript models from each transcriptome and merges models based on the similarity of features (transcription start/end sites and exon start/end sites). The ouput is a merged transcriptome in bed12 format along with other files containing source information.
Note that the input bed12 files must have the gene ID's and transcript ID's formatted as such "gene_id;transcript_id" in the 4th column. The gene ID must be the first subfield and the subfields must be delimited with a semicolon (;).
You can define the threshold for transcription start/end sites (TSS/TES) and exon start/end sites (ESS/EES). You can also give priority to features from specific transcriptomes. For instance, you may want to give priority to Iso-Seq data for transcription start/end sites and priority to your short read RNAseq transcriptome for splice junctions. This means that when you are merging models between these two transcriptomes the final merged model will use the TSS/TES from the Iso-Seq data and the ESS/EES from the short read RNAseq data. The source for each feature prediction is included in the output files so you can see exactly what happened with each merging event.
Manual
usage: tama_merge.py [-h] [-f] [-p] [-e] [-a] [-j] [-z]
This script merges transcriptomes.
optional arguments:
-h, --help show this help message and exit -f F File list -p P Output prefix -e E Collapse exon ends flag: common_ends or longest_ends (Default is common_ends) -a A 5 prime threshold (Default is 10) -m M Exon ends threshold/ splice junction threshold (Default is 10) -z Z 3 prime threshold (Default is 10) -d D Flag for merging duplicate transcript groups (default no_merge quits when duplicates are found, merge_dup will merge duplicates)
Default command would look like this:
python tama_merge.py -f filelist.txt -p merged_annos
Detailed explanation of arguments:
-f filelist.txt
The filelist file contains the name of the files you want to merge as well as some additional information. The format for the file should be like this (tab separated, do not include header):
file_name cap_flag merge_priority(start,junctions,end) source_name annotation_capped.bed capped 1,1,1 cap_lib annotation_nocap.bed no_cap 2,1,1 nocap_lib
"cap_flag" can be one of two options "capped" or "no_cap". This represents whether the transcriptome start sites should be trusted or if transcripts should be merged into longer matching transcripts. If "no_cap" is selected for a dataset, the start priority will be placed at last regardless of what is set in the filelist file.
"merge_priority" designates the rank of the information from each source with respect to start site, splice junctions, and end sites. "1" is the highest rank. So in the example above the "capped" transcriptome will have a start site priority over the "no_cap" transcriptome.
"source_name" is used for the source information files to show where each prediction comes from.
-p P Output prefix
The output prefix is the prefix that will be sued to name the output files.
-e E Collapse exon ends flag: common_ends or longest_ends
The collapse exon ends flag is used to determine whether an exon end feature should be chosen based on how common it is (common_ends) or if it makes the longest exon (longest_ends). Default is common_ends.
-a A 5 prime threshold
The 5 prime threshold is the amount of tolerance at the 5' end of the transcript for grouping reads to be collapsed.
-m M Exon ends threshold/ pslice junction threshold
The Exon/Splice junction threshold is the amount of tolerance for the splice junctions of the transcript for grouping reads to be collapsed.
-z Z 3 prime threshold
The 3 prime threshold is the amount of tolerance for the 3' end of the transcript for grouping reads to be collapsed.
-d D Flag for merging duplicate transcript groups
Either no_merge (default) or merge_dup. This gives you the choice to merge duplicate groups where different transcripts in different groups happen to collapse to the same model. If no_merge is used and there is a duplicate, the program will exit early and not complete the run. You can also adjust the thresholds (increase allowances) to avoid duplicates.
Outputs:
prefix.bed prefix_gene_report.txt prefix_merge.txt prefix_trans_report.txt
Detailed explanation:
prefix.bed
This is the main merged annotation file. Transcripts are coloured according to the number of source support for each model.
1 = red
2 = orange
3 = yellow
4 = lime
5 = light turquoise
6 = light blue
7 = royal blue
8 = dark blue
9 = dark purple
10 = magenta
prefix_gene_report.txt
This contains a report of the genes from the merged file. "num_clusters" refers to the number of source transcripts that were used to make this gene model. "num_final_trans" refers to the number of transcripts in the final gene model. The format is as follows:
gene_id num_clusters num_final_trans sources chrom start end G1 2 2 tissue1,tissue2 1 225 3214
prefix_merge.txt
This contains a bed12 format file which shows the coordinates of each input transcript matched to the merged transcript ID. I used the "txt" extension even though it is a bed file just to avoid confusion with the main bed file. You can use this file to map the final merged transcript models to their pre-merged supporting transcripts. The 1st subfield in the 4th column shows the final merged transcript ID while the 2nd subfield shows the pre-merged transcript ID with source prefix.
1 219 3261 G1.2;spleen_G1.1 40 + 219 3261 255,0,0 5 98,93,181,107,714 0,1457,1757,2132,2328
prefix_trans_report.txt
This contains the source information for each merged transcript. The format is as follows:
transcript_id num_clusters sources start_wobble_list end_wobble_list exon_start_support exon_end_support G2.1 1 newnormbrain 0 0 newnormbrain_G2.1 newnormbrain_G2.1