Skip to content

Tama Merge

GenomeRIK edited this page Feb 10, 2018 · 36 revisions

TAMA Merge

TAMA Merge is a tool that allows you to merge multiple transcriptomes while maintaining source information.

Detailed explanation of TAMA Merge:

Are you interested in:

  1. Combining your Iso-Seq data from different tissue types/library preps into a single transcriptome.
  2. Comparing your Iso-Seq data to the reference annotation (or short read RNAseq annotation). 
  3. Combining your Iso-Seq data with a short read RNAseq annotation and with the reference annotation.
  4. Doing any of the above while still maintaining source information.
  5. Doing any of the above with the power to define merging parameters.
  6. Comparing pipelines (use TAMA Merge on annotations made from the same dataset but using different pipelines).

If so, TAMA Merge is probably what you are looking for.

TAMA Merge takes as input multiple transcriptomes in bed12 format. It then compares the transcript models from each transcriptome and merges models based on the similarity of features (transcription start/end sites and exon start/end sites). The ouput is a merged transcriptome in bed12 format along with other files containing source information.

Note that the input bed12 files must have the gene ID's and transcript ID's formatted as such "gene_id;transcript_id" in the 4th column. The gene ID must be the first subfield and the subfields must be delimited with a semicolon (;).

You can define the threshold for transcription start/end sites (TSS/TES) and exon start/end sites (ESS/EES). You can also give priority to features from specific transcriptomes. For instance, you may want to give priority to Iso-Seq data for transcription start/end sites and priority to your short read RNAseq transcriptome for splice junctions. This means that when you are merging models between these two transcriptomes the final merged model will use the TSS/TES from the Iso-Seq data and the ESS/EES from the short read RNAseq data. The source for each feature prediction is included in the output files so you can see exactly what happened with each merging event.

Manual

usage: tama_merge.py [-h] [-f] [-p] [-e] [-a] [-j] [-z]

This script merges transcriptomes.

optional arguments:

  -h, --help  show this help message and exit
  -f F        File list
  -p P        Output prefix
  -e E        Collapse exon ends flag: common_ends or longest_ends (Default is common_ends)
  -a A        5 prime threshold (Default is 10)
  -m M        Exon ends threshold/ splice junction threshold (Default is 10)
  -z Z        3 prime threshold (Default is 10)
  -d D        Flag for merging duplicate transcript groups (default no_merge quits when duplicates are found, merge_dup will merge duplicates)

Default command would look like this:

python tama_merge.py -f filelist.txt -p merged_annos

Detailed explanation of arguments:

-f filelist.txt

The filelist file contains the name of the files you want to merge as well as some additional information. The format for the file should be like this (tab separated, do not include header):

  file_name    cap_flag    merge_priority(start,junctions,end)    source_name
  annotation_capped.bed        capped  1,1,1   cap_lib
  annotation_nocap.bed        no_cap  2,1,1   nocap_lib

"cap_flag" can be one of two options "capped" or "no_cap". This represents whether the transcriptome start sites should be trusted or if transcripts should be merged into longer matching transcripts. If "no_cap" is selected for a dataset, the start priority will be placed at last regardless of what is set in the filelist file.

"merge_priority" designates the rank of the information from each source with respect to start site, splice junctions, and end sites. "1" is the highest rank. So in the example above the "capped" transcriptome will have a start site priority over the "no_cap" transcriptome.

"source_name" is used for the source information files to show where each prediction comes from.

-p P Output prefix

The output prefix is the prefix that will be sued to name the output files.

-e E Collapse exon ends flag: common_ends or longest_ends

The collapse exon ends flag is used to determine whether an exon end feature should be chosen based on how common it is (common_ends) or if it makes the longest exon (longest_ends). Default is common_ends.

-a A 5 prime threshold

The 5 prime threshold is the amount of tolerance at the 5' end of the transcript for grouping reads to be collapsed.

-m M Exon ends threshold/ pslice junction threshold

The Exon/Splice junction threshold is the amount of tolerance for the splice junctions of the transcript for grouping reads to be collapsed.

-z Z 3 prime threshold

The 3 prime threshold is the amount of tolerance for the 3' end of the transcript for grouping reads to be collapsed.

-d D Flag for merging duplicate transcript groups

Either no_merge (default) or merge_dup. This gives you the choice to merge duplicate groups where different transcripts in different groups happen to collapse to the same model. If no_merge is used and there is a duplicate, the program will exit early and not complete the run. You can also adjust the thresholds (increase allowances) to avoid duplicates.

Outputs:

  prefix.bed
  prefix_gene_report.txt
  prefix_merge.txt
  prefix_trans_report.txt

Detailed explanation:

prefix.bed

This is the main merged annotation file.

prefix_gene_report.txt

This contains a report of the genes from the merged file. The format is as follows:

  gene_id num_clusters    num_final_trans sources chrom   start   end
  G1      2       2       tissue1,tissue2        1       225     3214

prefix_merge.txt

This contains a bed12 format file which shows the coordinates of each input transcript matched to the merged transcript ID. I used the "txt" extension even though it is a bed file just to avoid confusion with the main bed file. You can use this file to map the final merged transcript models to their pre-merged supporting transcripts. The 1st subfield in the 4th column shows the final merged transcript ID while the 2nd subfield shows the pre-merged transcript ID with source prefix.

  1       219     3261    G1.2;spleen_G1.1        40      +       219     3261    255,0,0 5       98,93,181,107,714       0,1457,1757,2132,2328

prefix_trans_report.txt

This contains the source information for each merged transcript. The format is as follows:

  transcript_id   num_clusters    sources start_wobble_list       end_wobble_list exon_start_support      exon_end_support
  G2.1    1       newnormbrain    0       0       newnormbrain_G2.1       newnormbrain_G2.1