Skip to content

Tama Merge

GenomeRIK edited this page Aug 2, 2017 · 36 revisions

TAMA Merge

TAMA Merge is a tool that allows you to merge multiple transcriptomes while maintaining source information.

Detailed explanation of TAMA Merge:

Are you interested in: 1. Combining your Iso-Seq data from different tissue types/library preps into a single transcriptome. 2. Comparing your Iso-Seq data to the reference annotation (or short read RNAseq annotation). 3. Combining your Iso-Seq data with a short read RNAseq annotation and with the reference annotation. 4. Doing any of the above while still maintaining source information. 5. Doing any of the above with the power to define merging parameters.

If so, TAMA Merge is probably what you are looking for.

TAMA Merge takes as input multiple transcriptomes in bed12 format. It then compares the transcript models from each transcriptome and merges models based on the similarity of features (transcription start/end sites and exon start/end sites). The ouput is a merged transcriptome in bed12 format along with other files containing source information.

Manual

usage: tama_merge.py [-h] [-f] [-p] [-e] [-a] [-j] [-z]

This script merges transcriptomes.

optional arguments:

  -h, --help  show this help message and exit
  -f F        File list
  -p P        Output prefix
  -e E        Collapse exon ends flag: common_ends or longest_ends (Default is common_ends)
  -a A        5 prime threshold (Default is 10)
  -m M        Exon ends threshold/ splice junction threshold (Default is 10)
  -z Z        3 prime threshold (Default is 10)

Default command would look like this:

python tama_merge.py -f filelist.txt -p merged_annos

Detailed explanation of arguments:

-f filelist.txt

The filelist file contains the name of the files you want to merge as well as some additional information. The format for the file should be like this (tab separated, do not include header):

  file_name    cap_flag    merge_priority(start,junctions,end)    source_name
  annotation_capped.bed        capped  1,1,1   cap_lib
  annotation_nocap.bed        no_cap  2,1,1   nocap_lib

"cap_flag" can be one of two options "capped" or "no_cap". This represents whether the transcriptome start sites should be trusted or if transcripts should be merged into longer matching transcripts.

"merge_priority" designates the rank of the information from each source with respect to start site, splice junctions, and end sites. "1" is the highest rank. So in the example above the "capped" transcriptome will have a start site priority over the "no_cap" transcriptome.

"source_name" is used for the source information files to show where each prediction comes from.

-p P Output prefix

The output prefix is the prefix that will be sued to name the output files.

-e E Collapse exon ends flag: common_ends or longest_ends

The collapse exon ends flag is used to determine whether an exon end feature should be chosen based on how common it is (common_ends) or if it makes the longest exon (longest_ends). Default is common_ends.

-a A 5 prime threshold

The 5 prime threshold is the amount of tolerance at the 5' end of the transcript for grouping reads to be collapsed.

-m M Exon ends threshold/ pslice junction threshold

The Exon/Splice junction threshold is the amount of tolerance for the splice junctions of the transcript for grouping reads to be collapsed.

-z Z 3 prime threshold

The 3 prime threshold is the amount of tolerance for the 3' end of the transcript for grouping reads to be collapsed.

Outputs:

  prefix.bed
  prefix_gene_report.txt
  prefix_merge.txt
  prefix_trans_report.txt

Detailed explanation:

prefix.bed

This is the main merged annotation file.

prefix_gene_report.txt

This contains a report of the genes from the merged file. The format is as follows:

  gene_id num_clusters    num_final_trans sources chrom   start   end
  G1      2       2       tissue1,tissue2        1       225     3214

prefix_merge.txt

This contains a bed12 format file which shows the coordinates of each input transcript matched to the merged transcript ID.

prefix_trans_report.txt

This contains the source information for each merged transcript. The format is as follows:

  transcript_id   num_clusters    sources start_wobble_list       end_wobble_list exon_start_support      exon_end_support
  G2.1    1       newnormbrain    0       0       newnormbrain_G2.1       newnormbrain_G2.1