Skip to content

Latest commit

 

History

History
165 lines (122 loc) · 6.77 KB

gff3_sort.md

File metadata and controls

165 lines (122 loc) · 6.77 KB

gff3_sort readme

Sort features in a gff3 file by according to their order on a scaffold, their coordinates on a scaffold, and parent-child relationships.

Inputs:

  1. GFF3 file: Specify the file name with the -g argument

Outputs:

  1. Sorted GFF3 file: Specify the file name with the -og argument
    • All related features (with parent-child relationships) are separated by ### directives for easier downstream parsing

Usage:

  1. Specify the input, output file names and options using short arguments:
    • gff3_sort -g example_file/example.gff3 -og example_file/example_sorted.gff
  2. Specify the input, output file names and options using long arguments:
    • gff3_sort --gff_file example_file/example.gff3 --output_gff example_file/example_sorted.gff

Optional arguments:

  1. -h, --help
    • show this help message and exit
  2. -g GFF_FILE, --gff_file GFF_FILE
    • GFF3 file that you would like to sort.
  3. -og OUTPUT_GFF, --output_gff OUTPUT_GFF
    • Sorted GFF3 file
  4. -t, SORT_TEMPLATE, --sort_template SORT_TEMPLATE
    • A file that indicates the sorting order of features within a gene model
  5. -i, --isoform_sort
    • Sort multi-isoform gene models by feature type (default: False)
  6. -v, --version
    • show program's version number and exit
  7. -r, --reference
    • Sort scaffold (seqID) by order of appearance in gff3 file (default is by number)

Example:

Sort gff3 file without a sort template file

  • example command:

gff3_sort --gff_file example.gff3 --output_gff example_sort.gff3

  • Input gff3 file:
LGIB01000001.1  Gnomon  gene    52056   58768   .       +       .       ID=gene1
LGIB01000001.1  Gnomon  mRNA    52056   58768   .       +       .       ID=rna1;Parent=gene1
LGIB01000001.1  Gnomon  CDS     52056   52096   .       +       0       ID=cds1;Parent=rna1
LGIB01000001.1  Gnomon  exon    52056   52096   .       +       .       ID=id4;Parent=rna1
LGIB01000001.1  Gnomon  mRNA    52056   58768   .       +       .       ID=rna2;Parent=gene1
LGIB01000001.1  Gnomon  CDS     52100   53000   .       +       0       ID=cds2;Parent=rna2
LGIB01000001.1  Gnomon  exon    52056   53000   .       +       .       ID=id19;Parent=rna2
  • Output gff3 file:
LGIB01000001.1  Gnomon  gene    52056   58768   .       +       .       ID=gene1
LGIB01000001.1  Gnomon  mRNA    52056   58768   .       +       .       ID=rna1;Parent=gene1
LGIB01000001.1  Gnomon  exon    52056   52096   .       +       .       ID=id4;Parent=rna1
LGIB01000001.1  Gnomon  CDS     52056   52096   .       +       0       ID=cds1;Parent=rna1
LGIB01000001.1  Gnomon  mRNA    52056   58768   .       +       .       ID=rna2;Parent=gene1
LGIB01000001.1  Gnomon  exon    52056   53000   .       +       .       ID=id19;Parent=rna2
LGIB01000001.1  Gnomon  CDS     52100   53000   .       +       0       ID=cds2;Parent=rna2

Sort gff3 file with a sort template file

  • sort template file: A file that indicates the sorting order of features within a gene model. Feature type with the same sorting order should be in the same line and split by space.
gene pseudogene
mRNA
exon
CDS

Sort gff3 file without --isoform_sort

  • example command:

gff3_sort --gff_file example.gff3 --sort_template sort_template.txt --output_gff example_sort.gff3

  • Output gff3 file:
LGIB01000001.1  Gnomon  gene    52056   58768   .       +       .       ID=gene1
LGIB01000001.1  Gnomon  mRNA    52056   58768   .       +       .       ID=rna1;Parent=gene1
LGIB01000001.1  Gnomon  exon    52056   52096   .       +       .       ID=id4;Parent=rna1
LGIB01000001.1  Gnomon  CDS     52056   52096   .       +       0       ID=cds1;Parent=rna1
LGIB01000001.1  Gnomon  mRNA    52056   58768   .       +       .       ID=rna2;Parent=gene1
LGIB01000001.1  Gnomon  exon    52056   53000   .       +       .       ID=id19;Parent=rna2
LGIB01000001.1  Gnomon  CDS     52100   53000   .       +       0       ID=cds2;Parent=rna2

Note:

If not all the feature type are documented in the sort template file. gff3_sort will sort features by level(1st-level, 2nd-level, and etc) and then by the order in sort template file.

  • sort template file:
gene pseudogene
CDS
  • Output gff3 file:
LGIB01000001.1  Gnomon  gene    52056   58768   .       +       .       ID=gene1
LGIB01000001.1  Gnomon  mRNA    52056   58768   .       +       .       ID=rna1;Parent=gene1
LGIB01000001.1  Gnomon  CDS     52056   52096   .       +       0       ID=cds1;Parent=rna1
LGIB01000001.1  Gnomon  exon    52056   52096   .       +       .       ID=id4;Parent=rna1
LGIB01000001.1  Gnomon  mRNA    52056   58768   .       +       .       ID=rna2;Parent=gene1
LGIB01000001.1  Gnomon  CDS     52100   53000   .       +       0       ID=cds2;Parent=rna2
LGIB01000001.1  Gnomon  exon    52056   53000   .       +       .       ID=id19;Parent=rna2

Sort gff3 file with --isoform_sort

  • example command:

gff3_sort --gff_file example.gff3 --sort_template sort_template.txt --isoform_sort --output_gff example_sort.gff3

  • Output gff3 file:
LGIB01000001.1  Gnomon  gene    52056   58768   .       +       .       ID=gene1
LGIB01000001.1  Gnomon  mRNA    52056   58768   .       +       .       ID=rna1;Parent=gene1
LGIB01000001.1  Gnomon  mRNA    52056   58768   .       +       .       ID=rna2;Parent=gene1
LGIB01000001.1  Gnomon  exon    52056   53000   .       +       .       ID=id19;Parent=rna2
LGIB01000001.1  Gnomon  exon    52056   52096   .       +       .       ID=id4;Parent=rna1
LGIB01000001.1  Gnomon  CDS     52056   52096   .       +       0       ID=cds1;Parent=rna1
LGIB01000001.1  Gnomon  CDS     52100   53000   .       +       0       ID=cds2;Parent=rna2

Note:

If not all the feature type are documented in the sort template file. gff3_sort will sort features by the order in sort template file and then by level(1st-level, 2nd-level, and etc).

  • sort template file:
gene pseudogene
CDS
  • Output gff3 file:
LGIB01000001.1  Gnomon  gene    52056   58768   .       +       .       ID=gene1
LGIB01000001.1  Gnomon  mRNA    52056   58768   .       +       .       ID=rna1;Parent=gene1
LGIB01000001.1  Gnomon  CDS     52056   52096   .       +       0       ID=cds1;Parent=rna1
LGIB01000001.1  Gnomon  exon    52056   52096   .       +       .       ID=id4;Parent=rna1
LGIB01000001.1  Gnomon  mRNA    52056   58768   .       +       .       ID=rna2;Parent=gene1
LGIB01000001.1  Gnomon  CDS     52100   53000   .       +       0       ID=cds2;Parent=rna2
LGIB01000001.1  Gnomon  exon    52056   53000   .       +       .       ID=id19;Parent=rna2

Assumptions:

  1. Any features without a Parent attribute are 'root' features - the program will insert directives (lines beginning with ##) above these features.
  2. All child features occur after their respective Parent feature, but before new Parent features.