Skip to content

Input files

Dina edited this page Sep 24, 2019 · 7 revisions

How to import different files

  1. Importing a file containing the input genomes:

    1. Choose File->Import->Genomes File. If your dataset is large, this make take a few minutes.

    Sample input files are provided in the input directory in the installation folder

    1. The "Run" button should be enabled. Click on this button to set the parameters.

    2. A progressBar appears. Hover over the question mark icon next to each parameter for an explanation of each parameter. After setting the parameters, click on "Run". This can take a few minutes, depending on the size of the dataset and on the parameters specified.

    3. After the process is done, the lower panel will contain all the discovered CSBs.

  2. Importing gene orthology group information:
    Load it by choosing File->Import->Orthology Information file. This information will be displayed on the lower right panel.

  3. Importing taxonomy information:
    Load it by choosing File->Import->Taxonomy file. This information will be displayed on the upper panel after choosing a specific CSB.

  4. CSB patterns file:
    If this file is provided, CSBs are no longer extracted from the input sequences. This file should contain specific CSB patterns which the user is interested to find in the input sequences.

  • This is an optional input text file
  • The path to this file should be provided using the:
    • User Interface: In the dialog opened after clicking on the "Run" button
    • Command Line: "--patterns" or "-p" option

Sample input files are located in the input directory of the installation folder. You can also download the following zip file:

Sample_input_files.zip

The above zip file contains three files, located inside a folder named 'input':

  • plasmid_genomes.fasta
    Plasmid dataset - 471 prokaryotic genomes with at least one plasmid, chromosomes were removed.
  • chromosomal_genomes.fasta
    Chromosomal dataset - 1,485 prokaryotic genomes with at least one chromosome, plasmids were removed.

    Important: this is a huge dataset. See instructions below, how to run CSBFinder with a large dataset

  • cog_info.txt
    Functional information of gene orthology groups

A text/fasta file containing all input genomes modeled as strings, where each character is an orthology group ID (for example, COG ID) that has been assigned to a corresponding gene

  • This is a mandatory input file
  • The path to this file is provided in:
    • User Interface: Load this file by choosing File->Import->Genomes File
    • Command Line: "-in" option

This file should use the following format:

>[genome name] | [ replicon name (e.g. plasmid or chromosome id)]
[homology group ID] TAB [Strand (+ or -)] TAB [you can add additional information]
[homology group ID] TAB [Strand (+ or -)] TAB [you can add additional information] 
[homology group ID] TAB [Strand (+ or -)] TAB [you can add additional information] 
....

All replicons of the same genome should be consecutive, i.e.:

>genomeA|replicon1
....
>genomeA|replicon2
...
>genomeB|replicon1
...

Genes that do not belong to any gene orthology group, should be marked as 'X'

Example:

>Agrobacterium_H13_3_uid63403|NC_015183
COG1806	+
COG0424	+
COG0169	+
COG0237	+
COG0847	+
COG1952	-
COG3030	-
COG4395	+
COG2821	+
....
>Agrobacterium_H13_3_uid63403|NC_015508
X	+
X	+
COG1487	-
X	-
X	-
X	-
COG1525	-
X	+
COG2253	-
COG5340	-
....
>Agrobacterium_radiobacter_K84_uid58269|NC_011983
COG1192	+
COG1475	+
X	+
X	+
COG0715	+
COG0600	+
....

Assigning genes to orthologous group identifiers

You can annotate genes by any orthologous group identifiers. The IDs can be numbers or symbols, the only restriction is that each orthology group will have a unique ID.

Examples
  1. The STRING database contains COG and NOG annotations of many publicly available genomes
  2. Newly sequenced genomes can be mapped to known orthology groups such as:
  3. A tool such as Proteinortho detects orthologous genes within different species.
  4. The paper "New Tools in Orthology Analysis: A Brief Review of Promising Perspectives" by Bruno T. L. Nichio et. al. reviews several current tools for gene orthology detection
  • This is an optional input file
  • The path to this file is provided in:
    • User Interface: File->Import->Orthology Information file
    • Command Line: "-cog-info" option

COG information input file

If you are using COGs (Cluster of Orthologous Genes) as your gene orthology group identifiers, you can use the file cog_info.txt provided in the input directory in the installation folder (also can be downloaded from here).

The functional description of gene orthology groups will appear in the legend (User Interface) or in the output catalog file (when clicking on the "Save" button in the User Interface, or when executing via Command Line).

You can also use a custom file of your own. See instructions below.

Custom gene orthology group information input file

This file should use the following format:

COGID;COG description;[COG functional categries seperated by a comma (e.g. "E,H"); COG functional category description 1; COG functional category description 2;...;geneID] 

The text inside the brackets [] is optional

Example

COG0318;Acyl-CoA synthetase (AMP-forming)/AMP-acid ligase II;I,Q;Lipid transport and metabolism;Secondary metabolites biosynthesis, transport and catabolism;CaiC;
COG0319;ssRNA-specific RNase YbeY, 16S rRNA maturation enzyme;J;Translation, ribosomal structure and biogenesis;YbeY;
COG0320;Lipoate synthase;H;Coenzyme transport and metabolism;LipA;
...

This file contains the taxonomic information of the input genomes.

This file should use the following format:

A header
genome name identical to the name in the input genomes file,kingdom,phylum,class,genus,species

An unknown classification should be indicated using a hyphen "-"

Example

genome,kingdom,phylum,class,genus,species
Acaryochloris_marina_MBIC11017_uid58167,Bacteria,Cyanobacteria,-,Acaryochloris,Acaryochloris_marina
Acetobacter_pasteurianus_IFO_3283_01_uid59279,Bacteria,Proteobacteria,Alphaproteobacteria,Acetobacter,Acetobacter_pasteurianus
Acetohalobium_arabaticum_DSM_5501_uid51423,Bacteria,Firmicutes,Clostridia,Acetohalobium,Acetohalobium_arabaticum
...

If this file is provided, CSBs are no longer extracted from the input sequences. This file should contain specific CSB patterns which the user is interested to find in the input sequences.

  • This is an optional input text file
  • The path to this file is provided in:
    • User Interface: In the dialog opened after clicking on the "Run" button
    • Command Line: "--patterns" or "-p" option

This file should use the following format:

>[unique pattern ID, must be an integer]
[homology group IDs seperated by commas]
>[unique pattern ID, must be an integer]
[homology group IDs seperated by commas]

Example

>1
COG3736,COG3504,COG2948,COG0630
>564654
COG3736,COG3504,COG2948
....

If you are running without segmentation to directons, you should add a strand to each homology group ID e.g. COG3736+,COG3504+,COG2948-,COG0630+