-
Notifications
You must be signed in to change notification settings - Fork 5
Input files
-
Importing a file containing the input genomes:
- Choose
File->Import->Genomes File. If your dataset is large, this make take a few minutes.
Sample input files are provided in the input directory in the installation folder
-
The "Run" button should be enabled. Click on this button to set the parameters.
-
A progressBar appears. Hover over the question mark icon next to each parameter for an explanation of each parameter. After setting the parameters, click on "Run". This can take a few minutes, depending on the size of the dataset and on the parameters specified.
-
After the process is done, the lower panel will contain all the discovered CSBs.
- Choose
-
Importing gene orthology group information:
Load it by choosingFile->Import->Orthology Information file. This information will be displayed on the lower right panel. -
Importing taxonomy information:
Load it by choosingFile->Import->Taxonomy file. This information will be displayed on the upper panel after choosing a specific CSB. -
CSB patterns file:
If this file is provided, CSBs are no longer extracted from the input sequences. This file should contain specific CSB patterns which the user is interested to find in the input sequences.
- This is an optional input text file
- The path to this file should be provided using the:
- User Interface: In the dialog opened after clicking on the "Run" button
- Command Line: "--patterns" or "-p" option
Sample input files are located in the input directory of the installation folder. You can also download the following zip file:
The above zip file contains three files, located inside a folder named 'input':
- plasmid_genomes.fasta
Plasmid dataset - 471 prokaryotic genomes with at least one plasmid, chromosomes were removed. - chromosomal_genomes.fasta
Chromosomal dataset - 1,485 prokaryotic genomes with at least one chromosome, plasmids were removed.Important: this is a huge dataset. See instructions below, how to run CSBFinder with a large dataset
- cog_info.txt
Functional information of gene orthology groups
A text/fasta file containing all input genomes modeled as strings, where each character is an orthology group ID (for example, COG ID) that has been assigned to a corresponding gene
- This is a mandatory input file
- The path to this file is provided in:
- User Interface: Load this file by choosing
File->Import->Genomes File - Command Line: "-in" option
- User Interface: Load this file by choosing
This file should use the following format:
>[genome name] | [ replicon name (e.g. plasmid or chromosome id)]
[homology group ID] TAB [Strand (+ or -)] TAB [you can add additional information]
[homology group ID] TAB [Strand (+ or -)] TAB [you can add additional information]
[homology group ID] TAB [Strand (+ or -)] TAB [you can add additional information]
....
All replicons of the same genome should be consecutive, i.e.:
>genomeA|replicon1
....
>genomeA|replicon2
...
>genomeB|replicon1
...
Genes that do not belong to any gene orthology group, should be marked as 'X'
>Agrobacterium_H13_3_uid63403|NC_015183
COG1806 +
COG0424 +
COG0169 +
COG0237 +
COG0847 +
COG1952 -
COG3030 -
COG4395 +
COG2821 +
....
>Agrobacterium_H13_3_uid63403|NC_015508
X +
X +
COG1487 -
X -
X -
X -
COG1525 -
X +
COG2253 -
COG5340 -
....
>Agrobacterium_radiobacter_K84_uid58269|NC_011983
COG1192 +
COG1475 +
X +
X +
COG0715 +
COG0600 +
....
You can annotate genes by any orthologous group identifiers. The IDs can be numbers or symbols, the only restriction is that each orthology group will have a unique ID.
- The STRING database contains COG and NOG annotations of many publicly available genomes
- Newly sequenced genomes can be mapped to known orthology groups such as:
- COGs using CDD
- NOGs using eggNOG mapper
- A tool such as Proteinortho detects orthologous genes within different species.
- The paper "New Tools in Orthology Analysis: A Brief Review of Promising Perspectives" by Bruno T. L. Nichio et. al. reviews several current tools for gene orthology detection
- This is an optional input file
- The path to this file is provided in:
- User Interface:
File->Import->Orthology Information file - Command Line: "-cog-info" option
- User Interface:
If you are using COGs (Cluster of Orthologous Genes) as your gene orthology group identifiers, you can use the file cog_info.txt provided in the input directory in the installation folder (also can be downloaded from here).
The functional description of gene orthology groups will appear in the legend (User Interface) or in the output catalog file (when clicking on the "Save" button in the User Interface, or when executing via Command Line).
You can also use a custom file of your own. See instructions below.
This file should use the following format:
COGID;COG description;[COG functional categries seperated by a comma (e.g. "E,H"); COG functional category description 1; COG functional category description 2;...;geneID]
The text inside the brackets [] is optional
COG0318;Acyl-CoA synthetase (AMP-forming)/AMP-acid ligase II;I,Q;Lipid transport and metabolism;Secondary metabolites biosynthesis, transport and catabolism;CaiC;
COG0319;ssRNA-specific RNase YbeY, 16S rRNA maturation enzyme;J;Translation, ribosomal structure and biogenesis;YbeY;
COG0320;Lipoate synthase;H;Coenzyme transport and metabolism;LipA;
...
This file contains the taxonomic information of the input genomes.
This file should use the following format:
A header
genome name identical to the name in the input genomes file,kingdom,phylum,class,genus,species
An unknown classification should be indicated using a hyphen "-"
genome,kingdom,phylum,class,genus,species
Acaryochloris_marina_MBIC11017_uid58167,Bacteria,Cyanobacteria,-,Acaryochloris,Acaryochloris_marina
Acetobacter_pasteurianus_IFO_3283_01_uid59279,Bacteria,Proteobacteria,Alphaproteobacteria,Acetobacter,Acetobacter_pasteurianus
Acetohalobium_arabaticum_DSM_5501_uid51423,Bacteria,Firmicutes,Clostridia,Acetohalobium,Acetohalobium_arabaticum
...
If this file is provided, CSBs are no longer extracted from the input sequences. This file should contain specific CSB patterns which the user is interested to find in the input sequences.
- This is an optional input text file
- The path to this file is provided in:
- User Interface: In the dialog opened after clicking on the "Run" button
- Command Line: "--patterns" or "-p" option
This file should use the following format:
>[unique pattern ID, must be an integer]
[homology group IDs seperated by commas]
>[unique pattern ID, must be an integer]
[homology group IDs seperated by commas]
>1
COG3736,COG3504,COG2948,COG0630
>564654
COG3736,COG3504,COG2948
....
If you are running without segmentation to directons, you should add a strand to each homology group ID e.g. COG3736+,COG3504+,COG2948-,COG0630+