-
Notifications
You must be signed in to change notification settings - Fork 10
Incremental distance tree directory
The data produced by incremental distance tree building must persist.
They is stored in a specifically organized directory, referred to as an incremental distance tree directory, which contains the following files and subdirectories.
-
Biological parameters
-
variance: parameters to be placed aftermakeDistTree -variance.<br> For example, to implement the variance function L^4.5 this file must contain:pow -variance_power 4.5; -
hybridness_min: min. hybridness, > 1, or 0 meaning no hybrids identification; -
dissim_boundary: Point of discontinuity in the dissimilarity distribution (= genospecies barrier), > 0 orNAN; -
genogroup_barrier: dissimilarity barrier for genogroup identification to find genogroup outliers, > 0, orNANmeaning no genogroup outliers; less or equaldissim_boundary; -
delete_criterion_outliers: the presence of this file means that all criterion outliers will be deleted from the tree; -
good: list of objects which should not be removed as outliers; -
phen/: link to the directory with "phenotype" attributes of objects, see Taxonomy miscongruence;
-
-
Database
-
server: SQL server name; -
database: database on the SQL server; -
bulk: local directory for bulk inserts; -
bulk_remote: path in Universal Naming Convention to the local bulk directory.
-
-
Grid engine
-
pairs2dissim.grid: min. number of dissimilarity requests to be processed on a grid, > 0; -
object2dissim.grid: min. number of invocations ofobject2closest.shto be processed on a grid, > 0; -
object2closest.sql: the presence of this file means thatrequest_closest.shqueries an SQL database and the number of concurrent connections must be restricted to 30; -
nogrid: the presence of this file means that threads on the main computer must be used instead of a grid;
-
-
Computer processing
-
large: the presence of this file means that the files in thenew/andphen/directories are grouped into subdirectories namedfile2hash <file name>; -
threads: number of threads; if this file is absent then the parameter-threads 15will be used;
-
-
version: version number of the files; -
tree: distance tree in an internal format; -
dissim: file with dissimilarities. Format of a line: obj_name1 obj_name2 dissimilarity; -
indiscern: pairs of indiscernible objects; -
new/: directory containing the names (zero-length files) of new objects to be added to the tree and then to be removed from this directory, the reservoir of objects; -
good.expanded: the objects indiscernible with the objects ingood; -
outlier_genogroup: optional file with genogroup outliers; -
runlog: start times of the incrementations; -
hist/: directory with the historic versions of thetreefiles and the temporary files created bydistTree_inc.sh, where each file has the extension .version.<br> Fileshybrid.versioncontain the output ofmakeDistTree -delete_hybrids; -
finished: empty file created if the iterations of the incremental tree building are finished. -
tree.released: link to the latest released tree.
The presence of these files acts act as Boolean flags controlling the work of distTree_inc.sh
-
stop:distTree_inc.shmust stop; -
skip: incrementations must be skipped and the next steps must start.
These scripts act like virtual procedures in object-oriented programming. The information on which object of the reservoir is an outlier, in the tree or yet unprocessed is stored in a "database" which most conveniently can be a relational database. The database also stores the genogroup partitioning of the objects. The code for database interaction is not provided.
Scripts should finish with exit code 0.
-
genogroup2db.sh: input is filegenogroup_table; output is fileoutlier-genogroup. Update genogroup information in the database, find genogroup outliers and print them inoutlier-genogroup; -
objects_in_tree.sh: Tell the database whether a list of objects is or is not in the tree. Parameters:-
list of objects
-
in_tree: 0/1
-
-
object2closest.sh: find approximately 100 closest objects which are in the tree for an input object.<br> Parameter: input_obj_name;<br> Printed output: file with 100 lines where each line has format: found_obj_name; -
outlier2db.sh -
pairs2dissim.sh: Compute dissimilarities for requested pairs of objects. This script is invoked in parallel. Parameters:-
input file of pairs of objects where each line has format: obj_name1 obj_name2
-
new object file or '': for placement of a new object
-
output file where each line has format: obj_name1 obj_name2 dissimilarity, where dissimilarity is a non-negative number,
infornan. -
output log file for error messages. These files are created in the temporary directories
search/anddr.out/.
-
-
qc.sh: quality control: check whether thetreefile, the subdirectorynew/and the database agree. -
qc_object.sh: quality control of one object.