GitHub - kuredatan/taxocluster

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
meta		meta
.gitignore		.gitignore
LICENSE_MIT.txt		LICENSE_MIT.txt
README		README
actions.py		actions.py
clean.sh		clean.sh
compareClusters.py		compareClusters.py
computeDistances.py		computeDistances.py
dotModule.py		dotModule.py
extractNodes.py		extractNodes.py
featuresVector.py		featuresVector.py
importMatrix.py		importMatrix.py
kMeans.py		kMeans.py
main.py		main.py
misc.py		misc.py
normalization.py		normalization.py
parsingFasta.py		parsingFasta.py
parsingInfo.py		parsingInfo.py
parsingMatch.py		parsingMatch.py
preformat.py		preformat.py
script_format.sh		script_format.sh
script_format_per_file.sh		script_format_per_file.sh
taxoTree.py		taxoTree.py
toto~		toto~
writeOnFiles.py		writeOnFiles.py

Repository files navigation

**** Internship in the CBIB, Bordeaux, 2016.

*** Use main.py to interact gently with the program. ***

LINUX:
In Emacs:
- open main.py
- C-c C-p to open interpreter
- in the code, C-c C-l to evaluate the file
- write down "main.py"
- in the interpreter, write down "main()"

In the terminal:
- write down "python"
- write down "from main import main"
- write down "main()"
- to exit Python, write down "exit()"
- If you stop to stop prematurely Python: Ctrl-C

WINDOWS:
In Pyzo/Anaconda:
- open file "main.py"
- write down "main()" in the interpreter

-----------------------------------------------------------------------------------------------------------------

This program uses a K-Means type clustering to infer correlations between metadata and the phylogenetic trees in samples. 

Notations:
For a certain read Ri, and if T is the whole taxonomic tree, then Ti is the subtree of T rooted at the LCA of the leaves/species matched by Ri. Li is the set of leaves of Ti and Mi is the set of matched leaves. Let Ni be such as Li is the disjoint union of Mi and Ni. We extend these notations to samples, knowing that a sample is a set of reads: for sample i, Ti is the subtree of T rooted at the LCA of the matched leaves in at least one of the reads in sample i. 

Firstly, it clusters the set of samples according to their values of a certain fixed metadatum, obtaining the set of clusters C1. Then the program uses the distance d(sample1,sample2) = |M1| + |M2| - 2*|M1interM2| to get the set of clusters C2. It deletes approximately one quarter of the samples in each cluster, and later on performs a clustering of the new set of samples using distance d(sample1,sample2) = |L1| + |L2| - q*(|N1interM2| + |N2interM1|) - |M1interM2| where q is provided by the user, 0 <= q <= 1. q is intuitively the way to choose a consensus tree between the two samples: when q = 0, the consensus tree is strict (only allowing "nodes" that are matched in both trees), and when q = 1, the consensus tree is the union of the matched nodes. We obtain a set of clusters C3. 

Secondly, the program compares C3 and C1 (computing the distance between two centers of clusters which are supposed to be the same, computing the number of common elements in both clusters) and gives a score S. The closest S is to 1, the more alike the clusters are. When the clustering by microbial population seems relevant accordingly to the clustering by values of metadatum, the program computes and returns the set of common nodes for each cluster of C3.

-----------------------------------------------------------------------------------------------------------------

Details about the files:
**** actions.py ****

**** compareClusters.py ****
Tools to quantify to which extent two clusters are alike.

**** computeDistances.py ****
Different sorts of distances for the clustering.

**** dotModule.py ****
Turns a graph (adjacency matrix) into a .DOT file, that can be seen using GraphViz.

**** extractNodes.py ****
Computes the set of common nodes for each cluster.

**** featuresVector.py ****
Create list of matching nodes for each patient.

**** /files ****
Stores written files using writeOnFiles functions.

**** importMatrix.py ****
Allows to get the distance matrix stored at /meta.

**** kMeans.py ****
Implements K-Means Algorithm.

**** main.py ****
Interface with the user.

**** /meta ****
Contains raw material.

**** misc.py ****
Contains useful macros.

**** normalization.py ****
Normalizes (mean-centers) values of a list.

**** parsingFasta.py ****
Parses FASTA file stored at /meta.

**** parsingInfo.py ****
Parses data table stored at /meta.

**** parsingMatch.py ****
Parses MATCH files stored at /meta/match.

**** preformat.py ****
Formats the .match files to parse them.
Stores the resulting .test files at /meta/match/testfiles.

**** script_format.sh & script_format_per_file.sh ****
These files would not be modified. They are used by the preformat module.

**** taxoTree.py ****
Implements a modified version of taxonomic tree.

**** writeOnFiles ****
Helps saving results in files.
Converts also DOT files into PNG files.

-----------------------------------------------------------------------------------------------------------------

Comments:

*** To clean the folder of unnecessary files ***
- Run ./clean.sh
- In case it would not execute, write in the terminal "chmod +rwx clean.sh"

*** Before starting TAXOCLUSTER ***
- Start process in preformat.py /!\ Beware: this process uses multithreading! To disable multithreading, please comment the "&" in script_format.sh with a "#". It can take several dozens of minutes.
OR IF YOU HAVE ALREADY GOT THE .test FILES
- Create a folder entitled "files" to store your results in main folder.
- Create a folder entitled "meta" to store your .tree, .fasta files. In "meta" create "match" folder (to put your .match files), and then in "match" create "testfiles" folder.

TaxoCluster --
             |
             ---- files
             ---- meta
                 |
                  --- match
                      |
                      -- testfiles

*** To visualize a graph with dot files ***
- LINUX: do "sudo apt install graphviz". Then you can right-click on the DOT file in "files" and see the graph.

*** In case of an error during parsing ***
- LINUX: install cut, awk, and sed with "sudo apt [software]".
- Please be aware the input files should perfectly respect the template of .fasta and .match files.