-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathrapport.tex
1264 lines (889 loc) · 94.5 KB
/
rapport.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass{report}
\usepackage{lastpage}
\usepackage{fancyhdr}
\usepackage{blindtext}
\usepackage{xpatch}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage{float}
\usepackage[T1]{fontenc}
\usepackage[normalem]{ulem}
\usepackage[nottoc,notlof,notlot]{tocbibind}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{graphicx}
\pagestyle{headings}
\title{Compared analysis of taxonomic trees}
\author{C. REDA \\ \\ supervised by M. NIKOLSKI and M. RAFFINOT\\CBIB, Bordeaux}
\date{August, 21 2016}
\fancypagestyle{mystyle}{%
\fancyhf{}%
\fancyfoot[LE,RO]{\thepage~of \pageref{LastPage}}%
\renewcommand{\headrulewidth}{0pt}%
}
\begin{document}
\maketitle
\pagenumbering{gobble}
\chapter*{Abstract}
\nocite{*}
Taxonomic trees are trees that represent the classification of species into groups according to their evolutionnary past. In metagenomics, that is the study of genetic material from samples taken from a natural environment (opposite to labs), taxonomic trees allow to identify the species matching the samples, and thus to determine the composition of the microbial population endemic to the environment.\\
Interest in metagenomics has recently plummeted in bioinformatics, as a consequence of the decreasing price of DNA sequencing, and also for studying environmental samples has unveiled the existence of many bacterias which could not be cultivated in labs. Metagenomics has many applications in biology and medicine. The subject of this internship is to improve the analysis of metagenomic data in medical applications.\\
\newpage
\tableofcontents
\newpage
\listoffigures
\newpage
\listoftables
\newpage
\pagenumbering{arabic}
\xpatchcmd{\chapter}{%
\thispagestyle{plain}%
}{%
\pagestyle{mystyle}
}{}{}
\chapter{Scientific context}
First we explain what metagenomics is, how taxonomic trees intervene in this scientific field, and what are the most burning issues today in metagenomics. Then we introduce the test database.
\section{Definition and goals of metagenomics}
Metagenomics is the study of the genetic material of samples, often directly taken from a natural environment. It can be parts of DNA, proteins or RNA.\\
We only focus here on bacteria's DNAs. Indeed, bacteria used to be cultivated in laboratories before having their DNA sequenced and annoted. However a great number of bacteria cannot be raised in an artificial environment, such as some bacteria in animal guts or living in fragile marine ecosystems. Thus metagenomics helps to study more and unknown species of bacteria. Along with the decreasing price of DNA sequencing, this explains the development and the growing interest in metagenomics today.\\
The whole pipeline to turn raw genetic material into reliable data comprises these following steps:
\begin{itemize}
\item After removing samples of genetic data from the environment, one extracts the DNA from the sample by a chemical reaction.
\item Then one sequences the resulting DNA, that is obtains strings of characters that correspond to the primary structure (one-dimensioned) of the DNA, i.e. the order of (set of) the base pairs constructed with cytosine ($\{C\}$), adenine ($\{A\}$), guanine ($\{G\}$), and thymine ($\{T\}$). Note that there exist other set of bases such as $N = \{A,C,T,G\}$ or $X$, and this can increase greatly the complexity of sequencing. There are two main ways to sequence DNA: the first method, called \emph{Sanger sequencing}, is the slowest, but gives good quality sequences. The second one NGS (that stands for \textsc{Next-Generation Sequencing} \cite{WikiNGS}) is faster, however resulting sequences can contain errors. NGS produces only parts of the DNA, of length 32 to 1,000 base pairs. As a general rule, these two methods are mixed to make a good tradeoff between quality and price of the sequencing.
\end{itemize}
To distinguish one bacteria from another, sequencing is performed on genes, called \emph{16S}, that are present in all bacteria. These genes contain highly conserved regions (they are the same in all bacteria) and highly variable regions \emph{HVR}. The sequencing therefore focuses on the \emph{HVR} to determine to which bacteria a given sequence belongs. It is particularly hard to correctly determine the bounds of these \emph{HVR}, and it is one of the causes of errors in sequencing.\\
After sequencing by NGS, one obtains \emph{reads} -bits of DNA sequences of length 32 to 1,000 base pairs: e.g. this (partial) read is associated with the \emph{16S} gene of species \emph{Hydrothermal vent clone VC2.1 Arc13}:\\
\begin{center}
{\tt AGGCCACTGCTATCGGGCTCCGACTAAGCCATGCGAGTCTAG\\
GGGCTCCTTCGGGAGCACCGGCGGACGGCTCAGTAACACGTGGACAA\\
CCTACCCTCGGGTGGGGGATAACCTCGGGAAACTGAGGCTAATACCC...}\\
\end{center}
Once the sequencing is done, reads are obtained, and then they are usually matched with the reference sequences found in databases such as NCBI, \textsc{GreenGenes} or RDP: for each read, one looks for the best alignment with one of the reference sequences, that is the alignment that preserves most the order of the bases (see figure $1.1$). After this step, in the case of human bacteria (unfortunately it is not a general rule), to one read corresponds one or more sequences, that is, one or more species.\\
\begin{figure}[H]
\centering
\includegraphics[scale=0.5]{illustrations/Sequence_gaps.JPG}
\caption{An example of alignment of DNA sequence, from \emph{Gap penalty} article in \textsc{Wikipedia}. The upper read is the reference to which the lower read is aligned}
\end{figure}
The main objective of metagenomics is to determine to which species belong the sequencing reads by matching to the reference sequences.
\section{Assignment/identification of reads to a certain species of bacteria}
There exists no unique common phylogeny. A handful of taxonomic trees are used in bioinformatics, coming from different databases such as, for instance, NCBI, \textsc{GreenGenes} or RDP.\\
A taxonomic tree (see figure $1.2$) is of bounded height, which value can vary according to the database. In \textsc{GreenGenes}'s taxonomic tree, that is the one used for our tests, there are eight ranks, that is, eight levels/generations in the tree in growing precision (see figure $1.2$):
\begin{itemize}
\item 'D' (\textsc{Domain}) or 'R' (\textsc{Reign}): this rank is basically divided into three nodes: '\emph{Bacteria}', '\emph{Archae}' and '\emph{Eucarya}'.
\item 'P' (\textsc{Phylum}): '\emph{Animalia}', '\emph{Plantae}', '\emph{Fungi}', ...
\item 'C' (\textsc{Class}): '\emph{Mammalia}', ...
\item 'O' (\textsc{Order}): '\emph{Carnivora}', ...
\item 'F' (\textsc{Family}): '\emph{Carnidae}', ...
\item 'G' (\textsc{Genus}): '\emph{Vulpes}', ...
\item 'S' (\textsc{Species}): '\emph{Vulpes}', ... (corresponds to a single species)
\end{itemize}
Note that it exists also another group called OTU, which stands for \textsc{Operational Taxonomic Unit}. An OTU is a set of leaves and internal nodes of the taxonomic tree that are thought to be really close. The OTU are disjoint.\\
\begin{figure}[H]
\begin{center}
\subfigure\includegraphics[scale=0.30]{illustrations/arbretaxo.png}
\subfigure\includegraphics[scale=0.30]{illustrations/Taxonomic_Rank.png}
\caption{A partial taxonomic tree from \textsc{GreenGenes} and the taxonomy of the fox, from \emph{Taxonomic Rank} article in \textsc{Wikipedia}}
\end{center}
\end{figure}
It is worth noticing that the nearest nodes to the root have the lowest degree, and that a taxonomic tree has a great proportion of leaves (S-ranked nodes) compared to the number of internal nodes. \textsc{GreenGenes}'s taxonomic trees for the domains '\emph{Bacteria}' and '\emph{Archae}' used in \textsc{TaxoTree} (see in the next chapters) approximately owns $9,065$ nodes, and among them $5,565$ leaves.\\
Most of the time, in spite of the fact that one read can match many species -that is leaves of the taxonomic tree, a given read is assigned to only one node of the taxonomic tree. Generally speaking, a read is assigned to the \emph{Least Common Ancestor} (LCA) \cite{Tarjan} of the set of matched sequences/species/S-ranked nodes. The LCA of a set of leaves L is the last common node in the paths from the root to each leaf of L (see figure $1.3$).
\begin{figure}[H]
\centering
\includegraphics[scale=0.35]{illustrations/Lowest_common_ancestor.png}
\caption{The dark green node is the LCA of \emph{x} and \emph{y}, from \emph{Lowest Common Ancestor} article in \textsc{Wikipedia}}
\end{figure}
One can seldomly associate one read to a single species, because of errors in the sequencing. The deeper in the tree the read is assigned, better is the precision of the assignment. Nevertheless, some algorithms to improve the accuracy of assignment of reads have been designed \cite{Tango1} \cite{Tango3}. Please note eventually that a node can match several reads from a same sample, and thus several reads can be assigned to this node in a single sample.
\section{Topics in metagenomics}
There are already algorithms that enhance the assignment of a read to a node of the taxonomic tree \cite{Tango1}, and measures on phylogenies that quantify the relevancy of the tree according to a certain distance matrix between reads, or according to a certain other taxonomic tree \cite{RobinsonFoulds} \cite{Alignment}. But there is a lack of efficient tools \cite{Enaud} when it comes to compare several trees/forests by quantifiying their similarities and also their differences, and above all to draw biological conclusions on the presence or the absence of certain nodes, provided some information that cannot be given by the taxonomic tree. This sort of information is called in this report \emph{metadata}, that is, information that cast light on the natural environment in which samples were taken.\\
For instance, given a group of patients afflicted with cystic fibrosis, one can extract samples from their guts and may want to evaluate the influence of the age or the treatment over the population of \emph{E. Coli} bacteria in their patient's guts. Statistical methods, using e.g. \textsc{Wilcoxon} \cite{Wilcoxon} \cite{Whitney} or \textsc{Mc Nemar}'s \cite{McNemar} tests, are nowadays used to solve these questions, but their interpretation needs a manual overlap of the analysed data. To guide the diagnosis or the evaluation of the efficiency of a treatment, practitionners are in need of a semi-automatic processing of the data.
\section{The test database}
The test database used to design our algorithms is a study which took place at the Children's Hospital of Bordeaux, from November, $2015$ to May, $2016$ \cite{Enaud}. All patients were afflicted with cystic fibrosis, and some of them needed a heavy antibacterial treatment by intraveinous injection (denoted ATB-IV) to cure their chronic bacterial infections. The aim of the study was to underline the influence of antibacterials on the microbial population of the patient's gut.\\
Patients were divided into witnesses ($W$) and treated patients ($T$). Those in the $W$ group saw the paediatrician twice, on Day $0$ and Day $90$, whereas those in the $T$ group saw him thrice, on Day $0$, Day $45$ and Day $90$, and were treated from Day $0$ to Day $45$. There are in total $47$ samples, with $21$ patients.\\
Each time patients saw the doctor, they gave them samples of their stool and a filled survey over their quality of life. Samples were then sequenced. Later on, results were filtered and normalized this way:
\begin{itemize}
\item Variables that will not likely be useful (e.g. a node with a number of assignments too small) are identified and removed: for instance, if the number of assignments falls below the first quartile, the variable associated is discarded.
\item Assignment numbers have been normalized relatively to the total number of reads assigned in the taxonomic tree.
\item Then these numbers have been mean-centered and reduced.
\end{itemize}
\newpage
\chapter{Subject of the internship}
Our research aimed at solving two quite similar problems, that mainly target at \emph{clustering} patients, that is separating them into groups which maximize the resemblance between two patients of a same group or \emph{cluster}, and minimize it for a pair of patients from different \emph{clusters}. For all these problems, a common numbering of samples, nodes in the taxonomic tree, and non-taxonomic information (\emph{metadata}) have been chosen. Furthermore, we use for input a reference taxonomic tree denoted \textsc{T} (from GreenGenes's in our test database), and a \emph{data/information matrix} denoted \textsc{infoM} (see Figure $2.2$), such as \textsc{infoM}[$i$][$j$] is the value of metadatum $j$ in sample $i$ (being $0$ or $1$ for boolean values).
\section{Problem of most different pairs (MDPairs)}
\textsc{MDPairs} is:\\
\textsc{Supplementary input:}
\begin{itemize}
\item A matrix \textsc{occM} (called \emph{occurrence matrix}) such as \textsc{occM}[$i$][$j$] (see Figure $2.1$) is the number of assignments to node $j$ in sample $i$.
\item A set of groups of samples $S$.
\end{itemize}
\bigskip
\textsc{Output:} \begin{itemize}
\item The list of the pairs in $S$ that are the most different.
\end{itemize}
\begin{figure}[H]
\centering
\includegraphics[scale=0.3]{illustrations/occmatrix.png}
\caption{A screenshot of the occurrence matrix from the test database}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics[scale=0.3]{illustrations/infomatrix.png}
\caption{A screenshot of the info matrix from the test database}
\end{figure}
\section{Problems of clustering (CLust)}
The following two problems deal with the clustering of the groups of samples by their respective microbial populations, and the comparison of such a clustering with the values of metadata for these groups, in order to find a potential correspondance between bacteria and metadata.
\subsection{Problem of compatibility (CL.Comp)}
\textsc{CL.Comp} is:\\
\textsc{Supplementary input:} \begin{itemize}
\item The set of lists of nodes \textsc{nodesS} (called \emph{matchingNodes}) such as \textsc{nodesS}[$i$] is the list of nodes matched in at least one of the reads in sample $i$.
\item A subset $N$ of the set of nodes in the taxonomic tree.
\item A set of metadata $M$.
\end{itemize}
\bigskip
\textsc{Output:} \begin{itemize}
\item Decide if the clustering of samples according to the microbial population restricted to $N$ is compatible with the clustering of samples according to their values of metadata in $M$.
\end{itemize}
\subsection{Problem of best clustering (CL.BCLust)}
\textsc{CL.BCLust} is:\\
\textsc{Supplementary input:}
\begin{itemize}
\item The set of lists of nodes \textsc{nodesS} (called \emph{matchingNodes}) such as \textsc{nodesS}[$i$] is the list of nodes matched in at least one of the reads in sample $i$.
\item A set of metadata $M$.
\end{itemize}
\bigskip
\textsc{Output:} \begin{itemize}
\item Find the best subset $N$ of the set of nodes in \textsc{T} for a compatible classification of samples according to values of metadata in $M$.
\end{itemize}
\newpage
\chapter*{Methods}
Three different approaches have been designed, each of them solving one or more of these problems. In chronological order:
\begin{itemize}
\item First, a method, \textsc{TaxoTree}, that uses a statistical approach with customized measures of microbial populations, determines a distance between the samples to solve problem \textsc{MDPairs}. However, in order to be relevant, this method requires robust measures, and the biological phenomena are not so easy to describe with only a few characteristics.
\item Second, an approach, called \textsc{TaxoClassifier}, that uses the \emph{supervised learning} paradigm, and tries to identify nodes (of the taxonomic tree)/bacteria that discriminate the samples according to the values of metadata, that is, finds the nodes which seem to have a relationship with the considered metadata. Strong \emph{a priori} hypotheses on microbial populations have to be made, which could bias the results, but this approach can solve problems \textsc{CLust}.
\item Third a method, entitled \textsc{TaxoCluster}, that uses \emph{non-supervised learning}. It clusters the samples using two distances only depending on node populations, and compares the resulting clusters with the groups of metadata values. It can solve problem \textsc{CL.Comp}.
\end{itemize}
For \textsc{TaxoTree} and \textsc{TaxoCluster}, we have put in pace a novel implementation of taxonomic trees (see \textsc{Appendix C}).
\chapter{Statistics-inspired approach (TaxoTree software)}
\section{Definitions}
There are a few useful definitions to know:
\begin{enumerate}
\item \textsc{A tree induced by a group of samples $G$} is the tree (or the forest) containing only nodes that have been assigned at least once in one of the samples in $G$ (see figure $3.1$).
\begin{figure}[H]
\centering
\includegraphics[scale=0.5]{illustrations/inducedtree.png}
\caption{Subtrees induced by three groups of samples (each corresponding to a single colour). Incuced subtrees can of course overlap.}
\end{figure}
\item \textsc{A pattern in the tree $A$} is a connected subgraph of $A$. For instance, the blue tree from the example above.
\end{enumerate}
\newpage
\section{Description of the method}
This approach tries to solve \textsc{MDPairs}, and computes a certain number of scores for a group of samples, independantly of the values of metadata. Each score corresponds to a partial definition of identity between two groups of samples. The union of those scores aims at quantifying the most accurately the difference between two sets of samples.
\subsection{Total Ratio}
\begin{itemize}
\item \textsc{Total Ratio:} provided two groups $G_{1}$ and $G_{2}$ of samples, if $n$ is the number of assignments to common nodes in both groups (i.e. the sum of assignments in common nodes in each group), and $n_{1}$ et $n_{2}$, the number of assignments in these two groups in non-common nodes respectively, then Total Ratio is $TR(G_{1},G_{2}) = \frac{n}{n_{1} + n_{2} + n}$ (see figure $3.2$). If $n_{1} = n_{2} = n = 0$, it means that no read from the samples of $G_{1}$ and $G_{2}$ has matched a node in \textsc{T}. We choose to have in this case $TR = +\infty$.
\begin{figure}[H]
\centering
\includegraphics[scale=0.5]{illustrations/totalratio.png}
\caption{Example for \textsc{Total Ratio}: blue nodes are common to green and red trees. Green and red nodes correspond to two distinct groups of samples. Here, $TR$ = $\frac{10 + 3 + 34}{(1 + 34 + 5) + (10 + 3 + 34) + (24 + 16)} \simeq 0.37$}
\end{figure}
\bigskip
\textsc{Biological interpretation:} Total Ratio Distance focuses on the node population in samples. If $n_{1} + n_{2} = 0$, it means that $G_{1}$ and $G_{2}$ have the same set of nodes and thus $TR = 1$. If $TR = 0$, it means $G_{1}$ and $G_{2}$ have no node in common.
\end{itemize}
\subsection{Pattern Ratio}
\begin{itemize}
\item \textsc{Pattern Ratio:} if $n_{common}$ is the number of assignments in common patterns of length > 1 in the trees induced by the groups $G_{1}$ and $G_{2}$ of samples selected, and $n_{specific}$ the number of assignments in specific patterns, then Pattern Ratio is the quantity: $PR(G_{1},G_{2}) = \frac{n_{common}}{n_{specific}}$ (see figure $3.3$). If $n_{specific} = 0$, it means that the two trees induced by $G_{1}$ and $G_{2}$ are the same, so having $PR = +\infty$ is consistent.
\begin{figure}[H]
\centering
\includegraphics[scale=0.5]{illustrations/patternratio.png}
\caption{Example for \textsc{Pattern Ratio}: violet nodes are common to cyan and yellow trees. Cyan and yellow nodes correspond to two distinct groups of samples. Here, $PR$ = $\frac{10 + 34 + 3 + 24}{(1 + 34 + 12) + (23 + 45 + 16 + 5)} \simeq 0.52$}
\end{figure}
\bigskip
\textsc{Biological interpretation:} Pattern Ratio focuses on the phylogenetic proximity between the nodes of the two groups. When computed for the whole set of samples, it corresponds to the 'functional kernel' of the gut, that is, a group of bacteria that is usually common to every human and that allow the different chemical reactions in the gut.
\end{itemize}
\subsection{Microbial Diversity}
\begin{itemize}
\item \textsc{Microbial Diversity:} provided a group of samples $G$, if $n_{nodes}$ is the number of nodes in the tree induced by $G$, and $n_{tree}$ the number of nodes in the whole taxonomic tree, then Microbial Diversity is: $MD(G) = \frac{n_{nodes}}{n_{tree}}$ (see figure $3.4$). It is the definition of diversity used in \cite{Enaud}, that must be distinguished from \textsc{Phylogenetic Diversity} \cite{PhyloD} and \textsc{Mean Diversity} \cite{MeanD} coefficients.
\begin{figure}[H]
\centering
\includegraphics[scale=0.5]{illustrations/diversity.png}
\caption{Example for \textsc{Microbial Diversity}: Here, \textsc{Microbial Diversity} coefficient for violet tree is: $MD$ = $\frac{5}{12} \simeq 0.42$}
\end{figure}
\bigskip
\textsc{Biological interpretation:} The diversity of bacteria for instance in guts is of paramount importance, because the equilibrium of the ecosystem is compulsory to ensure the good development of the gut. Therefore two samples having the same proeminent sorts of bacteria, but one having less diversity compared to the other, must be considered different. This coefficient may be quite superfluous in the light of the previous scores, but it seems justified by biology.
\end{itemize}
\subsection{Computation of the similarity coefficient}
The final similarity coefficient $s$ between two groups of samples $G_{1}$ and $G_{2}$ (distance $d$ can be obtained from s with the formula: $d = \frac{1}{s}$, $d = +\infty$ if $s = 0$) to compare the groups of patients is a mix of these scores. An equal importance here is accorded to each score in the following formula:\\
\begin{center}
if $MD(G_{1}) - MD(G_{2}) = 0$:\\
$s(G_{1},G_{2}) = TR(G_{1},G_{2}) + PR(G_{1},G_{2})$\\
else:\\
$s(G_{1},G_{2}) = TR(G_{1},G_{2}) + PR(G_{1},G_{2}) - | MD(G_{1}) - MD(G_{2}) |$\\
\end{center}
Then this similarity coefficient is normalized:\\
\begin{center}
$\overline{s}(G_{1},G_{2}) = \frac{s(G_{1},G_{2}) - E(s)}{\sigma(s)}$\\
\end{center}
Then, if $m_{M}$ is the number of classes induced by the values of a given metadata $M$, the program asks to cluster the set of samples in groups $(G_{i})_{1 \le i \le m_{M}}$, each group corresponding to one single known value of the metadatum. After computing $\overline{s}$ for every pair of $(G_{i})_{1 \le i \le m_{M}}$, it returns the list of pairs of groups $(G_{i},G_{j})_{i \neq j, 1 \le i,j \le m_{M}}$ such as $\overline{s}(G_{i},G_{j}) \le FQ$, where $FQ$ is the value of the first quartile.
\section{Implementation}
Our implementation of TaxoTree in \textsc{Python 2.9.7} offers a few other features:
\begin{enumerate}
\item After selection of a group of bacteria and of a group of samples/metadata, gives an array with the percentage of assignments to this family of bacteria depending on the samples/metadata.
\item Computes the Pearson correlation coefficient $r$ between a number of assignments to a group of bacteria and a group of metadatas, or between the numbers of assignments to two groups of bacteria. \\
\textsc{Pearson sample product-moment correlation coefficient:} \cite{Pearson} for vectors $x$ and $y$ of values of size $n$, if $\overline{x}$ is the mean of the $(x_{i})_{i}$, and $\overline{y}$ the mean of the $(y_{i})_{i}$, then $r = \frac{\sum{_{i = 1}^{n}}{(x_{i} - \overline{x}) \times (y_{i} - \overline{y})}}{\sqrt{\sum{_{i = 1}^{n}}{(x_{i} - \overline{x})^{2}}} \times \sqrt{\sum{_{i = 1}^{n}}{(y_{i} - \overline{y})^{2}}}}$. $r$ is comprised between $1$ and $-1$:
\begin{itemize}
\item $r = 1$ meaning there is a perfect uphill linear relationship between $x$ and $y$.
\item $r = -1$ meaning there is a perfect downhill linear relationship between $x$ and $y$.
\item $r = 0$ meaning there is no linear relationship at all.
\end{itemize}
\item Draws graphs and pie charts.
\item Computes a distance coefficient $sim(k,l)$ between two patients $P_{k}$ and $P_{l}$: after selecting a subset $M' = \{ m_{i_{1}}, m_{i_{2}}, ... m_{i_{m}}\}$ of the set of metadata, if $n_{j,k}$ known value in $P_{k}$ for $m_{i_{j}}$, then $dist(k,l) = \sum{_{j = 1}^{m}}{|n_{j,k} - n_{j,l}|}$. The corresponding similarity coefficient is $sim(k,l) = \frac{1}{dist(k,l)}$, $sim(k,l) = +\infty$ iff $dist(k,l) = 0$.
\end{enumerate}
Code is available at: \emph{github.com/cbib/taxotree}.\\
The overall worst case time complexity of the computation of distance matrix is O($n_{taxo-nodes}^{2} \times n_{samples}^{3}$) (see annex for more details). On our computer (see \textsc{Appendix A} for hardware characteristics), it took $10$ minutes to compute it.
\section{Results}
\subsection{Tests}
The results have been compared to those obtained by statistical methods on the test database \cite{Enaud}:
\begin{itemize}
\item \textsc{Comparison of microbial populations at Day 0:}
\textsc{Output of the statistical analyses (chapter 5.2.2):} Microbial populations between groups of samples without antibacterials (samples selected with parameters $Day = 0$, $ATB-IV = 0$) and with antibacterials (samples selected with parameters $Day = 0$, $ATB-IV = 1$) are said to be very alike according to the statistical results of the study. \emph{Clostridium bolteae} and \emph{Pediococcus acidilactici} were overrepresented in $ATB-IV = 1$ groups.\\
According to the table and pie charts (see figure $3.5$), the results seem to confirm the output of TaxoTree.
\begin{table}[H]
\caption{Results from \textsc{TaxoTree} for the comparison at Day 0, (*) with Day = 0, (**) with (Clostridium bolteae,S),(Pediococcus acidilactici,S), where S is one of the taxonomic ranks}
\begin{tabular}{|l|c|r|}
\hline
\textsc{Score} & \textsc{Result} & \textsc{Parameters}\\
\hline
Normalized Total Ratio & 0.92837592759 & ATB-IV=0,1 (*)\\
\hline
Pattern Ratio & 9.46867603736 & ATB-IV=0,1 (*)\\
\hline
Percentage Assignments & 59.954342\% & ATB-IV=1, option "bacteria" (**)\\
\hline
Percentage Assignments & 5.877773\% & ATB-IV=0, option "bacteria" (**)\\
\hline
Microbial Diversity & 0.0227222589896 & ATB-IV=0 (*) \\
\hline
Microbial Diversity & 0.0196337966027 & ATB-IV=1 (*) \\
\hline
\end{tabular}
\end{table}
\begin{figure}
\begin{center}
\subfigure\includegraphics[scale=0.3]{illustrations/diversityDay0ATB0.png}
\subfigure\includegraphics[scale=0.3]{illustrations/diversityDay0ATB1.png}
\caption{Microbial Diversity Day=0, ATB-IV=0 (top) and ATB-IV=1 (bottom)}
\end{center}
\end{figure}
\item \textsc{Evolution of microbial populations:}
\textsc{Output of the statistical analyses (chapter 5.2.3):} Microbial populations are said to change between Day $0$, Day $45$ and Day $90$ for all patients, with/without $ATB-IV$. Nevertheless microbial diversity seemed not to be sensibly different between Day $0$, Day $45$ and Day $90$ in patients being treated with antibacterials.\\
The result of Total Ratio does not match the statistical results. However, Pattern Ratio shows the variations of the microbial populations between Day $0$, Day $45$ and Day $90$. Indeed, when Pattern Ratio is largely superior to $1$, it means the samples have similar node population and have got some similar clusters of nodes. It could mean that patients' guts recover their initial $Day-0$ microbial population at Day $90$, which would explain the high coefficient of similarity between Day $0$ and Day $90$ samples, while the $Day-45$ samples have got a different node population than initially due to the treatment and a unknown cause for untreated patients (the paper underlines this problem for $ATB-IV = 0$ patients). Microbial diversity seems not to really change between Day $0$, Day $45$ and Day $90$ for $ATB-IV = 1$ patients (see figure $3.6$).
\begin{table}
\caption{Results from \textsc{TaxoTree} for the comparison at Day 0, Day 45, Day 90}
\begin{tabular}{|l|c|r|}
\hline
\textsc{Score} & \textsc{Result} & \textsc{Parameters}\\
\hline
Normalized Total Ratio & 0.969346342836 & Day=0,45\\
\hline
Normalized Total Ratio & 0.929612804449 & Day=45,90\\
\hline
Normalized Total Ratio & 0.94998844733 & Day=0,90\\
\hline
Pattern Ratio & 3.75530416309 & Day=0,45 \\
\hline
Pattern Ratio & 13.0716519314 & Day=0,90 \\
\hline
Pattern Ratio & 1.9536860416 & Day=45,90 \\
\hline
Microbial Diversity & 0.0196337966027 & Day=0,ATB-IV=1 \\
\hline
Microbial Diversity & 0.0195234943746 & Day=45,ATB-IV=1 \\
\hline
Microbial Diversity & 0.0163247297595 & Day=90, ATB-IV=1\\
\hline
\end{tabular}
\end{table}
\begin{figure}
\begin{center}
\subfigure\includegraphics[scale=0.3]{illustrations/diversityDay451.png}
\subfigure\includegraphics[scale=0.3]{illustrations/diversityDay901.png}
\caption{Microbial Diversity Day=0, ATB-IV=0 (top) and ATB-IV=1 (bottom)}
\end{center}
\end{figure}
\item \textsc{Comparison of the groups of samples at Day 90:}
\textsc{Output of the statistical analyses (chapter 5.2.4):} $ATB-IV = 1$ and $ATB-IV = 0$ group still bear a significant difference.
\begin{table}
\caption{Results from \textsc{TaxoTree} for the comparison at Day 90, (*) for Day 0, (**) for Day 90}
\begin{tabular}{|l|c|r|}
\hline
\textsc{Score} & \textsc{Result} & \textsc{Parameters}\\
\hline
Normalized Total Ratio & 0.92837592759 & ATB-IV=0,1 (*)\\
\hline
Normalized Total Ratio & 0.907051044683 & ATB-IV=0,1 (**)\\
\hline
Pattern Ratio & 9.46867603736 & ATB-IV=0,1 (*) \\
\hline
Pattern Ratio & 1.80113934679 & ATB-IV=0,1 (**) \\
\hline
Microbial Diversity & 0.0227222589896 & ATB-IV=0 (*)\\
\hline
Microbial Diversity & 0.0197440988308 & ATB-IV=1 (*)\\
\hline
Microbial Diversity & 0.0254798146923 & ATB-IV=0 (**)\\
\hline
Microbial Diversity & 0.0163247297595 & ATB-IV=1 (**)\\
\hline
\end{tabular}
\end{table}
Although the microbial diversity is said to evolve in the same way in both $ATB-IV = 1$ and $ATB-IV = 0$ groups, here the gap between the two diversity coefficients is deeper at Day 90 than at Day 0. We cannot tell if it is a natural variation, or an effect of the antibacterials. The difference between the samples increases according to the variation of \textsc{Pattern Ratio} coefficient, meaning microbial populations have evolved (under antibacterials?) between the two groups. Here, \textsc{Total Ratio} does not give any information.
\end{itemize}
\subsection{Overview and discussion}
\subsubsection{About the method}
The main difficulty when using this method is thus to find relevant measures to characterize the samples. Since there are still many unknown elements today about the biological mechanisms, we cannot be sure to have the best interpretation of these results. The algorithms below try to get rid of the \emph{a priori} hypotheses we could have over the biological mechanisms, that may have influence on the resulting scores.\\
This method is particularly naive, and tends to be close to the previous statistical analyses.\\
\subsubsection{About the numerical results}
Although being the most intuitive score, \textsc{Total Ratio} measure appears not to be really relevant here. This can be explained by the fact that it does not take into account the proportions of assignments to nodes. As long as the two groups of samples own a same node, it is considered as a common node, even though group $1$ might have many more reads assigned to this node than group $2$. On the opposite, \textsc{Pattern Ratio} seems justified by biology and gives consistent results.\\
Some of these results contradict the statistical results. We would need to test it on other databases to know who is right.\\
\chapter{Supervised learning (TaxoClassifier software)}
The section just below explains some important concepts, that will help to understand the second method, described in the following sections.
\section{Machine Learning and supervised learning: the Naive Bayes classifier}
This section aims at explaining what \emph{Machine Learning} and \emph{supervised Learning} are, and at describing the \emph{Naive Bayes} classifier, and the \textsc{Youden's J coefficient} that will quantify the relevance of the classification.
\subsection{Machine Learning}
The last two approaches, \textsc{TaxoClassifier} and \textsc{TaxoCluster}, use Machine Learning type algorithms. \emph{Machine Learning} is a paradigm that automates the recognition of meaningful patterns in data \cite{SSS}, that is, the machine is technically able to sort the input according to certain criteria, without having the sorting explictly programmed.\\
For instance, it can be used to detect spam emails: the program searches throughout the emails keywords such as "fortune", "code bank", or "money". Then it evaluates (by computing the probability of having these keywords if the email is not a spam) whether the email should be discarded or not. If the probability of this email being a spam, with this number of keywords, is greater than the probability of this email being a regular email with these keywords, then the email is deleted. At the opposite, a 'regular' program would maybe need a threshold of number of occurrences above the one it discards the email. However, which value of threshold should we choose to be sure to keep (most of) our non-spam emails?\\
Along with the ability to deal with tremendously large files, \emph{Machine Learning} has nowadays become a quite common tool in bioinformatics \cite{Nikolski}. One of the ways to make the machine "learn" is called \emph{supervised learning}.
\subsection{Supervised Learning}
\emph{Supervised learning} algorithms try to classify data into fixed labeled categories, basing their decision on prior knownledge provided by a training set of data, e.g. the algorithm from the spam example above. A set of emails comprising safe mails and spams (where the spam or non-spam nature of the emails is indicated and constitutes labels for the categories) is given to the algorithm. It computes then (for instance) the probability of having occurrences of the words "fortune", "money", "love" in spam and non-spam emails. If the difference between the two probabilities is significant, it can help distinguish the unsafe mails out of the mailbox with the previous strategy.\\
One of the most used classifiers is the \emph{Naive Bayes} one.
\subsection{Naive Bayes Classifier}
Given $k$ (disjoint) classes of data $(C_{i})_{1 \le i \le k}$, a set of criteria $(F_{i})_{1 \le i \le m}$, and a datum $d$ to classify having $(F_{i} = x_{i})_{1 \le i \le m}$, the \emph{Naive Bayes classifier} \cite{NaiveBayes} computes $(P(C_{j} | F_{1} = x_{1},...,F_{m} = x_{m}))_{1 \le j \le k}$, that is, the probability of $d$ being in $C_{j}$ having these values of $(F_{i})_{1 \le i \le k}$. Then $d$ belongs to the class $C_{j}$ such as:\\
\begin{center}
$P(C_{j} | F_{1} = x_{1}, ..., F_{m} = x_{m}) = \max_{h \in \{ 1, ..., k \}}P(C_{h} | F_{1} = x_{1}, ... F_{m} = x_{m})$.\\
\end{center}
Since we do not know these probabilities, using Bayes' theorem \cite{NaiveBayes}, conditional probabilities can be rewritten this way (let $(F = x)$ be $(F_{1} = x_{1}, ..., F_{m} = x_{m})$):\\
\begin{center}
$P(C_{j} | F = x) = \frac{P(C_{j},F = x)}{P(F = x)}$\\
\end{center}
Using the chain rule \cite{NaiveBayes}, we can get:\\
\begin{center}
$P(C_{j},F_{1} = x_{1}, ..., F_{m} = x_{m})$\\
$= P(C_{j})P(F_{1} = x_{1} | C_{j})P(F_{2} = x_{2} | C_{j}, F_{1} = x_{1}) ... P(F_{m} = x_{m} | C_{j}, F_{1} = x_{1}, F_{2} = x_{2}, ..., F_{m-1} = x_{m-1})$\\
\end{center}
Under the assumption of independance between the $(F_{i})_{i}$ \cite{NaiveBayes}, this can be rewritten as follows:\\
\begin{center}
$P(C_{j},F_{1} = x_{1}, ..., F_{m} = x_{m})$\\
$= P(C_{j})P(F_{1} = x_{1} | C_{j})P(F_{2} = x_{2} | C_{j}) ... P(F_{m} = x_{m} | C_{j})$\\
\end{center}
Eventually, the probability we are looking for is:\\
\begin{center}
$P(C_{j} | F = x) = \frac{P(C_{j})\prod{_{i = 1}^{m}}{P(F_{i} = x_{i} | C_{j})}}{P(F = x)}$\\
\end{center}
Later in this report, we use to compute $P(F_{i} = x_{i} | C_{j})$ the \emph{Bernouilli model} \cite{NaiveBayes}. This model has got a few drawbacks, but can be a good solution when relations between the $(F_{i})_{i}$ are unknown, and when the $(F_{i})_{i}$ can only have two (boolean) values.\\
If $p_{i}$ is the probability of $F_{i}$ being true, then:\\
\begin{center}
$P(F_{i} = x_{i} | C_{j}) = p_{i}^{x_{i}}(1 - p_{i})^{x_{i}}$
\end{center}
\subsection{Youden's J coefficient}
The measure used to quantify the relevance of the classification here is \textsc{Youden's J coefficient} \cite{Youden}. Note that many other measures exist, such as \emph{F-measure} \cite{F-measure}, \emph{study of ROC space graph} \cite{IntroROC}, ... Since \emph{F-measure} raises certain issues \cite{FMproblems}, and knowing that \textsc{Youden's J coefficient} is often used for medical diagnosis, we have considered the latter to be more helpful.\\
Some basics in statistics are required here. For a certain class $C$, we denote (see figure $4.1$):
\begin{itemize}
\item $TP(C)$ the True Positive data, that is the data that belongs to $C$ and is assigned by the algorithm to \textsc{C}.
\item $TN(C)$ the True Negative data, not assigned to $C$ and not belonging to $C$.
\item $FN(C)$ the False Negative data, not assigned to $C$ and belonging to $C$.
\item $FP(C)$ the False Positive data, assigned to $C$ and not belonging to $C$.
\end{itemize}
\begin{figure}[H]
\centering
\includegraphics[scale=0.5]{illustrations/confusionmatrix.png}
\caption{\textsc{Confusion matrix} describing the four classes from gepsoft.com}
\end{figure}
Then the \textsc{Youden's J coefficient} for this class $C$ is:\\
\begin{center}
$J(C) = \frac{TP(C)}{TP(C) + FN(C)} + \frac{TN(C)}{TN(C) + FP(C)} - 1$.\\
\end{center}
It uses respectively the \emph{recall}/\emph{sensitivity}\cite{F-measure} (the proportion of objects assigned to the right class among those assigned to $C$) and the \emph{precision}/\emph{specificity}\cite{F-measure} (the proportion of objects that do not belong to $C$ among those not assigned to this class).\\
\begin{itemize}
\item When $J(C) = 1$, it means that the classification is perfect (no $FN$ and no $FP$).
\item When $J(C) = 0$, it means that the test does not do any better than a random classifier.
\item When $J(C) = -1$, it means that the test went all wrong (no $TP$ and no $TN$).
\end{itemize}
\section{Description of the method}
\textsc{TaxoClassifier} tries to solve problems \textsc{CL.Comp} and \textsc{CL.BClust}. It uses a \emph{Naive Bayes Classifier}, which attempts to classify the samples/patients in one of the groups of values of metadata depending on their microbial populations.\\
\begin{enumerate}
\item First the user is asked to give the metadata $M$ which will partition the set of samples, and the node population $N$ to consider in the classification. This sets the number of classes, according to the different values for each metadatum of $M$.
\begin{figure}[H]
\centering
\includegraphics[scale=0.3]{illustrations/classessimples.png}
\caption{Let $M_1$ be a metadatum having values 0 (red), or 1 (green), or 2 (blue). Then there are three classes associated with $M_1$.}
\end{figure}
\begin{figure}[H]
\centering
\includegraphics[scale=0.3]{illustrations/classescomplexes.png}
\caption{Let $M_2$ be a metadatum having values 0 (yellow), or 1 (cyan). Then there are two classes associated with $M_2$, that may intersect classes of $M_1$. Thus there are four classes associated with $M_1$ and $M_2$ (classes do not take into account unknown values)}
\end{figure}
\item Then a random strict subset $R$ of the initial set of samples is chosen to be the \emph{training set}. The size of this subset is chosen by the user. From this training set, the algorithm computes an estimated probability of having a certain node for each node in $N$ (corresponding to the $p_{i}$ described in the previous chapter), and also for each sample $s$ and for each node $n$, if $n$ appears in $s$ (this information is stored as a boolean value 0 or 1: with the previous notations, if $F_{i}$ is associated with node $n$, whether $F_{i} = 1 = x_{i}$ in sample $s$).
\item Hence for every sample not in $R$ (since samples in $R$ are already assigned), under the hypothesis of equiprobability of being in a certain class, the algorithm computes the probability of being in class $C$ knowing the actual presence of the nodes in $N$ for this sample, for each class $C$.
\item The algorithm eventually assigns the sample to the class which maximizes this probability.
\end{enumerate}
The relevance of the resulting classification $f_{class}$ is quantified by the \textsc{Youden's J coefficient} $j_{f_{class}}$, pretty much like above. The best classification (that hightlights a significant correspondance between the values of the selected metadata and the microbial populations) is the classification such as, for all class $C$, $J_{f_{class}}(C)$ is the closest to 1.\\
In other words, if there are $k$ classes $(C_{i})_{1 \le i \le k}$,\\
\begin{center}
$k - \sum{_{i = 1}^{k}}{J_{f_{class}}(C_{i})}$ is minimum and non-negative.
\end{center}
\section{Implementation}
This method has also been implemented in \textsc{Python 2.9.7}. The program allows two operations, each of them solving either \textsc{CL.C} or \textsc{CL.BCL}:\\
\begin{itemize}
\item The user chooses the sets $M$ and $N$ like above, and the program returns the \textsc{Youden’s J coefficient} associated with the resulting classification, giving a hint about a possible correspondance between $M$ and $N$.
\item The user chooses two integers $s$ and $n$, the set $M$. Then the program randomly picks $s$ times $n$ distinct nodes in the taxonomic tree, and returns the set $N$ of nodes of size $n$ which ensures the best classification (among the $s$ tries) for M. It may give a hint of a possible correspondance between $M$ and the resulting set $N$.
\end{itemize}
The two main difficulties with the \textsc{K-Means} algorithm is choosing $k$ and the initialization of the clusters. Fortunately, our method solves these problems since $k$ is the number of metadata classes associated with the user-selected metadata, and each cluster is initialized with a sample having the correct values of metadata associated with this cluster.\\
To deal with multiple metadata classes (see the example above), multi-dimensional lists have been implemented (see annex).\\
The overall worst case time complexity for a classifiation is roughly in O($n_{samples}^{2} \times n_{taxo-nodes}^{2} \times n_{values}$) (see annex for more details).\\
(Still in progress) code can be found at:\\ \begin{center}\emph{github.com/kuredatan/taxoclassifier}.\end{center}
\section{Results}
\subsection{Tests}
Due to time limitations, tests were not yet completed.\\
\subsection{Overview and discussion}
\subsubsection{About the method}
Although the algorithm classifies samples not by focusing on specific characteristics of node population, but by directly comparing the microbial populations (which is theoretically better than the previous program), we still have to do \emph{a priori} hypotheses on the set of samples.\\
First, we suppose the equiprobability of being in a random class, that is, equiprobability of having a certain value for the selected metadatum. However, not only does it strongly depend on the values of other metadata, but it would even be sometimes completely wrong according to the metadatum considered. In our database for instance, $ATB-IV = 0$ or $ATB-Os = 0$ imply $ATB-IV-Os = 0$: $ATB-IV = 0$ means that the patient has not been treated with antibacterials by intraveinous injections (\emph{IV}), $ATB-Os = 0$ means that the patient has not been treated with antibacterials \emph{per os}, and $ATB-IV-Os = 0$ means that the patient has not been treated with antibacterials \emph{per os}, neither \emph{IV}. Unfortunately, since we do not know the exact relationship between the different metadata (e.g. does the treatment really affect the quality of life of the patient?), we cannot fix this issue.\\
Same goes for the hypothesis of independance between the probabilities of having a certain node, that is at core of the Bayes conditional model. Biologically speaking, we can wonder if having a certain species of bacteria may prevent another species to develop itself in the gut.\\
Second, the size and the content of the starting set for the classification given by the \emph{Naive Bayes Classifier} is of paramount importance. Prior probabilities of having a certain node are computed from this starting set by default (because assuming equiprobability would lead to contradictions, since we know antibacterials for instance do affect the microbial population). If the starting set is badly chosen, some nodes might not have matched any read in the starting group of samples, and thus the probability of having such a node is theoretically zero. Then the Bernouilli model would always return zero for a probability of being in a certain class, if this node is selected.\\
We thus let the user choose themselves the size of the starting set, provided the total size of the set of samples to classify.\\
A solution to this problem is to use the \emph{Bayesian average} \cite{BayesianRating}, that is, if the probability $p_{n}$ of having the node $n$ computed from the starting set $S_{start}$ and:
\begin{itemize}
\item $N_{n}$ be the number of samples having node $n$ among the samples in the starting set,
\item Boolean variable $x_{i,j} = 1$ iff node $j$ is matched in sample $i$ (else $x_{i,j} = 0$),
\item $M_{n} = \frac{N_{n}}{|S_{start}|}$,
\item $v_{n} = \sqrt{\sum{_{i=1}^{S_{start}}}{(x_{i,n} - M_{n})^{2}}}$,
\item $C_{n} = \frac{|S_{start}|}{v_{n} + 1}$,
\item $m_{n} = \frac{C_{n}}{|S_{start}|}$,
\end{itemize}
\begin{center}
Then $p_{n} = \frac{C_{n}.m_{n} + N_{n}}{C_{n} + |S_{start}|}$.\\
\end{center}
The idea of \emph{Bayesian average} is not to fully rely on the samples of the starting set to compute the probabilities of having a certain node. $C_{n}$ is a coefficient denoting how much confident we are in the probabilities given by the starting set. The more we believe the probabilities are distributed on the whole set of samples the same way they do in the starting set, the larger is $C_{n}$. This is quantified by the variation of values in the starting set: if most of the samples have matched this node -that is, variation is low- it is likely that all samples may actually own this node.\\
Note that the maximum value of $C_{n}$ is $|S_{start}|$. Having $C_{n} = |S_{start}|$ (that is $v_{n} = 0$), would mean that we completely trust the starting set (replacing $C_{n}$ in the formula above leads to $p_{n} = \frac{|S_{start}| + N}{2.|S_{start}|}$), whereas $C_{n} = 0$ gives $p_{n} = \frac{N_{n}}{|S_{start}|}$, that is, we consider the non-amplified values of probabilities from the starting set.\\
For instance, if node $n$ matches in $5$ samples out of $17$, having $C_{n} = 0$ would give $p_{n} = \frac{5}{17} \simeq 0.29$, while having $C_{n} = |S_{start}|$ gives $p_{n} = \frac{17 + 5}{2 \times 17} \simeq 0.65 $. If node $n$ does not match in any of the $17$ samples, then for $C_{n} = \frac{|S_{start}|}{2}$, $p_{n} = \frac{|S_{start}|}{4 \times |S_{start}|} = \frac{1}{4} = 0.25$.\\
%\subsubsection{About the numerical results}
\chapter{Non-supervised learning (TaxoCluster software)}
The following sections introduce some useful concepts and notations, that will help to understand the third method described at the end of this chapter.
\section{Machine Learning and non-supervised learning: the K-Means algorithm}
This section explains what \emph{non-supervised learning} is in \emph{Machine Learning}, and present one of the most common tool in this category, which is the \emph{K-Means algorithm}.
\subsection{Non-Supervised Learning}
Another broad category in \emph{Machine Learning} is \emph{non-supervised learning}: unlike \emph{supervised learning}, the different classes to which belong the data are not known at first. Provided the set of elements, the algorithm has to distinguish by itself several different classes according to the similarity between the pieces of data from the initial set. This is why one of the most common approach for \emph{non-supervised learning} is \emph{clustering}.
\subsection{Clustering}
Given a certain distance, \emph{clustering} is the task of gathering objects into disjoint clusters, such that the resulting sets maximize the proximity (in terms of the chosen distance) between elements of a same cluster, and maximize the distance between objects of different clusters.
\subsection{K-Means Algorithm}
Although the problem of partitionning $n$ elements into $k$ clusters \cite{PartitionIsNPhard} is \textsc{NP-complete}\cite{NPhard} (see the note in bibliography for a definition of \textsc{NP-hardness} and \textsc{NP-completeness}), the \emph{K-Means algorithm} \cite{KMeans} is a rather efficient clustering algorithm. After choosing an integer $k$ that is the estimated number of clusters, the user initializes each cluster with one of the elements of the set. Then, for each non-clustered object, the algorithm computes the distance from this object to the mean of every cluster (in our implementation, it is the sample that minimizes the sum of all distances to the other samples in the same cluster), and assigns the object to the closest cluster. Then it updates the mean of this cluster, and iterates the last two steps until convergence of the solution. For more detailed explanation, see \cite{KMeans}.
\section{Definitions}
Before starting to describe our \textsc{TaxoCluster} method, here are some helpful notations (adapted from \cite{Tango1}):\\
\subsection{Notations}
Knowing that $T$ is the whole taxonomic tree, for a certain read $i$, let $M_{i}$ be the set of matched leaves for this read, and $T_{i}$ the subtree of $T$ rooted at the LCA (see above for the definition of LCA) of the nodes in $M_{i}$. Let $L_{i}$ be the set of leaves of $T_{i}$, and $N_{i}$ be such as $L_{i} = M_{i} \sqcup N_{i}$ ($M_{i}$ and $N_{i}$ are disjoint, see figure $5.1$).
\begin{figure}[H]
\centering
\includegraphics[scale=0.3]{illustrations/timili.png}
\caption{Let T be the whole taxonomic tree. Then for the set of red nodes, the subtree associated is rooted at the violet node, and the total set of leaves for this subtree is the set of yellow and red nodes.}
\end{figure}
\subsection{Distances}
Let us define two distances $d_{matched}$ and $d_{consensus}$ over pairs of reads ($R_{i},R_{j}$) for the clustering:
\begin{itemize}
\item \begin{center} $d_{matched}(R_{i},R_{j}) = |M_{i}| + |M_{j}| - 2*|M_{i} \cap M_{j}|$. \end{center}\\
It is quite easy to check that $d_{matched}$ is indeed a distance: $d_{matched}$ is symmetric, satisfies the triangle inequality, is non-negative, and $d_{matched}(i,j)$ equals to zero iff $R_{i} = R_{j}$ (The unique relevant equality relation between reads here is the equality between the respective sets of matched nodes, because nobody can know to which node the read should truly be assigned).
\item Having a fixed parameter $q \in [0,1]$,\\
\begin{center}
$d_{consensus}(R_{i},R_{j}) = |L_{i}| + |L_{j}| - q*(|N_{i}\cap M_{j}| + |N_{j} \cap M_{i}|) - |M_{i} \cap M_{j}|$.\\
\end{center}
It is also easy to check here that $d_{consensus}$ is a distance. This distance corresponds to search a consensus taxonomic tree between the trees induced by the set of nodes matched by the reads \cite{Consensus}.\\
When $q = 0$, we consider a \emph{strict} consensus tree, only keeping leaves matched in both reads.\\
When $q = 1$, we consider a tree having leaves that are matched in at least one of the two reads (see figure $5.2$).
\begin{figure}[H]
\centering
\subfigure\includegraphics[scale=0.35]{illustrations/distance2-1.png}
\subfigure\includegraphics[scale=0.35]{illustrations/distance2-2.png}
\caption{Let us consider the trees $T_{1}$ and $T_{2}$ rooted at A (trees are drawn separately for simplicity's sake), such as $M_{1}$ = \{D, E, F, H\} and $M_{2}$ = \{F, G, H\} (thus matched leaves are in green). Then $d_{consensus}(T_{1},T_{2}) = 5 + 5 - q \times (1 + 2) - 2$. When $q = 0$, $d_{consensus}(T_{1},T_{2}) = 8$. When $q = 1$, $d_{consensus}(T_{1},T_{2}) = 5$}
\end{figure}
\end{itemize}
\item These distances can easily be extended to samples: if $Reads_{k}$ is the set of reads for sample $S_{k}$,\\
\begin{center}
$d_{consensus}(S_{k},S_{l}) = \sum{r_{k} \in Reads_{k}}{\sum{r_{l} \in Reads_{l}}{d_{consensus}(r_{k},r_{l})}}$.
\end{center}
Same goes for $d_{matched}$.
\section{Description of the method}
\textsc{TaxoCluster} tries to solve problem \textsc{CL.Comp}. It compares trees induced by samples before the assignment of reads, and avoids providing \emph{a priori} hypotheses on the probabilities of being in a certain class of metadata values, or of having a certain node. The number $k$ of clusters used in the \textsc{K-Means} algorithm is thus the number of vectors of metadata that can be obtained.\\
We use the very same definition of classes of metadata as in the second method.
\begin{itemize}
\item First the set of samples is clustered into $k$ clusters using $d_{matched}$ distance.
\item For every cluster, the most remote elements are deleted from the cluster, in order to keep only relevant elements. In our implementation, we discard the points for which sum of all distances to other samples of the same cluster is above the value of the third quartile (of the list of such distances).
\item Then the union of the remaining elements in all clusters is clustered again into $k$ clusters using $d_{consensus}$ distance.
\end{itemize}
The clusters resulting from the second clustering are then compared to the clusters obtained by directly looking at the values of metadata in samples. If the two groups of clusters are alike (this is quantified by a special distance), this could mean that there exists a correspondance between the selected metadata and the microbial populations. The distance to compare two clusters \textsc{$C_{1}$} and \textsc{$C_{2}$} is d($1$,$2$) $= \frac{|C_{1} \cap C_{2}|}{|C_{1}| + |C_{2}| - |C_{1} \cap C_{2}|} $ (if both clusters are empty, it returns \textsc{None}). The overall distance is (if one of the clusters is empty, then it returns \textsc{None}) $d_{clusters} = \frac{\sum{_{1 \le i < j \le k}}{d(i,j)}}{k}$, where $k$ is the total number of clusters. The program also returns (on demand) the bacteria in common for samples in a same cluster.\\
\section{Implementation}
The whole pipeline has been implemented in \textsc{Python 2.9.7}, and allows to draw and see the clusters in a DOT file.\\
The worst case time complexity is O($n_{samples}^{2} \times (n_{samples} \times n_{paths} \times n_{taxo-nodes} + n_{taxo-nodes}^{2}) \times n_{samples}^{2}$) (see annex for details).\\
(Still in progress) code can be seen at: \\\begin{center}\emph{github.com/kuredatan/taxocluster}.\end{center}
\section{Results}
\subsection{Tests}
The current implementation of \textsc{TaxoCluster} can only cluster the samples according to the values of a single metadatum. With the use of MDL or Multi-Dimensional Lists (see \textsc{Appendix D}), it would be able to handle different metadata. As for the previous method, tests were not yet completed.\\
The tests that could be executed on our test database show that the computation of the distance $d_{consensus}$ does not depend on the value of $q$: indeed, the LCA is too high in the taxonomic tree (has rank $R$ or $K$), due to the large number of nodes matched by each patient. A solution would be to count the number of matches for every node, and to only keep those which have been matched the most.\\
\subsection{Overview and discussion}
\subsubsection{About the method}
However, the main issue with this method is that, unlike the first \emph{Machine Learning} approach, it does not give the nodes that really discriminate the samples. We can only get access to the list of common bacteria for samples in a same cluster. One solution would be to post-process the clusters with a PCA (\emph{Principal Component Analysis}) method \cite{PCA}, or to test this pipeline with another clustering algorithm (different from \emph{K-Means}).
%\subsubsection{About the numerical results}
\newpage
\chapter{Comparison of the three approaches}
Theoretically, \textsc{TaxoCluster} is the best method, as it aims at fixing the issues in the first two methods \textsc{TaxoTree} and \textsc{TaxoClassifier}. Indeed, it does not require a special measure on microbial populations, that would induce \emph{a priori} hypotheses on the way these populations should be, with respect to the values of metadata. This fact is of paramount importance as most of the biological phenomena still remain partially unknown. Moreover, it does not require strong mathematical hypotheses as it is the case for the \emph{Naive Bayes Classifier}. However, the latter allows more flexibility, whereas it might be irrelevant if the problem is specific enough.\\
Notice that \textsc{TaxoCluster} can only solve the \textsc{Problem of Clustering Compatibility (CL.Comp)}, and not the \textsc{Problem of Best Clustering (CL.BClust)} unlike \textsc{TaxoClassifier}. Furthermore, in terms of worst case time complexity, knowing the actual values for each variable in our test database, it is unfortunately the slowest algorithm:\\
\begin{table}
\caption{Complexity for each method: for our test database, $n_{taxo-nodes} = 9,065$, $n_{values} \le 10$, $n_{samples} = 47$ and $n_{paths} = 5,565$}
\begin{tabular}{|l|c|r|}
\hline
\textsc{Method} & \textsc{Theoretical complexity} \\
\hline
Statistics & O($n_{taxo-nodes}^{2} \times n_{samples}^{3}$)\\
\hline
Supervised learning & O($n_{samples}^{2} \times n_{taxo-nodes}^{2} \times n_{values}$)\\
\hline
Non-supervised learning & O($n_{samples}^{2} \times n_{taxo-nodes}\times (n_{samples} \times n_{paths})$)\\
\hline
\end{tabular}
\end{table}
It is worth noticing that nevertheless the three methods can be completely executed on the test computer, even it is just a regular computer (see \textsc{Appendix A}). Nevertheless, to fully validate the results from the algorithms, we should get access to several other databases, and compare them to current biological knowledge and to the output of statistical methods. Due to time limitations, we unfortunately could not use any other database to perform these necessary tests. Moreover, the consistency of results greatly depends on the previous operations applied to input data (have the numeric results been normalized prior to the algorithm? Have all raw material for samples been collected the same way, following a same standard procedure? Has each assignment of read to nodes been performed under the same parameters? and so on). And, last but not least, the final interpretation of the results cannot still be taken into account without the practitionner or a practised biologist's validation.
\chapter{Outlook}
\section{Personal work}
During this internship, we suggested and implemented three methods to analyse data from sequencing according to the metadata provided: \textsc{TaxoTree}, \textsc{TaxoClassifier} and \textsc{TaxoCluster}, each of them using a different general paradigm of computer science. We tried to design each method as an improvement of the previous one. These methods solve various clustering-like problems that often occur in biology and medicine. Although their time complexity can probably be improved, the numerical tests done so far appear to be consistent with the medical observations.\\
We also have implemented a class of multi-dimensional lists, and a rather efficient -in regards to the size of our data...- algorithm to reconstruct taxonomic trees from the list of paths from root to every leaf.\\
\section{Comments}
Use of algorithms from \emph{Machine Learning} in metagenomics is not so new \cite{Nikolski}. However, the last two approaches use a hitherto unseen mix between the common algorithms of \emph{supervised} and \emph{non-supervised learning}, statistics and tree algorithms.\\
On the one hand, a quite large number of comparison problems over labeled unordered trees -taxonomic trees being a special case of this sort of trees- are \textsc{NP-complete}, for instance the problem of tree inclusion \cite{TreeInclusion}. What makes possible efficient algorithms is that the relevant comparison over trees here mainly focuses on the set of leaves for each tree. However, the time complexity of the last two methods, that is \textsc{TaxoClassifier} and \textsc{TaxoCluster}, should still be improved.\\
On the other hand, it would be interesting to test these algorithms on other databases than the one of cystic fibrosis-afflicted patients, so as to see if the resulting correspondances given by the softwares are relevant.\\
The code for all three methods is available on \textsc{GitHub}.\\
\newpage
\bibliographystyle{plain}
\bibliography{rapport.bib}
\newpage
\appendix
\chapter*{Annex}
Here are gathered details about the hardware limits of our computer for testing, detailed complexity function by function for each of the three methods described above, different algorithms to reconstruct a taxonomic tree as well as the implementation of multi-dimensional lists used in two out of the three methods.
\chapter{Hardware characteristics of the testing computer}
\begin{itemize}
\item \uline{Processor:} Intel Core i3-3227U 3rd generation.
\item \uline{Clock speed:} 1.90 GHz.
\item \uline{3rd level cache:} 3 Mo.
\item \uline{Hard Disk:} 1 To, rotation speed: 5,400 r/min.
\item \uline{RAM:} 6 Go, max: 16 Mo, DDR3 1,600 MHz.
\end{itemize}
\chapter{Complexity}
This chapter give details about the complexity of each method, per function and for the whole pipeline.
\section{Worst case complexity for the first approach (TaxoTree)}
Let $n_{samples}$ be the total number of available samples (in our test database, $n_{samples} = 47$), and $n_{taxo-nodes}$ the number of nodes in the taxonomic tree ($n_{taxo-nodes} = 9,065$).
\subsection{Per function}
\begin{itemize}
\item \textsc{Total Ratio:} See functions \emph{misc/inSample}, \emph{misc/takeNodesInTree}, \emph{totalRatio/compute} and \emph{totalRatio/countAssignmentsInCommon}.\\
\uline{inSample}: @sampleNameList (the list of user-selected samples) can be as large as the whole list of samples: O($n_{samples}$).\\
\uline{takeNodesInTree}: This procedure does a top-down search in the taxonomic tree, and executes \emph{inSample} for each list of assignments, that is, for each node of the taxonomic tree: O($n_{taxo-nodes} \times n_{samples}$).\\
\uline{compute}: \emph{compute} executes twice \emph{takeNodesInTree}. The member function \emph{mem} is in O($n_{taxo-nodes}$). In the worst case where the two user-selected lists are the same, the following operations are in O($size of the first subforest$) = O($n_{taxo-nodes}$). So \emph{compute} is in O($n_{taxo-nodes} \times n_{samples}$).\\
\uline{countAssignmentsInCommon}: It can easily be seen that \emph{countAssignmentsInCommon} is in O($n_{taxo-nodes}.n_{samples}^{2}$).\\
The overall complexity is thus O($n_{taxo-nodes} \times n_{samples}^{2}$). The most costly part is \emph{countAssignmentsInCommon}.
\item \textsc{Pattern Ratio:} See functions in \emph{patternRatio} and \emph{taxoTree} modules.
\uline{TaxoTree.search}: The procedure does a top-down search in the tree for the node in argument: O($n_{taxo-nodes}$).\\
\uline{misc/mergeList}: The procedure takes two lists of length $n$ and $m$, sorts both of them and then merge the two lists without duplicates (assuming there is no duplicate in the two lists): O($nlog(n) + mlog(m)$).\\
\uline{misc/trimList}: This procedure takes two lists of length $n$ and $m$, and deletes from the first list elements that belongs to the second one. It sorts both of the lists, and then considers in linear time the elements of the two lists. Thus the time complexity is O($nlog(n) + mlog(m)$).\\
\uline{enumerateCommonPatterns}: This function uses \emph{takeNodesInTree} and considers every node from the tree induced by the user-selected set of samples as a potential starting point for a pattern, and then tries to spread the pattern to the node's children, if it is possible (that is, if the child is matched at least in one of the samples in the user-selected list: the test is in O($n_{samples}$)). The two lists of assignments to this node are merged using \emph{mergeList} procedure: O($n_{samples}log(n_{samples})$). Final time complexity is thus O($n_{taxo-nodes} \times n_{samples} + n_{taxo-nodes}^{2} \times (n_{samples} + n_{samples}log(n_{samples}))$) = O($n_{taxo-nodes}^{2} \times n_{samples}log(n_{samples})$).\\
\uline{enumerateSpecificPatterns}: This function takes into argument the two user-selected lists of samples, uses \emph{trimList} procedure to have disjoint lists (such as one pattern of the first list does not belong to the second list, and the other way around), and then executes pretty much the same procedure as \emph{enumerateCommonPatterns}. Thus the time complexity is O($n_{taxo-nodes}^{2} \times n_{samples} + n_{samples}log(n_{samples})$).\\
\uline{patternRatio}: This function counts the number of assignments for all (common or specific) patterns. These patterns are disjoint, by construction, so in the worst case, it counts assignments for every node of the taxonomic tree: O($n_{taxo-nodes}$).\\
Henceforth using the \textsc{Pattern Ratio} procedure is\\
in O($n_{taxo-nodes}^{2} \times n_{samples}log(n_{samples})$) time.
\item \textsc{Microbial Diversity:} See \emph{computeDiversityCoefficient} function in \emph{diversityCoefficient} modules.
\uline{computeDiversityCoefficient}: This procedure uses \emph{takeNodesInTree} (see the chapter above). It considers every node in the tree induced by the user-selected list of samples, and counts the total number of assignments to this node in samples belonging to the list (O($n_{samples}$) if the list of samples contains all available samples).\\
Thus the overall complexity is O($n_{taxo-nodes} \times n_{samples}$).
\end{itemize}
\subsection{Overall complexity}
Therefore, to compute the distance described above for every pair of samples ($\frac{n_{samples}(n_{samples} - 1)}{2}$ pairs, distance being symmetric):\\
The time complexity is roughly in O($n_{taxo-nodes}^{2} \times n_{samples}^{3}$).
\section{Worst case complexity for second approach (TaxoClassifier)}
Without loss of generality, complexity is evaluated without multi-dimensional lists, since one metadatum can have an arbitrary number of different values.\\
Let $n_{samples}$ be the total number of available samples (in our test database, $n_{samples} = 47$), and $n_{taxo-nodes}$ the number of nodes in the taxonomic tree ($n_{taxo-nodes} = 9,000$). Let $n_{values}$ the maximum number of effective values in the information matrix that a metadatum can have, that is, the maximum number of classes -in our test database, $n_{values}$ is bounded by $10$, metadatum $Sample$ excepted. Let eventually $n_{metadata}$ the total number of metadata ($n_{metadata} = 21$).
\subsection{Per function}
\begin{itemize}
\item \textsc{Training:} See functions in \emph{training} module.\\
\uline{computeClasses}: See \emph{misc/partitionSampleByMetadatumValue}:\\
O($n_{samples}log(n_{samples}) + n_{metadata}$).\\
\uline{selectTrainingSample}: O($n_{samples}$) (Reservoir Sampling R algorithm).\\
\uline{assignClass}: O($n_{samples}^{2} \times n_{values}$).\\
\uline{getPriorProbability}: O($n_{samples}^{2} \times n_{taxo-nodes}^{2}$).\\
The overall complexity is hence O($n_{samples}^{2} \times n_{taxo-nodes}^{2} + n_{samples}^{2} \times n_{values} + n_{metadata}$)\\
= O($n_{samples}^{2} \times n_{taxo-nodes}^{2}$). \\
\item \textsc{Classification:} See functions in \emph{classifier} module.\\
\uline{probabilityKnowingClass}: O($n_{taxo-nodes} \times n_{samples} \times n_{taxo-nodes}$)\\
= O($n_{taxo-nodes}^{2} \times n_{samples}$).\\
\uline{bayesCalculus}: The most costly part is the call to \uline{probabilityKnowingClass}: O($n_{taxo-nodes}^{2} \times n_{samples}$). \\
\uline{classifyIt}: with the call to \uline{trainingPart}: O($n_{samples}^{2} \times n_{taxo-nodes}^{2} \times n_{values}$).\\
\item \textsc{Computation of Youden's J coefficient:} See \emph{countYouden} function in \emph{youden} module.
The whole set of classes contains at maximum the $n_{samples}$ samples. Thus the loop over the classes is in fact a loop over the set of samples. Then the overall time complexity is O($n_{samples}^{2} + n_{values}$).\\
\end{itemize}
\subsection{Overall complexity}
The worst case time complexity is roughly in O($n_{samples}^{2} \times n_{taxo-nodes}^{2} \times n_{values}$).
\section{Worst case complexity for third approach (TaxoCluster)}
Same notations as in previous chapter are used here. The most costly part in the clustering is the call to \textsc{K-means}, the computation of the two distance matrix, and the comparison between clusters.\\
\subsection{Per function}
\begin{itemize}
\item \textsc{Distances:} Between two given samples:\\
\begin{itemize}
\item \textsc{Distance 1:} The maximum length of the list of matching nodes is $n_{taxo-nodes}$ (lists of matching nodes are stored in a dictionary). Thus time complexity is in O($n_{taxo-nodes}^{2}$).\\
Thus the computation of the whole distance matrix has got a worst case time complexity of O($n_{taxo-nodes}^{2} \times n_{samples}^{2}$).
\item \textsc{Distance 2:} The computation of the LCA of a list of nodes of length $m$ ($m$ is bounded by $n_{samples}$) is in O($m \times n_{paths} \times l_{path}$) (see \emph{misc/taxoLCA}). The search of subtrees is in O($n_{taxo-nodes}$) (see above), as well as the call to \emph{TaxoTree.leaves} that counts leaves of the tree. The final loop is in O($n_{taxo-nodes}$).\\
Thus the computation of the whole distance matrix has got a worst case time complexity of O($n_{samples}^{2} \times (n_{samples} \times n_{paths} \times l_{path}$ \-$+ n_{taxo-nodes})$).
\end{itemize}
\item \textsc{K-means:} O($(n_{samples} - n_{values}) \times n_{values} \times n_{iterations}$) where $n_{iterations}$ is the number of needed iterations until convergence (30).
\item \textsc{Cluster comparison:} For two given clusters (maximum sum of both lengths being $n_{samples}$: thus maximum value of the product of the lengths is $(\frac{n_{samples}}{2})^{2}$):
\uline{compareCluster}: O($n_{samples}^{2}$)\\
\uline{compareCenters}: for two given clusters: O($1$) (storing distances between samples in a dictionary).\\
\end{itemize}
\subsection{Overall complexity}
Thus the worst case time complexity is\\
O($(n_{samples} - n_{values}) \times n_{values} \times n_{iterations} + n_{samples}^{2} \times n_{values}^{2} + n_{samples}^{2} (n_{samples} \times n_{paths} \times l_{path} + n_{taxo-nodes}) + n_{taxo-nodes}^{2} \times n_{samples}^{2}$)\\
= O($n_{samples}^{2} \times (n_{samples} \times n_{paths} \times n_{taxo-nodes} + n_{taxo-nodes}^{2} \tmes n_{samples}^{2})$).
\chapter{Construction of taxonomic trees}
\section{State-of-the-art and input}
There are of course many algorithms to construct a taxonomic tree, often starting from a distance matrix between the \emph{taxa} (plural of \emph{taxon}, that is, nodes of the taxonomic tree); for instance, \textsc{Neighbor \-Joining} \cite{NeighborJoining}, or \\ \textsc{Unweighted/Weighted\- Pair \-Group\- Method\- with \-Arithmetic\- Mean\-} \cite{UMPGA}. However, here we have already got the whole taxonomic tree and the phylogenetic relation between the \emph{taxa}.\\
Provided a list of paths \emph{paths} from the root to every leaf of the taxonomic tree, we thus need an efficient algorithm to construct the tree, which must grant easy (i.e. as cheap as possible) access to the list of assignments to node, to the children and also to the phylogeny of the node (the so-called "lineage"). It should also provide labelling of the nodes by an integer identifier. Since we could not find an article having dealt with this subject (trace analysis do use trees, but since we are interested in a very special sort of tree (taxonomic trees), it would be a bit off topic here), we had to design a customized algorithm.\\
Let $n_{taxo-nodes}$ be the number of the nodes in the taxonomic tree (in our test database, $n_{taxo-nodes} = 9,000$), $n_{samples}$ the number of samples ($n_{samples} = 47$), and $n_{paths}$ the number of paths from the root to every leaf (as it is a tree, this number is as well the number of leaves in the tree, that is $n_{paths} \simeq 8,000$). Let also $l_{path}$ be the maximum length between the root and a leaf (that is the number of ranks: in our test database, $l_{paths} = 8$), and $n_{metadata}$ the number of metadata ($n_{metadata} = 21$).\\
We present three methods in chronological order of design:
\begin{itemize}
\item The first one is a top-down construction, where we go through every path from \emph{root} to a leaf, and construct the missing nodes.
\item Unlike the method above, the second one is a bottom-up construction, where for each taxonomic rank (from S to R, that is from the most accurate to the most broad rank), we gather siblings (that is nodes with a same father) and construct the corresponding trees until having the tree rooted at \emph{root}.
\item The third one is an improvement of the second method. The first two methods are quite time-consuming, and this one succeeds in being the fastest of the three. Preprocess is applied to the nodes to precompute the groups of siblings and the father node associated, using hash tables.
\end{itemize}
\section{A naive top-down construction}
The method is to consider one by one every path from \emph{paths} and to create the branch of nodes when the algorithm finds a uncreated node while following the path (Algorithm 1).\\
\textsc{INPUT:} \emph{paths} the list of paths from the root to every leaf of the tree, \emph{root} the root of the empty taxonomic tree.\\
\textsc{OUTPUT:} The corresponding taxonomic tree rooted at \emph{root}.\\
\begin{algorithm}
\caption{The naive top-down construction}
\begin{algorithmic}
\FOR{path in \emph{paths}}
\STATE $currNode \leftarrow \emph{root}$
\STATE $currLineage \leftarrow [ (\emph{root}.name, \emph{root}.rank) ]$
\STATE $currChildren \leftarrow \emph{root}.children$
\STATE $currPath \leftarrow \emph{paths}.tail()$
\WHILE{currPath is non-empty}
\STATE $name,rank \leftarrow currPath.head()$
\STATE $nextNode \leftarrow$ currNode's child which name, rank are $name$, $rank$