slc-path
crate is based on the single-linkage_clustering
method. In statistics, single-linkage clustering is one of several methods of hierarchical clustering.
This method tends to produce long thin clusters in which nearby elements of the same cluster have small distances, but elements at opposite ends of a cluster may be much farther from each other than two elements of other clusters. For some classes of data, this may lead to difficulties in defining classes that could usefully subdivide the data. However, it is popular in astronomy for analyzing galaxy clusters, which may often involve long strings of matter; in this application, it is also known as the friends-of-friends algorithm.
It is based on grouping clusters in bottom-up fashion (agglomerate clustering), at each step combining two clusters that contain the closest pair of elements not yet belonging to the same cluster as each other.
This module principally use the kodama
crate. With a
forked version.
Clustering allows to trace the extent of the impact of a program on the file tree.
Let us assume that we have five elements (a,b,c,d,e)
and the following matrix D1
of pairwise
distances between them:
a | b | c | d | e | |
---|---|---|---|---|---|
a | 0 | 17 |
21 | 31 | 23 |
b | 17 |
0 | 30 | 34 | 21 |
c | 21 | 30 | 0 | 28 | 39 |
d | 31 | 34 | 28 | 0 | 43 |
e | 23 | 21 | 39 | 43 | 0 |
In this example, D1(a,b)=17
is the lowest value of D1
, so we cluster elements a
and b
.
We then proceed to update the initial proximity matrix D1
into a new proximity matrix D2
(see below), reduced in size by one row and one column because of the clustering of a
with b
.
Bold values in D2
correspond to the new distances, calculated by retaining the minimum distance
between each element of the first cluster (a,b)
and each of the remaining elements:
D2((a,b),c)=min(D1(a,c),D1(b,c))=min(21,30)=21
D2((a,b),d)=min(D1(a,d),D1(b,d))=min(31,34)=31
D2((a,b),e)=min(D1(a,e),D1(b,e))=min(23,21)=21
Italicized values in D2
are not affected by the matrix update as they correspond to distances
between elements not involved in the first cluster.
We now reiterate the three previous actions, starting from the new distance matrix D2
:
(a,b) | c | d | e | |
---|---|---|---|---|
(a,b) | 0 | 21 |
31 | 21 |
c | 21 |
0 | 28 | 39 |
d | 31 | 28 | 0 | 43 |
e | 21 |
39 | 43 | 0 |
Here, D2((a,b),c)=21
and D2((a,b),e)=21
are the lowest values of D2
, so we join cluster (a,b)
with element c
and with element e
.
We then proceed to update the D2
matrix into a new distance matrix D3
(see below), reduced in
size by two rows and two columns because of the clustering of (a,b)
with c
and with e
:
D3(((a,b),c,e),d)=min(D2((a,b),d),D2(c,d),D2(e,d))=min(31,28,43)=28
The final D3
matrix is:
((a,b),c,e) | d | |
---|---|---|
((a,b),c,e) | 0 | 28 |
d | 28 |
0 |
So we join clusters ((a,b),c,e)
and d
.
Let r
denote the (root)
node to which ((a,b),c,e)
and d
are now connected.
The dendrogram is now complete. It is ultrametric because all tips (a,b,c,e, and d)
are equidistant from r
:
δ(a,r) = δ(b,r) = δ(c,r) = δ(e,r) = δ(d,r) = 14
The dendrogram is therefore rooted by r
, its deepest node.