Skip to content

MATLAB functions designed to construct dissimilarity matrices using a variety of distance metric functions. It provides a comprehensive toolkit for analyzing and comparing data sets through different distance measures.

Notifications You must be signed in to change notification settings

aungpyaeap/distfun-matlab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 

Repository files navigation

contributions welcome

Distance Metrics Toolkit (DISTFUN-MATLAB): Distance metric functions for numerical and categorical data dissimilarity

This repository contains MATLAB functions designed to construct dissimilarity matrices using a variety of distance metric functions. It provides a comprehensive toolkit for analyzing and comparing data sets through different distance measures.

Definitions

  • A dataset is denoted by $\mathfrak{X}^{n \times m}$ where $n$ is number of data points (rows) and $m$ is number of features (columns).
  • A data point is denoted by $x_i \in \mathbb{R}^m$ where each $x$ is a vector of $m$ features.
  • A distance metric $d: \mathbb{R}^m \times \mathbb{R}^m \rightarrow \mathbb{R}_+$ is a function that quantifies degree of separation (distance) between pair of data points.
  • A similarity metric $s: \mathbb{R}^m \times \mathbb{R}^m \rightarrow \mathbb{R}$ is a function that quantifies degree of likeness (similarity) between pair of data points.
  • The distance matrix $D$ is an $n \times n$ matrix where each entry $D_{ij}$ represents distance between data points $x_i$ and $x_j$. For all data points, $D_{ij} = [x_{ij}]\in \mathbb{R}^{n \times n}$ represents a symmetric matrix of distances.

For any distance metric, the following conditions must be satisfied for any three data points $x_i, x_j, x_k$ [6][7].

  • Identity: $d(x_i, x_j) = 0 \Leftrightarrow x_i = x_j$
  • Symmetry: $d(x_i, x_j) = d(x_j, x_i)$
  • Triangle inequality: $d(x_i, x_j) \leq d(x_i, x_k) + d(x_k, x_j)$

Example of use

For numerical dataset

% Using pre defined function
D = compdist(points, "euclidean");
disp(D);

% Using pdist2
D = pdist2(points, points, @(XI, XJ) distfun(XI, XJ, "euclidean"));
disp(D);

For categorical dataset

T = readtable('sample.csv', VariableNamingRule='preserve');
X = T{:,:}; % Convert table to array
dname = 'hamming'; % Choose distance name
D = compdist(X_categorical, 'hamming');
disp(D);

Distance metrics included in repository

euclidean	- Euclidean distance.
sqeuclidean - Squared Euclidean distance. (Does not satisfy triangle inequality.)
cityblock	- City block distance.
chebyshev	- Chebyshev distance.
canberra	- Canberra distance.
cosine		- Cosine distance. (Does not satisfy triangle inequality.)
corr		- Correlation distance.
clark		- Clark distance.
soergel		- Soergel distance.
hamming		- Hamming distance.
jaccard		- Jaccard distance.
dice		- Dice distance.

Distance metric formulas

  • $\text{Euclidean distance: }d(x_i,x_j) = \Vert x_i - x_j\Vert_2 = \sqrt{\sum_{k=1}^m (x_{ik} - x_{jk})^2}$

  • $\text{Squared Euclidean distance: }d(x_i,x_j) = \Vert x_i - x_j\Vert^2 = \sum_{k=1}^m (x_{ik} - x_{jk})^2$

  • $\text{City block distance: }d(x_i,x_j) = \Vert x_i - x_j\Vert_1 = \sum_{k=1}^m |x_{ik} - x_{jk}|$

  • $\text{Chebyshev distance: }d(x_i,x_j) = \Vert x_i - x_j\Vert_4 = \max_{k=1}^m |x_{ik} - x_{jk}|$

  • $\text{Canberra distance: }d(x_i,x_j) = \sum_{k=1}^{m} \frac{|x_{ik} - x_{jk}|}{|x_{ik}| + |x_{jk}|}$

  • $\text{Cosine distance: }d(x_i,x_j) = 1 - \frac{x_i \cdot x_j}{\Vert x_i\Vert \Vert x_j\Vert} = 1 - \frac{\sum_{k=1}^m x_{ik} x_{jk}}{\sqrt{\sum_{k=1}^m x_{ik}^2} \sqrt{\sum_{k=1}^m x_{jk}^2}}$

  • $\text{Correlation distance: }d(x_i, x_j) = 1 - \frac{\sum_{k=1}^{m} (x_{ik} - \bar{x_i})(x_{jk} - \bar{x_j})}{\sqrt{\sum_{k=1}^{m} (x_{ik} - \bar{x_i})^2} \sqrt{\sum_{k=1}^{m} (x_{jk} - \bar{x_j})^2}}$

  • $\text{Clark distance: }d(x_i,x_j) = \sqrt{\sum_{k=1}^{m} \left(\frac{|x_{ik} - x_{jk}|}{x_{ik} + x_{jk}}\right)^2}$

  • $\text{Soergel distance: }d(x_i,x_j) = \frac{\sum_{k=1}^{m} |x_{ik} - x_{jk}|}{\sum_{k=1}^{m} \max(x_{ik},x_{jk})}$

  • $\text{Hamming distance: }d(x_i,x_j) = \sum_{k=1}^m \mathbb{I}(x_{ik} \neq x_{jk})$

  • $\text{Jaccard distance: }d(x_i,x_j) = 1 - \frac{|x_i \cap x_j|}{|x_i \cup x_j|} = 1 - \frac{\sum_{k=1}^{m} x_{ik} \cdot x_{jk}}{\sum_{k=1}^{m} (x_{ik} + x_{jk}) - \sum_{k=1}^{m} x_{ik} \cdot x_{jk}}$

  • $\text{Dice distance: }d(x_i,x_j) = 1 - \frac{2|x_i \cap x_j|}{|x_i| + |x_j|} = 1 - \frac{2 \cdot \sum_{k=1}^{m} x_{ik} \cdot x_{jk}}{\sum_{k=1}^{m} x_{ik} + \sum_{k=1}^{m} x_{jk}}$

Distance metric data type and ranges

Distance Name Data Type Range Origin Ref.
Euclidean distance Numerical $[0, +\infty\rangle$ $d$ [2][7]
Squared Euclidean distance Numerical $[0, +\infty\rangle$ $d$ [2][7]
City block distance Numerical $[0, +\infty\rangle$ $d$ [2][7]
Chebyshev distance Numerical $[0, +\infty\rangle$ $d$ [2][7]
Canberra distance Numerical $[0, +\infty\rangle$ $d$ [2]
Cosine distance Numerical $[0, +2]$ $s$ [7]
Correlation distance Numerical $[0, +2]$ $d$ [7]
Clark distance Numerical $[0, +\infty\rangle$ $d$ [1]
Soergel distance Numerical $[0, +1]$ $d$ [2]
Hamming distance Categorical $[0, +\infty\rangle$ $d$ [4]
Jaccard distance Categorical $[0, +1]$ $s$ [5]
Dice distance Categorical $[0, +1]$ $s$ [3]

Transform distance to similarity

For bounded distances like cosine, Jaccard, and Dice, the relationship can be expressed as $d(x_i, x_j) = 1 - s(x_i, x_j)$, allowing for straightforward transformation in both directions [8].

Example usage

X = rand(10,2);
D = compdist(X, 'cosine');
S = 1 - D;
disp(S);

For unbounded distance metrics, the transformation to similarity is typically given by $s(x_i,x_j):= \exp(-\frac{\Vert x_i - x_j \Vert^2}{2\sigma^2})$ where $\sigma > 0$ is a parameter [9][10].

Example usage

X = rand(10,2);
D = compdist(X, 'euclidean');
S = exp(-D.^2);
disp(S);

Citation

@INPROCEEDINGS{10730392,
  author={Pyae, Aung and Low, Yeh-Ching and Chua, Hui Na},
  booktitle={2024 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET)}, 
  title={A Combined Distance Metric Approach with Weight Adjustment For Improving Mixed Data Clustering Quality}, 
  year={2024},
  volume={},
  number={},
  pages={183-188},
  keywords={Measurement;Refining;Clustering algorithms;Learning (artificial intelligence);Complexity theory;Optimization;Distance Metrics;Mixed Data;Hierarchical Clustering;Unsupervised Learning},
  doi={10.1109/IICAIET62352.2024.10730392}
}

References

  1. Abu Alfeilat, H. A., Hassanat, A. B., Lasassmeh, O., Tarawneh, A. S., Alhasanat, M. B., Eyal Salman, H. S., & Prasath, V. S. (2019). Effects of distance measure choice on k-nearest neighbor classifier performance: a review. Big data, 7(4), 221-248.
  2. Cha, S. H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. City, 1(2), 1.
  3. Dice, L. R. (1945). Measures of the amount of ecologic association between species. Ecology, 26(3), 297-302.
  4. Hamming, R. W. (1950). Error detecting and error correcting codes. The Bell system technical journal, 29(2), 147-160.
  5. Murphy, A. H. (1996). The Finley affair: A signal event in the history of forecast verification. Weather and forecasting, 11(1), 3-20.
  6. Scitovski, R., Sabo, K., Martínez-Álvarez, F., & Ungar, Š. (2021). Cluster analysis and applications (2021st ed.). Cham, Switzerland: Springer Nature.
  7. Wierzchon, S. T., & Klopotek, M. (2018). Modern Algorithms of Cluster Analysis (1st ed.). Cham, Switzerland: Springer International Publishing.
  8. Miyamoto, S. (2022). Theory of agglomerative hierarchical clustering (2022nd ed.). Singapore, Singapore: Springer.
  9. Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing, 17, 395-416.
  10. The MathWorks, Inc. (2025). Statistics and machine learning toolbox (Version 24.2.0 R2024b). The MathWorks, Inc.

About

MATLAB functions designed to construct dissimilarity matrices using a variety of distance metric functions. It provides a comprehensive toolkit for analyzing and comparing data sets through different distance measures.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages