Fix Wasserstein Distance Computation Bug #79

MohammadYasinKarbasian · 2024-03-18T16:23:48Z

This pull request addresses an issue related to the computation of Wasserstein distance in our codebase. Specifically, the UL matrix is removed due to a phenomenon where certain rows after the M-th row in the Hungarian algorithm, match with themselves which creates a distance greater than zero so it creates inaccuracies in the resultant distance calculation.

Changes Made
Removing the UL matrix results in eliminating erroneous values in the Wasserstein distance calculation.

Impact
This fix ensures more accurate computation of Wasserstein distance by eliminating erroneous additions to the distance metric, thereby improving the overall reliability of our codebase.

catanzaromj · 2024-03-18T21:51:29Z

@MohammadYasinKarbasian What bug does this PR specifically fix? Also, it seems that your fix incorrectly computes the Wasserstein distance when points in the diagram contain an infinite death--see this test failing here.

MohammadYasinKarbasian · 2024-03-19T03:16:06Z

If you run this code:

data = np.random.random((1000,2))
diagrams = ripser(data)['dgms']
data2 = np.random.random((1000,2))
rips2 = ripser(data2)
diagrams2 = rips2['dgms']
dgm1 = diagrams[1] # a list of M pairs
dgm2 = diagrams2[1] # a list of N pairs
matchlist, distance = wasserstein(dgm1, dgm2, matching=True)

You see that when we create the matrix UL, values on the diagonal are not zero. Then we place that matrix on D[M: N+M, 0:N], so we have those non-zero values on D[M: N+M, 0:N]. Then we run the Hungarian algorithm using this line "matchi, matchj = optimize.linear_sum_assignment(D)". If you look at matchj as a result, you can see that usually at least for one i from [M: N+M], matchj[i] = i - M. So it means that some values from dgm2 match with diagonal and add some extra distance on matchdist. so this problem is happening here and the computed matchdist is not correct.

In the code I have fixed, this issue does not happen because the cost of matching dgm2 data to all points is zero and I only calculate the distance for matching dgm1 points to dgm2 or diagonal. so for every matching on i from [M: N+M], matchj[i] will match to some point that creates no distance and all the distance is for i from [0: M]. This is the real distance because in Wasserstein distance we want to move the points from dgm1 to dgm2. right? So we should not add distance for points that are in dgm2.

Also, I have debugged the test error and figured out what happens when we have this error.
Because empty is [0, np.inf] it will be replaced with [0, 0]. So then when we compute distances we see that [0, 0] is closer to diagonal than [1, 2]. So it matches with the diagonal and creates a distance of 0. If you consider it is wrong and it should be matched with [1, 2] I can fix it and create another pull request but based on my assumption it is true because I think that each point in dgm1 should be matched to either a point in dgm2 or diagonal and there are no other constraints.

catanzaromj · 2024-03-21T02:27:04Z

Thanks for the detailed reply. I still think I'm failing to understand what the problem is. Are you saying you believe the code is inefficient or that it is outright wrong? It might be helpful for me if you could give a concrete example (instead of using random persistence diagrams) of an incorrect or extraneous computation being done.

However, I think that your explanation is incorrect because it seems you are not treating the two persistence diagrams on equal footing. On your PR branch, I get the following

In [1]: import persim

In [2]: import numpy as np

In [3]: dgm1 = np.array([[5,6]])

In [4]: dgm2 = np.array([[6,7],[7,8]])

In [5]: persim.wasserstein(dgm1, dgm2)
Out[5]: 0.7071067811865483

In [6]: persim.wasserstein(dgm2, dgm1)
Out[6]: 1.4142135623730971

Your proposed solution is not symmetric, so it can't be a metric. Also, calculating by hand gives $3/2\cdot\sqrt{2}$ which I believe is correct (and agrees with the answer on master):

# Switch to master branch
In [7]: persim.wasserstein(dgm1, dgm2)
Out[7]: 2.1213203435596437

In [8]: np.sqrt(2)*3/2
Out[8]: 2.121320343559643

MohammadYasinKarbasian · 2024-03-22T02:19:47Z

Yes, you're right, the code I've modified causes other problems, including that it's no longer symmetrical. I think the problem is due to the difference in our view of Wasserstein's Distance calculation. And with these assumptions that you have, there is probably no problem in the code.
Thanks for the answers

ctralie · 2024-03-22T03:08:17Z

Thank you very much for bringing this up though, and thank you Michael for your oversight. I haven't had the cycles to contribute the past week (pun intended), but I will keep this in my inbox and check it out later. Message ID: ***@***.***>

…

Fix Wasserstein Distance Computation Bug

67e6cde

MohammadYasinKarbasian closed this Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Wasserstein Distance Computation Bug #79

Fix Wasserstein Distance Computation Bug #79

MohammadYasinKarbasian commented Mar 18, 2024

catanzaromj commented Mar 18, 2024

MohammadYasinKarbasian commented Mar 19, 2024

catanzaromj commented Mar 21, 2024

MohammadYasinKarbasian commented Mar 22, 2024

ctralie commented Mar 22, 2024 via email

Fix Wasserstein Distance Computation Bug #79

Fix Wasserstein Distance Computation Bug #79

Conversation

MohammadYasinKarbasian commented Mar 18, 2024

catanzaromj commented Mar 18, 2024

MohammadYasinKarbasian commented Mar 19, 2024

catanzaromj commented Mar 21, 2024

MohammadYasinKarbasian commented Mar 22, 2024

ctralie commented Mar 22, 2024 via email