Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Wasserstein Distance Computation Bug #79

Conversation

MohammadYasinKarbasian
Copy link

This pull request addresses an issue related to the computation of Wasserstein distance in our codebase. Specifically, the UL matrix is removed due to a phenomenon where certain rows after the M-th row in the Hungarian algorithm, match with themselves which creates a distance greater than zero so it creates inaccuracies in the resultant distance calculation.

Changes Made
Removing the UL matrix results in eliminating erroneous values in the Wasserstein distance calculation.

Impact
This fix ensures more accurate computation of Wasserstein distance by eliminating erroneous additions to the distance metric, thereby improving the overall reliability of our codebase.

@catanzaromj
Copy link
Contributor

@MohammadYasinKarbasian What bug does this PR specifically fix? Also, it seems that your fix incorrectly computes the Wasserstein distance when points in the diagram contain an infinite death--see this test failing here.

@MohammadYasinKarbasian
Copy link
Author

If you run this code:

data = np.random.random((1000,2))
diagrams = ripser(data)['dgms']
data2 = np.random.random((1000,2))
rips2 = ripser(data2)
diagrams2 = rips2['dgms']
dgm1 = diagrams[1] # a list of M pairs
dgm2 = diagrams2[1] # a list of N pairs
matchlist, distance = wasserstein(dgm1, dgm2, matching=True)

You see that when we create the matrix UL, values on the diagonal are not zero. Then we place that matrix on D[M: N+M, 0:N], so we have those non-zero values on D[M: N+M, 0:N]. Then we run the Hungarian algorithm using this line "matchi, matchj = optimize.linear_sum_assignment(D)". If you look at matchj as a result, you can see that usually at least for one i from [M: N+M], matchj[i] = i - M. So it means that some values from dgm2 match with diagonal and add some extra distance on matchdist. so this problem is happening here and the computed matchdist is not correct.

In the code I have fixed, this issue does not happen because the cost of matching dgm2 data to all points is zero and I only calculate the distance for matching dgm1 points to dgm2 or diagonal. so for every matching on i from [M: N+M], matchj[i] will match to some point that creates no distance and all the distance is for i from [0: M]. This is the real distance because in Wasserstein distance we want to move the points from dgm1 to dgm2. right? So we should not add distance for points that are in dgm2.

Also, I have debugged the test error and figured out what happens when we have this error.
Because empty is [0, np.inf] it will be replaced with [0, 0]. So then when we compute distances we see that [0, 0] is closer to diagonal than [1, 2]. So it matches with the diagonal and creates a distance of 0. If you consider it is wrong and it should be matched with [1, 2] I can fix it and create another pull request but based on my assumption it is true because I think that each point in dgm1 should be matched to either a point in dgm2 or diagonal and there are no other constraints.

@catanzaromj
Copy link
Contributor

Thanks for the detailed reply. I still think I'm failing to understand what the problem is. Are you saying you believe the code is inefficient or that it is outright wrong? It might be helpful for me if you could give a concrete example (instead of using random persistence diagrams) of an incorrect or extraneous computation being done.

However, I think that your explanation is incorrect because it seems you are not treating the two persistence diagrams on equal footing. On your PR branch, I get the following

In [1]: import persim

In [2]: import numpy as np

In [3]: dgm1 = np.array([[5,6]])

In [4]: dgm2 = np.array([[6,7],[7,8]])

In [5]: persim.wasserstein(dgm1, dgm2)
Out[5]: 0.7071067811865483

In [6]: persim.wasserstein(dgm2, dgm1)
Out[6]: 1.4142135623730971

Your proposed solution is not symmetric, so it can't be a metric. Also, calculating by hand gives $3/2\cdot\sqrt{2}$ which I believe is correct (and agrees with the answer on master):

# Switch to master branch
In [7]: persim.wasserstein(dgm1, dgm2)
Out[7]: 2.1213203435596437

In [8]: np.sqrt(2)*3/2
Out[8]: 2.121320343559643

@MohammadYasinKarbasian
Copy link
Author

Yes, you're right, the code I've modified causes other problems, including that it's no longer symmetrical. I think the problem is due to the difference in our view of Wasserstein's Distance calculation. And with these assumptions that you have, there is probably no problem in the code.
Thanks for the answers

@ctralie
Copy link
Member

ctralie commented Mar 22, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants