Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of reconnectOrphanedNodes #359

Merged
merged 10 commits into from
Sep 27, 2024
Merged

Conversation

jkni
Copy link
Collaborator

@jkni jkni commented Sep 19, 2024

reconnectOrphanedNodes's ability to produce connected graphs was improved by #335. One change in particular produces a performance regression when reconnecting large (1M+ node graphs) with a sizeable partitioning (e.g., only ~600K nodes are reachable from the entry point, meaning 400k nodes need to be reconnected).

The change is as follows:

Exclude connectionTargets when searching for new connection points. If all of a node's existing neighbors have been considered and a search is performed, it is not unusual to see that the existing neighbors are many of the search results. This is particularly important if approach 2 is used, as connection targets will grow across loop iterations.

This works very well when a small number of nodes need to be reconnected, as the set of connection targets is small. When it's large, like when reconnecting 400k nodes, the performance of the searches is extremely poor.

This PR proposes several changes:

  1. Limit neighbors used as connection targets to nodes connected at the start of the loop iteration. Testing the current reconnection behavior for graphs with large numbers of disconnected nodes showed that the neighbor reconnection has a tendency to connect to nodes already in the partition. This change improves the effectiveness of neighbor reconnection.
  2. Instead of using excludeBits, use resumable searches and post-filter results based on connection targets. This removes the performance hit of connection targets. The threshold of 50 resumes was decided by a basic histogram analysis of the number of resumes needed. Past this point, we see diminishing returns.
  3. Increase reconnect loops. The previous decrease from 5 to 3 is no longer appropriate to see highly effective restoration of connectivity on large, partitioned graphs.
  4. Introduce slf4j-api for basic debug logging. java.util.logging bridged to other logging frameworks via slf4j has some performance concerns. This seemed like the appropriate way to bridge logging to the broader ecosystem.

@jkni jkni requested a review from jbellis September 19, 2024 15:53
jkni and others added 2 commits September 19, 2024 12:33
…nnection targets to nodes that were reachable by the entry node at the start of the pass. Instead of using exclusion bits for connection targets, perform several rounds of resumes and post-filter for connectionTargets. Log basic debugging information when reconnecting orphaned nodes by introducing slf4j-api.
no need for resuming the search again

add backlinking of new edges from search
@jbellis jbellis force-pushed the reconnect-improvements branch from 48ac258 to c47b907 Compare September 19, 2024 17:50
@jbellis
Copy link
Owner

jbellis commented Sep 19, 2024

Looks good overall.

I'm not sure about switching from excludebits to search/resume, especially since we end up with code that uses both. Here's my attempt to make excludebits less painful.

@jkni
Copy link
Collaborator Author

jkni commented Sep 19, 2024

I'm seeing poor reconnect behavior via searches with the added commits. I'll update once I figure out what's going on.

@jbellis
Copy link
Owner

jbellis commented Sep 19, 2024

Try without the backlinking?

@jkni
Copy link
Collaborator Author

jkni commented Sep 20, 2024

It was a simple fix once I cleared my mind a bit -- no need to invert the bits on the search.

I ran some tests with the changes, as I like the simplicity of not having both resumes/bits to include, but it negates most of the performance benefit. For the graphs I was testing with (approximately 1 million nodes, with hundreds of thousands of disconnected nodes), on my test machine, my original PR can complete reconnection with 0 disconnected nodes in around 12 minutes. With the revised PR, it is about an order of magnitude slower at around 2 hours. I think this is because the searches using include bits walk a very significant portion of the graph, as they need beamWidth connected results. The resume approach only needs one connected result to not have to resume, so most connections happen with a small number of resumes.

@jbellis
Copy link
Owner

jbellis commented Sep 20, 2024

Okay, I added back the split search. It retains most of the simplicity I think.

If it's still slow then again I blame backlink. :)

@jbellis
Copy link
Owner

jbellis commented Sep 20, 2024

(I changed the 50 to 2*degree so that it doesn't break down in the corner case with larger degree, since I left out the excludebits for neighbors as well.)

@jkni
Copy link
Collaborator Author

jkni commented Sep 27, 2024

Pushed another commit to recover some performance, as the recent round still left things 50% slower than the original commit. But, it looks like we get to keep backlink!

The split between connected nodes/global connection targets appears to be important. When they're unified, initialized to the first pass of connected nodes, we don't benefit from improved connectivity on future passes, slowing down their performance meaningfully. I've also found that filtering candidates discovered via search using connectedNodes is harmful, as we already have connectivity by virtue of being discovered via search.

@jbellis
Copy link
Owner

jbellis commented Sep 27, 2024

okay, what's your preferred version at this point?

@jkni
Copy link
Collaborator Author

jkni commented Sep 27, 2024

@jbellis -- tip of this branch as-is. I think your other changes are good, so this is somewhat simplified relative to the initial PR + speed of the original + backlinking.

@jbellis
Copy link
Owner

jbellis commented Sep 27, 2024

LGTM

nit: can we combine the two connectToClosestNeighbors by passing Bits.ALL as connectedNodes to the four-parameter version?

@jkni
Copy link
Collaborator Author

jkni commented Sep 27, 2024

Good call on the nit. Pushed and merging when CI is clean.

@jkni jkni merged commit 78ee760 into main Sep 27, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants