Improve performance of reconnectOrphanedNodes #359

jkni · 2024-09-19T15:53:19Z

reconnectOrphanedNodes's ability to produce connected graphs was improved by #335. One change in particular produces a performance regression when reconnecting large (1M+ node graphs) with a sizeable partitioning (e.g., only ~600K nodes are reachable from the entry point, meaning 400k nodes need to be reconnected).

The change is as follows:

Exclude connectionTargets when searching for new connection points. If all of a node's existing neighbors have been considered and a search is performed, it is not unusual to see that the existing neighbors are many of the search results. This is particularly important if approach 2 is used, as connection targets will grow across loop iterations.

This works very well when a small number of nodes need to be reconnected, as the set of connection targets is small. When it's large, like when reconnecting 400k nodes, the performance of the searches is extremely poor.

This PR proposes several changes:

Limit neighbors used as connection targets to nodes connected at the start of the loop iteration. Testing the current reconnection behavior for graphs with large numbers of disconnected nodes showed that the neighbor reconnection has a tendency to connect to nodes already in the partition. This change improves the effectiveness of neighbor reconnection.
Instead of using excludeBits, use resumable searches and post-filter results based on connection targets. This removes the performance hit of connection targets. The threshold of 50 resumes was decided by a basic histogram analysis of the number of resumes needed. Past this point, we see diminishing returns.
Increase reconnect loops. The previous decrease from 5 to 3 is no longer appropriate to see highly effective restoration of connectivity on large, partitioned graphs.
Introduce slf4j-api for basic debug logging. java.util.logging bridged to other logging frameworks via slf4j has some performance concerns. This seemed like the appropriate way to bridge logging to the broader ecosystem.

…nnection targets to nodes that were reachable by the entry node at the start of the pass. Instead of using exclusion bits for connection targets, perform several rounds of resumes and post-filter for connectionTargets. Log basic debugging information when reconnecting orphaned nodes by introducing slf4j-api.

no need for resuming the search again add backlinking of new edges from search

jbellis · 2024-09-19T17:52:22Z

Looks good overall.

I'm not sure about switching from excludebits to search/resume, especially since we end up with code that uses both. Here's my attempt to make excludebits less painful.

jkni · 2024-09-19T20:08:47Z

I'm seeing poor reconnect behavior via searches with the added commits. I'll update once I figure out what's going on.

jbellis · 2024-09-19T20:50:46Z

Try without the backlinking?

…ch already matches connectionTargets

jkni · 2024-09-20T19:39:13Z

It was a simple fix once I cleared my mind a bit -- no need to invert the bits on the search.

I ran some tests with the changes, as I like the simplicity of not having both resumes/bits to include, but it negates most of the performance benefit. For the graphs I was testing with (approximately 1 million nodes, with hundreds of thousands of disconnected nodes), on my test machine, my original PR can complete reconnection with 0 disconnected nodes in around 12 minutes. With the revised PR, it is about an order of magnitude slower at around 2 hours. I think this is because the searches using include bits walk a very significant portion of the graph, as they need beamWidth connected results. The resume approach only needs one connected result to not have to resume, so most connections happen with a small number of resumes.

jbellis · 2024-09-20T20:28:55Z

Okay, I added back the split search. It retains most of the simplicity I think.

If it's still slow then again I blame backlink. :)

jbellis · 2024-09-20T20:33:13Z

(I changed the 50 to 2*degree so that it doesn't break down in the corner case with larger degree, since I left out the excludebits for neighbors as well.)

…hbors found via search by the connected set

jkni · 2024-09-27T20:41:17Z

Pushed another commit to recover some performance, as the recent round still left things 50% slower than the original commit. But, it looks like we get to keep backlink!

The split between connected nodes/global connection targets appears to be important. When they're unified, initialized to the first pass of connected nodes, we don't benefit from improved connectivity on future passes, slowing down their performance meaningfully. I've also found that filtering candidates discovered via search using connectedNodes is harmful, as we already have connectivity by virtue of being discovered via search.

jbellis · 2024-09-27T21:50:03Z

okay, what's your preferred version at this point?

jkni · 2024-09-27T22:00:05Z

@jbellis -- tip of this branch as-is. I think your other changes are good, so this is somewhat simplified relative to the initial PR + speed of the original + backlinking.

jbellis · 2024-09-27T22:04:53Z

LGTM

nit: can we combine the two connectToClosestNeighbors by passing Bits.ALL as connectedNodes to the four-parameter version?

…hen connecting through search

jkni · 2024-09-27T22:38:32Z

Good call on the nit. Pushed and merging when CI is clean.

jkni requested a review from jbellis September 19, 2024 15:53

jkni and others added 2 commits September 19, 2024 12:33

simplify connectionTargets, connectedNodes, and excludedBits

c47b907

no need for resuming the search again add backlinking of new edges from search

jbellis force-pushed the reconnect-improvements branch from 48ac258 to c47b907 Compare September 19, 2024 17:50

jbellis approved these changes Sep 19, 2024

View reviewed changes

preserve connectionTargets across passes

20447d0

No need to invert bits -- the argument is for results to include, whi…

be26004

…ch already matches connectionTargets

switch back to Joel's search + resume approach

908879c

jbellis and others added 4 commits September 20, 2024 15:35

we do need to exclude self

8268f07

r/m unused

ff3c44e

Re-split connected nodes/global connection targets. Don't filter neig…

fd68612

…hbors found via search by the connected set

Javadoc updates

c9674ee

DRY connectToClosestNeighbor by passing Bits.ALL for connectedNodes w…

2d79d3e

…hen connecting through search

jkni merged commit 78ee760 into main Sep 27, 2024
6 checks passed

jkni mentioned this pull request Sep 30, 2024

CNDB-10870: Upgrade JVector dependency to 3.0.1 datastax/cassandra#1312

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of reconnectOrphanedNodes #359

Improve performance of reconnectOrphanedNodes #359

jkni commented Sep 19, 2024

jbellis commented Sep 19, 2024

jkni commented Sep 19, 2024

jbellis commented Sep 19, 2024

jkni commented Sep 20, 2024 •

edited

Loading

jbellis commented Sep 20, 2024

jbellis commented Sep 20, 2024

jkni commented Sep 27, 2024

jbellis commented Sep 27, 2024

jkni commented Sep 27, 2024

jbellis commented Sep 27, 2024

jkni commented Sep 27, 2024

Improve performance of reconnectOrphanedNodes #359

Improve performance of reconnectOrphanedNodes #359

Conversation

jkni commented Sep 19, 2024

jbellis commented Sep 19, 2024

jkni commented Sep 19, 2024

jbellis commented Sep 19, 2024

jkni commented Sep 20, 2024 • edited Loading

jbellis commented Sep 20, 2024

jbellis commented Sep 20, 2024

jkni commented Sep 27, 2024

jbellis commented Sep 27, 2024

jkni commented Sep 27, 2024

jbellis commented Sep 27, 2024

jkni commented Sep 27, 2024

jkni commented Sep 20, 2024 •

edited

Loading