Improve multi-CTA algorithm #492

anaruse · 2024-11-25T10:46:04Z

It has been reported that when the number of search results is large, for example 100, using the multi-CTA algorithm can cause a decrease in recall. This PR is intended to alleviate this low recall issue.

close #208

copy-pr-bot · 2024-11-25T10:46:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cjnolet · 2024-12-03T18:33:40Z

/ok to test

tfeher

Thanks @anaruse for this PR, it is great to see these improvements. Overall the changes look good, and the benchmarks that you have shared offline look very encouraging. I just have a few questions below.

cpp/src/neighbors/detail/cagra/search_plan.cuh

cpp/src/neighbors/detail/cagra/search_multi_cta_kernel-inl.cuh

cpp/src/neighbors/detail/cagra/search_plan.cuh

tfeher · 2024-12-05T00:18:12Z

@anaruse there are some unsigned commits, that blocks CI from testing the changes automatically. To fix this issue, could you rebase the PR?

cjnolet · 2024-12-05T05:31:33Z

/ok to test

…when the number of results is large Fix some issues Fix lower recall issue with new multi-cta algo Removing redundant code and changing some parameters Update cpp/src/neighbors/detail/cagra/search_plan.cuh Co-authored-by: Tamas Bela Feher <tfeher@nvidia.com> Remove an unnecessary line and satisfy clang-format

tfeher

Thanks Akira for the updates, the PR looks good to me.

cjnolet · 2024-12-05T13:12:33Z

/merge

cjnolet · 2024-12-05T13:13:04Z

/ok to test

tfeher · 2024-12-05T15:53:07Z

/ok to test

cjnolet · 2024-12-05T15:57:20Z

/merge

cjnolet · 2024-12-05T16:23:30Z

/ok to test

cjnolet · 2024-12-05T16:23:37Z

/merge

cjnolet · 2024-12-05T18:17:18Z

We are on the brink of missing code freeze for this PR. Please anyone reading this, don't click the "update" button. It inserts a merge commit, which reruns CI in its entirety and this is not needed to merge the PR. We can re-run individual flaky tests that fail without having to rerun the entire CI (the former takes minutes and the latter can take several hours).

cjnolet · 2024-12-06T05:05:22Z

@anaruse @tfeher CI seems to be running successfully for other PRs but the gtests seem to be consistently timing out for this PR. As far as I can tell, there's no updates to any of the tests, in this PR, but the timeouts don't seem flaky, they seem isolated to these changes, somehow.

We are pushing back code freeze by 1 day. Do you guys think we can still make this in time for 24.12?

Handle the case when the search result contains invalid indices when building the updated graph in add_nodes. For debugging purposes, fail if any invalid indices found; in future, we can replace RAFT_FAIL with RAFT_LOG_WARN to make the add_nodes routine more robust.

achirkin · 2024-12-09T13:50:04Z

I took the liberty to add a workaround to add_nodes, which handles the case when CAGRA search doesn't return enough valid indices. With this, the tests should fail with a descriptive message in place of the segfault.
When we find the source of the bug, we can relax the RAFT_FAIL with RAFT_LOG_WARN.

achirkin · 2024-12-09T14:33:26Z

/ok to test

anaruse · 2024-12-10T11:20:25Z

The cause has not yet been identified, but it seems that this problem only occurs when the dimensionality of the dataset is 1.

anaruse · 2024-12-23T04:19:58Z

The problem of invalid results being included in the search results when the number of dimensions in the data set is small has been improved in the case of the new multi-CTA algorithm.
As a result of the checks that @achirkin added to add_nodes, it was found that even in the case of single-CTA or multi-Kernel algo, if the dimensionality of the dataset is small, for example 1, the search results may include invalid results. Even if some of the search results are invalid, it is possible to execute add_nodes without error, so I have made the necessary changes to add_nodes.

anaruse · 2024-12-23T04:25:00Z

If the search results for add_nodes include invalid results, a message is output using RAFT_LOG_WARN, and if there are too many invalid results to execute add_nodes without error, it is terminated using RAFT_FAIL.

anaruse · 2024-12-23T04:27:13Z

@enp1s0 , Could you please check the changes I made to add_nodes?

enp1s0

@anaruse Thank you for adding invalid node handling to the node addition function. I have a minor comment, but the changes look good to me.

enp1s0 · 2024-12-23T07:01:35Z

cpp/src/neighbors/detail/cagra/add_nodes.cuh

@@ -136,9 +153,13 @@ void add_node_core(
        for (std::uint32_t i = 0; i < base_degree; i++) {
          std::uint32_t detourable_node_count = 0;
          const auto a_id                     = host_neighbor_indices(vec_i, i);
+          if (a_id >= idx.size()) {
+            detourable_node_count_list[i] = std::make_pair(a_id, base_degree + 1);


It would be good to write a comment about this process, e.g., in what case this conditional branch becomes true and why the detourable node count is set to base_degree + 1.

Thanks, I've added the comments.

achirkin

I think the error in add_nodes is not observed prior to this PR (without the safeguard RAFT_FAIL, the tests segfault in this PR, but not in the main branch). This leads me to believe that the problem in the search may be related (either introduced or just surfaced) by the new changes.
In any case, I don't see a reason why the cagra search algorithms have to return invalid values under the small dimensionality condition, and thus I believe we should fix the search rather throwing the error at the use-site. Where you able to pinpoint why the search returns the invalid values in the small dimensionality cases?

achirkin · 2024-12-23T07:52:58Z

cpp/src/neighbors/detail/cagra/add_nodes.cuh

+        RAFT_LOG_WARN(
+          "Invalid edges found in search results "
+          "(vec_i:%lu, invalid_edges:%lu, degree:%lu, base_degree:%lu)",
+          (uint64_t)vec_i,
+          (uint64_t)invalid_edges,
+          (uint64_t)degree,
+          (uint64_t)base_degree);


I believe it's good to have this check here, but I'm afraid it could be a bit too much when printed in the loop. Could you please add some sort of internal counter to prevent it spitting out too many lines (let say, no more than ten messages)?

Thanks. It is undesirable to have lots of warnings output. Changed to limit the number of warnings output to 3.

anaruse · 2024-12-23T13:34:27Z

It is not easy to explain why the new multi-CTA algo sometimes produces invalid search results, but it is mainly because it allows multiple CTAs to use the same node as a search path, and it allows multiple CTAs to have the same node in their local itopk buffer. If the number of dimensions is small, the paths to reach the query are rather limited, so it is likely that many nodes will be in the local itopk buffers of multiple CTAs at the same time. Even the, duplications are eventually removed using hash tables, but the result of that de-duplication is that the specified number of valid results may not be output.

So far, we know that the problem only occurs when the dimensionality of the dataset is 1. And a dataset with dimension 1 is not originally a target for vector search. Although not ideal, one way to deal with this is to stop testing on datasets with a dimension of 1.

anaruse requested a review from a team as a code owner November 25, 2024 10:46

github-actions bot added the cpp label Nov 25, 2024

anaruse mentioned this pull request Nov 25, 2024

[FEA] Strongly filtered CAGRA #480

Open

tfeher added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Nov 25, 2024

anaruse mentioned this pull request Dec 4, 2024

Automatic adjustment of itopk size according to filtering rate #509

Draft

tfeher requested changes Dec 5, 2024

View reviewed changes

anaruse force-pushed the improved_multi_cta_algo branch from 3dce160 to 6223fd2 Compare December 5, 2024 06:58

Merge branch 'branch-24.12' into improved_multi_cta_algo

8ff6991

tfeher approved these changes Dec 5, 2024

View reviewed changes

fix style

37e26c1

Merge branch 'branch-24.12' into improved_multi_cta_algo

3665d45

cjnolet assigned anaruse Dec 5, 2024

achirkin changed the base branch from branch-24.12 to branch-25.02 December 9, 2024 09:55

achirkin added 2 commits December 9, 2024 11:13

Merge branch 'branch-25.02' into improved_multi_cta_algo

018e792

Resolving various issues with the new multi-CTA algorithm

bedd224

enp1s0 approved these changes Dec 23, 2024

View reviewed changes

Add comments in add_nodes.cuh

ea8c273

achirkin requested changes Dec 23, 2024

View reviewed changes

Limit tht number of warnings output

5025481

Avoid invalid results in search results as much as possible

b61126a

anaruse force-pushed the improved_multi_cta_algo branch from 5a3519a to b61126a Compare December 25, 2024 09:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve multi-CTA algorithm #492

Improve multi-CTA algorithm #492

anaruse commented Nov 25, 2024

copy-pr-bot bot commented Nov 25, 2024

cjnolet commented Dec 3, 2024

tfeher left a comment

tfeher commented Dec 5, 2024

cjnolet commented Dec 5, 2024

tfeher left a comment

cjnolet commented Dec 5, 2024

cjnolet commented Dec 5, 2024

tfeher commented Dec 5, 2024

cjnolet commented Dec 5, 2024

cjnolet commented Dec 5, 2024

cjnolet commented Dec 5, 2024

cjnolet commented Dec 5, 2024 •

edited

Loading

cjnolet commented Dec 6, 2024

achirkin commented Dec 9, 2024

achirkin commented Dec 9, 2024

anaruse commented Dec 10, 2024

anaruse commented Dec 23, 2024

anaruse commented Dec 23, 2024

anaruse commented Dec 23, 2024

enp1s0 left a comment

enp1s0 Dec 23, 2024

anaruse Dec 23, 2024

achirkin left a comment

achirkin Dec 23, 2024

anaruse Dec 23, 2024

anaruse commented Dec 23, 2024

Improve multi-CTA algorithm #492

Are you sure you want to change the base?

Improve multi-CTA algorithm #492

Conversation

anaruse commented Nov 25, 2024

copy-pr-bot bot commented Nov 25, 2024

cjnolet commented Dec 3, 2024

tfeher left a comment

Choose a reason for hiding this comment

tfeher commented Dec 5, 2024

cjnolet commented Dec 5, 2024

tfeher left a comment

Choose a reason for hiding this comment

cjnolet commented Dec 5, 2024

cjnolet commented Dec 5, 2024

tfeher commented Dec 5, 2024

cjnolet commented Dec 5, 2024

cjnolet commented Dec 5, 2024

cjnolet commented Dec 5, 2024

cjnolet commented Dec 5, 2024 • edited Loading

cjnolet commented Dec 6, 2024

achirkin commented Dec 9, 2024

achirkin commented Dec 9, 2024

anaruse commented Dec 10, 2024

anaruse commented Dec 23, 2024

anaruse commented Dec 23, 2024

anaruse commented Dec 23, 2024

enp1s0 left a comment

Choose a reason for hiding this comment

enp1s0 Dec 23, 2024

Choose a reason for hiding this comment

anaruse Dec 23, 2024

Choose a reason for hiding this comment

achirkin left a comment

Choose a reason for hiding this comment

achirkin Dec 23, 2024

Choose a reason for hiding this comment

anaruse Dec 23, 2024

Choose a reason for hiding this comment

anaruse commented Dec 23, 2024

cjnolet commented Dec 5, 2024 •

edited

Loading