Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add cosine similarity query #3464

Conversation

fredcarle
Copy link
Collaborator

Relevant issue(s)

Resolves #3349

Description

This PR adds the possibility to calculate the cosine similarity between a vector field and a given vector.

To achieve this we added the _similarity system field which take a target field (part of the parent object) and vector as parameter.

query {
  User{
    _similarity(pointsList: {vector: [1, 2, 0]})
  }
}

Note that the added code to mapper and planner is more of a "bolt on" addition given the current state of that part of the code base. A refactor is expected in the future.

Future work will allow giving a content parameter instead of the vector if the target field has embedding generation configured. This will enable out-of-the-box RAG queries.

Tasks

  • I made sure the code is well commented, particularly hard-to-understand areas.
  • I made sure the repository-held documentation is changed accordingly.
  • I made sure the pull request title adheres to the conventional commit style (the subset used in the project can be found in tools/configs/chglog/config.yml).
  • I made sure to discuss its limitations such as threats to validity, vulnerability to mistake and misuse, robustness to invalidation of assumptions, resource requirements, ...

How has this been tested?

Added cosine similarity to integration testing

Specify the platform(s) on which this was tested:

  • MacOS

@fredcarle fredcarle added area/query Related to the query component area/parser Related to the parser components area/mapper Related to the mapper components area/planner Related to the planner system labels Feb 19, 2025
@fredcarle fredcarle added this to the DefraDB v0.16 milestone Feb 19, 2025
@fredcarle fredcarle requested a review from a team February 19, 2025 05:28
@fredcarle fredcarle self-assigned this Feb 19, 2025
Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, easy to read considering the planner packed stuff we went through. Approving assuming todo's etc are addressed before merge.

todo: Please make sure you add an introspection test(s) to make sure the new input type shows up in the correct shape in the correct place without breaking the GQL spec.

Copy link
Contributor

@islamaliev islamaliev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good (ignoring the hacky part).

I have mostly doc suggestions and few test requests.

Copy link
Contributor

@islamaliev islamaliev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks for quick adjustments

@fredcarle fredcarle force-pushed the fredcarle/feat/3349-vector-similarity branch from 11f4374 to 69feaa2 Compare February 19, 2025 17:08
Copy link

codecov bot commented Feb 19, 2025

Codecov Report

Attention: Patch coverage is 75.29880% with 62 lines in your changes missing coverage. Please review.

Project coverage is 78.44%. Comparing base (7107187) to head (3f75395).
Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
internal/planner/similarity.go 65.35% 41 Missing and 3 partials ⚠️
internal/request/graphql/schema/generate.go 72.73% 8 Missing and 4 partials ⚠️
internal/planner/select.go 85.71% 3 Missing ⚠️
internal/request/graphql/parser/request.go 50.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3464      +/-   ##
===========================================
- Coverage    78.51%   78.44%   -0.07%     
===========================================
  Files          396      397       +1     
  Lines        37334    37571     +237     
===========================================
+ Hits         29310    29470     +160     
- Misses        6342     6405      +63     
- Partials      1682     1696      +14     
Flag Coverage Δ
all-tests 78.44% <75.30%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
internal/planner/errors.go 57.14% <100.00%> (+32.14%) ⬆️
internal/planner/mapper/mapper.go 87.38% <100.00%> (+0.21%) ⬆️
internal/planner/planner.go 81.92% <100.00%> (+0.35%) ⬆️
internal/planner/scan.go 89.04% <100.00%> (+0.10%) ⬆️
internal/request/graphql/parser/query.go 89.47% <100.00%> (+0.90%) ⬆️
internal/request/graphql/schema/types/types.go 100.00% <ø> (ø)
internal/planner/select.go 85.38% <85.71%> (+0.31%) ⬆️
internal/request/graphql/parser/request.go 68.31% <50.00%> (-0.62%) ⬇️
internal/request/graphql/schema/generate.go 86.44% <72.73%> (-0.57%) ⬇️
internal/planner/similarity.go 65.35% <65.35%> (ø)

... and 11 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7107187...3f75395. Read the comment docs.

@fredcarle fredcarle merged commit 4d90e6e into sourcenetwork:develop Feb 19, 2025
43 of 44 checks passed
@fredcarle fredcarle deleted the fredcarle/feat/3349-vector-similarity branch February 19, 2025 22:13
@shahzadlone
Copy link
Member

shahzadlone commented Feb 20, 2025

Bug bash + PR Review: I am leaving/ moving this back to the bug bash Todo because it can probably benefit from being tested by one more person, my testing might have missed some other edge cases:

  1. However I noticed a minor bug that was introduced in the debug explain testing framework: Cosine similarity debug testing framework explain bug #3471

Should be very easy to fix and add a debug explain test for it, explained it all in the issue.

  1. There is no compile time check to ensure the new introduced node is an explainableNode (specially when the explainable features were added), I will resolve this in: Cosine similarity debug testing framework explain bug #3471

  2. There is no tests for simple explain with similarity when new attributes were added (vector and target). Made an issue here: Add simple explain cosine similarity tests #3472

@islamaliev
Copy link
Contributor

Bug bash result:
After manually updating the embedding field for one of the document I got this message:
"source and vector must be of the same length. Source: 3, Vector: 768"
This is expected, but if there are many document, it might be hard to find one, especially considering that we don't have a way of querying docs with array of certain length. So I would suggest to also return the doc's docID.

After deleting the document querying works as normal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/mapper Related to the mapper components area/parser Related to the parser components area/planner Related to the planner system area/query Related to the query component
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compute similarity between two vectors
4 participants