-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 424: Add support for handling of vector fields. #489
Issue 424: Add support for handling of vector fields. #489
Conversation
0dc1d37
to
44d3b19
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing this @dgoldenberg-ias, this is great!
There's just a compilation error at the moment as it looks like you've missed adding the enum variant to FieldType
Great catch @Xtansia, thank you! :) I was going between a couple of my branches and this slipped through the cracks. Should be good now. |
Hi @Xtansia, I was wondering if the PR is good to go now. Let me know pls, thanks. |
Sorry for the delay! Changes look good, would you be able to add a short test like these geo_point ones here and here? |
Hi @Xtansia, sure I can do that. The tests look the same but I get it, we want coverage for both spark sql 20 and 30.. |
Hi @Xtansia, I've added two unit tests, as discussed. However, I've started experiencing build problems. Could you lend a hand? I can email you more details but basically below are the two main issues I'm seeing: I run I get the below error:
I temporarily commented out these lines in
Then all the integration tests fail, like this:
Then I can skip the integration tests OK (using I have:
It seems setting the default Java to 8 or 11 should be OK (?) Also, is the PR now all set and can it be merged? |
Hi @Xtansia, Any word on the build failure? Looking at https://github.com/opensearch-project/opensearch-hadoop/actions/runs/10022839453/job/27729806497#step:10:437,
Also would appreciate some help on the local build. Thanks. |
@dgoldenberg-ias The CI jobs are failing as the default OpenSearch cluster used for integration testing does not have the knn plugin installed. I am looking into the best way to handle this. I am currently unable to reproduce your issues fetching dependencies locally, it might have been a transient issue with our snapshots maven repository. Additionally it looks like some of your commits are missing the final |
852c64c
to
ea8d5a1
Compare
Hi @Xtansia Thanks for looking into the install of the knn plugin. I have added the necessary Signed-off-by's.
Yes, this seems intermittent and it does seem to be something to do with the snapshots maven repo. |
Hi @Xtansia, I was just wondering if you've had a chance to take a look at the install of the knn plugin. Anything I could perhaps help with? Let me know, thanks. |
Hi @Xtansia, I was just wondering if you've had a chance to take a look at the install of the knn plugin. Anything I could perhaps help with? Let me know, thanks. It would be great to wrap this up :) |
Hi @dgoldenberg-ias, very sorry for the delay, I've been swamped lately. I had a quick look into getting the k-nn plugin setup in the integ tests and doesn't look like a super simple job. For now I think easiest to get this merged we can just mark the tests as |
Hi @Xtansia, thanks for getting back to me! My other question for you was, will this PR become a part of the next release; do you know what the expected/planned schedule for it is? |
I think the two tests added in this PR will need it to pass CI. Could you please rebase onto the latest main and re-push your branch to re-trigger the CI runs to confirm first? I'll then open an issue regarding fixing the integ tests/k-nn tests. Yes it would become part of the next release, we don't have a fixed release schedule, but I could look into kicking one off in the next week. |
Hi @Xtansia,
A rebase did not rake anything in. I just added a couple of comments in this latest commit. Could you let me know how the CI went? and whether those @ Ignore's are necessary. We could reference the new ticket in those although I don't see a pattern for that in other @ Ignore's.
Great.
Perfect, thanks. |
Hi @Xtansia, The build logs seem a bit cryptic. But there are build failures in tests of spark 20 and spark 30. Safe to assume we just need an @ Ignore on testKnnVectorAsArrayOfFloats()? and that the |
@dgoldenberg-ias I've fixed the broken CI in 766ae02, could you please rebase ontop of that? |
@Xtansia I'm not able to; it's really odd.
Somehow I can only get to that commit from Jul 2 and it's not seeing yours. Any ideas? |
@dgoldenberg-ias If you navigate to your fork, you should see something like below, click sync fork to bring in the latest changes from this repo and then do the rebase: |
@Xtansia Done, I believe :) |
@dgoldenberg-ias It is just the |
Hi @Xtansia,
|
I'm not sure where it's gone wrong as in theory that should work. You should be able to fixup your branch like so: # Move most recent commit back to original commits, dropping rebased main commits
git rebase --onto faf01c9ad8eac0222ece9c872b38fee541a4e453 HEAD~1
# Rebase all PR commits onto main
git rebase --onto main $(git merge-base main issue-424-support-knn-vectors) |
Hi Thomas @Xtansia , thanks for that. I did that but things still aren't looking right in the modified files list in the PR. I don't grok why git keeps wanting me to merge CHANGELOG.md... Here's the details of what went on: git rebase --onto faf01c9 HEAD~1 git rebase --onto main $(git merge-base main issue-424-support-knn-vectors) vi CHANGELOG.md nothing to commit, working tree clean nothing to commit, working tree clean nothing to commit, working tree clean |
A couple things:
Let's try the following: # Stash any un-commited changes to be safe
git stash -u
# Bring in up-to-date remote info
git fetch origin
# Switch to main branch and match it to remote
git switch main
git reset --hard origin/main
# Switch to feature branch and ensure it matches remote
git switch issue-424-support-knn-vectors
git reset --hard origin/issue-424-support-knn-vectors
# Undo duplicated commit that adds second copy of tests
git reset --hard HEAD~1
# Move most recent commit back to original commits, dropping rebased main commits
git rebase --onto faf01c9ad8eac0222ece9c872b38fee541a4e453 HEAD~1
# Rebase all PR commits onto main
git rebase --onto origin/main $(git merge-base origin/main issue-424-support-knn-vectors)
# Push to remote
git push -u -f origin issue-424-support-knn-vectors |
…Dmitry Goldenberg <dgoldenberg@integralads.com> Signed-off-by: Dmitry Goldenberg <dgoldenberg@integralads.com>
… the CHANGELOG file. Signed-off-by: Dmitry Goldenberg <dgoldenberg@integralads.com> Signed-off-by: Dmitry Goldenberg <dgoldenberg@integralads.com>
…mber to the CHANGELOG file. Signed-off-by: Dmitry Goldenberg <dgoldenberg@integralads.com>
…ng enum variant for KNN_VECTOR to the FieldType enum. Signed-off-by: Dmitry Goldenberg <dgoldenberg@integralads.com>
…Dmitry Goldenberg <dgoldenberg@integralads.com> Signed-off-by: Dmitry Goldenberg <dgoldenberg@integralads.com>
…tests. Signed-off-by: Dmitry Goldenberg <dgoldenberg@integralads.com>
…of comments. Signed-off-by: Dmitry Goldenberg <dgoldenberg@integralads.com>
…r testKnnVectorAsArrayOfFloats for now as the k-NN plugin is currently missing and needs to be added. Signed-off-by: Dmitry Goldenberg <dgoldenberg@integralads.com>
9379374
to
a64cf03
Compare
That is great, thank you so much @Xtansia. It looks like we're at a good, clean state now. Is this good to go or is anything else needed? I see the PR saying "12 workflows awaiting approval". |
Thanks for all the help, @Xtansia, much appreciated! I assume you'll be merging this into main? Could you let me know about the new release? (dgoldenberg@integralads.com) |
@dgoldenberg-ias I recommend "watching" the repo to be notified. |
Got it, @Xtansia, thanks for the detailed explanation :) Please let me know if there are other units of work I could pick up; I'd love to contribute more. Cheers! |
Any and all contributions would be greatly appreciated! Feel free to have a look through the open issues if there's something there that peaks your interest, or any other improvements/fixes you can think of yourself. |
Hi @Xtansia, we're looking to cut a release soon and were hoping to reference this next upcoming release of openseach-hadoop. Any word on when the next release will be out? Thanks. |
@dgoldenberg-ias v1.3.0 has now been released |
Awesome! Thanks, @Xtansia |
Description
This change will allow the users of opensearch-hadoop to get values of fields of type knn-vector to be loaded into a Spark dataframe properly from OpenSearch or persisted to OpenSearch from a Spark dataframe. Up till now, the type had not yet gotten added to the codebase and thus, for example, one of the effects was that knn-vector field values were simply not returned as a column in the Spark dataframe. With this change, they will be.
Issues Resolved
Closes #424
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.