Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to pass customField to /rerank endpoint #303

Open
wants to merge 10 commits into
base: rc/2024-10
Choose a base branch
from

Conversation

aulorbe
Copy link
Collaborator

@aulorbe aulorbe commented Oct 15, 2024

Problem

The current implementation of /rerank in the TS client does not (correctly) allow users to pass a custom field upon which to rerank.

Solution

Allow custom fields!

Please reference this PR to account for all expected functionality: https://github.com/pinecone-io/python-plugin-inference/pull/21/files

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Infrastructure change (CI configs, etc)
  • Non-code change (docs, etc)
  • None of the above: (explain here)

Test Plan

CI passes + reviewer xreferences PR above w/functionality introduced in this PR.


@aulorbe aulorbe marked this pull request as ready for review October 15, 2024 22:27
Copy link
Contributor

@austin-denoble austin-denoble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, I did leave some question and comments about a few things. I feel like I haven't read much TypeScript in a while and am a bit rusty. 😅

Comment on lines 174 to 178
} else {
throw new PineconeArgumentError(
'Documents must be a list of strings or a list of objects'
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for it to be a mix of the two? I'm actually not sure, but I'm assuming we don't allow that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's a good q! Right now, the TS client doesn't account for that. I'll add it in, thank you!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean if someone mixed strings and documents in the same request? That would be kinda odd, I don't think we need to bend over backwards to support that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay cool. I think my final else stmt would actually account for this anyways, no?

The logic is:

  • if docs is an array of strings, do something
  • else if docs is an array of objs, do something else
  • else, error out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the logic should be fine for that atm.

Comment on lines 156 to 165
if (documents.every((doc) => typeof doc === 'string')) {
// Set rankFields to default value of 'text'
computedRankFields = ['text'];
} else if (
// If`documents` is an array of objects:
documents.every((doc) => typeof doc === 'object' && doc !== null)
) {
// If there is a `text` key (other keys can be present too):
if (
documents.every((doc) => (doc as { [key: string]: string })['text'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there maybe a more efficient way to parse the documents that are passed and make some determinations about the contents one time up front rather than calling documents.every so many times? In the last code path we'd be iterating over everything in the array 3 times which is a bit heavy.

I'd try maybe looping over it once and building some parameters/bools that we use in these if statements rather than iterating repeatedly.

computedRankFields = ['text'];
} else if (
// If`documents` is an array of objects:
documents.every((doc) => typeof doc === 'object' && doc !== null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check doc !== null here but not with the string case above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added above!

}
// Standardize documents to ensure they are in object format
const newDocuments = documents.map((doc) =>
typeof doc === 'string' ? { text: doc } : doc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the text key here be different if they passed a custom rankFields?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a valid scenario, here's my thinking:

  • If someone passes in an array of strings, there are no keys in their documents
  • If there are no keys in their documents, on the backend we artificially transform their documents into an array of key:value pairs, with a text key, simply b/c that's what the model needs for reranking
  • If someone passes in an array of strings and then also passes in a custom field to rerank on... their docs would inherently be missing that custom field (by way of not being key:value pairs), so it would error out

Thoughts @ssmith-pc ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fully following the original question since the impl has changed, but:

  • If there are no keys in their documents, on the backend we artificially transform their documents into an array of key:value pairs, with a text key, simply b/c

By "on the backend we transform" I think you mean, it's transformed in the client right? This is the right behavior

  • If someone passes in an array of strings and then also passes in a custom field to rerank on... their docs would inherently be missing that custom field (by way of not being key:value pairs), so it would error out

Correct, but it's fine if that error is detected in the server and sent back to the client. The client doesn't have to try and detect that.

Comment on lines 24 to 25
const params = { inputType: 'text', truncate: 'END' };
const expected = { inputType: 'text', truncate: 'END' };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to test if we're not expecting this to change anything? Maybe I'm unclear on the test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird... this must be a holdover from something... I'll remove method (it does nothing currently) + this test. thanks!

{ text: 'hi', customField: 'doc1' },
{ text: 'bye', customField: 'doc2' },
];
expect(pinecone.inference.rerank(model, query, myDocuments)).toBeDefined();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little rusty on ts at this point, if the call fails it would avoid the expect and fail the test, right? Just trying to understand what this is asserting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh man, extremely good catch! I'll change to assert that rerankUnits are >= 1

Comment on lines 18 to 20
const expectedError = new PineconeArgumentError(
'You must pass at least one document to rerank'
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Personal preference, but I'd maybe just leave the expected error inline on the expect().toEqual(). Feels easier to parse.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

Copy link
Contributor

@jseldess jseldess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, @aulorbe, but can you add an example to the README?

@aulorbe
Copy link
Collaborator Author

aulorbe commented Oct 17, 2024

Looks good to me, @aulorbe, but can you add an example to the README?

Thank you!

I think the existing example in the README on the RC branch is still valid, right? Or were you think I should add more to that one?

@jseldess
Copy link
Contributor

Looks good to me, @aulorbe, but can you add an example to the README?

Thank you!

I think the existing example in the README on the RC branch is still valid, right? Or were you think I should add more to that one?

Yeah, I think it's still valid. But perhaps we should add rankFields to the second example. Also, I'm not sure if it's worth NOT setting the other rank options in the first example (top_n, return documents, etc.). I'd suggest the first example has everything but rankfields, and the second example shows how to use rankfields to rank on a custom field.

@aulorbe
Copy link
Collaborator Author

aulorbe commented Oct 17, 2024

Looks good to me, @aulorbe, but can you add an example to the README?

Thank you!
I think the existing example in the README on the RC branch is still valid, right? Or were you think I should add more to that one?

Yeah, I think it's still valid. But perhaps we should add rankFields to the second example. Also, I'm not sure if it's worth NOT setting the other rank options in the first example (top_n, return documents, etc.). I'd suggest the first example has everything but rankfields, and the second example shows how to use rankfields to rank on a custom field.

Cool, will do!

Copy link
Contributor

@jseldess jseldess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README LGTM, with a nit

README.md Outdated Show resolved Hide resolved
aulorbe and others added 2 commits October 17, 2024 12:22
Co-authored-by: Jesse Seldess <j_seldess@hotmail.com>
* should to rank the documents, and additional model-specific parameters. See {@link RerankOptions} for more details.
* @param documents - (Required) An array of documents to rerank. The array can either be an array of strings or
* an array of objects.
* @param options - (Optional) Additional options to send with the reranking request. See {@link RerankOptions} for more details.
* */
@prerelease('2024-10')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove this pre-release tag now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

});
});
if (!options.rankFields) {
// If each obj in newDocs does not have a 'text' field, throw an error

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the comment here should probably read "if any obj does not have... "

But may be more clear as simply:
"Ensure each obj in newDocs has a 'text' field"

Comment on lines 191 to 193
if (
!newDocuments.every((doc) => typeof doc === 'object' && 'text' in doc)
) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server will validate that the documents contain the fields they're supposed to, so it's not entirely necessary to validate on the client. However, if we want to validate on the client, why not also validate the case where a custom rank field was provided?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah good call, I remove this part actually (the validation part)

parameters = {},
rankFields = undefined,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to destructure rankFields and set to undefined? Isn't it undefined by default? I think I don't entirely understand why we can't keep the rankFields = ['text'] default

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good call, good call

Copy link
Contributor

@austin-denoble austin-denoble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for iterating and helping me iterate in return! 🚢

const {
topN = documents.length,
returnDocuments = true,
rankFields = ['text'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be left in, right? If they haven't provided a rankFields array explicitly, we should just default to "text".

Copy link

@ssmith-pc ssmith-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants