Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The passed-in tokenizer is not being used in the get_scores method. #38

Open
Alkacid opened this issue Apr 18, 2024 · 2 comments
Open
Labels
enhancement New feature or request

Comments

@Alkacid
Copy link

Alkacid commented Apr 18, 2024

I noticed that a custom tokenizer can be passed in during initialization to tokenize the input documents, but the tokenizer is not used to tokenize the query in the get_scores method. This means that the query needs to be tokenized manually externally. Would it be possible to add the following content at the beginning of the get_scores method:

if self.tokenizer:
    query = self.tokenizer(query)
@dorianbrown dorianbrown added the enhancement New feature or request label Oct 8, 2024
@dorianbrown
Copy link
Owner

I do like the idea for this change, but am a little worried about making backward-incompatible changes considering how many people seem to be using the package.

I'd like to leave this issue open and see if there's more support for this.

@jankovicsandras
Copy link

I think it's very important to ensure that the same tokenizer function is used in index creation and at query time. I made an optimized rewrite where the tokenizer function is registered in the class: https://github.com/jankovicsandras/bm25opt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants