You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that a custom tokenizer can be passed in during initialization to tokenize the input documents, but the tokenizer is not used to tokenize the query in the get_scores method. This means that the query needs to be tokenized manually externally. Would it be possible to add the following content at the beginning of the get_scores method:
if self.tokenizer:
query = self.tokenizer(query)
The text was updated successfully, but these errors were encountered:
I do like the idea for this change, but am a little worried about making backward-incompatible changes considering how many people seem to be using the package.
I'd like to leave this issue open and see if there's more support for this.
I think it's very important to ensure that the same tokenizer function is used in index creation and at query time. I made an optimized rewrite where the tokenizer function is registered in the class: https://github.com/jankovicsandras/bm25opt
I noticed that a custom tokenizer can be passed in during initialization to tokenize the input documents, but the tokenizer is not used to tokenize the query in the
get_scores
method. This means that the query needs to be tokenized manually externally. Would it be possible to add the following content at the beginning of the get_scores method:The text was updated successfully, but these errors were encountered: