You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let's say one is trying to search The quest for the holy grail...
A phrase query will compute the intersection of the posting list associated to each of these words,
and then refine the documents by checking if this is indeed a phrase query match.
This works by taking each term in their docfreq order (tf is assumed to be very correlated with docfreq), and checking if each new terms is at the right position for the terms visited so far.
We might accelerate interseciton and matching by indexing ngrams. It can be seen as a way of caching part of the phrase query. For instance, assuming we had indexed the quest with its respective positions, our phrase query would apply on ["The quest", for, the, holy, grail].
This problem can be entirely addressed using our tokenizer pipeline...
At indexing time, a tokenfilter could add an extra token and emit.
n-grams
can considerably acceleratePhraseQuery
.Let's say one is trying to search
The quest for the holy grail
...A phrase query will compute the intersection of the posting list associated to each of these words,
and then refine the documents by checking if this is indeed a phrase query match.
This works by taking each term in their docfreq order (tf is assumed to be very correlated with docfreq), and checking if each new terms is at the right position for the terms visited so far.
We might accelerate interseciton and matching by indexing ngrams. It can be seen as a way of caching part of the phrase query. For instance, assuming we had indexed
the quest
with its respective positions, our phrase query would apply on["The quest", for, the, holy, grail]
.This problem can be entirely addressed using our tokenizer pipeline...
At indexing time, a tokenfilter could add an extra token and emit.
At query time, the tokenfilter would have a slightly different behavior and emit.
So the
TokenFilter
would have a different behavior for search and for indexing. Note that this is similar to a synonym token filter situation.This therefore requires adding something to the tokenizer API to tell whether we are in search mode or indexing mode as we build the pipeline.
Not all n-grams ?
Indexing all grams would be overkill for this problems. Ideally the
TokenFilter
should be configurable, by the user :The text was updated successfully, but these errors were encountered: