Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elective n-grams #314

Open
fulmicoton opened this issue Jun 7, 2018 · 1 comment
Open

Elective n-grams #314

fulmicoton opened this issue Jun 7, 2018 · 1 comment

Comments

@fulmicoton
Copy link
Collaborator

fulmicoton commented Jun 7, 2018

n-grams can considerably accelerate PhraseQuery.

Let's say one is trying to search The quest for the holy grail...
A phrase query will compute the intersection of the posting list associated to each of these words,
and then refine the documents by checking if this is indeed a phrase query match.

This works by taking each term in their docfreq order (tf is assumed to be very correlated with docfreq), and checking if each new terms is at the right position for the terms visited so far.

We might accelerate interseciton and matching by indexing ngrams. It can be seen as a way of caching part of the phrase query. For instance, assuming we had indexed the quest with its respective positions, our phrase query would apply on ["The quest", for, the, holy, grail].

This problem can be entirely addressed using our tokenizer pipeline...

At indexing time, a tokenfilter could add an extra token and emit.

- text:"The", position_increment: 0, token_length: 1
- text:"The quest", position_increment: 2, token_length: 2
- text:"for", position_increment: 1, token_length: 1
- text:"the", position_increment: 1, token_length: 1
- text:"holy", position_increment: 1, token_length: 1
- text:"grail", position_increment: 1, token_length: 1

At query time, the tokenfilter would have a slightly different behavior and emit.

- text:"The quest", position_increment: 2, token_length: 2
- text:"for", position_increment: 1, token_length: 1
- text:"the", position_increment: 1, token_length: 1
- text:"holy", position_increment: 1, token_length: 1
- text:"grail", position_increment: 1, token_length: 1

So the TokenFilter would have a different behavior for search and for indexing. Note that this is similar to a synonym token filter situation.

This therefore requires adding something to the tokenizer API to tell whether we are in search mode or indexing mode as we build the pipeline.

Not all n-grams ?

Indexing all grams would be overkill for this problems. Ideally the TokenFilter should be configurable, by the user :

  • explicitely passing a list of n-grams
  • passing a list of triggers. (E.g. "the *" would generate 2-grams that start by the).
@fulmicoton
Copy link
Collaborator Author

Related to #291

@fulmicoton fulmicoton added this to the Undefined milestone Jun 12, 2018
@fulmicoton fulmicoton modified the milestones: Undefined, 0.7.0 Jun 13, 2018
@fulmicoton fulmicoton modified the milestones: 0.7.0, 0.8 Sep 16, 2018
@fulmicoton fulmicoton modified the milestones: 0.8, 0.9 Dec 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant