Elective n-grams #314

fulmicoton · 2018-06-07T01:06:09Z

n-grams can considerably accelerate PhraseQuery.

Let's say one is trying to search The quest for the holy grail...
A phrase query will compute the intersection of the posting list associated to each of these words,
and then refine the documents by checking if this is indeed a phrase query match.

This works by taking each term in their docfreq order (tf is assumed to be very correlated with docfreq), and checking if each new terms is at the right position for the terms visited so far.

We might accelerate interseciton and matching by indexing ngrams. It can be seen as a way of caching part of the phrase query. For instance, assuming we had indexed the quest with its respective positions, our phrase query would apply on ["The quest", for, the, holy, grail].

This problem can be entirely addressed using our tokenizer pipeline...

At indexing time, a tokenfilter could add an extra token and emit.

- text:"The", position_increment: 0, token_length: 1
- text:"The quest", position_increment: 2, token_length: 2
- text:"for", position_increment: 1, token_length: 1
- text:"the", position_increment: 1, token_length: 1
- text:"holy", position_increment: 1, token_length: 1
- text:"grail", position_increment: 1, token_length: 1

At query time, the tokenfilter would have a slightly different behavior and emit.

- text:"The quest", position_increment: 2, token_length: 2
- text:"for", position_increment: 1, token_length: 1
- text:"the", position_increment: 1, token_length: 1
- text:"holy", position_increment: 1, token_length: 1
- text:"grail", position_increment: 1, token_length: 1

So the TokenFilter would have a different behavior for search and for indexing. Note that this is similar to a synonym token filter situation.

This therefore requires adding something to the tokenizer API to tell whether we are in search mode or indexing mode as we build the pipeline.

Not all n-grams ?

Indexing all grams would be overkill for this problems. Ideally the TokenFilter should be configurable, by the user :

explicitely passing a list of n-grams
passing a list of triggers. (E.g. "the *" would generate 2-grams that start by the).

The text was updated successfully, but these errors were encountered:

fulmicoton · 2018-06-12T23:45:40Z

Related to #291

fulmicoton added enhancement help wanted newfeature good first issue and removed enhancement labels Jun 7, 2018

fulmicoton added this to the Undefined milestone Jun 12, 2018

fulmicoton mentioned this issue Jun 13, 2018

Index frequent bigrams #39

Closed

fulmicoton modified the milestones: Undefined, 0.7.0 Jun 13, 2018

fulmicoton mentioned this issue Jul 20, 2018

Add a position length to Token #291

Closed

3 tasks

fulmicoton modified the milestones: 0.7.0, 0.8 Sep 16, 2018

fulmicoton modified the milestones: 0.8, 0.9 Dec 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elective n-grams #314

Elective n-grams #314

fulmicoton commented Jun 7, 2018 •

edited

Loading

fulmicoton commented Jun 12, 2018

Elective n-grams #314

Elective n-grams #314

Comments

fulmicoton commented Jun 7, 2018 • edited Loading

Not all n-grams ?

fulmicoton commented Jun 12, 2018

fulmicoton commented Jun 7, 2018 •

edited

Loading