Improve boolean queries #127

fulmicoton · 2017-04-27T08:54:44Z

Depends on #11

The way tantivy works right now is it is a very naive algorithm.

There is a lot of room to improve performance there.
For instance, the must postings should be intersected, and all of the other legs should be skipped to their next item; better skipping could be done (via block or using a skip list); different algorithms could kick in depending on the size of the lists considered.

The text was updated successfully, but these errors were encountered:

fulmicoton · 2017-04-27T08:58:48Z

Possible inspiration source

https://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/media/documents/adrien_grand-_algorithms_and_data-structures_that_power_lucene_and_elasticsearch.pdf

http://lemire.me/blog/2015/03/25/accelerating-intersections-with-simd-instructions/
https://github.com/lemire/SIMDCompressionAndIntersection/blob/master/src/intersection.cpp

https://github.com/RedisLabsModules/RediSearch/tree/master/src

stevemk14ebr · 2020-11-15T23:40:27Z

I'd like to comment something i found interesting, and wonder if you can explain why i might be seeing what I am. I am implementing something that is most easily described as ElastiSearches 'more_like_this' query. It selects a set of terms from the current document that are infrequent and then does a disjunctive (OR) boolean search to find other documents that have those same infrequent terms. The idea being similar documents will be those that have their most globally infrequent terms, in common. When performing this search it is very fast for small queries, of say like 1-10 terms, but gets noticably slow when around 25-50 terms, and extremely slow beyond that (upwards of 1 minute to search at a few hundred terms).

The query is done on a u64 FAST | INDEXED field and looks something like this

HASH1 OR HASH2 OR HASH3 OR HASH4 OR ...

I decided to attempt to parallelize this potentially many term query into a set of few-term chunk queries, each done on a new thread. So a query that would normally be like 1000 terms might be 40 queries of 25 terms each, with the load spread across all available cores using rayon. This brings performance back down to reasonable around ~10 seconds or so total time when done on an 8 core machine, the same large query would take over a minute. Is this expected based on how the boolean query is implemented, and is this an optimization possible to be done by this library itself? The code i use is:

let mut search_query_terms: Vec<tantivy::query::TermQuery> = Vec::new();
// i have an external DB that tracks which terms are infrequent for cur doc, for each term do this:
let term_query = tantivy::query::TermQuery::new(
    tantivy::Term::from_field_u64(sha256_value_field, val_hash),
    tantivy::schema::IndexRecordOption::Basic
);

// split what would normally be a huge single query, into smaller chunks
search_query_terms.par_chunks(50).filter_map(|term_chunks| {
    let mut term_queries: Vec<(tantivy::query::Occur, Box<dyn tantivy::query::Query>)> =  Vec::new();
    for t in term_chunks {
        term_queries.push((tantivy::query::Occur::Should,  Box::new(t.clone())));
    }

    let boolean_query = tantivy::query::BooleanQuery::from(term_queries);
    let (searcher, docs) = match state.index_handle.search_vals_query(&boolean_query, &tantivy::collector::TopDocs::with_limit(10), vec![sha256_value_field]) {
        Ok(v) => v,
        Err(e) => {
            return None
        }
    };

    let mut similar_samples = SimilarSamplesResponse::new();
    for (score, doc_address) in docs {
        let retrieved_doc = searcher.doc(doc_address).unwrap();
        let sha256_field = state.index_handle.search_vals.schema.get_field("sha256").unwrap();

        let mut field_sha256 = "".to_string();
        for field in retrieved_doc.field_values() {
            if field.field() == sha256_field {
                field_sha256 = field.value().text().unwrap().to_string();
                break;
            }
        }

        similar_samples.insert(field_sha256, score as f64);
    }
        
    Some(similar_samples)
}).collect();

// approximately average the chunk scores back together, not exact but good enough for such a rough metric
let mut samples: SimilarSamplesResponse = sample_chunks.iter().fold(SimilarSamplesResponse::new(), |mut acc, sample_chunk| {
    for (sample, score) in sample_chunk {
        acc.entry(sample.to_string()).and_modify(|e| *e = (*e + score) / 2.0).or_insert(score.clone());
    }
    acc
});

// find self, get its score which should be the max possible, then divide others buy it to convert to percentage
let max_score = samples.get(&sha256.to_string()).expect("Similarity search should always return self!").clone();
for (_hash, score) in samples.iter_mut() {
    *score /= max_score;
}

The u64 field here is the 'farmhash' of a STRING field which is the content i actually care about. I use a hash because i assume that it's faster to search over a u64 than a whole string, is this a correct assumption? (My tests suggest yes). My schema is:

let search_field_sha256 = self.search_vals.schema.get_field("sha256").unwrap();
let search_field_feature_value =  self.search_vals.schema.get_field("feature_value").unwrap();

// this is the hash of the above feature_value, makes search faster.
let search_field_feature_value_sha256 = self.search_vals.schema.get_field("feature_value_sha256").unwrap();

let mut sample_search_doc = Document::default();
sample_search_doc.add_text(search_field_sha256, &sha256);
for feature in features.iter() {
    sample_search_doc.add_text(search_field_feature_value, &feature.value);
    sample_search_doc.add_u64(search_field_feature_value_sha256, farmhash::hash64(feature.value.as_bytes()));
}

I am searching over 60,000 documents where each has about 4,000 terms or 'features' on average as i call them in the code.

fulmicoton · 2020-11-16T00:43:11Z

It sounds extremely slow indeed. Can you share the entire project with an index? I suspect there might be another problem lurking there.

If you cannot, can you at least join information about the version of tantivy you use, confirm you are compiling the project with --release and tell how many segments your have in your index.

To get the number of segments you can tell me how many files you get ending by .idx in your directory or join your meta.json file.

Searching on the int rather than the original string should make no difference.
The hash is very good however to use as a fast field. I cannot spot where you effectively use the fast field in this code.

This part can be considerably accelerated using fast fields instead of fetching the entire document, but I am not sure this
is your bottleneck.

```

let retrieved_doc = searcher.doc(doc_address).unwrap();
let sha256_field = state.index_handle.search_vals.schema.get_field("sha256").unwrap();

    let mut field_sha256 = "".to_string();
    for field in retrieved_doc.field_values() {
        if field.field() == sha256_field {
            field_sha256 = field.value().text().unwrap().to_string();
            break;
        }
    }

    similar_samples.insert(field_sha256, score as f64);

stevemk14ebr · 2020-11-16T01:41:55Z

I've responded on gitter, lets chat there

fulmicoton added the enhancement label Apr 27, 2017

fulmicoton assigned guilload Apr 27, 2017

fulmicoton changed the title ~~Improve query matching~~ Improve boolean queries Apr 27, 2017

fulmicoton added the performance label Apr 27, 2017

fulmicoton assigned lnicola and unassigned guilload May 7, 2017

lnicola removed their assignment Sep 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve boolean queries #127

Improve boolean queries #127

fulmicoton commented Apr 27, 2017 •

edited

Loading

fulmicoton commented Apr 27, 2017 •

edited

Loading

stevemk14ebr commented Nov 15, 2020 •

edited

Loading

fulmicoton commented Nov 16, 2020 •

edited

Loading

stevemk14ebr commented Nov 16, 2020

Improve boolean queries #127

Improve boolean queries #127

Comments

fulmicoton commented Apr 27, 2017 • edited Loading

fulmicoton commented Apr 27, 2017 • edited Loading

stevemk14ebr commented Nov 15, 2020 • edited Loading

fulmicoton commented Nov 16, 2020 • edited Loading

stevemk14ebr commented Nov 16, 2020

fulmicoton commented Apr 27, 2017 •

edited

Loading

fulmicoton commented Apr 27, 2017 •

edited

Loading

stevemk14ebr commented Nov 15, 2020 •

edited

Loading

fulmicoton commented Nov 16, 2020 •

edited

Loading