Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve boolean queries #127

Open
fulmicoton opened this issue Apr 27, 2017 · 4 comments
Open

Improve boolean queries #127

fulmicoton opened this issue Apr 27, 2017 · 4 comments

Comments

@fulmicoton
Copy link
Collaborator

fulmicoton commented Apr 27, 2017

Depends on #11

The way tantivy works right now is it is a very naive algorithm.

There is a lot of room to improve performance there.
For instance, the must postings should be intersected, and all of the other legs should be skipped to their next item; better skipping could be done (via block or using a skip list); different algorithms could kick in depending on the size of the lists considered.

@fulmicoton fulmicoton changed the title Improve query matching Improve boolean queries Apr 27, 2017
@stevemk14ebr
Copy link

stevemk14ebr commented Nov 15, 2020

I'd like to comment something i found interesting, and wonder if you can explain why i might be seeing what I am. I am implementing something that is most easily described as ElastiSearches 'more_like_this' query. It selects a set of terms from the current document that are infrequent and then does a disjunctive (OR) boolean search to find other documents that have those same infrequent terms. The idea being similar documents will be those that have their most globally infrequent terms, in common. When performing this search it is very fast for small queries, of say like 1-10 terms, but gets noticably slow when around 25-50 terms, and extremely slow beyond that (upwards of 1 minute to search at a few hundred terms).

The query is done on a u64 FAST | INDEXED field and looks something like this

HASH1 OR HASH2 OR HASH3 OR HASH4 OR ...

I decided to attempt to parallelize this potentially many term query into a set of few-term chunk queries, each done on a new thread. So a query that would normally be like 1000 terms might be 40 queries of 25 terms each, with the load spread across all available cores using rayon. This brings performance back down to reasonable around ~10 seconds or so total time when done on an 8 core machine, the same large query would take over a minute. Is this expected based on how the boolean query is implemented, and is this an optimization possible to be done by this library itself? The code i use is:

let mut search_query_terms: Vec<tantivy::query::TermQuery> = Vec::new();
// i have an external DB that tracks which terms are infrequent for cur doc, for each term do this:
let term_query = tantivy::query::TermQuery::new(
    tantivy::Term::from_field_u64(sha256_value_field, val_hash),
    tantivy::schema::IndexRecordOption::Basic
);

// split what would normally be a huge single query, into smaller chunks
search_query_terms.par_chunks(50).filter_map(|term_chunks| {
    let mut term_queries: Vec<(tantivy::query::Occur, Box<dyn tantivy::query::Query>)> =  Vec::new();
    for t in term_chunks {
        term_queries.push((tantivy::query::Occur::Should,  Box::new(t.clone())));
    }

    let boolean_query = tantivy::query::BooleanQuery::from(term_queries);
    let (searcher, docs) = match state.index_handle.search_vals_query(&boolean_query, &tantivy::collector::TopDocs::with_limit(10), vec![sha256_value_field]) {
        Ok(v) => v,
        Err(e) => {
            return None
        }
    };

    let mut similar_samples = SimilarSamplesResponse::new();
    for (score, doc_address) in docs {
        let retrieved_doc = searcher.doc(doc_address).unwrap();
        let sha256_field = state.index_handle.search_vals.schema.get_field("sha256").unwrap();

        let mut field_sha256 = "".to_string();
        for field in retrieved_doc.field_values() {
            if field.field() == sha256_field {
                field_sha256 = field.value().text().unwrap().to_string();
                break;
            }
        }

        similar_samples.insert(field_sha256, score as f64);
    }
        
    Some(similar_samples)
}).collect();

// approximately average the chunk scores back together, not exact but good enough for such a rough metric
let mut samples: SimilarSamplesResponse = sample_chunks.iter().fold(SimilarSamplesResponse::new(), |mut acc, sample_chunk| {
    for (sample, score) in sample_chunk {
        acc.entry(sample.to_string()).and_modify(|e| *e = (*e + score) / 2.0).or_insert(score.clone());
    }
    acc
});

// find self, get its score which should be the max possible, then divide others buy it to convert to percentage
let max_score = samples.get(&sha256.to_string()).expect("Similarity search should always return self!").clone();
for (_hash, score) in samples.iter_mut() {
    *score /= max_score;
}

The u64 field here is the 'farmhash' of a STRING field which is the content i actually care about. I use a hash because i assume that it's faster to search over a u64 than a whole string, is this a correct assumption? (My tests suggest yes). My schema is:

let search_field_sha256 = self.search_vals.schema.get_field("sha256").unwrap();
let search_field_feature_value =  self.search_vals.schema.get_field("feature_value").unwrap();

// this is the hash of the above feature_value, makes search faster.
let search_field_feature_value_sha256 = self.search_vals.schema.get_field("feature_value_sha256").unwrap();

let mut sample_search_doc = Document::default();
sample_search_doc.add_text(search_field_sha256, &sha256);
for feature in features.iter() {
    sample_search_doc.add_text(search_field_feature_value, &feature.value);
    sample_search_doc.add_u64(search_field_feature_value_sha256, farmhash::hash64(feature.value.as_bytes()));
}

I am searching over 60,000 documents where each has about 4,000 terms or 'features' on average as i call them in the code.

@fulmicoton
Copy link
Collaborator Author

fulmicoton commented Nov 16, 2020

It sounds extremely slow indeed. Can you share the entire project with an index? I suspect there might be another problem lurking there.

If you cannot, can you at least join information about the version of tantivy you use, confirm you are compiling the project with --release and tell how many segments your have in your index.

To get the number of segments you can tell me how many files you get ending by .idx in your directory or join your meta.json file.

Searching on the int rather than the original string should make no difference.
The hash is very good however to use as a fast field. I cannot spot where you effectively use the fast field in this code.

This part can be considerably accelerated using fast fields instead of fetching the entire document, but I am not sure this
is your bottleneck.

    ```

let retrieved_doc = searcher.doc(doc_address).unwrap();
let sha256_field = state.index_handle.search_vals.schema.get_field("sha256").unwrap();

    let mut field_sha256 = "".to_string();
    for field in retrieved_doc.field_values() {
        if field.field() == sha256_field {
            field_sha256 = field.value().text().unwrap().to_string();
            break;
        }
    }

    similar_samples.insert(field_sha256, score as f64);

@stevemk14ebr
Copy link

I've responded on gitter, lets chat there

@lnicola lnicola removed their assignment Sep 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants