-
-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve boolean queries #127
Comments
I'd like to comment something i found interesting, and wonder if you can explain why i might be seeing what I am. I am implementing something that is most easily described as ElastiSearches 'more_like_this' query. It selects a set of terms from the current document that are infrequent and then does a disjunctive (OR) boolean search to find other documents that have those same infrequent terms. The idea being similar documents will be those that have their most globally infrequent terms, in common. When performing this search it is very fast for small queries, of say like 1-10 terms, but gets noticably slow when around 25-50 terms, and extremely slow beyond that (upwards of 1 minute to search at a few hundred terms). The query is done on a u64 FAST | INDEXED field and looks something like this
I decided to attempt to parallelize this potentially many term query into a set of few-term chunk queries, each done on a new thread. So a query that would normally be like 1000 terms might be 40 queries of 25 terms each, with the load spread across all available cores using rayon. This brings performance back down to reasonable around ~10 seconds or so total time when done on an 8 core machine, the same large query would take over a minute. Is this expected based on how the boolean query is implemented, and is this an optimization possible to be done by this library itself? The code i use is:
The u64 field here is the 'farmhash' of a STRING field which is the content i actually care about. I use a hash because i assume that it's faster to search over a u64 than a whole string, is this a correct assumption? (My tests suggest yes). My schema is:
I am searching over 60,000 documents where each has about 4,000 terms or 'features' on average as i call them in the code. |
It sounds extremely slow indeed. Can you share the entire project with an index? I suspect there might be another problem lurking there. If you cannot, can you at least join information about the version of tantivy you use, confirm you are compiling the project with To get the number of segments you can tell me how many files you get ending by Searching on the int rather than the original string should make no difference. This part can be considerably accelerated using fast fields instead of fetching the entire document, but I am not sure this
let retrieved_doc = searcher.doc(doc_address).unwrap();
|
I've responded on gitter, lets chat there |
Depends on #11
The way
tantivy
works right now is it is a very naive algorithm.There is a lot of room to improve performance there.
For instance, the must postings should be intersected, and all of the other legs should be skipped to their next item; better skipping could be done (via block or using a skip list); different algorithms could kick in depending on the size of the lists considered.
The text was updated successfully, but these errors were encountered: