-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent results when searching in attachment #38
Comments
I need to find some time to create some test scripts for this, so haven't yet been able to triage... I will note that if this statement is true: 'it doesn't find "attachment" because it scans the mail without decoding the PDF.' -- this would be a bug in Dovecot core, not flatcurve. Dovecot core code (i.e. core maybe matching) should see the exact same text from a decoded attachment that flatcurve would, so if that's not working correctly it will need to be fixed there. |
I hope this is helpful: I created a test in my fork: edieterich@ae73576 All the FTS plugins that come with Dovecot 2.3 pass the test, Flatcurve fails. Here's the test run: https://github.com/edieterich/dovecot-fts-flatcurve/actions/runs/3836744310
I had to patch Squat to make it not run into a "NO [SERVERBUG]" failure. There appears to be some minimum search term limit in fts_lucence, so I'm searching for "bodybody" instead of "body" as in my original description to get a match. |
Thank you for your assistance in generating tests! I've pushed a much smaller commit that isolates the issue. Currently, a single test fails in that branch:
Here is what is needed to trigger the failing result:
The issue is that the flatcurve query correctly finds the term in the attachment indexed text, but it also does a header search on the non-indexed headers. This is the one search category that causes "maybe" matches, since flatcurve itself can't verify which header a term is located in if it is not one of the indexed headers (to, from, cc, etc.). Here, this part of the query will return no results ... but the ENTIRE query is marked as a "maybe" search due to current search limitations. Thus, the message is marked as a maybe match and passed back to FTS core code, but that code does not do any attachment decoding when doing a manual search (manual decoding would be crushing resource use if done real-time for all non-FTS indexed searches), so it doesn't find either the body or the header search, so it returns no match. AND searches are not affected because the fts core code breaks up the queries before passing to flatcurve - thus the body is returned as a real match and the header search is returned as a maybe match, but the fts core will correctly do a manual query since it has access to header data. Solution here is tricky and will take some thinking. Either we manually separate ALL OR searches internally within flatcurve, or we flag non-indexed header searches and do those queries separately from the rest of the search string. |
Use an ODT file (instead of PDF) as this is natively handled with Dovecot without needing to install PDF helpers on the OS.
In some edge cases, a maybe query can potentially break search results because Dovecot's internal fallback querying does not have the necessary contextual information to properly query (i.e. if the original text was decoded from an attachment). Fixed by handling the one "maybe" query flatcurve does (queries on non-indexed headers) separately from the "main" full query that will return definite results. Fixes GitHub Issue #38
Use an ODT file (instead of PDF) as this is natively handled with Dovecot without needing to install PDF helpers on the OS.
In some edge cases, a maybe query can potentially break search results because Dovecot's internal fallback querying does not have the necessary contextual information to properly query (i.e. if the original text was decoded from an attachment). Fixed by handling the one "maybe" query flatcurve does (queries on non-indexed headers) separately from the "main" full query that will return definite results. Fixes GitHub Issue #38
Use an ODT file (instead of PDF) as this is natively handled with Dovecot without needing to install PDF helpers on the OS.
In some edge cases, a maybe query can potentially break search results because Dovecot's internal fallback querying does not have the necessary contextual information to properly query (i.e. if the original text was decoded from an attachment). Fixed by handling the one "maybe" query flatcurve does (queries on non-indexed headers) separately from the "main" full query that will return definite results. Fixes GitHub Issue #38
Use an ODT file (instead of PDF) as this is natively handled with Dovecot without needing to install PDF helpers on the OS.
In some edge cases, a maybe query can potentially break search results because Dovecot's internal fallback querying does not have the necessary contextual information to properly query (i.e. if the original text was decoded from an attachment). Fixed by handling the one "maybe" query flatcurve does (queries on non-indexed headers) separately from the "main" full query that will return definite results. Fixes GitHub Issue #38
In some edge cases, a maybe query can potentially break search results because Dovecot's internal fallback querying does not have the necessary contextual information to properly query (i.e. if the original text was decoded from an attachment). Fixed by handling the one "maybe" query flatcurve does (queries on non-indexed headers) separately from the "main" full query that will return definite results. Fixes GitHub Issue #38
In some edge cases, a maybe query can potentially break search results because Dovecot's internal fallback querying does not have the necessary contextual information to properly query (i.e. if the original text was decoded from an attachment). Fixed by handling the one "maybe" query flatcurve does (queries on non-indexed headers) separately from the "main" full query that will return definite results. Fixes GitHub Issue #38
Fixed by MR #41 |
I have a mail with the word "body" in the body and the word "attachment" in an attached PDF (see attached mail.txt). I enabled
fts_decode
withdecode2text.sh
to index the PDF.Searching for "body" and "attachment" returns the mail as expected:
A more complicated search also works as expected when searching for "body":
Because of the "Reply-To" header it's just a "maybe match", but it works.
But searching for "attachment" doesn't return the mail:
My understanding is that "maybe match" means that Dovecot searches the message again, but it doesn't find "attachment" because it scans the mail without decoding the PDF.
I checked with Squat and it returns the mail:
I would expect Flatcurve to return the mail as well. I think the problem is that Flatcurve returns either "definite matches" or "maybe matches", but in this case it should probably return both, the BODY match as a "definite match" and the HEADER match as a "maybe match". Dovecot then wouldn't need to scan the mail again, because it knows it has a "definite match" for the mail. This would lead to more consistent (and probably faster) results.
The text was updated successfully, but these errors were encountered: