Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent results when searching in attachment #38

Closed
edieterich opened this issue Nov 24, 2022 · 4 comments
Closed

Inconsistent results when searching in attachment #38

edieterich opened this issue Nov 24, 2022 · 4 comments

Comments

@edieterich
Copy link
Contributor

I have a mail with the word "body" in the body and the word "attachment" in an attached PDF (see attached mail.txt). I enabled fts_decode with decode2text.sh to index the PDF.

Searching for "body" and "attachment" returns the mail as expected:

sudo doveadm -D search -u ewald BODY body MAILBOX INBOX
... fts-flatcurve(INBOX): Query (body:body* OR body:bodi*) matches=1 uids=8
d1c791351b174863987e00006a82f8f2 8
sudo doveadm -D search -u ewald BODY attachment MAILBOX INBOX
... fts-flatcurve(INBOX): Query (body:attachment* OR body:attach*) matches=1 uids=8
d1c791351b174863987e00006a82f8f2 8

A more complicated search also works as expected when searching for "body":

sudo doveadm -D search -u ewald \( BODY body OR HEADER Reply-To body \) MAILBOX INBOX
... fts-flatcurve(INBOX): Query (body:body* OR allhdrs:body* OR body:bodi*) maybe_matches=1 uids=8
d1c791351b174863987e00006a82f8f2 8

Because of the "Reply-To" header it's just a "maybe match", but it works.

But searching for "attachment" doesn't return the mail:

sudo doveadm -D search -u ewald \( BODY attachment OR HEADER Reply-To attachment \) MAILBOX INBOX | wc -l
0
... fts-flatcurve(INBOX): Query (body:attachment* OR allhdrs:attachment* OR body:attach*) maybe_matches=1 uids=8

My understanding is that "maybe match" means that Dovecot searches the message again, but it doesn't find "attachment" because it scans the mail without decoding the PDF.

I checked with Squat and it returns the mail:

sudo doveadm search -u ewald \( BODY attachment OR HEADER Reply-To attachment \) MAILBOX INBOX
d1c791351b174863987e00006a82f8f2 8

I would expect Flatcurve to return the mail as well. I think the problem is that Flatcurve returns either "definite matches" or "maybe matches", but in this case it should probably return both, the BODY match as a "definite match" and the HEADER match as a "maybe match". Dovecot then wouldn't need to scan the mail again, because it knows it has a "definite match" for the mail. This would lead to more consistent (and probably faster) results.

@slusarz
Copy link
Owner

slusarz commented Dec 9, 2022

I need to find some time to create some test scripts for this, so haven't yet been able to triage...

I will note that if this statement is true: 'it doesn't find "attachment" because it scans the mail without decoding the PDF.' -- this would be a bug in Dovecot core, not flatcurve. Dovecot core code (i.e. core maybe matching) should see the exact same text from a decoded attachment that flatcurve would, so if that's not working correctly it will need to be fixed there.

edieterich added a commit to edieterich/dovecot-fts-flatcurve that referenced this issue Jan 3, 2023
edieterich added a commit to edieterich/dovecot-fts-flatcurve that referenced this issue Jan 3, 2023
edieterich added a commit to edieterich/dovecot-fts-flatcurve that referenced this issue Jan 4, 2023
edieterich added a commit to edieterich/dovecot-fts-flatcurve that referenced this issue Jan 4, 2023
edieterich added a commit to edieterich/dovecot-fts-flatcurve that referenced this issue Jan 4, 2023
edieterich added a commit to edieterich/dovecot-fts-flatcurve that referenced this issue Jan 4, 2023
@edieterich
Copy link
Contributor Author

I hope this is helpful: I created a test in my fork: edieterich@ae73576

All the FTS plugins that come with Dovecot 2.3 pass the test, Flatcurve fails. Here's the test run: https://github.com/edieterich/dovecot-fts-flatcurve/actions/runs/3836744310

2023-01-04T09:39:35.9613970Z Testing GitHub Issue #38 using Solr
2023-01-04T09:39:38.0055653Z 1 test groups: 0 failed, 0 skipped due to missing capabilities
2023-01-04T09:39:38.0056124Z base protocol: 0/5 individual commands failed
2023-01-04T09:39:38.0056465Z extensions: 0/0 individual commands failed
2023-01-04T09:39:38.0071773Z
2023-01-04T09:39:38.0072108Z Testing GitHub Issue #38 using Lucene
2023-01-04T09:39:39.5179508Z 1 test groups: 0 failed, 0 skipped due to missing capabilities
2023-01-04T09:39:39.5180465Z base protocol: 0/5 individual commands failed
2023-01-04T09:39:39.5181480Z extensions: 0/0 individual commands failed
2023-01-04T09:39:39.5188413Z
2023-01-04T09:39:39.5192999Z Testing GitHub Issue #38 using Squat
2023-01-04T09:39:41.0278305Z 1 test groups: 0 failed, 0 skipped due to missing capabilities
2023-01-04T09:39:41.0280452Z base protocol: 0/5 individual commands failed
2023-01-04T09:39:41.0281061Z extensions: 0/0 individual commands failed
2023-01-04T09:39:41.0281386Z
2023-01-04T09:39:41.0281625Z Testing GitHub Issue #38 using Flatcurve
2023-01-04T09:39:42.5530073Z *** Test issue-38 command 4/5 (line 9)
2023-01-04T09:39:42.5530968Z  - failed: Missing 1 untagged replies (1 mismatches)
2023-01-04T09:39:42.5531629Z  - first unexpanded: search 1
2023-01-04T09:39:42.5532214Z  - first expanded: search 1
2023-01-04T09:39:42.5532787Z  - best match: SEARCH
2023-01-04T09:39:42.5533603Z  - Command: search or body attachment header reply-to attachment
2023-01-04T09:39:42.5533993Z
2023-01-04T09:39:42.5543942Z 1 test groups: 1 failed, 0 skipped due to missing capabilities
2023-01-04T09:39:42.5544380Z base protocol: 1/5 individual commands failed
2023-01-04T09:39:42.5544727Z extensions: 0/0 individual commands failed
2023-01-04T09:39:42.5549769Z ERROR: Failed test (/dovecot/imaptest/issue-38/issue-38)!

I had to patch Squat to make it not run into a "NO [SERVERBUG]" failure.

There appears to be some minimum search term limit in fts_lucence, so I'm searching for "bodybody" instead of "body" as in my original description to get a match.

slusarz added a commit that referenced this issue Jan 10, 2023
@slusarz
Copy link
Owner

slusarz commented Jan 10, 2023

Thank you for your assistance in generating tests!

I've pushed a much smaller commit that isolates the issue. Currently, a single test fails in that branch:

ok search or body attachment header x-foo test2
* search 1

Here is what is needed to trigger the failing result:

  • It MUST be an OR search
  • The matching OR clause MUST be in decoded text (i.e. attachment data that is decoded via Tika or decode2text.sh)
  • The non-matching OR clause MUST be in a non-indexed header

The issue is that the flatcurve query correctly finds the term in the attachment indexed text, but it also does a header search on the non-indexed headers. This is the one search category that causes "maybe" matches, since flatcurve itself can't verify which header a term is located in if it is not one of the indexed headers (to, from, cc, etc.). Here, this part of the query will return no results ... but the ENTIRE query is marked as a "maybe" search due to current search limitations. Thus, the message is marked as a maybe match and passed back to FTS core code, but that code does not do any attachment decoding when doing a manual search (manual decoding would be crushing resource use if done real-time for all non-FTS indexed searches), so it doesn't find either the body or the header search, so it returns no match.

AND searches are not affected because the fts core code breaks up the queries before passing to flatcurve - thus the body is returned as a real match and the header search is returned as a maybe match, but the fts core will correctly do a manual query since it has access to header data.

Solution here is tricky and will take some thinking. Either we manually separate ALL OR searches internally within flatcurve, or we flag non-indexed header searches and do those queries separately from the rest of the search string.

slusarz added a commit that referenced this issue Nov 22, 2023
slusarz added a commit that referenced this issue Nov 23, 2023
Use an ODT file (instead of PDF) as this is natively handled with
Dovecot without needing to install PDF helpers on the OS.
slusarz added a commit that referenced this issue Nov 23, 2023
In some edge cases, a maybe query can potentially break search results
because Dovecot's internal fallback querying does not have the necessary
contextual information to properly query (i.e. if the original text was
decoded from an attachment).

Fixed by handling the one "maybe" query flatcurve does (queries on
non-indexed headers) separately from the "main" full query that will
return definite results.

Fixes GitHub Issue #38
slusarz added a commit that referenced this issue Nov 23, 2023
Use an ODT file (instead of PDF) as this is natively handled with
Dovecot without needing to install PDF helpers on the OS.
slusarz added a commit that referenced this issue Nov 23, 2023
In some edge cases, a maybe query can potentially break search results
because Dovecot's internal fallback querying does not have the necessary
contextual information to properly query (i.e. if the original text was
decoded from an attachment).

Fixed by handling the one "maybe" query flatcurve does (queries on
non-indexed headers) separately from the "main" full query that will
return definite results.

Fixes GitHub Issue #38
slusarz added a commit that referenced this issue Nov 23, 2023
Use an ODT file (instead of PDF) as this is natively handled with
Dovecot without needing to install PDF helpers on the OS.
slusarz added a commit that referenced this issue Nov 23, 2023
In some edge cases, a maybe query can potentially break search results
because Dovecot's internal fallback querying does not have the necessary
contextual information to properly query (i.e. if the original text was
decoded from an attachment).

Fixed by handling the one "maybe" query flatcurve does (queries on
non-indexed headers) separately from the "main" full query that will
return definite results.

Fixes GitHub Issue #38
slusarz added a commit that referenced this issue Nov 23, 2023
Use an ODT file (instead of PDF) as this is natively handled with
Dovecot without needing to install PDF helpers on the OS.
slusarz added a commit that referenced this issue Nov 23, 2023
In some edge cases, a maybe query can potentially break search results
because Dovecot's internal fallback querying does not have the necessary
contextual information to properly query (i.e. if the original text was
decoded from an attachment).

Fixed by handling the one "maybe" query flatcurve does (queries on
non-indexed headers) separately from the "main" full query that will
return definite results.

Fixes GitHub Issue #38
slusarz added a commit that referenced this issue Nov 23, 2023
In some edge cases, a maybe query can potentially break search results
because Dovecot's internal fallback querying does not have the necessary
contextual information to properly query (i.e. if the original text was
decoded from an attachment).

Fixed by handling the one "maybe" query flatcurve does (queries on
non-indexed headers) separately from the "main" full query that will
return definite results.

Fixes GitHub Issue #38
slusarz added a commit that referenced this issue Nov 23, 2023
In some edge cases, a maybe query can potentially break search results
because Dovecot's internal fallback querying does not have the necessary
contextual information to properly query (i.e. if the original text was
decoded from an attachment).

Fixed by handling the one "maybe" query flatcurve does (queries on
non-indexed headers) separately from the "main" full query that will
return definite results.

Fixes GitHub Issue #38
slusarz added a commit that referenced this issue Nov 23, 2023
@slusarz
Copy link
Owner

slusarz commented Nov 23, 2023

Fixed by MR #41

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants