Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

imap segfault with fts flatcurve #54

Closed
kadlec-g opened this issue Feb 19, 2024 · 14 comments · Fixed by #58
Closed

imap segfault with fts flatcurve #54

kadlec-g opened this issue Feb 19, 2024 · 14 comments · Fixed by #58

Comments

@kadlec-g
Copy link

We started to use fts flatcurve on Debian bookworm with the following
components:

  • Dovecot 1:2.3.19.1+dfsg1-2.1 debian packages
  • libxapian 1.4.22-1 debian package
  • dovecot-fts-flatcurve recent from git repo, manually compiled

Everything seems to work fine, except imap crashes from time to time
with different users:

Feb 16 21:52:06 mail2 kernel: [459605.167041] imap[55238]: segfault at 8 ip 0000620dfcc4aee9 sp 0000713494feb4e0 error 4 in libxapian.so.30.12.3[620dfcc14000+173000]

Enabling coredumps the backtrace shows:

#0 0x00006ffff4529ee9 in Xapian::Database::get_document(unsigned int) const ()
from /lib/x86_64-linux-gnu/libxapian.so.30
#1 0x00006ffff6e67cb0 in fts_flatcurve_xapian_uid_exists_db (uid=uid@entry=8040,
backend=0x1a4f0a594a70) at fts-backend-flatcurve-xapian.cpp:916
#2 0x00006ffff6e6bb34 in fts_flatcurve_xapian_write_db_by_uid (uid=8040,
backend=0x1a4f0a594a70) at fts-backend-flatcurve-xapian.cpp:935
#3 fts_flatcurve_xapian_expunge (backend=0x1a4f0a594a70, uid=8040)
at fts-backend-flatcurve-xapian.cpp:1179
#4 0x00006ffff75050ba in mailbox_sync_notify ()
from /usr/lib/dovecot/libdovecot-storage.so.0
#5 0x00006ffff75292e1 in sdbox_sync_finish ()

It happens consistently at IMAP EXPUNGE, in all cases. The mailboxes are on NFS but a single machine accesses
them. Any hint is really appreciated.

@kadlec-g
Copy link
Author

In order to make sure it's not related to locking over NFS, the mount option local_lock=all was enabled and in mail_location we set VOLATILEDIR=/dev/shm/dovecot/%u. Still, the segfaults happen.

@kadlec-g
Copy link
Author

Enabling debuggin revealed more informations:

Feb 20 08:44:44 mail2 dovecot: imap(XXXX)<42566><T+YPYssR2RhUAtZM>: Debug: fts-flatcurve(Spam): Cannot open DB (RO; index.6340); DatabaseNotFoundError: Couldn't detect type of database
Feb 20 08:44:44 mail2 dovecot: imap(XXXX)<42566><T+YPYssR2RhUAtZM>: Debug: fts-flatcurve(Spam): Opened DB (RO) messages=4 version=1 shards=1
Feb 20 08:44:44 mail2 dovecot: imap(XXXX)<42566><T+YPYssR2RhUAtZM>: Fatal: master: service(imap): child 42566 killed with signal 11

We have got the following settings to exclude mailboxes:

fts_autoindex_exclude = Trash
fts_autoindex_exclude2 = Spam

[According to the documentation, mailbox name or special use flag can be used.]
Checking the database of the Spam mailbox does not show an error:

doveadm fts-flatcurve check -u XXXX mailbox Spam
Spam guid=784d14359d22265d94c0000048cecde4 errors=0 shards=1

@edieterich
Copy link
Contributor

I think I have the same problem. Can you check if you have empty Flatcurve DB directories for the affected users? And if you do, does the crash go away if you delete them?

The empty DB directories appear on NFS if Flatcurve optimizes the index. There's is code in Flatcurve that deletes empty DB directories, but that fails on NFS because .nfs files are kept open in the DB directory.

The attached patch (flatcurve_fix_empty_db_dir.patch) fixes the problem for me.

I added a second patch (flatcurve_work_around_unlink_directory_bug.patch), because this problem goes unnoticed when Flatcurves tries to delete empty DB directories. This is actually a bug in Dovecot core, I guess, because the used unlink_directory() function doesn't report an error if the final rmdir fails.

@kadlec-g
Copy link
Author

No, the directories are not empty. As I'm unfamiliar with xapian, I'm just guessing the meaning of the log lines:

Feb 20 08:44:44 mail2 dovecot: imap(XXXX)<42566><T+YPYssR2RhUAtZM>: Debug: fts-flatcurve(Spam): Cannot open DB (RO; index.6340); DatabaseNotFoundError: Couldn't detect type of database

As far as I see there was an dbox/mailboxes/Spam/dbox-Mails/fts-flatcurve/index.6340/ directory under the user's home and either it was empty or it was broken.

Feb 20 08:44:44 mail2 dovecot: imap(XXXX)<42566><T+YPYssR2RhUAtZM>: Debug: fts-flatcurve(Spam): Cannot open DB (RO; current.1708356974118889); DatabaseNotFoundError: Couldn't detect type of database

I missed to include this line: is it a test to create a new database? But the line says RO, so it should not create new database. The mailbox is explicitly excluded from indexing, so it should be left out. What's " current.1708356974118889" then?

Feb 20 08:44:44 mail2 dovecot: imap(XXXX)<42566><T+YPYssR2RhUAtZM>: Debug: fts-flatcurve(Spam): Opened DB (RO) messages=4 version=1 shards=1

According to the line, the db was successfully openen, it contains 4 messages, it's version number is 1 and contains one shard.

Feb 20 08:44:44 mail2 dovecot: imap(XXXX)<42566><T+YPYssR2RhUAtZM>: Fatal: master: service(imap): child 42566 killed with signal 11

And then comes the segmentation fault. Looking into the directory in question, there's an "index.3514" directory there
with modification date earlier:

drwx------ 2 XXXX XXXX 4096 Feb 20 06:27 index.3514

The directory looks OK (except it should not exist at all):

-rw------- 1 XXXX XXXX 0 Feb 20 06:27 flintlock
-rw------- 1 XXXX XXXX 93 Feb 20 06:27 iamglass
-rw------- 1 XXXX XXXX 32768 Feb 20 06:27 postlist.glass
-rw------- 1 XXXX XXXX 8192 Feb 20 06:27 termlist.glass

At 6:25 we run a cron job which calls "doveadm fts optimize -A".
That command messes up the excluded mailboxes?

@kadlec-g
Copy link
Author

I think I have the same problem. Can you check if you have empty Flatcurve DB directories for the affected users? And if you do, does the crash go away if you delete them?

The empty DB directories appear on NFS if Flatcurve optimizes the index. There's is code in Flatcurve that deletes empty DB directories, but that fails on NFS because .nfs files are kept open in the DB directory.

The .nfs files are created by the NFS client when while a process keeps open a file, another one removes it. So while the file is kept open, the .nfs file is there and cannot be removed. It migth be some syncronization, timing issue between two processes.

@slusarz
Copy link
Owner

slusarz commented Feb 22, 2024

Need to work on this more ... but I'm fairly certain that @edieterich first patch is the correct fix here.

The code that is crashing is iterating through the internal DB hash table - which would potentially include stale DB entries containing Xapian objects that point to data directories that no longer exist. These DB entries should be removed from the table when they are determined to be invalid, which is missing in the current code; only the physical directories are being removed, not the internal cache state.

The log entries indicate flatcurve is otherwise working as expected. "Couldn't detect type of database" is the error Xapian would throw if the shard directory was deleted. After an optimize, you should end up with a single, newly created shard, and the successful load of this shard doesn't have a debug entry. It thus looks like flatcurve is correctly detecting the missing directories, it's just not deleting them from the internal table.

Need to verify with a unit test before committing the fix. Pretty sure this can only happen with long running mail sessions that had previously accessed the mailbox - this is essentially a caching issue.

@edieterich : any chance you can submit your first patch as an official merge request?

@edieterich
Copy link
Contributor

My pull request: #55

@kadlec-g
Copy link
Author

kadlec-g commented Feb 22, 2024 via email

@slusarz
Copy link
Owner

slusarz commented Feb 24, 2024

"I attached the patches from the fork dovecot-fts-flatcurve-nfs." - no patches showed up here. (I don't think replying to GitHub via email does allows attachments to be converted to MRs...)

slusarz added a commit that referenced this issue Feb 27, 2024
@slusarz slusarz linked a pull request Feb 27, 2024 that will close this issue
@slusarz
Copy link
Owner

slusarz commented Feb 27, 2024

Thanks to @kadlec-g, I added a patch that deals with the volatiledir situation. (VOLATILEDIR used to be handled natively by locking code, but the dev team made changes when flatcurve was merged to core team that switched to a different locking method.)

Probably need to document recommendation of VOLATILEDIR for NFS installations.

Also, still TBD what needs to be done regarding extra files in the directory. I don't like the solution in #56 as it is too specific to a certain kind of file (".nfs") - this needs to be more generic imho. @kadlec-g has provided a different solution, but I still need to review further and think about it.

slusarz added a commit that referenced this issue Feb 27, 2024
@slusarz
Copy link
Owner

slusarz commented Feb 27, 2024

These commits should fix the various issues being reported (none that I could reproduce locally, so relying on external feedback):

  • Optimization: if directory is deleted, remove from internal DB cache
  • Honor VOLATILEDIR setting
  • Ignore unknown files when deleting in data directory

If these issues don't fix things, please reopen this ticket.

@kadlec-g
Copy link
Author

kadlec-g commented Jun 23, 2024 via email

@slusarz
Copy link
Owner

slusarz commented Jun 24, 2024

I'm not seeing what this patch does.

backend->volatile_dir is guaranteed to be an empty string at this point - if a previous mailbox is opened, volatile_dir is truncated to a zero-length string by fts_backend_flatcurve_close_mailbox(). (This is the same mechanism that clears the mailbox name, for example, so if this truncation code isn't being run you would not be able to open the mailbox at all.)

What exactly are you seeing without this change?

@kadlec-g
Copy link
Author

kadlec-g commented Jun 25, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants