-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
imap segfault with fts flatcurve #54
Comments
In order to make sure it's not related to locking over NFS, the mount option local_lock=all was enabled and in mail_location we set VOLATILEDIR=/dev/shm/dovecot/%u. Still, the segfaults happen. |
Enabling debuggin revealed more informations: Feb 20 08:44:44 mail2 dovecot: imap(XXXX)<42566><T+YPYssR2RhUAtZM>: Debug: fts-flatcurve(Spam): Cannot open DB (RO; index.6340); DatabaseNotFoundError: Couldn't detect type of database We have got the following settings to exclude mailboxes:
[According to the documentation, mailbox name or special use flag can be used.] doveadm fts-flatcurve check -u XXXX mailbox Spam |
I think I have the same problem. Can you check if you have empty Flatcurve DB directories for the affected users? And if you do, does the crash go away if you delete them? The empty DB directories appear on NFS if Flatcurve optimizes the index. There's is code in Flatcurve that deletes empty DB directories, but that fails on NFS because .nfs files are kept open in the DB directory. The attached patch (flatcurve_fix_empty_db_dir.patch) fixes the problem for me. I added a second patch (flatcurve_work_around_unlink_directory_bug.patch), because this problem goes unnoticed when Flatcurves tries to delete empty DB directories. This is actually a bug in Dovecot core, I guess, because the used unlink_directory() function doesn't report an error if the final rmdir fails. |
No, the directories are not empty. As I'm unfamiliar with xapian, I'm just guessing the meaning of the log lines: Feb 20 08:44:44 mail2 dovecot: imap(XXXX)<42566><T+YPYssR2RhUAtZM>: Debug: fts-flatcurve(Spam): Cannot open DB (RO; index.6340); DatabaseNotFoundError: Couldn't detect type of database As far as I see there was an dbox/mailboxes/Spam/dbox-Mails/fts-flatcurve/index.6340/ directory under the user's home and either it was empty or it was broken. Feb 20 08:44:44 mail2 dovecot: imap(XXXX)<42566><T+YPYssR2RhUAtZM>: Debug: fts-flatcurve(Spam): Cannot open DB (RO; current.1708356974118889); DatabaseNotFoundError: Couldn't detect type of database I missed to include this line: is it a test to create a new database? But the line says RO, so it should not create new database. The mailbox is explicitly excluded from indexing, so it should be left out. What's " current.1708356974118889" then? Feb 20 08:44:44 mail2 dovecot: imap(XXXX)<42566><T+YPYssR2RhUAtZM>: Debug: fts-flatcurve(Spam): Opened DB (RO) messages=4 version=1 shards=1 According to the line, the db was successfully openen, it contains 4 messages, it's version number is 1 and contains one shard. Feb 20 08:44:44 mail2 dovecot: imap(XXXX)<42566><T+YPYssR2RhUAtZM>: Fatal: master: service(imap): child 42566 killed with signal 11 And then comes the segmentation fault. Looking into the directory in question, there's an "index.3514" directory there drwx------ 2 XXXX XXXX 4096 Feb 20 06:27 index.3514 The directory looks OK (except it should not exist at all): -rw------- 1 XXXX XXXX 0 Feb 20 06:27 flintlock At 6:25 we run a cron job which calls "doveadm fts optimize -A". |
The .nfs files are created by the NFS client when while a process keeps open a file, another one removes it. So while the file is kept open, the .nfs file is there and cannot be removed. It migth be some syncronization, timing issue between two processes. |
Need to work on this more ... but I'm fairly certain that @edieterich first patch is the correct fix here. The code that is crashing is iterating through the internal DB hash table - which would potentially include stale DB entries containing Xapian objects that point to data directories that no longer exist. These DB entries should be removed from the table when they are determined to be invalid, which is missing in the current code; only the physical directories are being removed, not the internal cache state. The log entries indicate flatcurve is otherwise working as expected. "Couldn't detect type of database" is the error Xapian would throw if the shard directory was deleted. After an optimize, you should end up with a single, newly created shard, and the successful load of this shard doesn't have a debug entry. It thus looks like flatcurve is correctly detecting the missing directories, it's just not deleting them from the internal table. Need to verify with a unit test before committing the fix. Pretty sure this can only happen with long running mail sessions that had previously accessed the mailbox - this is essentially a caching issue. @edieterich : any chance you can submit your first patch as an official merge request? |
My pull request: #55 |
Hello,
On Wed, 21 Feb 2024, Michael M Slusarz wrote:
Need to work on this more ... but I'm fairly certain that @edieterich
first patch is the correct fix here.
The code that is crashing is iterating through the internal DB hash
table - which would potentially include stale DB entries containing
Xapian objects that point to data directories that no longer exist.
These DB entries should be removed from the table when they are
determined to be invalid, which is missing in the current code; only the
physical directories are being removed, not the internal cache state.
The log entries indicate flatcurve is otherwise working as expected.
"Couldn't detect type of database" is the error Xapian would throw if
the shard directory was deleted. After an optimize, you should end up
with a single, newly created shard, and the successful load of this
shard doesn't have a debug entry. It thus looks like flatcurve is
correctly detecting the missing directories, it's just not deleting them
from the internal table.
Need to verify with a unit test before committing the fix. Pretty sure
this can only happen with long running mail sessions that had previously
accessed the mailbox - this is essentially a caching issue.
@edieterich : any chance you can submit your first patch as an official
merge request?
As I wrote, .nfs file cannot be deleted: a new one is created if you
delete one. So my approach was different: in
fts_flatcurve_xapian_db_iter_init() I ignored all files which are not
valid db types.
It did not help.
So I created a second patch on top of it: flatcurve uses the user
directory for lockfile creation. But we use VOLATILEDIR in order to avoid
locking over NFS. In the attached second patch I added supporting
VOLATILEDIR.
Since then (~12h) we had a single segfault instead of 2-3/h. Unfortunately
debugging was not enabled then so I'm waiting if any other segfault
happens again, with debugging enabled.
I attached the patches from the fork dovecot-fts-flatcurve-nfs.
Best regards,
Jozsef
--
E-mail : ***@***.***
PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
H-1525 Budapest 114, POB. 49, Hungary
|
"I attached the patches from the fork dovecot-fts-flatcurve-nfs." - no patches showed up here. (I don't think replying to GitHub via email does allows attachments to be converted to MRs...) |
Thanks to @kadlec-g, I added a patch that deals with the volatiledir situation. (VOLATILEDIR used to be handled natively by locking code, but the dev team made changes when flatcurve was merged to core team that switched to a different locking method.) Probably need to document recommendation of VOLATILEDIR for NFS installations. Also, still TBD what needs to be done regarding extra files in the directory. I don't like the solution in #56 as it is too specific to a certain kind of file (".nfs") - this needs to be more generic imho. @kadlec-g has provided a different solution, but I still need to review further and think about it. |
These commits should fix the various issues being reported (none that I could reproduce locally, so relying on external feedback):
If these issues don't fix things, please reopen this ticket. |
Hello,
On Thu, 22 Feb 2024, Kadlecsik József wrote:
As I wrote, .nfs file cannot be deleted: a new one is created if you
delete one. So my approach was different: in
fts_flatcurve_xapian_db_iter_init() I ignored all files which are not
valid db types.
It did not help.
So I created a second patch on top of it: flatcurve uses the user
directory for lockfile creation. But we use VOLATILEDIR in order to avoid
locking over NFS. In the attached second patch I added supporting
VOLATILEDIR.
The patch was not prepared to handle when multiple mail is handled with
the same process and fts_backend_flatcurve_set_mailbox() was called
multiple times. I applied the next patch on top of the most recent version
of your git tree:
diff --git a/src/fts-backend-flatcurve.c b/src/fts-backend-flatcurve.c
index e46b2b0..b9cf9fc 100644
--- a/src/fts-backend-flatcurve.c
+++ b/src/fts-backend-flatcurve.c
@@ -135,7 +135,7 @@ void fts_backend_flatcurve_set_mailbox(struct
flatcurve_fts_backend *backend,
user = mail_storage_get_user(storage);
volatile_dir = mail_user_get_volatile_dir(user);
- if (volatile_dir != NULL)
+ if (volatile_dir != NULL && str_len(backend->volatile_dir) == 0)
str_append(backend->volatile_dir, volatile_dir);
if (!backend->debug_init) {
Best regards,
Jozsef
--
E-mail : ***@***.***
PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
H-1525 Budapest 114, POB. 49, Hungary
|
I'm not seeing what this patch does. backend->volatile_dir is guaranteed to be an empty string at this point - if a previous mailbox is opened, volatile_dir is truncated to a zero-length string by fts_backend_flatcurve_close_mailbox(). (This is the same mechanism that clears the mailbox name, for example, so if this truncation code isn't being run you would not be able to open the mailbox at all.) What exactly are you seeing without this change? |
On Mon, 24 Jun 2024, Michael M Slusarz wrote:
I'm not seeing what this patch does.
backend->volatile_dir is guaranteed to be an empty string at this point - if a
previous mailbox is opened, volatile_dir is truncated to a zero-length string by
fts_backend_flatcurve_close_mailbox(). (This is the same mechanism that clears
the mailbox name, for example, so if this truncation code isn't being run you
would not be able to open the mailbox at all.)
Sorry, drop the patch: in my original one which was running here the
truncate line for volatile_dir was missing from
fts_backend_flatcurve_close_mailbox(). That missing line caused the issue.
Best regards,
Jozsef
--
E-mail : ***@***.***
PGP key: https://wigner.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
H-1525 Budapest 114, POB. 49, Hungary
|
We started to use fts flatcurve on Debian bookworm with the following
components:
Everything seems to work fine, except imap crashes from time to time
with different users:
Feb 16 21:52:06 mail2 kernel: [459605.167041] imap[55238]: segfault at 8 ip 0000620dfcc4aee9 sp 0000713494feb4e0 error 4 in libxapian.so.30.12.3[620dfcc14000+173000]
Enabling coredumps the backtrace shows:
#0 0x00006ffff4529ee9 in Xapian::Database::get_document(unsigned int) const ()
from /lib/x86_64-linux-gnu/libxapian.so.30
#1 0x00006ffff6e67cb0 in fts_flatcurve_xapian_uid_exists_db (uid=uid@entry=8040,
backend=0x1a4f0a594a70) at fts-backend-flatcurve-xapian.cpp:916
#2 0x00006ffff6e6bb34 in fts_flatcurve_xapian_write_db_by_uid (uid=8040,
backend=0x1a4f0a594a70) at fts-backend-flatcurve-xapian.cpp:935
#3 fts_flatcurve_xapian_expunge (backend=0x1a4f0a594a70, uid=8040)
at fts-backend-flatcurve-xapian.cpp:1179
#4 0x00006ffff75050ba in mailbox_sync_notify ()
from /usr/lib/dovecot/libdovecot-storage.so.0
#5 0x00006ffff75292e1 in sdbox_sync_finish ()
It happens consistently at IMAP EXPUNGE, in all cases. The mailboxes are on NFS but a single machine accesses
them. Any hint is really appreciated.
The text was updated successfully, but these errors were encountered: