Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingester behavior when disk is full #5589

Open
fulmicoton opened this issue Dec 17, 2024 · 2 comments
Open

Ingester behavior when disk is full #5589

fulmicoton opened this issue Dec 17, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@fulmicoton
Copy link
Contributor

fulmicoton commented Dec 17, 2024

Currently, ingester may end up accepting persist request when their disk is full.
If the OS buffer is not full, no error might be returned.

We need to poll-check for the disk usage, and change the behavior of quickwit when it goes above
a threshold.

The behavior is yet to be decided. Probably, the closest thing is decommissionning: close all shards and not accept the creation of new shards. In addition, it might not be possible to run indexing/merge pipelines; which could really make the control plane's task hard.

@fulmicoton fulmicoton added the bug Something isn't working label Dec 17, 2024
@rdettai
Copy link
Contributor

rdettai commented Jan 6, 2025

@fulmicoton How did you identify the main problem comes from records accumulating in the OS buffer? I thought the OS buffer would usually be quite small (few MBs).

It seems to me that the problem might also come from the persist policy that is configured on mrecordlogs. A full disk is only detected after the persist delay (5s), and when that happens, the error is bubbled up and converted to a persist failure here. The problem is that when that happens, a transient error is returned to the user, but meanwhile the shard is closed, a new one is opened, and records are accepted again during the mrecordlog persist delay. I didn't manage to reproduce it yet, but does this seem like a plausible explanation to you?

EDIT: I tried to mimic the WAL disk being full using a small loop device mounted on wal/

sudo dd if=/dev/zero of=virtual_disk.img bs=1M count=10
sudo mkfs.ext4 virtual_disk.img 
mkdir wal
sudo mount -o loop virtual_disk.img wal/

The error I get (and I get it conistently) when the disk is full is:

{
  "message": "ingest service is unavailable (no shards available)"
}

@fulmicoton
Copy link
Contributor Author

How did you identify the main problem comes from records accumulating in the OS buffer? I thought the OS buffer would usually be quite small (few MBs).

Just an hypothesis to explain how we could accept message and eventually lose them.

The problem is that when that happens, a transient error is returned to the user, but meanwhile the shard is closed, a new one is opened, and records are accepted again during the mrecordlog persist delay. I didn't manage to reproduce it yet, but does this seem like a plausible explanation to you?

Plausible yes, but we still need to know by which mechanism we end up accepting writes sometimes. Mezmo mentions they lost data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants