-
Describe the bugWhen a single node RabbitMQ on Windows x64, using OTP 26.2.5.2 and RabbitMQ v3.13.6, with a stream that is being published to and being consumed from, fairly often the recovery from crashes will result in an unavailable stream. 1- A node with just 50-100 clients, with just a couple of messages of a couple of kbs published to exchanges and streams per hour, can try to allocate 450Mb and fail (with erl crash dump file) despite 2Gb being available. I assume that there is no contiguous 450Mb available at that point. I have not nailed down what can cause a very little used RabbitMQ with few clients to allocate that much RAM. 2- Once RabbitMQ crashes because it could not allocate memory (because of (1)), or because it was killed at a particular moment, it sometimes missing an index file for which it has a segment file. These left over segments do not cause issues but become unmanaged resources and are never removed. In simple words, 00000000000000725366.segment is left there for ever. 3- The directory under %APPDATA%\db\streams...\ also has an index file with zero bytes. This causes an infinite loop within the stream coordinator managing replicas (despite it being a single node cluster) which keeps the stream unavailable instead of recovering by skipping that ghost index file. 4- The
I created my own version of osiris_log.erl where I added the stacktrace to the file_size() call that fails. I hope it is helpful. This is how I know it is accessing the empty index file and failing on the non-existent corresponding stream file instead of recovering. log_with_modified_osiris_log_to_add_stack_trace_to_file_size_that_tries_to_open_stream_file.log 5- Sometimes when RabbitMQ is violently shutdown with Reproduction steps
Expected behaviorAutomated recovery from any failure scenario for streams. Data in the mnesia and db subdirectories should be fully managed by RabbitMQ and no orphaned files should be left around, even when there was a failure case. Logs need to remain relevant, even in failure cases. Additional contextI tried to nail down the root cause, and tried to make very reliable reproduction steps, but sadly I run into new issues every time I try to replicate. So I presented what I have observed, and the attempts to reproduce. Which of these issues you will see depends on luck apparently, but I think all these issues are unacceptable. While I do regularly use Elixir, I have a limited experience of Erlang. So I can be coached to get additional information or modify code if I can get some support in doing so. I have been using RabbitMQ for 10 years but am new to streams, and was more on the developer side than managing my own instances as an administrator. But I am fast learner. Feel free to point me to documentation and ask me to digest it. If I can contribute in some way as a junior contributor to RabbitMQ, I am happy to support the team in the way that supports the team the best for these issues. |
Beta Was this translation helpful? Give feedback.
Replies: 9 comments 21 replies
-
Thank you for providing the details. We will never promise "recovery from any failure scenario" because there scenarios where there is no path to recovery, or rather, in order to recover you must throw away some data. Modern quorum queues already recover from many similar scenarios (in at least some cases, by deleting corrupt segment files 🤷), so streams can be more defensive. |
Beta Was this translation helpful? Give feedback.
-
@sysupbda can you share a relevant directories from the node that runs into this "error logging loop"? Since you have a way to reproduce with test data. |
Beta Was this translation helpful? Give feedback.
-
Thank you @michaelklishin for your prompt answer. I have no node where I can reproduce all issues. But I can share a node where I have the empty index and a missing segment issue (stream unavailable with infinite retry loop and log spamming), and the node where I get a consumption error but no other information. Are these the directories that are most relevant? The %APPDATA%\RabbitMQ directory? RabbitMQ_stream unavailable with infinite retry loop empty index and missing segment.zip |
Beta Was this translation helpful? Give feedback.
-
Here is the other one: RabbitMQ_with_segment_file_laying_around_and_rejecting_consumption_from_the_stream.zip |
Beta Was this translation helpful? Give feedback.
-
@sysupbda thanks for providing the data directories and I have confirmed there are a couple of issues in the code that I am working to fix. However my main concern is the behaviour of windows in this case. From your description it appears that a simple process crash/forced exit is enough to lose page cache data (or file cache, cache manager cache, whatever it is called on windows). We tested similar cases heavily on linux and at no point did we lose page cache data from a process crash (as that isn't how the page cache works). To lose page cache data we had to force terminate the machine or vm. I just can't believe that is how windows works so there must be something else going on this case but we test very little on windows and don't have as much familiarity with it. If that is how windows is designed to work w.r.t to unflushed disk data then I don't think windows should be used for production systems that use RabbitMQ streams. At least not without further OS specific changes. |
Beta Was this translation helpful? Give feedback.
-
Hi Karl !
Thank you for your fast and detailed reply!
I understand and agree with your concerns.
I had in mind to support the project with two fixes that I believe would
take us a very long way:
1- a stage before streams are loaded where the environment is "recovered".
If we did this, we could eventually maybe share recovery best practices
between quorums and streams.
I would categorize them in to three groups:
A) can self-heal, and I think nearly all the issues I listed can be solved
by this. Essentially we start by validating our storage and if recovery is
possible without any intervention, we recover.
=> I think we already have a process to copy from other replicas. Do we
maybe just need to add the check whether any 0 byte index files exist
without segment file?
B) can be healed but needs the owner of the system to make some decisions.
One such decision might be to abandon a subset of data that was not
recoverable.
=> I dont think that any of the issues I mentioned require this. Maybe it
could be done at a later stage?
2- a process that runs as a service and cleans up orphaned, useless segment
files that have no corresponding index file. This can happen with latency,
but probably should happen.
I had no time to dive into it, but I wanted to discuss it on the mailing
list. I wasn't sure whether this issue or the mailing list was the best
place.
What do you think?
Is there any way I can support the process?
Thanks!
…On Mon, 19 Aug 2024, 21:04 Karl Nilsson, ***@***.***> wrote:
@sysupbda <https://github.com/sysupbda> thanks for providing the data
directories and I have confirmed there are a couple of issues in the code
that I am working to fix.
However my main concern is the behaviour of windows in this case. From
your description it appears that a simple process crash/forced exit is
enough to lose page cache data (or file cache, cache manager cache,
whatever it is called on windows).
We tested similar cases heavily on linux and at no point did we lose page
cache data from a process crash (as that isn't how the page cache works).
To lose page cache data we had to force terminate the machine or vm.
I just can't believe that is how windows works so there must be something
else going on this case but we test very little on windows and don't have
as much familiarity with it. *If* that is how windows is designed to work
w.r.t to unflushed disk data then I don't think windows should be used for
production systems that use RabbitMQ streams. At least not without further
OS specific changes.
—
Reply to this email directly, view it on GitHub
<#12036 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACCV3TUQWVXQ6GMD5DAED33ZSHUMDAVCNFSM6AAAAABMULSPESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJWGUZDSNJVHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
The most helpful thing now would be to share some insight into how windows works w.r.t to page cache. Are these services running in some kind of sandboxing/isolation mode that could cause a simple process crash to lose unsynced file caches? |
Beta Was this translation helpful? Give feedback.
-
@sysupbda I have made some changes that should at least handle the two cases you reported. Please can you test this? |
Beta Was this translation helpful? Give feedback.
-
Hi Karl, PS: I am going to naively do a Edit: I see it is a branch, not a tag. Will try to run it that way. I only used installers until today. |
Beta Was this translation helpful? Give feedback.
@sysupbda I have made some changes that should at least handle the two cases you reported. Please can you test this?
#12073