Streams replica can fail to recover and fail to accept consumers in case of an abrupt (`kill -9`-like) node process termination on Windows #12054

sysupbda · 2024-08-16T16:36:26Z

sysupbda
Aug 16, 2024

Describe the bug

When a single node RabbitMQ on Windows x64, using OTP 26.2.5.2 and RabbitMQ v3.13.6, with a stream that is being published to and being consumed from, fairly often the recovery from crashes will result in an unavailable stream.

1- A node with just 50-100 clients, with just a couple of messages of a couple of kbs published to exchanges and streams per hour, can try to allocate 450Mb and fail (with erl crash dump file) despite 2Gb being available. I assume that there is no contiguous 450Mb available at that point. I have not nailed down what can cause a very little used RabbitMQ with few clients to allocate that much RAM.
(I don't know how to investigate or reproduce this one, but I am coachable)

2- Once RabbitMQ crashes because it could not allocate memory (because of (1)), or because it was killed at a particular moment, it sometimes missing an index file for which it has a segment file. These left over segments do not cause issues but become unmanaged resources and are never removed. In simple words, 00000000000000725366.segment is left there for ever.

3- The directory under %APPDATA%\db\streams...\ also has an index file with zero bytes. This causes an infinite loop within the stream coordinator managing replicas (despite it being a single node cluster) which keeps the stream unavailable instead of recovering by skipping that ghost index file.

4- The missing_file warning from the stream coordinator is warranted although it might be best to explicitly say that the stream is not available. It also repeatedly logs warnings with no retry delays which can cause self-inflicted Denial-Of-Services or poor log strategy tradeoffs. I am not sure if it is possible, but it would be better to only warn periodically instead of continuously.

...
2024-08-15 20:20:59.254000+08:00 [warning] <0.3811.0> rabbit_stream_coordinator: failed to get tail of member __static_data_17237229779233216 on rabbit@DESKTOP-R73UFKT in 2 Error: {case_clause,missing_file}
2024-08-15 20:20:59.258000+08:00 [warning] <0.3812.0> rabbit_stream_coordinator: failed to get tail of member __static_data_17237229779233216 on rabbit@DESKTOP-R73UFKT in 2 Error: {case_clause,missing_file}
2024-08-15 20:20:59.262000+08:00 [warning] <0.3813.0> rabbit_stream_coordinator: failed to get tail of member __static_data_17237229779233216 on rabbit@DESKTOP-R73UFKT in 2 Error: {case_clause,missing_file}
...

I created my own version of osiris_log.erl where I added the stacktrace to the file_size() call that fails. I hope it is helpful. This is how I know it is accessing the empty index file and failing on the non-existent corresponding stream file instead of recovering.
See attachment:

log_with_modified_osiris_log_to_add_stack_trace_to_file_size_that_tries_to_open_stream_file.log

5- Sometimes when RabbitMQ is violently shutdown with taskkill /f /im erl.exe and a connected erl -node ... that uses erlang:halt(2)., instead of trying to recover from the ghost index file, it just rejects clients consumption of the stream (see crash_instead_of_consume.txt)

crash_instead_of_consume.txt

Reproduction steps

Setup a Windows Server 2022, or use Windows 10.0.19045 Build 19045 or an AWS Windows_Server-2022-English-Full-Base-2024.07.10 - Microsoft Windows 2022 Datacenter edition t2.large (8 Gb 2 vcpu) hvm 64bit ENA enabled = true, root device: ebs, 40Gb gp2
Install https://github.com/erlang/otp/releases/download/OTP-26.2.5.2/otp_win64_26.2.5.2.exe and https://github.com/rabbitmq/rabbitmq-server/releases/download/v3.13.6/rabbitmq-server-3.13.6.exe
Enable streams and friends with rabbitmq-plugins.bat enable rabbitmq_stream_management
Create a stream that will have frequent eval_retention calls. I attached a config that can be imported with rabbitmqctl import_definitions definitions.json and sets up administrator/administrator with definitions.json
publish and crash at the same time:

Publish data to the stream that will result it retention evaluations
(I am attaching this silly example to help publish and consume)
break_rabbit.zip
Run taskkill /f /im erl.exe while publishing
Run erlang:halt(2). while connected to the cluster of one node

Repeat 4 until you find some of the described failure cases

Expected behavior

Automated recovery from any failure scenario for streams.

Data in the mnesia and db subdirectories should be fully managed by RabbitMQ and no orphaned files should be left around, even when there was a failure case.

Logs need to remain relevant, even in failure cases.

Additional context

I tried to nail down the root cause, and tried to make very reliable reproduction steps, but sadly I run into new issues every time I try to replicate. So I presented what I have observed, and the attempts to reproduce. Which of these issues you will see depends on luck apparently, but I think all these issues are unacceptable.

While I do regularly use Elixir, I have a limited experience of Erlang. So I can be coached to get additional information or modify code if I can get some support in doing so.

I have been using RabbitMQ for 10 years but am new to streams, and was more on the developer side than managing my own instances as an administrator. But I am fast learner. Feel free to point me to documentation and ask me to digest it.

If I can contribute in some way as a junior contributor to RabbitMQ, I am happy to support the team in the way that supports the team the best for these issues.

Answered by kjnilsson

Aug 21, 2024

@sysupbda I have made some changes that should at least handle the two cases you reported. Please can you test this?

#12073

View full answer

michaelklishin · 2024-08-16T16:53:43Z

michaelklishin
Aug 16, 2024
Maintainer

Thank you for providing the details.

We will never promise "recovery from any failure scenario" because there scenarios where there is no path to recovery, or rather, in order to recover you must throw away some data.

Modern quorum queues already recover from many similar scenarios (in at least some cases, by deleting corrupt segment files 🤷), so streams can be more defensive.

0 replies

michaelklishin · 2024-08-16T16:54:30Z

michaelklishin
Aug 16, 2024
Maintainer

@sysupbda can you share a relevant directories from the node that runs into this "error logging loop"? Since you have a way to reproduce with test data.

0 replies

sysupbda · 2024-08-16T17:10:41Z

sysupbda
Aug 16, 2024
Author

Thank you @michaelklishin for your prompt answer.

I have no node where I can reproduce all issues. But I can share a node where I have the empty index and a missing segment issue (stream unavailable with infinite retry loop and log spamming), and the node where I get a consumption error but no other information.

Are these the directories that are most relevant? The %APPDATA%\RabbitMQ directory?

RabbitMQ_stream unavailable with infinite retry loop empty index and missing segment.zip

0 replies

sysupbda · 2024-08-16T17:11:03Z

sysupbda
Aug 16, 2024
Author

Here is the other one:

RabbitMQ_with_segment_file_laying_around_and_rejecting_consumption_from_the_stream.zip

0 replies

kjnilsson · 2024-08-19T13:03:39Z

kjnilsson
Aug 19, 2024
Maintainer

@sysupbda thanks for providing the data directories and I have confirmed there are a couple of issues in the code that I am working to fix.

However my main concern is the behaviour of windows in this case. From your description it appears that a simple process crash/forced exit is enough to lose page cache data (or file cache, cache manager cache, whatever it is called on windows).

We tested similar cases heavily on linux and at no point did we lose page cache data from a process crash (as that isn't how the page cache works). To lose page cache data we had to force terminate the machine or vm.

I just can't believe that is how windows works so there must be something else going on this case but we test very little on windows and don't have as much familiarity with it. If that is how windows is designed to work w.r.t to unflushed disk data then I don't think windows should be used for production systems that use RabbitMQ streams. At least not without further OS specific changes.

0 replies

sysupbda · 2024-08-19T15:03:22Z

sysupbda
Aug 19, 2024
Author

Hi Karl ! Thank you for your fast and detailed reply! I understand and agree with your concerns. I had in mind to support the project with two fixes that I believe would take us a very long way: 1- a stage before streams are loaded where the environment is "recovered". If we did this, we could eventually maybe share recovery best practices between quorums and streams. I would categorize them in to three groups: A) can self-heal, and I think nearly all the issues I listed can be solved by this. Essentially we start by validating our storage and if recovery is possible without any intervention, we recover. => I think we already have a process to copy from other replicas. Do we maybe just need to add the check whether any 0 byte index files exist without segment file? B) can be healed but needs the owner of the system to make some decisions. One such decision might be to abandon a subset of data that was not recoverable. => I dont think that any of the issues I mentioned require this. Maybe it could be done at a later stage? 2- a process that runs as a service and cleans up orphaned, useless segment files that have no corresponding index file. This can happen with latency, but probably should happen. I had no time to dive into it, but I wanted to discuss it on the mailing list. I wasn't sure whether this issue or the mailing list was the best place. What do you think? Is there any way I can support the process? Thanks!

…

On Mon, 19 Aug 2024, 21:04 Karl Nilsson, ***@***.***> wrote: @sysupbda <https://github.com/sysupbda> thanks for providing the data directories and I have confirmed there are a couple of issues in the code that I am working to fix. However my main concern is the behaviour of windows in this case. From your description it appears that a simple process crash/forced exit is enough to lose page cache data (or file cache, cache manager cache, whatever it is called on windows). We tested similar cases heavily on linux and at no point did we lose page cache data from a process crash (as that isn't how the page cache works). To lose page cache data we had to force terminate the machine or vm. I just can't believe that is how windows works so there must be something else going on this case but we test very little on windows and don't have as much familiarity with it. *If* that is how windows is designed to work w.r.t to unflushed disk data then I don't think windows should be used for production systems that use RabbitMQ streams. At least not without further OS specific changes. — Reply to this email directly, view it on GitHub <#12036 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACCV3TUQWVXQ6GMD5DAED33ZSHUMDAVCNFSM6AAAAABMULSPESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJWGUZDSNJVHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

sysupbda Aug 23, 2024
Author

@kjnilsson , concerning your main concern about caching and lost page cache data, is this something we still want to investigate? Is that something I could contribute to?

kjnilsson · 2024-08-19T15:26:11Z

kjnilsson
Aug 19, 2024
Maintainer

Is there any way I can support the process?

The most helpful thing now would be to share some insight into how windows works w.r.t to page cache. Are these services running in some kind of sandboxing/isolation mode that could cause a simple process crash to lose unsynced file caches?

1 reply

sysupbda Aug 19, 2024
Author

I see. I would have to learn about how Windows handles page caches myself too. This might take me some time. Can I come back to you on that?

The data corruption I experienced was reproduced on three different environments:
My home PC: Windows 10 pro, typical desktop setup,

An AWS windows server t2.large with hvm on a ebs disk

A windows server VM that my client, a bank, runs and to which I only have very limited access. It might be running VMware but honestly I don't know and they are not sharing too much information.

As a side note :
I chose RabbitMQ because it of its features. However a big reason was that it advertises that it runs on Windows. The choice of systems that officially support Windows (and can be run in-premises) is thin. I thought erlang had a special trick up it's sleaves but I guess Erlang is not magical. It might make sense to officially not support Windows until a battery of tests are created for it?

kjnilsson · 2024-08-21T07:52:01Z

kjnilsson
Aug 21, 2024
Maintainer

@sysupbda I have made some changes that should at least handle the two cases you reported. Please can you test this?

#12073

0 replies

sysupbda · 2024-08-22T06:57:04Z

sysupbda
Aug 22, 2024
Author

Hi Karl,
Thank you for the prompt update! I am setting up a development environment on the AWS EC2 instance I used to reproduce the issue. I will let you know if I can still reproduce when the new version of osiris is used.

PS: I am going to naively do a git checkout osiris-1.8.3 on the rabbitmq-server repo. I hope that is what you are also expecting from me.

Edit: I see it is a branch, not a tag. Will try to run it that way. I only used installers until today.

19 replies

sysupbda Aug 24, 2024
Author

In this particular config, is the segment size not set to 4k and the stream to 40k? Did I misunderstand that lart? Or is that not supposed to work?

michaelklishin Aug 24, 2024
Maintainer

It's not a configuration that makes much sense, even though it may be good for reproducing certain edge cases.

Streams are meant to be used for reasonably long storage of shared data sets that should be read repeatedly. A 40 kB limit is hardly very practical for a stream (or even a queue). RabbitMQ defaults assume multi-megabyte segment size.

Of course, if these were set specifically to reproduce what was reported in this discussion, then you can disregard my comment. Otherwise why not stick to stream defaults and a limit of some 128 MB, for example?

sysupbda Aug 25, 2024
Author

Thank you @michaelklishin for highlighting the concern. I strongly welcome input, particularly from the team that has a lot more experience with stream that I have. I did indeed choose a very small segment and overall stream size, in the hope that it should work. My architecture involves a variety of teams, and needed a strong decoupling of each component. Conceptually, this particular stream represents a kind of config that may be updated by the customer and nearly immediately reflected in all the workers (consumers of the queue).

If using a stream this way makes sense, even if it is not a common use case for it, I would argue that having the stream be small and the consumers be allowed to start at "offset 1" at startup. I felt it would add quite a lot of complexity for no benefit to start storing the offset to read from on startup. Most consumers are stateless. I might have missed an alternative solution here?

Is my use case a poor fit for streams? If it is an acceptable fit, would using larger segment/stream sizes not result in a longer startup time for every consumer rereading from offset 1? I was also worried it would create a sudden spike in networking that could ripple into other side effects that are hard to investigate for colocated systems.

michaelklishin Aug 26, 2024
Maintainer

@sysupbda 3.13.7 has shipped https://github.com/rabbitmq/rabbitmq-server/releases/tag/v3.13.7. Thank you for reporting this recovery scenario and helping verify the fixes.

sysupbda Aug 26, 2024
Author

Thank you! I am extremely thankful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streams replica can fail to recover and fail to accept consumers in case of an abrupt (`kill -9`-like) node process termination on Windows #12054

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 21 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Streams replica can fail to recover and fail to accept consumers in case of an abrupt (kill -9-like) node process termination on Windows #12054

sysupbda Aug 16, 2024

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 9 comments · 21 replies

michaelklishin Aug 16, 2024 Maintainer

michaelklishin Aug 16, 2024 Maintainer

sysupbda Aug 16, 2024 Author

sysupbda Aug 16, 2024 Author

kjnilsson Aug 19, 2024 Maintainer

sysupbda Aug 19, 2024 Author

sysupbda Aug 23, 2024 Author

kjnilsson Aug 19, 2024 Maintainer

sysupbda Aug 19, 2024 Author

kjnilsson Aug 21, 2024 Maintainer

sysupbda Aug 22, 2024 Author

sysupbda Aug 24, 2024 Author

michaelklishin Aug 24, 2024 Maintainer

sysupbda Aug 25, 2024 Author

michaelklishin Aug 26, 2024 Maintainer

sysupbda Aug 26, 2024 Author

Streams replica can fail to recover and fail to accept consumers in case of an abrupt (`kill -9`-like) node process termination on Windows #12054

sysupbda
Aug 16, 2024

Replies: 9 comments 21 replies

michaelklishin
Aug 16, 2024
Maintainer

michaelklishin
Aug 16, 2024
Maintainer

sysupbda
Aug 16, 2024
Author

sysupbda
Aug 16, 2024
Author

kjnilsson
Aug 19, 2024
Maintainer

sysupbda
Aug 19, 2024
Author

sysupbda Aug 23, 2024
Author

kjnilsson
Aug 19, 2024
Maintainer

sysupbda Aug 19, 2024
Author

kjnilsson
Aug 21, 2024
Maintainer

sysupbda
Aug 22, 2024
Author

sysupbda Aug 24, 2024
Author

michaelklishin Aug 24, 2024
Maintainer

sysupbda Aug 25, 2024
Author

michaelklishin Aug 26, 2024
Maintainer

sysupbda Aug 26, 2024
Author