Mimir does not start because of WAL corruption #5567

suikast42 · 2023-04-12T06:14:32Z

suikast42
Apr 12, 2023

Version 2.7.1

Describe the bug

Mimir is not able to boot with that error:

ts=2023-04-12T05:42:15.506982069Z caller=mimir.go:804 level=error msg="module failed" module=ingester-service err="invalid service state: Failed, expected: Running, failure: opening existing TSDBs: unable to open TSDB for user anonymous: failed to compact TSDB: /data/tsdb/anonymous: WAL truncation in Compact: create checkpoint: read segments: corruption in segment /data/tsdb/anonymous/wal/00000011 at 3756: unexpected full record"
ts=2023-04-12T05:42:15.507465367Z caller=mimir.go:804 level=error msg="module failed" module=ruler err="context canceled"

I can't reproduce errors like this. but they come suddenly if I reboot my virtual machines. ( sometimes hard shutdown for testing purpose )

I have setup loki and mimir and tempo in monolithic mode with local file storge,

Sometimes loki or mimir hangs with corrupted wal entries? Did I miss some twaeks in config ?

I want but can't go in production with the LGTM stack without understanding this issue. That is realy not resilient.

lookcrabs · 2024-03-19T22:20:45Z

lookcrabs
Mar 19, 2024

This is the top hit in google when I searched for "mimir" "corruption" "file not found"
I don't know if this helps but this is what I did to fix this on my corrupt ingester.

if you're in kubernetes:
Replace the existing pod with a debug pod:

kubectl -n mimir debug mimir-ingester-zone-a-4 -it --copy-to mimir-ingester-zone-a-4 --same-node --replace --image=ubuntu:latest --share-processes -- /bin/bash

Inside your debug container find the pid of the mimir process and cd into /proc/${MIMIR_PID}/root/data/ -- for me this was /proc/7/root/data/
find the wdb // wal file for your corrupt chunk and rename it to ${CHUNK}.repair
IE this would be mv /proc/7/root/data/tsdb/anonymous/wal/0000000011{,.repair} AND/OR mv /proc/7/root/data/tsdb/anonymous/wdb/0000000011{,.repair}

This let my ingester start again but I don't know if this causes any issues on the compactor or any other part of the cluster. My guess is anything in that chunk will be removed so keep that in mind.

I'm on the helm chart for mimir-distributed 5.2.1 with mimir 2.10 in it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mimir does not start because of WAL corruption #5567

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Mimir does not start because of WAL corruption #5567

suikast42 Apr 12, 2023

Describe the bug

Replies: 1 comment

lookcrabs Mar 19, 2024

suikast42
Apr 12, 2023

lookcrabs
Mar 19, 2024