Mimir does not start because of WAL corruption #5567
Replies: 1 comment
-
This is the top hit in google when I searched for "mimir" "corruption" "file not found" if you're in kubernetes:
This let my ingester start again but I don't know if this causes any issues on the compactor or any other part of the cluster. My guess is anything in that chunk will be removed so keep that in mind. I'm on the helm chart for |
Beta Was this translation helpful? Give feedback.
-
Version 2.7.1
Describe the bug
Mimir is not able to boot with that error:
ts=2023-04-12T05:42:15.506982069Z caller=mimir.go:804 level=error msg="module failed" module=ingester-service err="invalid service state: Failed, expected: Running, failure: opening existing TSDBs: unable to open TSDB for user anonymous: failed to compact TSDB: /data/tsdb/anonymous: WAL truncation in Compact: create checkpoint: read segments: corruption in segment /data/tsdb/anonymous/wal/00000011 at 3756: unexpected full record"
ts=2023-04-12T05:42:15.507465367Z caller=mimir.go:804 level=error msg="module failed" module=ruler err="context canceled"
I can't reproduce errors like this. but they come suddenly if I reboot my virtual machines. ( sometimes hard shutdown for testing purpose )
I have setup loki and mimir and tempo in monolithic mode with local file storge,
Sometimes loki or mimir hangs with corrupted wal entries? Did I miss some twaeks in config ?
I want but can't go in production with the LGTM stack without understanding this issue. That is realy not resilient.
Beta Was this translation helpful? Give feedback.
All reactions