Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current levelDB panics on crash recovery #38

Open
aphyr opened this issue Aug 12, 2017 · 0 comments
Open

Current levelDB panics on crash recovery #38

aphyr opened this issue Aug 12, 2017 · 0 comments

Comments

@aphyr
Copy link

aphyr commented Aug 12, 2017

If a Merkleeyes LevelDB file is truncated (e.g. due to power failure or backup-and-restore), Merkleeyes can panic on startup, throwing:

no existing db, creating new db
loading existing db
error reading MerkleEyesState
panic: EOF

goroutine 1 [running]:
panic(0x8fab00, 0xc420606670)
    /home/balloo/go/src/runtime/panic.go:500 +0x1a1
github.com/tendermint/merkleeyes/app.NewMerkleEyesApp(0x7ffe977e5d61, 0x6, 0x0, 0x2)
    /home/balloo/goApps/src/github.com/tendermint/merkleeyes/app/app.go:93 +0x815
github.com/tendermint/merkleeyes/cmd.StartServer(0xc941e0, 0xc420058dc0, 0x0, 0x4)
    /home/balloo/goApps/src/github.com/tendermint/merkleeyes/cmd/app.go:33 +0x49
github.com/tendermint/merkleeyes/vendor/github.com/spf13/cobra.(*Command).execute(0xc941e0, 0xc420058d40, 0x4, 0x4, 0xc941e0, 0xc420058d40)
    /home/balloo/goApps/src/github.com/tendermint/merkleeyes/vendor/github.com/spf13/cobra/command.go:660 +0x44c
github.com/tendermint/merkleeyes/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc94840, 0xc42000c0b8, 0x0, 0xc4201adf18)
    /home/balloo/goApps/src/github.com/tendermint/merkleeyes/vendor/github.com/spf13/cobra/command.go:735 +0x367
github.com/tendermint/merkleeyes/vendor/github.com/spf13/cobra.(*Command).Execute(0xc94840, 0x0, 0x0)
    /home/balloo/goApps/src/github.com/tendermint/merkleeyes/vendor/github.com/spf13/cobra/command.go:694 +0x2b
github.com/tendermint/merkleeyes/cmd.Execute()
    /home/balloo/goApps/src/github.com/tendermint/merkleeyes/cmd/root.go:25 +0x31
main.main()
    /home/balloo/goApps/src/github.com/tendermint/merkleeyes/cmd/merkleeyes/main.go:8 +0x14
loading existing db
error reading MerkleEyesState
panic: EOF

This effectively means power failures etc. on more than 1/3 of a cluster have a chance to render the cluster unusable, at least until you can fix the leveldb recovery code or restore from non-corrupt backups.

I can't see tmlibs, so I'm not exactly sure how this dependency works, but from our conversation in channel, I think Merkleeyes uses goleveldb for its on-disk storage. There are a few other reports of issues like this: Prometheus hit crash-recovery problems in Fall 2016, Syncthing hit panics around the same time, and there's also a report of disk imaging resulting in "corrupted or incomplete meta file" errors in Spring 2017. GolevelDB's maintainer suggests in those threads that syndtr/goleveldb@1996ac2 and syndtr/goleveldb@69e19a4 may help, so it might be worth upgrading or cherry-picking those commits into Merkleeyes' LevelDB as well.

I also suggest developing a test suite to verify specifically whether Merkleeyes recovers correctly from arbitrary truncations of its various LevelDB files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants