Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

int/faq: add DKG failure docs #286

Merged
merged 5 commits into from
Nov 27, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 82 additions & 0 deletions docs/int/faq/dkg_failure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
---
sidebar_position: 4
description: Handling DKG failure
---

# Handling DKG failure

While the DKG process has been tested and validated against many different configuration instances, it can still encounter issues which might result in failure.

Our DKG is designed in a way that doesn't allow for inconsistent results: either it finishes correctly for every peer, or it fails.

This is a **safety** feature: you don't want to deposit an Ethereum distributed validator that not every operator is able to participate to, or even reach threshold for.

The most common source of issues lies in the network stack: if any of the peer's Internet connection glitches substantially, the DKG will fail.

Charon's DKG doesn't allow peer reconnection once the process is started, but it does allow for re-connections before that.

When you see the following message:

```
14:08:34.505 INFO dkg Waiting to connect to all peers...
```

this means your Charon instance is waiting for all the other cluster peers to start their DKG process: at this stage, peers can disconnect and reconnect at will, the DKG process will still continue.

A log line will confirm the connection of a new peer:

```
14:08:34.523 INFO dkg Connected to peer 1 of 3 {"peer": "fantastic-adult"}
14:08:34.529 INFO dkg Connected to peer 2 of 3 {"peer": "crazy-bunch"}
14:08:34.673 INFO dkg Connected to peer 3 of 3 {"peer": "considerate-park"}
```

As soon as all the peers are connected, this message will be shown:

```
14:08:34.924 INFO dkg All peers connected, starting DKG ceremony
```

Past this stage **no disconnections are allowed**, and _all peers must leave their terminals open_ in order for the DKG process to complete: this is a synchronous phase, and every peer is required in order to reach completion.

If for some reason the DKG process fails, you would see error logs that resemble this:

```
14:28:46.691 ERRO cmd Fatal error: sync step: p2p connection failed, please retry DKG: context canceled
```

As the error message suggests, the DKG process needs to be retried.

## Cleaning up the `.charon` directory

One cannot simply retry the DKG process: Charon refuses to overwrite any runtime file in order to avoid inconsistencies and private key loss.

When attempting to re-run a DKG with an unclean data directory -- which is either `.charon` or what was specified with the `--data-dir` CLI parameter -- this is the error that will be shown:

```
14:44:13.448 ERRO cmd Fatal error: data directory not clean, cannot continue {"disallowed_entity": "cluster-lock.json", "data-dir": "/compose/node0"}
```

The `disallowed_entity` field lists all the files that Charon refuses to overwrite, while `data-dir` is the full path of the runtime directory the DKG process is using.

In order to retry the DKG process one must delete the following entities, if present:

- `validator_keys` directory
- `cluster-lock.json` file
- `deposit-data.json` file
:::caution
The `charon-enr-private-key` file **must be preserved**, failure to do so requires the DKG process to be restarted from the beginning by creating a new cluster definition.
:::
If you're doing a DKG with a custom cluster definition - for example, create with `charon create dkg` rather than the Obol Launchpad - you can re-use the same file.

Once this process has been completed, the cluster operators can retry a DKG.

## Further debugging

If for some reason the DKG process fails again, node operators are adviced to reach out to the Obol team by opening an [issue](https://github.com/ObolNetwork/charon/issues), detailing what troubleshooting steps were taken and providing **debug logs**.

To enable debug logs first clean up the Charon data directory as explained in [the previous paragraph](#cleaning-up-the-charon-directory), then run your DKG command by appending `--log-level=debug` at the end.

In order for the Obol team to debug your issue as quickly and precisely as possible please provide full logs in textual form, not through screenshots or display photos.

Providing complete logs is particularly important, since it allows the team to reconstruct precisely what happened.
Loading