int/faq: add DKG failure docs (#286)

* int/faq: add DKG failure docs Add a FAQ page that explains how the DKG can fail, how it looks like on peer's terminals, and what to do to retry. * add debug logs section * Update docs/int/faq/dkg_failure.md Co-authored-by: Oisín Kyne <4981644+OisinKyne@users.noreply.github.com> * Update docs/int/faq/dkg_failure.md Co-authored-by: Oisín Kyne <4981644+OisinKyne@users.noreply.github.com> * Update docs/int/faq/dkg_failure.md Co-authored-by: Oisín Kyne <4981644+OisinKyne@users.noreply.github.com> --------- Co-authored-by: Oisín Kyne <4981644+OisinKyne@users.noreply.github.com>
ObolNetwork · Nov 27, 2023 · 85e9503 · 85e9503
1 parent 8cd1de0
commit 85e9503
Showing 1 changed file with 82 additions and 0 deletions.
diff --git a/docs/int/faq/dkg_failure.md b/docs/int/faq/dkg_failure.md
@@ -0,0 +1,82 @@
+---
+sidebar_position: 4
+description: Handling DKG failure
+---
+
+# Handling DKG failure
+
+While the DKG process has been tested and validated against many different configuration instances, it can still encounter issues which might result in failure.
+
+Our DKG is designed in a way that doesn't allow for inconsistent results: either it finishes correctly for every peer, or it fails.
+
+This is a **safety** feature: you don't want to deposit an Ethereum distributed validator that not every operator is able to participate to, or even reach threshold for.
+
+The most common source of issues lies in the network stack: if any of the peer's Internet connection glitches substantially, the DKG will fail.
+
+Charon's DKG doesn't allow peer reconnection once the process is started, but it does allow for re-connections before that.
+
+When you see the following message:
+
+```
+14:08:34.505 INFO dkg        Waiting to connect to all peers...
+```
+
+this means your Charon instance is waiting for all the other cluster peers to start their DKG process: at this stage, peers can disconnect and reconnect at will, the DKG process will still continue.
+
+A log line will confirm the connection of a new peer:
+
+```
+14:08:34.523 INFO dkg        Connected to peer 1 of 3                 {"peer": "fantastic-adult"}
+14:08:34.529 INFO dkg        Connected to peer 2 of 3                 {"peer": "crazy-bunch"}
+14:08:34.673 INFO dkg        Connected to peer 3 of 3                 {"peer": "considerate-park"}
+```
+
+As soon as all the peers are connected, this message will be shown:
+
+```
+14:08:34.924 INFO dkg        All peers connected, starting DKG ceremony
+```
+
+Past this stage **no disconnections are allowed**, and _all peers must leave their terminals open_ in order for the DKG process to complete: this is a synchronous phase, and every peer is required in order to reach completion.
+
+If for some reason the DKG process fails, you would see error logs that resemble this:
+
+```
+14:28:46.691 ERRO cmd        Fatal error: sync step: p2p connection failed, please retry DKG: context canceled
+```
+
+As the error message suggests, the DKG process needs to be retried.
+
+## Cleaning up the `.charon` directory
+
+One cannot simply retry the DKG process: Charon refuses to overwrite any runtime file in order to avoid inconsistencies and private key loss.
+
+When attempting to re-run a DKG with an unclean data directory -- which is either `.charon` or what was specified with the `--data-dir` CLI parameter -- this is the error that will be shown:
+
+```
+14:44:13.448 ERRO cmd        Fatal error: data directory not clean, cannot continue {"disallowed_entity": "cluster-lock.json", "data-dir": "/compose/node0"}
+```
+
+The `disallowed_entity` field lists all the files that Charon refuses to overwrite, while `data-dir` is the full path of the runtime directory the DKG process is using.
+
+In order to retry the DKG process one must delete the following entities, if present:
+
+ - `validator_keys` directory
+ - `cluster-lock.json` file
+ - `deposit-data.json` file
+:::caution
+The `charon-enr-private-key` file **must be preserved**, failure to do so requires the DKG process to be restarted from the beginning by creating a new cluster definition.
+:::
+If you're doing a DKG with a custom cluster definition - for example, create with `charon create dkg` rather than the Obol Launchpad - you can re-use the same file.
+
+Once this process has been completed, the cluster operators can retry a DKG.
+
+## Further debugging
+
+If for some reason the DKG process fails again, node operators are adviced to reach out to the Obol team by opening an [issue](https://github.com/ObolNetwork/charon/issues), detailing what troubleshooting steps were taken and providing **debug logs**.
+
+To enable debug logs first clean up the Charon data directory as explained in [the previous paragraph](#cleaning-up-the-charon-directory), then run your DKG command by appending `--log-level=debug` at the end.
+
+In order for the Obol team to debug your issue as quickly and precisely as possible please provide full logs in textual form, not through screenshots or display photos.
+
+Providing complete logs is particularly important, since it allows the team to reconstruct precisely what happened.