Update networking docs (#1649)

ch1bo · web-flow · commit b1c3914d376d · 2024-09-24T18:08:26.000+02:00
Adds more introductory information on our `Network` component in the
Architecture &gt; Networking page.

This also drops old/outdated information and investigations of the
Cardano network stack (not really relevant anymore). Ideally we would be
able to find these back if we'd want, but not sure how to ensure this?

---

* [x] CHANGELOG update not needed
* [x] Documentation updated
* [x] Haddocks update not needed
* [x] No new TODOs introduced
diff --git a/docs/docs/dev/architecture/index.md b/docs/docs/dev/architecture/index.md
@@ -23,21 +23,7 @@ $ plantuml -Tsvg architecture-c4.puml
 
 ### Network
 
-The _network_ component is responsible for all communications between Hydra nodes related to the off-chain part of the Hydra protocol. The [current implementation](./networking) is based on the [typed protocols](https://github.com/input-output-hk/typed-protocols) library, which is also used by the Cardano networking. It is asynchronous by nature and uses a push-based protocol with a uniform _broadcast_ abstraction.
-
-Messages are exchanged between nodes during different internal transitions and are authenticated using each peer's _Hydra key_. Each message sent is signed by the sender, and the signature is verified by the receiver.
-
-#### Authentication and authorization
-
-The messages exchanged through the _Hydra networking_ layer between participants are authenticated. Each message is [signed](https://github.com/input-output-hk/hydra/issues/727) using the Hydra signing key of the emitting party, which is identified by the corresponding verification key. When a message with an unknown or incorrect signature is received, it is dropped, and a notification is logged.
-
-However, messages are not encrypted. If confidentiality is required, an external mechanism must be implemented to prevent other parties from observing the messages exchanged within a head.
-
-#### Fault tolerance
-
-The Hydra Head protocol guarantees the safety of all honest participants' funds but does not inherently guarantee liveness. Therefore, for the protocol to progress, all parties involved in a head must be online and reactive.
-
-This means that if one or more participants' Hydra nodes become permanently unreachable due to a crash or network partition, no further transactions can occur in the head, and it must be closed. However, the [Hydra networking layer](https://hydra.family/head-protocol/unstable/haddock/hydra-node/Hydra-Node-Network.html) is tolerant to transient disconnections and (non-Byzantine) crashes.
+The _network_ component is responsible for communication between Hydra nodes related to the off-chain part of the Hydra protocol. See [Networking](./networking) for details.
 
 ### Chain interaction
 
diff --git a/docs/docs/dev/architecture/networking.md b/docs/docs/dev/architecture/networking.md
@@ -1,32 +1,96 @@
 # Networking
 
-This page provides details about the Hydra networking layer, which encompasses
-the network of Hydra nodes where heads can be opened.
-
-## Questions
-
-- What's the expected topology of the transport layer?
-  - Are connected peers a subset, superset, or identical set of the head parties?
-- Do we need the delivery ordering and reliability guarantees TCP provides?
-  - TCP provides full-duplex, stream-oriented, persistent connections between nodes
-  - The Hydra networking layer is based on asynchronous message passing, which seems better suited to UDP
-- Do we need to consider nodes being reachable through firewalls?
-  - This responsibility could be delegated to end users, allowing them to configure their firewalls/NATs to align with Hydra node requirements
-  - This may be more manageable for business, corporate, or organizational parties than for individual end-users
-- Do we want _privacy_ within a head?
-  - Transactions' details should be opaque to outside observers, with only the final outcome of the head's fanout being observable
-- How do we identify/discover peers/parties?
-  - The paper assumes a _setup_ phase where:
-    > To create a head-protocol instance, an initiator invites a set of participants \{p1,...,pn\} (including themselves) to join by announcing protocol parameters: the participant list, parameters of the (multi-)signature scheme, etc.
-    > Each party subsequently establishes pairwise authenticated channels with all other parties involved.
-- What constitutes a _list of participants_? Should each participant be uniquely identifiable? If so, what identification method should be used — naming scheme, IP: port address, public key, certificate?
-  - What do 'pairwise authenticated channels' entail? Are these actual TCP/TLS connections, or do they operate at the Transport (layer 4) or Session (layer 5) level?
-- How open do we want our network protocol to be?
-  - Currently leveraging the Ouroboros stack with CBOR message encoding, integrating other tools into the Hydra network may pose challenges.
-
-## Investigations
-
-### Network resilience
+This page provides details about the Hydra networking layer, through which hydra
+nodes exchange off-chain protocol messages. The off-chain protocol relies
+heavily on the correct operation of the **multicast** abstraction (`broadcast`
+in our fully connected topology here) in the way [it is
+specified](../specification) and the following sections explain our realization
+in the Hydra node implementation.
+
+## Interface
+
+Within a `hydra-node`, a `Network` component provides the capability to reliably
+`broadcast` a message to the whole Hydra network. In turn, when a message is
+received from the network, the `NetworkCallback` signals this by invoking
+`deliver`. This interface follows reliable broadcast terminology of distributed
+systems literature.
+
+Given the way the [off-chain protocol is specified](../specification), the
+`broadcast` abstraction required from the `Network` interface is a so-called
+_uniform reliable broadcast_ with properties:
+
+1. **Validity**: If a correct process p broadcasts a message m, then p eventually delivers m.
+2. **No duplication**: No message is delivered more than once.
+3. **No creation**: If a process delivers a message m with sender s, then m was
+previously broadcast by process s.
+4. **Agreement**: If a message m is delivered by some correct process, then m is
+eventually delivered by every correct process.
+
+See also Module 3.3 in [Introduction to Reliable and Secure Distributed
+Programming](https://www.distributedprogramming.net) by Cachin et al, or
+[Self-stabilizing Uniform Reliable Broadcast by Oskar
+Lundström](https://arxiv.org/abs/2001.03244); or [atomic
+broadcast](https://en.m.wikipedia.org/wiki/Atomic_broadcast) for an even
+stronger abstraction.
+
+## Topology
+
+Currently, the `hydra-node` operates in a static, **fully connected** network
+topology where each nodes connects to each other node and a message is broadcast
+to all nodes. For this, we need to pass publicly reachable endpoints of *all
+other nodes* via `--peer` options to each hydra node and *all links* must be
+operational to achieve liveness.
+
+Alternative implementations of a the `Network` interface could improve upon this
+by enabling **mesh** topologies where messages are forwarded across links. This
+would simplify configuration to only need to provide *at least one* `--peer`,
+while *peer sharing* in such a network could still allow for redundant
+connections and better fault tolerance.
+
+## Authentication
+
+The messages exchanged through the _Hydra networking_ layer between participants
+are authenticated. Each message is
+[signed](https://github.com/input-output-hk/hydra/issues/727) using the Hydra
+signing key of the emitting party, which is identified by the corresponding
+verification key. When a message with an unknown or incorrect signature is
+received, it is dropped, and a notification is logged.
+
+Currently, messages are not encrypted. If confidentiality is required, an
+external mechanism must be implemented to prevent other parties from observing
+the messages exchanged within a head.
+
+## Fault model
+
+Although the Hydra protocol can only progress when nodes of all participants are
+online and responsive, the network layer should still provide a certain level of
+tolerance to crashes, transient connection problems and *non-byzantine* faults.
+
+Concretely, this means that a _fail-recovery_ distributed systems model (again see Cachin et al) seems to fit these requirements best. This means, that processes may crash and later recover should still be able to participate in the protocol. Processes may forget what they did prior to crashing, but may use stable storage to persist knowledge. Links may fail and are _fair-loss_, where techniques to improve them to _stubborn_ or _perfect_ links likely will be required.
+
+See also [this ADR](/adr/27) for a past discussion on making the network component resilient against faults.
+
+## Implementations
+
+### Current network stack
+
+See [haddocks](/haddock/hydra-node/Hydra-Node-Network.html)
+
+- Hydra nodes form a network of pairwise connected *peers* using point-to-point (eg, TCP) connections that are expected to remain active at all times:
+  - Nodes use [Ouroboros](https://github.com/input-output-hk/ouroboros-network/) as the underlying network abstraction, which manages connections with peers via a reliable point-to-point stream-based communication framework known as a `Snocket`
+  - All messages are _broadcast_ to peers using the PTP connections
+  - Due to the nature of the Hydra protocol, the lack of a connection to a peer halts any progress of the head.
+- A `hydra-node` can only open a head with *all* its peers and exclusively with them. This necessitates that nodes possess prior knowledge of the topology of both peers and heads they intend to establish.
+- Connected nodes implement basic _failure detection_ through heartbeats and monitoring exchanged messages.
+- Messages exchanged between peers are signed using the party's Hydra key and validated upon receiving.
+
+### Gossip diffusion network
+
+The following diagram illustrates one possible implementation of a pull-based messaging system for Hydra, developed from discussions with IOG’s networking engineers:
+
+![Hydra pull-based network](./hydra-pull-based-network.jpg)
+
+## Network resilience testing
 
 In August 2024 we added some network resilience tests, implemented as a GitHub
 action step in [network-test.yaml](https://github.com/cardano-scaling/hydra/blob/master/.github/workflows/network-test.yaml).
@@ -82,56 +146,3 @@ The main things to note are:
 - It's okay to see certain configurations fail, but it's certainly not
   expected to see them _all_ fail; certainly not the zero-loss cases. Anything
   that looks suspcisious should be investigated.
-
-
-### Ouroboros
-
-We held a meeting with the networking team on February 14, 2022, to explore the integration of the Ouroboros network stack into Hydra. During the discussion, there was a notable focus on performance, with Neil Davies providing insightful performance metrics.
-
-- World circumference: 600ms
-- Latency w/in 1 continent: 50-100ms
-- Latency w/in DC: 2-3ms
-- Subsecond roundtrip should be fine wherever the nodes are located
-- Basic reliability of TCP connections decreases w/ distance:
-  - w/in DC connection can last forever
-  - outside DC: it's hard to keep a single TCP cnx up forever; if a reroute occurs because some intermediate node is down, it takes 90s to resettle a route
-  - this implies that as the number of connections goes up, the probability of having at least one connection down at all times increases
-- Closing of the head must be dissociated from network connections => a TCP cnx disappearing =/=> closing the head
-- Within the Cardano network, propagation of a single empty block takes 400ms (to reach 10K nodes)
-  - the Ouroboros network should withstand 1000s of connections (there are some system-level limits)
-- Modelling the Hydra network
-  - a logical framework for modelling the performance of network associate CDF with time for a message to appear at all nodes (this is what is done in the [hydra-sim](https://github.com/input-output-hk/hydra-sim)
-  - we could define a layer w/ the semantics we expect; for example, Snocket = PTP connection w/ ordered guaranteed messages delivery – do we need that in Hydra?
-- How about [Wireguard](https://wireguard.io)? It's a very interesting approach, with some shortcomings:
-  - no global addressing scheme
-  - there is one `eth` interface/connection
-  - on the plus side, it transparently manages IP address changes
-  - does not help w/ Firewalls, eg NAT needs to be configured on each node.
-
-### Cardano networking
-
-See [this Wiki page](https://github.com/input-output-hk/hydra.wiki/blob/master/Networking.md#L1) for detailed notes about how the Cardano network works and uses Ouroboros.
-
-- Cardano is a global network spanning thousands of nodes, with nodes constantly joining and leaving, resulting in a widely varying topology. Its primary function is block propagation: blocks produced by certain nodes according to consensus rules must reach every node in the network within 20 seconds.
-- Nodes cannot maintain direct connections to all other nodes; instead, block diffusion occurs through a form of _gossiping_. Each node is connected to a limited set of peers with whom it exchanges blocks.
-- Nodes must withstand adversarial behavior from peers and other nodes, necessitating control over the amount and rate of data they ingest. Hence, a _pull-based_ messaging layer is essential.
-- Producer nodes, which require access to signing keys, are considered sensitive assets. They are typically operated behind *relay nodes* to enhance security and mitigate the risks of DoS attacks or other malicious activities.
-- Nodes often operate behind ADSL or cable modems, firewalls, or in other complex networking environments that prevent direct addressing. Therefore, nodes must initiate connections to externally reachable *relay nodes*, and rely on a *pull-based* messaging approach.
-
-## Implementations
-
-### Current state
-
-- Hydra nodes form a network of pairwise connected *peers* using point-to-point (eg, TCP) connections that are expected to remain active at all times:
-  - Nodes use [Ouroboros](https://github.com/input-output-hk/ouroboros-network/) as the underlying network abstraction, which manages connections with peers via a reliable point-to-point stream-based communication framework known as a `Snocket`
-  - All messages are _broadcast_ to peers using the PTP connections
-  - Due to the nature of the Hydra protocol, the lack of a connection to a peer halts any progress of the head.
-- A `hydra-node` can only open a head with *all* its peers and exclusively with them. This necessitates that nodes possess prior knowledge of the topology of both peers and heads they intend to establish.
-- Connected nodes implement basic _failure detection_ through heartbeats and monitoring exchanged messages.
-- Messages exchanged between peers are signed using the party's Hydra key and validated upon receiving.
-
-### Gossip diffusion network
-
-The following diagram illustrates one possible implementation of a pull-based messaging system for Hydra, developed from discussions with IOG’s networking engineers:
-
-![Hydra pull-based network](./hydra-pull-based-network.jpg)