Skip to content

Commit b1c3914

Browse files
authored
Update networking docs (#1649)
Adds more introductory information on our `Network` component in the Architecture > Networking page. This also drops old/outdated information and investigations of the Cardano network stack (not really relevant anymore). Ideally we would be able to find these back if we'd want, but not sure how to ensure this? --- * [x] CHANGELOG update not needed * [x] Documentation updated * [x] Haddocks update not needed * [x] No new TODOs introduced
2 parents d5729d3 + f8578bc commit b1c3914

File tree

2 files changed

+92
-95
lines changed

2 files changed

+92
-95
lines changed

docs/docs/dev/architecture/index.md

Lines changed: 1 addition & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -23,21 +23,7 @@ $ plantuml -Tsvg architecture-c4.puml
2323

2424
### Network
2525

26-
The _network_ component is responsible for all communications between Hydra nodes related to the off-chain part of the Hydra protocol. The [current implementation](./networking) is based on the [typed protocols](https://github.com/input-output-hk/typed-protocols) library, which is also used by the Cardano networking. It is asynchronous by nature and uses a push-based protocol with a uniform _broadcast_ abstraction.
27-
28-
Messages are exchanged between nodes during different internal transitions and are authenticated using each peer's _Hydra key_. Each message sent is signed by the sender, and the signature is verified by the receiver.
29-
30-
#### Authentication and authorization
31-
32-
The messages exchanged through the _Hydra networking_ layer between participants are authenticated. Each message is [signed](https://github.com/input-output-hk/hydra/issues/727) using the Hydra signing key of the emitting party, which is identified by the corresponding verification key. When a message with an unknown or incorrect signature is received, it is dropped, and a notification is logged.
33-
34-
However, messages are not encrypted. If confidentiality is required, an external mechanism must be implemented to prevent other parties from observing the messages exchanged within a head.
35-
36-
#### Fault tolerance
37-
38-
The Hydra Head protocol guarantees the safety of all honest participants' funds but does not inherently guarantee liveness. Therefore, for the protocol to progress, all parties involved in a head must be online and reactive.
39-
40-
This means that if one or more participants' Hydra nodes become permanently unreachable due to a crash or network partition, no further transactions can occur in the head, and it must be closed. However, the [Hydra networking layer](https://hydra.family/head-protocol/unstable/haddock/hydra-node/Hydra-Node-Network.html) is tolerant to transient disconnections and (non-Byzantine) crashes.
26+
The _network_ component is responsible for communication between Hydra nodes related to the off-chain part of the Hydra protocol. See [Networking](./networking) for details.
4127

4228
### Chain interaction
4329

docs/docs/dev/architecture/networking.md

Lines changed: 91 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,96 @@
11
# Networking
22

3-
This page provides details about the Hydra networking layer, which encompasses
4-
the network of Hydra nodes where heads can be opened.
5-
6-
## Questions
7-
8-
- What's the expected topology of the transport layer?
9-
- Are connected peers a subset, superset, or identical set of the head parties?
10-
- Do we need the delivery ordering and reliability guarantees TCP provides?
11-
- TCP provides full-duplex, stream-oriented, persistent connections between nodes
12-
- The Hydra networking layer is based on asynchronous message passing, which seems better suited to UDP
13-
- Do we need to consider nodes being reachable through firewalls?
14-
- This responsibility could be delegated to end users, allowing them to configure their firewalls/NATs to align with Hydra node requirements
15-
- This may be more manageable for business, corporate, or organizational parties than for individual end-users
16-
- Do we want _privacy_ within a head?
17-
- Transactions' details should be opaque to outside observers, with only the final outcome of the head's fanout being observable
18-
- How do we identify/discover peers/parties?
19-
- The paper assumes a _setup_ phase where:
20-
> To create a head-protocol instance, an initiator invites a set of participants \{p1,...,pn\} (including themselves) to join by announcing protocol parameters: the participant list, parameters of the (multi-)signature scheme, etc.
21-
> Each party subsequently establishes pairwise authenticated channels with all other parties involved.
22-
- What constitutes a _list of participants_? Should each participant be uniquely identifiable? If so, what identification method should be used — naming scheme, IP: port address, public key, certificate?
23-
- What do 'pairwise authenticated channels' entail? Are these actual TCP/TLS connections, or do they operate at the Transport (layer 4) or Session (layer 5) level?
24-
- How open do we want our network protocol to be?
25-
- Currently leveraging the Ouroboros stack with CBOR message encoding, integrating other tools into the Hydra network may pose challenges.
26-
27-
## Investigations
28-
29-
### Network resilience
3+
This page provides details about the Hydra networking layer, through which hydra
4+
nodes exchange off-chain protocol messages. The off-chain protocol relies
5+
heavily on the correct operation of the **multicast** abstraction (`broadcast`
6+
in our fully connected topology here) in the way [it is
7+
specified](../specification) and the following sections explain our realization
8+
in the Hydra node implementation.
9+
10+
## Interface
11+
12+
Within a `hydra-node`, a `Network` component provides the capability to reliably
13+
`broadcast` a message to the whole Hydra network. In turn, when a message is
14+
received from the network, the `NetworkCallback` signals this by invoking
15+
`deliver`. This interface follows reliable broadcast terminology of distributed
16+
systems literature.
17+
18+
Given the way the [off-chain protocol is specified](../specification), the
19+
`broadcast` abstraction required from the `Network` interface is a so-called
20+
_uniform reliable broadcast_ with properties:
21+
22+
1. **Validity**: If a correct process p broadcasts a message m, then p eventually delivers m.
23+
2. **No duplication**: No message is delivered more than once.
24+
3. **No creation**: If a process delivers a message m with sender s, then m was
25+
previously broadcast by process s.
26+
4. **Agreement**: If a message m is delivered by some correct process, then m is
27+
eventually delivered by every correct process.
28+
29+
See also Module 3.3 in [Introduction to Reliable and Secure Distributed
30+
Programming](https://www.distributedprogramming.net) by Cachin et al, or
31+
[Self-stabilizing Uniform Reliable Broadcast by Oskar
32+
Lundström](https://arxiv.org/abs/2001.03244); or [atomic
33+
broadcast](https://en.m.wikipedia.org/wiki/Atomic_broadcast) for an even
34+
stronger abstraction.
35+
36+
## Topology
37+
38+
Currently, the `hydra-node` operates in a static, **fully connected** network
39+
topology where each nodes connects to each other node and a message is broadcast
40+
to all nodes. For this, we need to pass publicly reachable endpoints of *all
41+
other nodes* via `--peer` options to each hydra node and *all links* must be
42+
operational to achieve liveness.
43+
44+
Alternative implementations of a the `Network` interface could improve upon this
45+
by enabling **mesh** topologies where messages are forwarded across links. This
46+
would simplify configuration to only need to provide *at least one* `--peer`,
47+
while *peer sharing* in such a network could still allow for redundant
48+
connections and better fault tolerance.
49+
50+
## Authentication
51+
52+
The messages exchanged through the _Hydra networking_ layer between participants
53+
are authenticated. Each message is
54+
[signed](https://github.com/input-output-hk/hydra/issues/727) using the Hydra
55+
signing key of the emitting party, which is identified by the corresponding
56+
verification key. When a message with an unknown or incorrect signature is
57+
received, it is dropped, and a notification is logged.
58+
59+
Currently, messages are not encrypted. If confidentiality is required, an
60+
external mechanism must be implemented to prevent other parties from observing
61+
the messages exchanged within a head.
62+
63+
## Fault model
64+
65+
Although the Hydra protocol can only progress when nodes of all participants are
66+
online and responsive, the network layer should still provide a certain level of
67+
tolerance to crashes, transient connection problems and *non-byzantine* faults.
68+
69+
Concretely, this means that a _fail-recovery_ distributed systems model (again see Cachin et al) seems to fit these requirements best. This means, that processes may crash and later recover should still be able to participate in the protocol. Processes may forget what they did prior to crashing, but may use stable storage to persist knowledge. Links may fail and are _fair-loss_, where techniques to improve them to _stubborn_ or _perfect_ links likely will be required.
70+
71+
See also [this ADR](/adr/27) for a past discussion on making the network component resilient against faults.
72+
73+
## Implementations
74+
75+
### Current network stack
76+
77+
See [haddocks](/haddock/hydra-node/Hydra-Node-Network.html)
78+
79+
- Hydra nodes form a network of pairwise connected *peers* using point-to-point (eg, TCP) connections that are expected to remain active at all times:
80+
- Nodes use [Ouroboros](https://github.com/input-output-hk/ouroboros-network/) as the underlying network abstraction, which manages connections with peers via a reliable point-to-point stream-based communication framework known as a `Snocket`
81+
- All messages are _broadcast_ to peers using the PTP connections
82+
- Due to the nature of the Hydra protocol, the lack of a connection to a peer halts any progress of the head.
83+
- A `hydra-node` can only open a head with *all* its peers and exclusively with them. This necessitates that nodes possess prior knowledge of the topology of both peers and heads they intend to establish.
84+
- Connected nodes implement basic _failure detection_ through heartbeats and monitoring exchanged messages.
85+
- Messages exchanged between peers are signed using the party's Hydra key and validated upon receiving.
86+
87+
### Gossip diffusion network
88+
89+
The following diagram illustrates one possible implementation of a pull-based messaging system for Hydra, developed from discussions with IOG’s networking engineers:
90+
91+
![Hydra pull-based network](./hydra-pull-based-network.jpg)
92+
93+
## Network resilience testing
3094

3195
In August 2024 we added some network resilience tests, implemented as a GitHub
3296
action step in [network-test.yaml](https://github.com/cardano-scaling/hydra/blob/master/.github/workflows/network-test.yaml).
@@ -82,56 +146,3 @@ The main things to note are:
82146
- It's okay to see certain configurations fail, but it's certainly not
83147
expected to see them _all_ fail; certainly not the zero-loss cases. Anything
84148
that looks suspcisious should be investigated.
85-
86-
87-
### Ouroboros
88-
89-
We held a meeting with the networking team on February 14, 2022, to explore the integration of the Ouroboros network stack into Hydra. During the discussion, there was a notable focus on performance, with Neil Davies providing insightful performance metrics.
90-
91-
- World circumference: 600ms
92-
- Latency w/in 1 continent: 50-100ms
93-
- Latency w/in DC: 2-3ms
94-
- Subsecond roundtrip should be fine wherever the nodes are located
95-
- Basic reliability of TCP connections decreases w/ distance:
96-
- w/in DC connection can last forever
97-
- outside DC: it's hard to keep a single TCP cnx up forever; if a reroute occurs because some intermediate node is down, it takes 90s to resettle a route
98-
- this implies that as the number of connections goes up, the probability of having at least one connection down at all times increases
99-
- Closing of the head must be dissociated from network connections => a TCP cnx disappearing =/=> closing the head
100-
- Within the Cardano network, propagation of a single empty block takes 400ms (to reach 10K nodes)
101-
- the Ouroboros network should withstand 1000s of connections (there are some system-level limits)
102-
- Modelling the Hydra network
103-
- a logical framework for modelling the performance of network associate CDF with time for a message to appear at all nodes (this is what is done in the [hydra-sim](https://github.com/input-output-hk/hydra-sim)
104-
- we could define a layer w/ the semantics we expect; for example, Snocket = PTP connection w/ ordered guaranteed messages delivery – do we need that in Hydra?
105-
- How about [Wireguard](https://wireguard.io)? It's a very interesting approach, with some shortcomings:
106-
- no global addressing scheme
107-
- there is one `eth` interface/connection
108-
- on the plus side, it transparently manages IP address changes
109-
- does not help w/ Firewalls, eg NAT needs to be configured on each node.
110-
111-
### Cardano networking
112-
113-
See [this Wiki page](https://github.com/input-output-hk/hydra.wiki/blob/master/Networking.md#L1) for detailed notes about how the Cardano network works and uses Ouroboros.
114-
115-
- Cardano is a global network spanning thousands of nodes, with nodes constantly joining and leaving, resulting in a widely varying topology. Its primary function is block propagation: blocks produced by certain nodes according to consensus rules must reach every node in the network within 20 seconds.
116-
- Nodes cannot maintain direct connections to all other nodes; instead, block diffusion occurs through a form of _gossiping_. Each node is connected to a limited set of peers with whom it exchanges blocks.
117-
- Nodes must withstand adversarial behavior from peers and other nodes, necessitating control over the amount and rate of data they ingest. Hence, a _pull-based_ messaging layer is essential.
118-
- Producer nodes, which require access to signing keys, are considered sensitive assets. They are typically operated behind *relay nodes* to enhance security and mitigate the risks of DoS attacks or other malicious activities.
119-
- Nodes often operate behind ADSL or cable modems, firewalls, or in other complex networking environments that prevent direct addressing. Therefore, nodes must initiate connections to externally reachable *relay nodes*, and rely on a *pull-based* messaging approach.
120-
121-
## Implementations
122-
123-
### Current state
124-
125-
- Hydra nodes form a network of pairwise connected *peers* using point-to-point (eg, TCP) connections that are expected to remain active at all times:
126-
- Nodes use [Ouroboros](https://github.com/input-output-hk/ouroboros-network/) as the underlying network abstraction, which manages connections with peers via a reliable point-to-point stream-based communication framework known as a `Snocket`
127-
- All messages are _broadcast_ to peers using the PTP connections
128-
- Due to the nature of the Hydra protocol, the lack of a connection to a peer halts any progress of the head.
129-
- A `hydra-node` can only open a head with *all* its peers and exclusively with them. This necessitates that nodes possess prior knowledge of the topology of both peers and heads they intend to establish.
130-
- Connected nodes implement basic _failure detection_ through heartbeats and monitoring exchanged messages.
131-
- Messages exchanged between peers are signed using the party's Hydra key and validated upon receiving.
132-
133-
### Gossip diffusion network
134-
135-
The following diagram illustrates one possible implementation of a pull-based messaging system for Hydra, developed from discussions with IOG’s networking engineers:
136-
137-
![Hydra pull-based network](./hydra-pull-based-network.jpg)

0 commit comments

Comments
 (0)