Skip to content

Commit

Permalink
Remove path-aware content-id generation
Browse files Browse the repository at this point in the history
  • Loading branch information
pipermerriam committed Dec 20, 2023
1 parent fcb4ca8 commit b73ede9
Showing 1 changed file with 24 additions and 161 deletions.
185 changes: 24 additions & 161 deletions state-network.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ This document is the specification for the sub-protocol that supports on-demand

The execution state network is a [Kademlia](https://pdos.csail.mit.edu/~petar/papers/maymounkov-kademlia-lncs.pdf) DHT that uses the [Portal Wire Protocol](./portal-wire-protocol.md) to establish an overlay network on top of the [Discovery v5](https://github.com/ethereum/devp2p/blob/master/discv5/discv5-wire.md) protocol.

State data from the execution chain consists of all account data from the main storage trie, all contract storage data from all of the individual contract storage tries, and the individul bytecodes for all contracts.
State data from the execution chain consists of all account data from the main storage trie, all contract storage data from all of the individual contract storage tries, and the individul bytecodes for all contracts across all historical state roots. This is traditionally referred to as an "archive node".

### Data

Expand Down Expand Up @@ -40,33 +40,7 @@ The state network uses the stock XOR distance metric defined in the portal wire

### Content ID Derivation Function

The derivation function for Content Id values are defined separately for each data type.

#### Path Aware Content ID Derivation

The state network is primarily concerned with storing state data which is
cryptographically anchored into trie structure. A naive approach to
distribution of this data in the network would be to address trie nodes by
their node-hash. This approach is effective, however, it negatively impacts
the efficiency of lookups. The production patricia merkle trie for ethereum
accounts tends to be at least 7 layers deep, which in a production portal
network environment would mean nodes would have to perform seven (7) recursive
lookups in the DHT to find a single piece of state.

For this reason, the state network implements a custom scheme that uses the
trie path for the high bits in the content-id and fills the remaining low bits
from the node hash. The result of this is as follows.

- Content that is very *shallow* in the trie such as the state root node is
relatively randomly distributed around the DHT regardless of the path
component.
- Content that is very deep in the trie such as leaf nodes, or deep
intermediate nodes are clustered in the DHT address space with the most bits
in common with the path to those nodes in the trie.

This leads to a beneficial situation where leaves that are *close* to each
other in the trie will also be likely to be close to each other in the DHT,
which allows for more efficient network retrieval.
The history network uses the SHA256 Content ID derivation function from the portal wire protocol specification.

### Wire Protocol

Expand Down Expand Up @@ -137,161 +111,50 @@ MPTWitness := List(witness: WitnessNode, max_length=32)

#### Paths (Nibbles)

We define nibbles as a sequence of single hex values
A naive approach to storage of trie nodes would be to simply use the `node_hash` value of the trie node for storage. This scheme however results in stored data not being tied in any direct way to it's location in the trie. In a situation where a participant in the DHT wished to re-gossip data that they have stored, they would need to reconstruct a valid trie proof for that data in order to construct the appropriate OFFER/ACCEPT payload. We include the `path` metadata in state network content keys so that it is possible to reconstruct this proof.

We define path as a sequences of "nibbles" which represent the path through the MPT to reach the trie node.

```
nibble := {0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f} // 16 possible values
nibble := {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f} // 16 possible values
NibblePair := Byte // 2 nibbles tightly packed into a single byte
Nibbles := Vector(NibblePair, length=8) // fixed path length of 8 bytes
Nibbles := List(NibblePair, max_length=32) // fixed path length of 8 bytes
```


#### Helper Functions

We define these helper functions.

##### `combine_path_and_node_hash(path: Nibbles, node_hash: Bytes32) -> Bytes32`
##### `encode_path(path: Nibbles) -> ???`

This function is designed to combine the "nibbles" which define the path of a
node in the hexary merkle trie with the `node_hash` of that node in the trie
into a 32 byte content-id value. This function is designed such that the trie
path occupies the *highest* bits of the content-id and the remaining bits are
sourced from the `node_hash`. The result of this is that trie nodes which are
close to eacch other in the trie will be statistically likely to also be close
to each other when compared using the XOR distance metric in the DHT.
> TODO: what is the most efficient way to encode nibbles and place them in an SSZ container...
One thing to notice about this scheme is that for intermediate trie nodes, the
`path` component will often be short, or even empty in the case of the state
root. However, for leaf nodes, the path component passed into this function
MUST be the full 32 byte path in the trie where the account leaf lives. This
ensures that the content key and id for a leaf node of the trie doesn't change
as the trie around it changes.
- for state roots, the path is the empty list.
- for intermediate nodes we expect between 1-9 nibbles values, aka roughly 3-6 bytes when tightly packed and using terminator byte
- for leaf nodes we can use the 20-byte address pre-image to compress the actual 32-byte path in the trie.

In an SSZ context we end up with 4-byte length elements for anything variable length which suggests we should potentially do something like:

```
# The maximum number of bytes in the resulting content-id that may be sourced
# from the trie path
MAX_PATH_BYTES = 8
# The allowed values for a 'nibble'
NIBBLES_VALUES = {
0x00, 0x01, 0x02, 0x03,
0x04, 0x05, 0x06, 0x07,
0x08, 0x09, 0x0a, 0x0b,
0x0c, 0x0d, 0x0e, 0x0f,
}
def tightly_pack_nibbles(path: Nibbles) -> bytes:
"""
Take an even length bytestring of loosely packed nibbles values and return
them tightly packed as a byte string
>>> pack_nibbles(b'\x0a\x01\x03\x0f')
b'\xa1\x3f'
"""
assert len(path) % 2 == 0
ipath = iter(path)
packed_values = []
for _ in range(len(path) // 2):
high, low = next(ipath), next(ipath)
packed = high << 4 | low
packed_values.append(packed)
return bytes(packed_values)
def construct_trie_node_content_id(path: Nibbles, node_hash: Bytes32) -> Bytes32:
# `node_hash` must be a 32 byte value
assert len(node_hash) == 32
# `path` may not exceed 64 nibbles which is maximum possible trie depth
assert len(path) <= 64
# `path` must be valid `nibbles` values
assert set(path).issubset(NIBBLES_VALUES)
trimmed_path = path[:2 * MAX_PATH_BYTES]
if len(trimmed_path) % 2 == 0: # path length is "even"
#
# path = (0xa, 0xb, 0xc, 0x1, 0x2, 0x3)
# node_hash = 0xdeadbeefdeadbeef....
# content_id = 0xabc123efdeadbeef....
#
# +============+======+======+======+======+======+=====+
# | byte # | #0 | #1 | #2 | #3 | #4 | ... |
# +============+======+======+======+======+======+=====+
# | path | 0xab | 0xc1 | 0x23 | | | |
# +------------+------+------+------+------+------+-----+
# | content_id | ^^ | | ^^ | | | |
# | content_id | 0xab | 0xc1 | 0x23 | 0xef | 0xde | ... |
# | content_id | | | | vv | vv | |
# +------------+------+------+------+------+------+-----+
# | node_hash | 0xde | 0xad | 0xbe | 0xef | 0xde | ... |
# +------------+------+------+------+------+------+-----+
#
packed_path = tightly_pack_nibbles(trimmed_path)
node_hash_low_part = node_hash[len(packed_path):]
return packed_path + node_hash_low_part
else: # path length is "odd"
#
# path = (0xa, 0xb, 0xc, 0x1, 0x2)
# node_hash = 0xdeadbeefdeadbeef....
# content_id = 0xabc12eefdeadbeef....
#
# +============+======+======+======+======+======+=====+
# | byte # | #0 | #1 | #2 | #3 | #4 | ... |
# +============+======+======+======+======+======+=====+
# | path | 0xab | 0xc1 | 0x2 | | | |
# +------------+------+------+------+------+------+-----+
# | content_id | ^^ | | ^ | | | |
# | content_id | 0xab | 0xc1 | 0x2e | 0xef | 0xde | ... |
# | content_id | | | v | vv | vv | |
# +------------+------+------+------+------+------+-----+
# | node_hash | 0xde | 0xad | 0xbe | 0xef | 0xde | ... |
# +------------+------+------+------+------+------+-----+
#
packed_path_high_part = tightly_pack_nibbles(trimmed_path[:-1])
# Since the nibbles value in this case is odd length, then we must
# combine the last nibble with the lower 4 bits of the byte in that
# position from the node_hash value.
middle_high_part = trimmed_path[-1] << 4
middle_low_part = node_hash[len(packed_path_high_part)] & 0x0f
middle_byte = middle_high_part | middle_low_part
node_hash_low_part = node_hash[len(packed_path_high_part) + 1:]
return packed_path_high_part + bytes((middle_byte,)) + node_hash_low_part
AddressPath := Bytes20
max_possible_nibbles_depth := ??? # What is the maximum depth we should account for....
IntermediatePath := Vector(Byte, length=max_possible_nibbles_depth)
EncodedPath := Union(AddressPath, IntermediatePath)
```

#### `rotate(address: Bytes20, location: Bytes32) -> Bytes32`

This function is used to rotate the content-id values of contract storage trie
nodes in the DHT space, in order to ensure that contract storage tries are
evenly distributed accross the DHT key space.


```python
U256_MAX = 2**256 - 1

def rotate(address: Bytes20, location: Bytes32) -> Bytes32:
"""
Location is a 32-byte value representation a location in the DHT key space.
"""
rotation_amount = keccak256(address)
rotated_location = rotation_amount + location

# The modulo here ensures that the resulting location is constrained
# appropriately to the 32-byte DHT key space.
return rotated_location % U256_MAX
```

#### Account Trie Node


```
account_trie_node_key := Container(path: Nibbles, node_hash: Bytes32)
account_trie_node_key := Container(encoded_path: EncodedPath, node_hash: Bytes32)
selector := 0x20
content_for_offer := Container(proof: MPTWitness, block_hash: Bytes32)
content_for_retrieval := Container(node: WitnessNode)
content_id := construct_trie_node_content_id(path: Nibbles, node_hash: Bytes32)
content_key := selector + SSZ.serialize(account_trie_node_key)
content_id := sha256(content_key)
```

#### Contract Trie Node *
Expand All @@ -303,8 +166,8 @@ selector := 0x21
content_for_offer := Container(account_proof: MPTWitness, storage_proof: MPTWitness, block_hash: Bytes32)
content_for_retrieval := Container(node: WitnessNode)
content_id := rotate(storage_trie_node_key.address: Address, construct_trie_node_content_id(storage_trie_node_key.path, storage_trie_node_key.node_hash))
content_key := selector + SSZ.serialize(storage_trie_node_key)
content_id := sha256(content_key)
```


Expand All @@ -319,8 +182,8 @@ selector := 0x22
content_for_offer := Container(code: ByteList, account_proof: MPTWitness, block_hash: Bytes32)
content_for_retrieval := Container(code: ByteList)
content_id := sha256(contract_code_key.address + contract_code_key.code_hash)
content_key := selector + SSZ.serialize(contract_code_key)
content_id := sha256(content_key)
```


Expand Down

0 comments on commit b73ede9

Please sign in to comment.