From 051ada0786e079c928c514ac923c5b027f125f31 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Felf=C3=B6ldi=20Zsolt?= Date: Mon, 4 Nov 2024 04:35:50 +0100 Subject: [PATCH] Update EIP-7745: improve hashing scheme, simplify proving process --- EIPS/eip-7745.md | 376 +++++++++++++++++++++-------------------------- 1 file changed, 168 insertions(+), 208 deletions(-) diff --git a/EIPS/eip-7745.md b/EIPS/eip-7745.md index 53cb9e18cf4aeb..2f749c62cc6731 100644 --- a/EIPS/eip-7745.md +++ b/EIPS/eip-7745.md @@ -12,34 +12,143 @@ created: 2024-07-17 ## Abstract -Replace the fixed 2048 bit log event bloom filters in block headers with a new data structure that can adapt to the changing number of events per block and consistently guarantee a sufficiently low false positive ratio. Unlike the per-block bloom filters, the proposed structure allows searching for specific events by accessing only a small portion of the entire dataset which can also be proven with a Merkle proof, making the search both efficient and light client friendly. +Replace the fixed 2048 bit log event bloom filters in block headers with a new data structure that can adapt to the changing number of events per block and consistently guarantee a sufficiently low false positive ratio. -As an additional benefit, the new structure provides more precise position information on the potential hits (block number and transaction index) allowing the searcher to only look up a single receipt for each potential hit. +The proposed structure maps all log entries onto a global linear log index space and hashes them into a Merkle tree based on that index. It also contains a _filter map_ for every fixed length section of the log index space, hashed into the same Merkle tree. These are two dimensional sparse bit maps similar to bloom filters that allow searching for log address/topic patterns. Unlike the per-block bloom filters, they allow searching for specific events by accessing only a small portion of the entire dataset which can also be proven with a Merkle proof, making the search both efficient and light client friendly. They also provide exact position information for potential hits. Instead of signaling probable presence in a given block, potential hits derived from a _filter map_ are pointers to the log index space and can be directly used to look up the log at the given index from the Merkle tree of log entries. + +The proposed design uses the same structure for efficient local search and remote proof generation/verification, thereby simplifying implementation of provers and verifiers. It also allows validators that are not interested in either searching or proving logs to generate the log index root hash by maintaining a minimal log index state with a relatively small (hard capped) size. ## Motivation -Bloom filters are only useful as long as they are sufficiently sparse. False positive ratio rises rapidly with the number of events per filter and the density of `1` bits in the filter bit vector. In the currently existing filter each log address and topic sets 3 out of a fixed length of 2048 bits which resulted in sufficiently sparse filters in the beginning but with the increase of the block gas limits the false positive ratio soon became practically useless. +Bloom filters are only useful as long as they are sufficiently sparse. False positive ratio rises rapidly with the number of events per filter and the density of `1` bits in the filter bit vector. In the currently existing bloom filter each log address and topic sets 3 out of a fixed length of 2048 bits which resulted in sufficiently sparse filters in the beginning but with the increase of the block gas limits the false positive ratio soon made the filter practically useless. Mainnet blocks currently add over 1000 log addresses and topics in average and therefore the bloom filter size would need to increase about tenfold in order to achieve acceptable false positive rates again, allowing log search with a proof size of about 3 kilobytes per block which is still kind of expensive. -Another issue with per-block filters in headers is that searching for an event requires access to the entire header chain of the searched section (5-600 bytes per block). The main purpose of making logs a part of consensus was to be able to prove these events with a low amount of data and making trustless interaction with smart contracts possible without operating a full node. While a receipt with a Merkle proof can prove the existence of such events (also worth noting here that since the merge the historical state roots structure of the beacon chain also allows proving the canonicalness of the host blocks without possessing the entire header chain), a log filter structure can prove the non-existence of more such events. The proposed structure using the proposed constants allows a trustless proof of all potential hits within a block range with at least two orders of magnitude less data than the bloom filters currently embedded in headers. +The main purpose of making logs a part of consensus was to be able to prove these events with a low amount of data and making trustless interaction with smart contracts possible without operating a full node. The _filter maps_ achieve this by sorting events into rows and allowing smaller log range proofs by about 3 orders of magnitude. ## Specification ### Terms and definitions -- _log value_: either an _address value_ or a _topic value_. Each `LOG` opcode adds one _address value_ and 0..4 _topic values_. A _log value_ is represented by a 32 byte hash which is calculated as `SHA2("A" + address)` or `SHA2("T" + topic)` -- _log value index_: values are globally mapped to a linear index space, with a monotonically increasing _log value index_ assigned to each added _log value_. The _log values_ are added in the order of EVM execution (_address value_ first, then the _topic values_) so the logs generated in each block and in each transaction of the block occupy a continuous range in the index space, with boundary markers added to each block header and transaction receipt. -- _log value pointer_: a global pointer to the index space that always points to the next available _log value index_ and is increased by one each time a _log value_ is added. The actual value of _log value pointer_ is stored in each block receipt and each block header after adding the _log values_ generated in the given transaction/block. -- _filter map_: a `MAP_WIDTH` by `MAP_HEIGHT` map intended to help searching for _log values_ in a fixed `VALUES_PER_MAP` length section of the _log value index_ space. A mark is placed on the map for each _log value_. Multiple marks can be placed at the same position if multiple _log values_ are mapped onto the same position. Rows are sparsely encoded as a list of column indices where marks are placed. _Filter maps_ are indexed, the _log value index_ range covered by each _map index_ is `map_index * VALUES_PER_MAP` to `(map_index+1) * VALUES_PER_MAP - 1`. The sub-position of a _log value_ inside the range covered by a certain map is called _value subindex_ and is calculated as `log_value_index % VALUES_PER_MAP` and used for calculating the _column index_ where a mark should be placed. Filter maps are tree hashed and referenced in each block header as `log_filter_root`. Note that after processing a block the last filter map usually covers an index range that's not fully populated with values yet and therefore it can contain less than `VALUES_PER_MAP` marks and can be further populated in subsequent blocks. -- _filter epoch_: a group of filter maps stored in the hash tree in a way so that multiple rows of adjacent _filter maps_ with the same _row index_ can be efficiently retrieved in a single Merkle multiproof. The _map index_ range covered by each _epoch index_ is `epoch_index * MAPS_PER_EPOCH` to `(epoch_index+1) * MAPS_PER_EPOCH - 1`. +- _log value_: either an _address value_ or a _topic value_. Each `LOG` opcode adds one _address value_ and 0..4 _topic values_. A _log value_ is represented by a 32 byte hash which is calculated as `SHA2(address)` or `SHA2(topic)` +- _log index_: values are globally mapped to a linear index space, with a monotonically increasing _log index_ assigned to each added _log value_. The _log values_ are added in the order of EVM execution (_address value_ first, then the _topic values_) so the logs generated in each block and in each transaction of the block occupy a continuous range in the index space. A _block delimiter_ is also added between blocks which has its own _log index_ and is added to the Merkle tree of _log entries_ but not to the _filter maps_. +- _log entry_: an SSZ encoded log event with position metadata (a `LogEntry` container) is added to the `log_entries` Merkle tree at the first _log index_ assigned to the event (the one assigned to the _address value_). The entries at indices assigned to _topic values_ are left empty (a `LogEntry` with all zero fields). +- _block delimiter_: uses the same encoding as regular `LogEntry` containers but is always distinguishable. It is added to the `log_entries` tree before each block and has its own _log index_ assigned. +- _filter map_: a `MAP_WIDTH` by `MAP_HEIGHT` sized sparse map intended to help searching for _log values_ in a fixed `VALUES_PER_MAP` length section of the _log index_ space. Each _log value_ is marked on the map at a row and column that depends on the _log index_ and the _log value_ itself. Rows are sparsely encoded as a list of marked column indices (in the order of occurence, duplicates not filtered out for simplicity). Each map contains at most `VALUES_PER_MAP` marks and therefore the chance of false positives is kept at a constant low level. +- _filter epoch_: a `MAPS_PER_EPOCH` sized group of consecutive filter maps stored in the hash tree in a way so that multiple rows of adjacent _filter maps_ with the same _row index_ can be efficiently retrieved in a single Merkle multiproof. The _log value_ to _row index_ mapping is constant during a single epoch but changes between epochs. + +### Consensus data format + +#### Block headers + +Beginning at the execution timestamp `FORK_TIMESTAMP`, execution clients MUST replace the `logs_bloom` field of the header schema with `log_index_root` which is the root hash of the `LogIndex` structure after adding the logs emitted in the given block. + +The resulting RLP encoding of the header is therefore: + +``` +rlp([ + parent_hash, + 0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347, # ommers hash + coinbase, + state_root, + txs_root, + receipts_root, + log_index_root, + 0, # difficulty + number, + gas_limit, + gas_used, + timestamp, + extradata, + prev_randao, + 0x0000000000000000, # nonce + base_fee_per_gas, + withdrawals_root, + blob_gas_used, + excess_blob_gas, + parent_beacon_block_root, +]) +``` + +#### Container types + +``` +class LogIndex(Container): + epochs: Vector[LogIndexEpoch, MAX_EPOCH_HISTORY] + next_index: uint64 # next log index to be added + +class LogIndexEpoch(Container): + filter_maps: Vector[Vector[Bytes32, MAPS_PER_EPOCH], MAP_HEIGHT] # linear SHA2 hashes of filter map rows, zero for empty/non-existent rows + log_entries: Vector[LogEntry, MAPS_PER_EPOCH * VALUES_PER_MAP] # LogEntry containers at the first index of each log event, empty otherwise + +class LogEntry(Container): + log: Log + meta: LogMeta + +class LogMeta(Container): + block_number: uint64 + transaction_hash: Root + transaction_index: uint64 + log_in_tx_index: uint64 + +class Log(Container): + address: ExecutionAddress + topics: List[Bytes32, MAX_TOPICS_PER_LOG] + data: ByteList[MAX_LOG_DATA_SIZE] + +class BlockDelimiterEntry(Container): + dummyLog: Log # zero address and empty lists + meta: BlockDelimiterMeta + +class BlockDelimiterMeta(Container): + block_number: uint64 + block_hash: Root + timestamp: uint64 + dummy_value: uint64 # 2**64-1 +``` + +#### Log entries and block delimiters + +Log events with position metadata are added to the `log_entries` tree at the _log index_ assigned to the _address value_ of the log. This allows a simple Merkle proof of all fields of the `eth_getLogs` JSON-RPC response except for the `blockHash` which cannot be hashed into the `log_entries` tree at the time of block processing because the hash of the processed block is not known yet. In order to make `blockHash` provable too a `BlockDelimiterEntry` is added with the parent block's number and hash before adding entries of the next block. This allows the prover to either prove the _block delimiter_ of the same block number as the found _log entry_ or in case of the current head block prove that the last _log entry_ belongs to the same block number. + +#### Filter map row encoding -### Log value mapping +Each row of the filter map is encoded as a series of little endian binary encoded column indices. With the proposed value of `MAP_WIDTH = 2**32` this results in a simple and efficient encoding as a series of 4 byte values. Since the order of entries in this list should not affect the filter algorithm, for simplicity the indices are not sorted but simply appended to the list in the original order of the log events. In the rare case of column index collision (two events generating the same column index in the same row) the index is also added twice. This simplifies the in-memory maintenance of the tree. + +Note that the number of indices in a row may vary. The total number of indices in a fully populated _filter map_ is `VALUES_PER_MAP` minus the number of block delimiters which are not marked on the map. Though the average row size is about `VALUES_PER_MAP // MAP_HEIGHT`, the upper limit of individual row length is `VALUES_PER_MAP`. Though such a list could be represented for consensus hashing purposes as `List[Bytes4, VALUES_PER_MAP]` it would just add an unnecessary layer of complexity and extra processing time to tree hash this list since typically the entire row is required for any meaningful search operation. Therefore the rows regardless of their length are simply `SHA2` hashed and the hashes are stored in the consensus tree format. + +#### Proposed constants + +| Name | Value | +|-------------------|-------| +| MAP_WIDTH | 2**32 | +| MAP_HEIGHT | 2**12 | +| VALUES_PER_MAP | 2**16 | +| MAPS_PER_EPOCH | 2**6 | +| MAX_EPOCH_HISTORY | 2**24 | + +### Initialization and minimal state + +Since the Merkle trees are updated with increasing keys, they can be initialized with a Merkle branch of the next leaf to be updated, plus the previous leaf value itself except in case of `log_entries_branch` where this is always zero. Providing `LogIndexMinimalState` on request needs to be added to the sync protocol in order to allow freshly synced nodes to bootstrap. A validator that wants to keep its state minimal and does not want to prove historical log index data can also discard old Merkle tree nodes after update and use this structure as its log index state. + +``` +class LogIndexMinimalState: + next_index: uint64 # next log index to be added + epoch_branch: Vector[Bytes32, floorlog2(MAX_EPOCH_HISTORY)] # merkle branch of the epoch where next_index points + filter_map_rows: Vector[List[uint32, VALUES_PER_MAP], MAP_HEIGHT] # rows of the filter map where next_index points + filter_map_branches: Vector[Vector[Bytes32, floorlog2(MAPS_PER_EPOCH)], MAP_HEIGHT] # merkle branches of each row of the filter map where next_index points + log_entries_branch: Vector[Bytes32, floorlog2(MAPS_PER_EPOCH * VALUES_PER_MAP) + 3] # merkle branch of log entry where next_index points +``` + +This structure consists of `floorlog2(MAX_EPOCH_HISTORY)+MAP_HEIGHT*floorlog2(MAPS_PER_EPOCH)+floorlog2(MAPS_PER_EPOCH * VALUES_PER_MAP)` branch hashes and at most `VALUES_PER_MAP` column indices in total. With the proposed constants this amounts to `32*(24+4096*6+6+16)+4*65536 = 1050048` bytes plus encoding overhead. + +### Updating the log index + +#### Filter map row and column mapping The log value mapping functions `get_row_index` and `get_column_index` take a _log value_ and its position in the linear index space and return a position of the _filter map_ where a mark for the given _log value_ should be placed. ``` -def get_row_index(log_value_index, log_value): - map_index = log_value_index // VALUES_PER_MAP - epoch_index = map_index // MAPS_PER_EPOCH +def get_row_index(epoch_index, log_value): + # epoch_index = log_value_index // VALUES_PER_MAP // MAPS_PER_EPOCH return SHA2(log_value + uint32_to_littleEndian(epoch_index)) % MAP_HEIGHT def get_column_index(log_value_index, log_value): @@ -49,7 +158,7 @@ def get_column_index(log_value_index, log_value): return column_transform(value_subindex, transform_hash) ``` -The function `column_transform` performs a reversible quasi-random transformation on the range `0 .. MAP_WIDTH-1`. It can be realized with a series of reversible operations (modulo multiplication with odd numbers, modulo addition, binary XOR) where the second arguments of these operations are taken from different sections of the `transform_hash`. +The function `column_transform` and its inverse `reverse_transform` realize a quasi-random bijective mapping on the set of integers between `0 .. MAP_WIDTH-1`. They are realized with a series of reversible operations (modulo multiplication with odd numbers, modulo addition, binary XOR) where the second arguments of these operations are taken from different sections of the `transform_hash`. When adding a mark on the _filter map_ the sub-index of the `log value` inside the map (`value_subindex = log_index % VALUES_PER_MAP`) is used as an input for `column_transform`. ``` def column_transform(value_subindex, transform_hash): @@ -77,7 +186,7 @@ def reverse_transform(column_index, transform_hash): return x ``` -Note that with a non-reversible `column_transform` the searcher would need to iterate through the entire `VALUES_PER_MAP` range for each map row and searched log value and check whether there is a mark in the resulting `column_index`, indicating a potential hit. Having a matching `reverse_transform` function makes it possible to just iterate through the actual marks found in the map row and see whether they can be transformed back into a potentially valid `value_subindex`: +Note that with a non-reversible `column_transform` the searcher would need to iterate through the entire `VALUES_PER_MAP` range for each map row and searched log value and check whether there is a mark in the resulting `column_index`, indicating a potential hit. Having a matching `reverse_transform` function makes it possible to just iterate through the actual marked column indices found in the map row and see whether they can be transformed back into a potentially valid `value_subindex` that is less than `VALUES_PER_MAP`: ``` def potential_match_index(map_index, column_index, log_value): @@ -88,178 +197,59 @@ def potential_match_index(map_index, column_index, log_value): return -1 ``` -### Adding and searching for values - -For pseudocode readability the filter map rows are represented here as `filter_map_rows[map_index][row_index]` where each row is a list of column indices. Note that this does not match the consensus hash tree representation which is different for efficiency reasons (see below). - -Adding _log values_ generated by a block: +#### Block processing ``` -def add_block_logs(block): - for receipt in block.receipts: - for log in receipt.logs: - add_log_value(address_value(log.address)) +# Add all log values emitted in the block to the log index; should be called even if the block is empty +def add_block_logs(log_index, block): + if block.number > 0: + # add block delimiter entry + block_delimiter_meta = BlockDelimiterMeta(block_hash = block.parent_hash, block_number = block.number-1, timestamp = block.parent.timestamp, dummy_value = 2**64-1) + block_delimiter_entry = BlockDelimiterEntry(meta = block_delimiter_meta) + log_index.epochs[log_index.next_index // (VALUES_PER_MAP*MAPS_PER_EPOCH)].log_entries[log_index.next_index % (VALUES_PER_MAP*MAPS_PER_EPOCH)] = block_delimiter_entry + log_index.next_index += 1 + # add log entries and mark log values on filter maps + for tx_index, receipt in enumerate(block.receipts): + tx_hash = SHA3(block.transactions[tx_index]) + for log_in_tx_index, log in enumerate(receipt.logs): + log_meta = LogMeta(transaction_hash = tx_hash, block_number = block.number, transaction_index = tx_index, log_in_tx_index = log_in_tx_index) + log_entry = LogEntry(meta = log_meta, log = log) + log_index.epochs[log_index.next_index // (VALUES_PER_MAP*MAPS_PER_EPOCH)].log_entries[log_index.next_index % (VALUES_PER_MAP*MAPS_PER_EPOCH)] = log_entry + add_log_value(log_index, address_value(log.address)) for topic in log.topics: - add_log_value(topic_value(topic)) - receipt.log_value_pointer = log_value_pointer - block.log_value_pointer = log_value_pointer + add_log_value(log_index, topic_value(topic)) + block.log_index_root = hash_tree_root(log_index.root) -def add_log_value(log_value): - map_index = log_value_pointer // VALUES_PER_MAP - row_index = get_row_index(map_index // MAPS_PER_EPOCH, log_value) - column_index = get_column_index(log_value_pointer, log_value) - filter_map_rows[map_index][row_index].append(column_index) - log_value_pointer += 1 +# Mark a single log value on the filter maps +def add_log_value(log_index, log_value): + map_index = log_index.next_index // VALUES_PER_MAP + epoch_index = map_index // MAPS_PER_EPOCH + map_subindex = map_index % MAPS_PER_EPOCH + row_index = get_row_index(epoch_index, log_value) + column_index = get_column_index(log_index.next_index, log_value) + log_index.epochs[epoch_index].filter_maps[row_index][map_subindex].append(column_index) + log_index.next_index += 1 def address_value(address): - return sha2("A" + address) + return SHA2(address) def topic_value(topic): - return sha2("T" + topic) + return SHA2(topic) ``` -Searching for a _log value_ in a range of blocks: - -``` -def search_block_range(first_block_number, last_block_number, log_value): - # convert block number range to log value index range - first_index = 0 - if first_block_number > 0: - first_index = canonical_blocks(first_block_number - 1).log_value_pointer - last_index = canonical_blocks(last_block_number).log_value_pointer - 1 - return search_index_range(first_index, last_index, log_value) - -def search_index_range(first_index, last_index, log_value): - results = [] - # iterate through the filter map range where relevant logs might be marked - for map_index in range(first_index // VALUES_PER_MAP, last_index // VALUES_PER_MAP): - # select relevant row and iterate through column indices - row_index = get_row_index(map_index // MAPS_PER_EPOCH, log_value) - row = filter_map_rows[map_index][row_index] - for column_index in row: - # reverse transform column index and check if it is a potential match - match_index = potential_match_index(map_index, column_index, log_value) - if match_index >= first_index and match_index <= last_index: - # find receipt that generated the reverse transformed log value index - receipt = receipt_with_index(match_index) - # check if it actually lists the searched log value - if receipt != null and receipt_has_value(receipt, log_value): - results.append(receipt) - return results - -def receipt_with_index(log_value_index): - # return first canonical block receipt where receipt.log_value_pointer > log_value_index - -def receipt_has_value(receipt, log_value): - for log in receipt.logs: - if log_value == address_value(log.address): - return true - for topic in log.topics: - if log_value == topic_value(topic): - return true - return false -``` - -Note that the example pseudocode shown above realizes a search for a single log value but it is also possible to efficiently filter for specific address/topic combinations and patterns. - -### Consensus data format - -#### Block headers - -Beginning at the execution timestamp `FORK_TIMESTAMP`, execution clients MUST replace the `logs_bloom` field of the header schema with two new fields: the `log_filter_root` (the root hash of the filter map hash tree structure) and `log_value_pointer` (the first unoccupied position in the log value index space after processing the block). - -The resulting RLP encoding of the header is therefore: - -``` -rlp([ - parent_hash, - 0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347, # ommers hash - coinbase, - state_root, - txs_root, - receipts_root, - log_filter_root, - log_value_pointer, - 0, # difficulty - number, - gas_limit, - gas_used, - timestamp, - extradata, - prev_randao, - 0x0000000000000000, # nonce - base_fee_per_gas, - withdrawals_root, - blob_gas_used, - excess_blob_gas, - parent_beacon_block_root, -]) -``` - -#### Receipts - -Beginning at the execution timestamp `FORK_TIMESTAMP`, execution clients MUST replace the `bloom` field of the transaction receipt schema with a new field, the `log_value_pointer` (the first unoccupied position in the log value index space after executing the transaction). - -The resulting RLP encoding of the receipt is therefore: - -``` -rlp([ - type, - post_state, - status, - cumulative_gas_used, - log_value_pointer, - logs, -]) -``` - -#### Log filter map - -Each row of the filter map is encoded as a series of little endian binary encoded column indices. With the proposed value of `MAP_WIDTH = 2**32` this results in a simple and efficient encoding as a series of 4 byte values. Since the order of entries in this list should not affect the filter algorithm, for simplicity the indices are not sorted but simply appended to the list in the original order of the log events. In the rare case of column index collision (two events generating the same column index in the same row) the index is also added twice. This simplifies the in-memory maintenance of the tree and also allows clients to maintain only a single version of the tree and easily roll back events if necessary. - -Note that the number of indices in a row of a single map may vary, the global average number of entries per row is `VALUES_PER_MAP / MAP_HEIGHT` while the upper limit is `VALUES_PER_MAP`. Though such a list could be represented for consensus hashing purposes as `List[Bytes4, VALUES_PER_MAP]` it would just add an unnecessary layer of complexity to tree hash this list since typically the entire row is required for any meaningful search operation. Therefore the rows regardless of their length are simply `SHA2` hashed and the hashes are stored in the consensus tree format. - -#### Log filter tree - -The `SHA2` hashes of encoded map rows are stored in `log_filter_tree` which is defined as an SSZ binary Merkle tree structured as a list of vector of vectors: - -``` - log_filter_tree: List[Vector[Vector[Bytes32, MAPS_PER_EPOCH], MAP_HEIGHT], MAX_EPOCH_HISTORY] -``` - -The hash of row `row_index` of map `map_index` is stored in `log_filter_tree[map_index // MAPS_PER_EPOCH][row_index][map_index % MAPS_PER_EPOCH]`. The length of the list of epoch subtrees is always in direct relationship with `log_value_pointer`: - -``` -epoch_count = (log_value_pointer + VALUES_PER_MAP * MAPS_PER_EPOCH - 1) // VALUES_PER_MAP // MAPS_PER_EPOCH -``` - -When a new epoch subtree is added, it should be initialized by filling it with the `SHA2` hash of an empty string which is the canonical encoding of a map row with no marks in it. - -### Proposed constants - -- `MAP_WIDTH`: `2**32` -- `MAP_HEIGHT`: `2**12` -- `VALUES_PER_MAP`: `2**16` -- `MAPS_PER_EPOCH`: `2**6` -- `MAX_EPOCH_HISTORY`: `2**24` - ## Rationale -### Two dimensional mapping - -The currently existing bloom filter maps the log events of each block onto a one dimensional bit vector. Searching in such vectors requires accessing and processing the entire vectors from each block in the search range while only a few bits of each of those vectors are actually required in order to search for a specific value. To achieve higher bandwidth efficiency, log data from multiple blocks is collected and mapped in two dimensions where the filter data generated by different log values are sorted into a fixed number of rows. The same log value is mapped onto the same row index during a sufficiently long period (_filter epoch_), making it possible to access only the interesting rows and making a typical search for a small number of values in a long range of blocks more efficient by multiple orders of magnitude. +### Log index space -### Log value index space +In each block a varying number of _log values_ are emitted. In addition to inefficient search, another drawback of per-block fixed size bloom filters is the varying filter utilization leading to over-utilized filters giving many false positives in some blocks and/or wastefully under-utilized filters in some blocks. Block gas limits also tend to change significantly over the long term so any future-proof solution has to be able to adapt to the varying number of _log values_ per block. -In each block a varying number of log values are emitted. In addition to inefficient search, another drawback of per-block fixed size bloom filters is the varying filter utilization leading to over-utilized filters giving many false positives in some blocks and/or wastefully under-utilized filters in some blocks. Block gas limits also tend to change significantly over the long term so any future-proof solution has to be able to adapt to the varying number of _log values_ per block. - -Since _filter maps_ of the proposed design are no longer associated with individual blocks but contain filter data from multiple blocks, it is possible to detach them from block boundaries and put a fixed number of _log values_ into each _filter map_. The horizontal mapping realized by `column_transform` and `reverse_transform` ensures that potential hits returned by a search give not only the index of the _filter map_ but the exact _log value index_ where the searched _log value_ was probably added. Since a `log_value_pointer` field is added to each header and transaction receipt, it is always possible to find the exact transaction and receipt where the _log value_ has probably been added and a canonical receipt covering the _log value index_ of the potential hit always proves whether it was a real hit or a false positive. +Since _filter maps_ of the proposed design are no longer associated with individual blocks but contain filter data from multiple blocks, it is possible to detach them from block boundaries and put a fixed number of _log values_ into each _filter map_. The horizontal mapping realized by `column_transform` and `reverse_transform` ensures that potential hits returned by a search give not only the index of the _filter map_ but the exact _log index_ where the searched _log value_ was probably added and which can be directly used to look up the single log entry from `log_enrties`. Mapping _log values_ on their own linear index space ensures uniform filter utilization of identically structured _filter maps_. Compared to the alternative of constantly changing adaptive filter size, this approach greatly simplifies the storage and tree hashing scheme and the construction of Merkle multiproofs covering longer block ranges. It also allows mapping each address and topic value separately to consecutive indices and implementing specific address/topic pattern filters. ### Sparse horizontal mapping -False positive rate depends on map density and number of marks placed on the map per log value. It can be estimated as `FALSE_POSITIVE_RATE = (VALUES_PER_MAP * MARKS_PER_VALUE / MAP_WIDTH / MAP_HEIGHT) ** MARKS_PER_VALUE`. The proposed data structure achieves a low false positive rate by choosing a high `MAP_WIDTH` while only adding one mark per value. An alternative would be using multiple marks per log value and a lower `MAP_WIDTH`. With a fixed target `FALSE_POSITIVE_RATE` the necessary `MAP_WIDTH` can be calculated as follows: +False positive rate depends on map density and number of marks placed on the map per log value. It can be estimated as `FALSE_POSITIVE_RATE = (VALUES_PER_MAP * MARKS_PER_VALUE / MAP_WIDTH / MAP_HEIGHT) ** MARKS_PER_VALUE`. The proposed data structure achieves a low false positive rate by choosing a high `MAP_WIDTH` while only adding one mark per value. An alternative would be using multiple marks per _log value_ and a lower `MAP_WIDTH`. With a fixed target `FALSE_POSITIVE_RATE` the necessary `MAP_WIDTH` can be calculated as follows: ``` AVG_MARKS_PER_ROW = VALUES_PER_MAP * MARKS_PER_VALUE / MAP_HEIGHT @@ -290,41 +280,11 @@ FALSE_POSITIVE_RATE = 1 / 2**28 It shows that the smaller encoding of individual mark indices can almost offset the larger number of marks but still, `MARKS_PER_VALUE = 1` results in the best data efficiency when the total size of the filter data set is considered. The advantage of very sparse rows is even greater if we consider the data amount to be accessed when searching for a specific value which requires `MARKS_PER_VALUE` rows to be accessed and matched to each other. -### Hash tree structure - -The hash tree structure and parameters are determined as a tradeoff between two kinds of costs: the bandwidth cost of searching for a _log value_ in a long range of _filter maps_ and the processing/memory cost of updating the canonical hash tree of this structure after processing a block and adding new _log values_. This tradeoff exists because either proving or modifying multiple leaf nodes of a Merkle tree is more efficient if those nodes are located close to each other. - -Searching for a certain _log value_ requires access to a given row index that is calculated as a function of the searched value and the epoch, meaning that in each epoch the same row index of consecutive filter maps are accessed. A row-major indexing of the two-dimensional tree (`log_filter_tree[row_index][map_index]`) places these rows next to each other, allowing an efficient Merkle multiproof of an entire epoch that only contains 12 sibling nodes on the `row_index` path, 24 on the `map_index` path and the actual filter rows. - -On the other hand, _log values_ added during block processing typically placed into many different rows but occupies a continuous _log value index_ range which in most cases falls into the range of a single _filter map_. A column-major indexing (`log_filter_tree[map_index][row_index]`) places all rows of the same filter map next to each other, ensuring that during block processing typically only the subtree belonging to a single _filter map_ and the `map_index` path leading up to it needs to be re-hashed. - -In order to achieve good performance in both aspects, a third, mixed index ordering approach has been chosen for `log_filter_tree` that splits `map_index` into two parts, `epoch_index = map_index // MAPS_PER_EPOCH` and `map_subindex = map_index % MAPS_PER_EPOCH` and indexes the tree as `log_filter_tree[epoch_index][row_index][map_subindex]`. This approach allows efficient Merkle proofs like row-major ordering for entire epochs and actually performs even better when multiple epochs are proven because the expensive `epoch_index` path needs to be proven only once instead of once per epoch. Tree update hashing cost is higher than in case of column-major ordering because it hash to also update a `map_subindex` path for every updated row. As calculations below show though, the extra cost is significantly smaller than in case of row-major ordering which has to update an entire `map_index` path for every updated row. - -The calculations below are based on the proposed constants: - -|**Name** |**Formula** |**Value** | -|----------------------------|--------------------------------------------|----------| -| `epoch_index` path length | `log2(MAX_EPOCH_HISTORY)` | 24 hashes| -| `map_subindex` path length | `log2(MAPS_PER_EPOCH)` | 6 hashes | -| `map_index` path length | `log2(MAX_EPOCH_HISTORY * MAPS_PER_EPOCH)` | 30 hashes| -| `row_index` path length | `log2(MAP_HEIGHT)` | 12 hashes| -| average row length | `AVG_BITS_PER_ROW / 8` | 64 bytes | - -Merkle proof size is estimated for three different search range lengths: single map, one epoch and 64 epochs. Note that the leaves of the binary Merkle tree are hashes of actual rows but the proof contains the actual row data which has a varying size. For these calculations the average row length is used. - -|**indexing order**|**1 map** |**1 epoch** |**64 epochs** | -|------------------|---------------------|------------------------------|-----------------------------------| -|column-major |(30+12)\*32+64 = 1408|24\*32+64\*(12\*32+64) = 29440|18\*32+4096\*(12\*32+64) = 1835584 | -|row-major |(30+12)\*32+64 = 1408|(12+24)\*32+64\*64 = 5248 |64\*((12+24)\*32+64\*64) = 335872 | -|mixed order |(30+12)\*32+64 = 1408|(24+12)\*32+64\*64 = 5248 |18\*32+64\*(12\*32+64\*64) = 287296| +### Log entries tree -Tree update costs are estimated as a worst-case marginal cost of each additional _log value_ in terms of updated tree nodes and amount of hashed data including actual row data. Note that it is assumed here that all updated rows belong to one or two _filter maps_ and therefore the updated tree nodes leading up to these _filter maps_ are a negligible one time cost. Also note that just like in the calculations above the average encoded row data length is used for hashed data amount estimation. +Hashing entire logs along with position info into a tree greatly simplifies the remote proving/verifying process. There is no need to separately prove the canonicalness of block headers and the receipts referenced in them, everything can be proven with Merkle proofs of a single `LogIndex` structure. -|**indexing order**|**updated nodes**|**hashed bytes**| -|------------------|-----------------|----------------| -|column-major |12 |12\*64+64 = 832 | -|row-major |30+12 = 42 |42\*64+64 = 2752| -|mixed order |12+6 = 18 |18\*64+64 = 1216| +Storing the `log_entries` subtrees directly in their proposed merkleized format is not really efficient though. This is not an issue for validators that want to maintain a minimal state; for them, updating the `log_entries` subtrees is really cheap as they only need to maintain a single Merkle branch pointing to the next _log index_. Provers can implement `log_entries` efficiently by storing a subset of the Merkle tree nodes (the ones close to the root of each epoch's subtree) and generate the rest on demand based on the receipts that encode the same data in a more compact form. This also requires storing the _log index_ of block delimiters along with reverse pointers pointing to block numbers at the beginning of on-demand generated `log_entries` subtrees. ## Backwards Compatibility @@ -342,21 +302,21 @@ The existing log filter API (`eth_getLogs`, `eth_newFilter`, `eth_getFilterLogs` ### Safe access with a remote prover -In order to trustlessly access relevant data with the help of a remote prover, the prover needs to +In order to prove a complete set of matches matching a given search pattern in a given block range, the prover needs to -- prove canonical execution headers by number in order to determine the _log value index_ range belonging to the searched block number range -- prove the relevant rows of filter maps based on map index and row index -- prove an individual transaction receipt with a Merkle proof from a canonical execution header, based on the _log value index_ of potential matches +- prove the _log index_ range that corresponds to the searched block number range by proving the _block delimiter_ entries of `first_block - 1` and `last_block` +- prove the relevant rows of filter maps based on map index and row index (verifier can determine the relevant rows in advance based on the _log values_ in the search pattern and the relevant _log index_ range) +- prove the actual _log entry_ belonging to any potentially matching _log index_ and also the _block delimiter_ entry with the same block number if `blockHash` of the log is needed -The relevant parts of filter maps can be proved with SSZ multiproofs from the `log_filter_root` of the latest execution header. Proving canonical execution headers requires a canonical beacon header from which it is possible to prove any beacon state root through the `state_roots` and `historical_summary` structures, from which it is possible to prove the belonging execution block hash. In conclusion, any search with a remove prover is as safe as the client's knowledge about a recent canonical beacon header. +Since all three steps can be realized with Merkle proofs of the same `LogIndex` structure referenced in the block headers, any search with a remove prover is as safe as the client's knowledge about the most recent available canonical block header. ### False positives -From the filter maps a set of _potential matches_ can be derived for any block range and log value or pattern of log values. A canonical transaction receipt that covers the _log value index_ for the potential match can prove whether it was actual match or a false positive. The design guarantees that the set of potential matches includes all actual matches but it should also be ensured that an excessive amount false positives will not make the bandwidth and processing costs of the search prohibitively high. +From the filter maps a set of _potential matches_ can be derived for any block range and _log value_ or pattern of _log values_. These matches can then be looked up in the `log_entries` trees and actually matching logs can be added to the set of results. The design guarantees that the set of potential matches includes all actual matches but it should also be ensured that an excessive amount false positives will not make the bandwidth and processing costs of the search prohibitively high. False positives can happen when `reverse_transform(column_index, transform_hash)` gives a valid `potential_value_subindex` that is less than `VALUES_PER_MAP` while the `column_index` was actually generated by `column_transform(value_subindex, another_transform_hash)` where `another_transform_hash` belongs to a different log value. By assuming a random uniform distribution of column indices the chance of a false positive can be estimated as `VALUES_PER_MAP / MAP_WIDTH = 1 / 2**16` for each mark in the map row. Since each map contains a maximum of `VALUES_PER_MAP` marks, by assuming a random uniform distribution of row indices the rate of false positives can be estimated as `VALUES_PER_MAP^2 / MAP_WIDTH / MAP_HEIGHT = 1 / 2**12` per map or `VALUES_PER_MAP / MAP_WIDTH / MAP_HEIGHT = 1 / 2**28` per individual log value. Assuming `2**11` log values per block this gives a rate of one false positive per `2**17` blocks. -Though certain log values might be used a lot more than others and therefore the row index distribution might not be entirely uniform, changing the row mapping per _filter epoch_ ensures that even very often used log values will not raise the false positive rate of certain other log values over the long term. A deliberate attack on a certain important log value (address value in this case, since topic values can be freely used by anyone) in order to raise its false positive rate can not be ruled out entirely since with a low amount of filter data generated per log value it is always possible to "mine" another value that generates colliding filter data. The sparse quasi-randomized horizontal mapping makes this attack a lot harder though, since the column index depends on both the log value and the exact log value index, making this attack only possible for block creators who are probably offered MEV rewards for much more lucrative manipulations of the transaction set. +Though certain _log values_ might be emitted a lot more than others and therefore the row index distribution might not be entirely uniform, changing the row mapping per _filter epoch_ ensures that even very often used `log values` will not raise the false positive rate of certain other `log values` over the long term. A deliberate attack on a certain important `log value` in order to raise its false positive rate can not be ruled out entirely since with a low amount of filter data generated per log value it is always possible to "mine" another value that generates colliding filter data. The sparse quasi-randomized horizontal mapping makes this attack a lot harder though, since the column index depends on both the _log value_ and the exact _log index_, making this attack only possible for block creators who are probably offered MEV rewards for much more lucrative manipulations of the transaction set. ## Copyright