Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Root specific lookup #84

Merged
merged 3 commits into from
Mar 21, 2019
Merged

Root specific lookup #84

merged 3 commits into from
Mar 21, 2019

Conversation

carver
Copy link
Contributor

@carver carver commented Mar 15, 2019

What was wrong?

Want to look up the data in a node without first switching the root hash, eg, update API from:

original_root_hash = trie.root_hash
trie.root_hash = target_root_hash
assert trie.get(key) == expected_value_at_target_hash
trie.root_hash = original_root_hash

to:

assert trie.get(key, target_root_hash) == expected_value_at_target_hash

How was it fixed?

Builds on #83 -- this review can focus on just the last two commits (bf726c and e0c7ff5 currently) and I won't merge until #83 is approved.

Dropped a few places where HexaryTrie was relying on self.root_hash exclusively. Keep the old behavior for backwards compatibility.

Extended a current test for checking values against old roots.

Cute Animal Picture

Cute animal picture

db = {}
trie = HexaryTrie(db=db)
for key, val in changes:
if val is None:
del trie[key]
missing_by_root[trie.root_hash].add(key)
Copy link
Contributor Author

@carver carver Mar 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this value to a set instead of just assigning = key might be a little overboard, since the trie never goes back to an old state root right now. But it seemed easy enough to keep it general right now, in case we change the input data later.

@carver carver requested a review from lithp March 15, 2019 18:49
@carver carver force-pushed the root-specific-lookup branch from e0c7ff5 to 69f872f Compare March 18, 2019 19:35
@carver
Copy link
Contributor Author

carver commented Mar 19, 2019

TODO: use a different API, since the Mapping API already has a meaning for the 2nd-position: get(key, default). Some options:

  • get(key, *, at_root_hash=None)
  • lookup(key, at_root_hash)
  • get_by_root(key, at_root_hash)

@pipermerriam
Copy link
Member

ideas

# local mutation
with trie.at_root(root_hash):
    trie.get(key)
# non-mutative to local trie
with trie.at_root(root_hash) as other:
    other.get(key)

or no context manager at all and encourage chaining.

trie.at_root(root_hash).get(key)

@carver
Copy link
Contributor Author

carver commented Mar 19, 2019

I was avoiding dealing with what happens if you try to change a trie at a different root. There are a lot of open questions about how this should work, especially if pruning is turned on:

with trie.at_root(root_hash) as trie_at_root:
  trie_at_root[b'key'] = b'val'

(It's neat that using the context manager means you get all the methods on the trie upgraded to work with any root, for ~free)

I could make trie_at_root read-only, I suppose, but that concept isn't implemented yet, so would take some extra work. Maybe the open questions go away if pruning is disallowed in combination with this feature. Have to think about it a bit.

@carver carver force-pushed the root-specific-lookup branch from 69f872f to 3bfb4ab Compare March 19, 2019 18:36
@carver carver force-pushed the root-specific-lookup branch from 3bfb4ab to fda28be Compare March 19, 2019 18:40

def test_hexary_trie_raises_on_pruning_snapshot():
db = {}
trie = HexaryTrie(db, prune=True)
Copy link
Contributor Author

@carver carver Mar 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style flub due to copy-paste. Could combine previous lines since we don't need later access to db. I'll save that fix until after review.

Copy link

@lithp lithp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have nothing but nitpicks and vague concerns, left you some feedback but feel free to :shipit: !

@@ -535,6 +542,14 @@ def squash_changes(self):
yield memory_trie
self.root_node = memory_trie.root_node

@contextlib.contextmanager
def at_root(self, at_root_hash):
if self.is_pruning:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little dangerous, though I'm not sure how it could be fixed.

I guess the idea is: we're about to yield a copy of the trie and the two versions are likely to be used concurrently, if we allowed is_pruning to be true then operations such as delete() in one trie could mess with the other trie.

I scanned through get() and set() and some of the other methods and it seems this always works, is_pruning means the underlying trie won't change out from under us, but there's also nothing which enforces that being true. This is probably just me being paranoid, but "no operations on the other trie will change our nodes" and "we don't prune away outdated nodes" don't seem like equivalent concepts.

In particular, an optimization which modified nodes in place rather than building new ones and pruning the old ones would be allowed under the is_pruning interface but would break this method.

Definitely not suggesting this PR needs to do this! But: right now there are no restrictions on self.db, a solution might involve requiring db to be something which supports snapshotting. This would remove some potential for bugs and it'd also allow us to call at_root on tries which have pruning enabled.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sad because of Piper's suggestions I prefer the last two, the fluent interface is pretty!

The first one might be better though:

# local mutation
with trie.at_root(root_hash):
    trie.get(key)

If it mutates the old trie then there's no chance for users to accidentally concurrently modify a copy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about inverting this and having the copy do something like acquire a lock on the parent trie, and the parent trie refuses to prune if that lock isn't free? That should allow more freedom of movement in this API while still providing roughly equal levels of safety.

Copy link
Contributor Author

@carver carver Mar 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, an optimization which modified nodes in place rather than building new ones and pruning the old ones would be allowed under the is_pruning interface but would break this method.

A trie that modifies in place (removing references to old nodes from the db) would certainly be "pruning" semantically, so is_pruning should be True there.

Definitely not suggesting this PR needs to do this! But: right now there are no restrictions on self.db, a solution might involve requiring db to be something which supports snapshotting. This would remove some potential for bugs and it'd also allow us to call at_root on tries which have pruning enabled.

Yup, I am on board with doing this in future work. Having an immutable version of the trie has been on my wishlist for a long time. I listed it as a potential breaking change to make the default trie immutable: #85

The first one might be better though:

# local mutation
with trie.at_root(root_hash):
    trie.get(key)

If it mutates the old trie then there's no chance for users to accidentally concurrently modify a copy.

Here's my concern with that API. root_hash mutability can really mess with you in an async context:

class TrieUser:
    def __init__(self, trie):
        self.trie = trie

    async def look_up_old_val(self, key, root):
        trie = self.trie
        with trie.at_root(root):

            # this checks if nodes are missing from the trie
            while busted(trie):
                await fix_up_trie(trie)  # this is adding nodes to the trie

            # this is returning the key at whatever root hash `trie` has
            return trie[key]

    def trie_update(self, key, val):
        # Updates might come in while awaiting above.
        self.trie[key] = val

If trie_update() runs while await-ing fix_up_trie(), then it:

  1. modifies against a different state root than the class was initialized with
  2. sets the new state root, which affects the trie await-ing inside the context above

trie/hexary.py Outdated
@@ -8,7 +8,11 @@
keccak,
)

from eth_utils import to_list, to_tuple
from eth_utils import (
ValidationError,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: trie.exceptions has a ValidationError which might be better

import fnmatch
import itertools
import json
import os

from eth_utils import (
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick of all nitpicks: I think our convention is to alphabetize these as well, maybe that could be part of this PR?

@@ -166,13 +168,20 @@ def unread_keys(self):
def test_hexary_trie_saves_each_root():
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: This test now tests a couple different things.

Well, it seems like an oversight that it was previously only checking whether there were extra items at all! It's a nice addition that it also verifies that at least part of the older tries are accessible.

However at_root maybe deserves it's own test.

@lithp
Copy link

lithp commented Mar 20, 2019

Also, get_from_proof does the root_hash thing, maybe it could be updated as part of this PR:

py-trie/trie/hexary.py

Lines 166 to 176 in fda28be

@classmethod
def get_from_proof(cls, root_hash, key, proof):
trie = cls({})
for node in proof:
trie._set_raw_node(node)
trie.root_hash = root_hash
try:
return trie.get(key)
except KeyError as e:
raise BadTrieProof("Missing proof node with hash {}".format(e.args))

- Split big test in two
- Alphabetize some imports
- Use as_root in get_from_proof
- Use trie.exceptions.ValidationError instead of eth_utils
@carver
Copy link
Contributor Author

carver commented Mar 20, 2019

I'll hold on merging, in case someone still wants to convince me that mutable context manager is the better API. :)

@pipermerriam
Copy link
Member

pipermerriam commented Mar 20, 2019

pinged you in chat but will do so here too.

Re-suggesting my lock-based approach as I'm not sure you saw it:

with trie.at_root(new_root) as other_trie:
    ...

Implementation Details:

  • other_trie would be a separate object from trie, but sharing the same database.
  • other_trie would acquire a "lock" from trie which isn't released until the context exits.
  • trie will refuse to prune if this lock is not free.
  • other_trie would be created with pruning disabled.

I'm sure I'm missing something but this seems ideal.

  • allows use of the API even when pruning is enabled on the base trie
  • seems to protect against the weird concurrency issues you referred to.

@carver
Copy link
Contributor Author

carver commented Mar 20, 2019

Sorry I didn't explicitly respond, I thought it was aimed at enabling the mutable API. This is the part I had trouble imagining a good solution for:

  • trie will refuse to prune if this lock is not free.

These are the implementations of that idea I came up with so far:

  1. raise an exception if trying to prune while "lock" is held by context
  2. hold an asyncio.Lock() until context ends
  3. hold a threading.Lock() until context ends

Concerns about these approaches:

  1. Getting a surprise exception sometime later during a get or set seems worse than just disabling the option up-front, until we have a solution that actually works for a snapshot on a pruning trie.
  2. AFAICT requires all of .get(), .set() etc to be converted to async, and drop the old APIs -- I don't see the win as big enough to justify this kind of API upset
  3. If the trie is being used by an async method, using a threading.Lock() would lock up the event loop indefinitely

Any other implementations I'm missing?


Not supporting pruning seems like a relatively clean way to get this feature out now. I don't have a need for pruning in the syncing context. We can figure out how to support it if/when we need it later. Is there something I'm missing about why it needs that support right away?

Added: Any objections to me merging this and making a follow-up issue?
Otherwise, I am happy to leave it open and get back to syncing stuff.

@pipermerriam
Copy link
Member

:shipit:

@pipermerriam pipermerriam mentioned this pull request Mar 20, 2019
@carver carver merged commit 76071d2 into ethereum:master Mar 21, 2019
@carver carver deleted the root-specific-lookup branch March 21, 2019 00:38
pacrob pushed a commit to pacrob/py-trie that referenced this pull request May 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants