-
Notifications
You must be signed in to change notification settings - Fork 138
Fluffy: Only send offers to peers when content falls within radius #3242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I've found that this change significantly improves the gossip performance. Before this change the majority of gossip requests require a nodes lookup but after this change the majority don't require the nodes lookup. Some stats collected from running the Fluffy state bridge on mainnet for around 15 mins:
Also (almost) no
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming that the improvement on having less lookups comes from the increase in the radius cache? As I don't really see any other changes that could have to due with this?
And the improvement in less DeclinedNotWithinRadius
returned codes comes from the fact that you ping and check the radius.
if p.radiusCache.get(node.id).isNone(): | ||
# Send ping to add the node to the radius cache | ||
(await p.ping(node)).isOkOr: | ||
continue | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if you saw this but there was recently discussion and resulting spec change about removing the additional lookup in the NHgossip. I realized that that part was there mostly (only..?) for initial gossip.
So my plan was to remove this completely here, but keep it for the "putContent" json-rpc call. E.g. have a much simpler neighborhoodGossip
call, and the a putIntoNetwork()
or similar that does this more complex version with extra lookup.
The additional ping request here is something that thus definitely would not remain in the simpler call. I'm not fully sure if we could keep in the more complex call either though, as from my previous measurements the lookups were not really done often and it will add a delay on the total time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if you saw this but there was recently discussion and resulting spec change about removing the additional lookup in the NHgossip. I realized that that part was there mostly (only..?) for initial gossip.
So my plan was to remove this completely here, but keep it for the "putContent" json-rpc call. E.g. have a much simpler
neighborhoodGossip
call, and the aputIntoNetwork()
or similar that does this more complex version with extra lookup.
That idea sounds reasonable to me. Before doing that change, we should probably check that Fluffy can quickly populate the full routing table without too much delay at startup. I didn't really look into this part.
The additional ping request here is something that thus definitely would not remain in the simpler call. I'm not fully sure if we could keep in the more complex call either though, as from my previous measurements the lookups were not really done often and it will add a delay on the total time.
Actually I agree that removing the node lookup is the right direction and perhaps we should focus more on filling the routing table, even making the routing table larger to hold more of the network if possible somehow. I watched a talk from an ipfs developer who said one of the optimizations they do is create a routing table that holds the entire network (the ipfs network is very large) and if I remember correctly that part only uses a few MBs of memory in total. This design greatly improves performance because you have most/all of the nodes in the network locally.
Having said that I would say that this change is still an improvement in the short term. I did notice quiet a decent speed up when running the state bridge locally.
Yes increasing the size of the radius cache did help a bit but I noticed a bigger improvement from adding the ping which forces the node to be cached in the radius cache so that any future neighborhoodGossip calls can use the cached value without doing the node lookup. So the overall result is we end up doing less node lookups in total across all neighborhoodGossip calls, not specifically the first request to a particular node.
Yes, that's right. This was the main reason for this change as it seems wasteful to send offers when the content is outside the radius. |
I did some more testing for this change comparing the gossip performance before and after. In each case, I ran a single fluffy node, gave it two minutes to start up and build up it's routing table. The ran the state bridge with 10 workers for approx 5 minutes. Here are the results from running Fluffy built from the master branch (before changes):
Metrics:
Here are the results from running Fluffy built from this branch (after changes):
Metrics:
The performance improvement is significant and is likely related to the ping forcing the radius to get cached and the increase in cache size. It doesn't appear to be related to the efficiency gains from only sending offers to in range peers because these results show that only 500 approx offers were sent to peers that were not in range. |
@kdeme I updated this change to include the recent spec changes so that now we only do the node lookup in |
Yes, I have no doubt that the increased Cache + added pings to get this cache quickly filled in will improve the amount of offers it can send out because it causes a lower amount of potential node lookups required before it sends those offers. The actual total value that is useful is the The actual "efficiency" rate (accepted per successful) is also a little bit higher, not sure if that just has to do due to more data that was not gossiped before or not (did you send the same data to start from?). Now the part that slightly bothers me in this solution is that it will not automatically scale well when the size of the network ( = amount of nodes) will increase. This currently works great because the total size of the network in terms of nodes fits in the radius cache. Now, this is probably not much of an issue as this functionality is typically used for bridge nodes and for those we can increase the cache with a huge amount (perhaps even for regular nodes we can do this). So I am fine with merging this solution as long as we remember this (= document it somewhere, probably add some comment to the code?). |
Yes I sent the same data block range for both test scenarios. Perhaps the improved efficiency is because we no longer send content that is not in range.
Good point. For the nodes in the network which have the node lookup disabled (nodes not running with a bridge), this won't be an issue and for this scenario the radius cache should be a bit bigger than the routing table. As the size of the network grows we should probably increase the default size of the radius cache. For nodes that run connected to a bridge node we should allow for these nodes to be 'larger' and increase the size of the radius cache (there is already a debug parameter on the cli for this) so that this scenario which you describe doesn't occur in practice due to the larger cache.
Sure, I'll add some comments to make this part clear. |
Changes in this PR:
neighborhoodGossip
by caching more of the network in memory.neighborhoodGossip
. Nodes lookup is disabled by default and only enabled when called via the JSON-RPC API. This update is related to the recent spec change here: MoveNeighborhood Gossip
'sRecursiveFindNodes
requirement toportal_*PutContent
ethereum/portal-network-specs#394. I've added this into this PR to reduce the impact of the node lookup changes on the nodes in the network as they won't be using the node lookup at all unless being called via the JSON-RPC API.