fix(kad): enforce a timeout for inbound substreams#6009
fix(kad): enforce a timeout for inbound substreams#6009teor2345 wants to merge 4 commits intolibp2p:masterfrom
Conversation
925b8ac to
36ce9b5
Compare
This comment was marked as resolved.
This comment was marked as resolved.
|
Thank you for the PR @teor2345! For outbound substreams, the timeout is implemented by using the That would also match the implementation in other protocols, as you already stated in #5981:
|
Unfortunately not (or not without a larger refactor). rust-libp2p/protocols/kad/src/handler.rs Lines 906 to 907 in 5377558 It would be possible to hold the streams in a But we can't iterate through a rust-libp2p/protocols/kad/src/handler.rs Lines 557 to 582 in 5377558 So the only solution I could find is to add a timeout field to some of the inbound substream states. ( The underlying issue is that the code couples the state of the substream with the stream of items from it. A refactor could put Is this a change that would be acceptable in a bug fix? Particularly one that other users might want backported? |
Could we not use EDIT: Or maybe poll the stream in the future for the set? |
Sure, but that doesn't deal with substream reuse, because the There are two pieces of functionality that this code needs:
Here's one possible way to implement that:
This will involve some quite weird types, like Is this a change that would be acceptable in a bug fix? Particularly one that other users might want backported? I also can't guarantee I'll have time to work on this any time soon, because the current PR code works, and fixes our downstream bug. |
|
Thank you for the follow-ups and detailed explanation @teor2345.
Opened thomaseizinger/rust-futures-bounded#8 to see if we can implement iterator for |
#6015 made the timeout for outbound substreams in kademlia configurable by introducing a `outbound_substreams_timeout` config option. There is currently an open PR #6009 that plans to also apply that timeout to inbound substream, in which case we probably want toe rename the config option to `substreams_timeout` (without the prefix). Because we plan to release soon and might not get #6009 in in-time, I propose that we already do the renaming now, to avoid breaking the API again with the next release. Pull-Request: #6076.
|
friendly ping @teor2345, I saw thomaseizinger/rust-futures-bounded#10 (comment) but just to confirm, do you plan on continuing this? |
I think that thomaseizinger/rust-futures-bounded#10 will be a better alternative once it's released. |
|
This pull request has merge conflicts. Could you please resolve them @teor2345? 🙏 |
5377558 to
5fd5603
Compare
teor2345
left a comment
There was a problem hiding this comment.
@jxs @elenaf9 this is now ready for review again!
Most of the changes are for the futures-bounded 0.3.0 upgrade, happy to split them out into a separate PR if that would help.
I've marked it as draft because the kad/StreamSet changes depend on an unreleased version of futures-bounded (they are waiting for us to test these changes).
I can test this branch with subspace, but are there any other tests you'd like to do?
This pull request has merge conflicts. Could you please resolve them @teor2345? 🙏
It really doesn't 🤣
|
|
||
| if let Poll::Ready(Some(event)) = self.inbound_substreams.poll_next_unpin(cx) { | ||
| return Poll::Ready(event); | ||
| if let Poll::Ready(Some(event_result)) = self.inbound_substreams.poll_next_unpin(cx) { | ||
| match event_result { | ||
| Ok(event) => return Poll::Ready(event), | ||
| Err(_stream_set_timeout) => { | ||
| tracing::trace!( | ||
| "Inbound substream timed out waiting for peer, send, or close" | ||
| ); | ||
| continue; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
I wasn't sure what to do with an inbound substream timeout.
We could:
- explicitly close the substream
- return some synthetic inbound timeout event
- something else???
There was a problem hiding this comment.
The current solution to log the event is fine IMO.
elenaf9
left a comment
There was a problem hiding this comment.
Sorry for the delay. Code looks good to me, but still need to do some testing.
|
|
||
| if let Poll::Ready(Some(event)) = self.inbound_substreams.poll_next_unpin(cx) { | ||
| return Poll::Ready(event); | ||
| if let Poll::Ready(Some(event_result)) = self.inbound_substreams.poll_next_unpin(cx) { | ||
| match event_result { | ||
| Ok(event) => return Poll::Ready(event), | ||
| Err(_stream_set_timeout) => { | ||
| tracing::trace!( | ||
| "Inbound substream timed out waiting for peer, send, or close" | ||
| ); | ||
| continue; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The current solution to log the event is fine IMO.
Description
In the
kadsubstream handler, outbound substreams have a 10s timeout, but inbound substreams don't have any timeout.This results in large numbers of warnings under specific heavy load conditions, which we have encountered in subspace:
After this fix, if a substream times out, it is dropped from the set. The existing reusable substream behaviour is preserved: idle substreams are replaced when the substream limit is reached.
Fixes #5981
Cleanups
The substreams are pinned during iteration, so we can't remove them from the set. But we can replace them with a new substream. Since reusable streams are replaced directly, the
Canceledstate is no longer needed.This PR also includes an upgrade to
futures-bounded0.3.0, in a separate commit.Notes & open questions
Should the substream be closed on a timeout?
The existing code doesn't close them on (most) substream errors, so this PR handles timeouts the same way, by dropping the substream without closing it.
I have been unable to replicate the specific conditions leading to this bug in a test or my local subspace node, but we've confirmed the fix works on multiple nodes in the subspace network.
Change checklist