Skip to content

Commit

Permalink
fix(iroh-blobs): do not hit the network when downloading blobs which …
Browse files Browse the repository at this point in the history
…are complete (#2586)

## Description

Two changes to the downloader:

* Never try to download from ourselves. If the only provider node added
is our own node, fail with error "no providers".
* The actual download request flow is turned into a generator (while
keeping API compatibility for the existing `get_to_db` public function).
A new `get_to_db_in_steps` function either runs to completion if the
requested data is fully available locally, or yields a `NeedsConn`
struct at the point where it needs a network connection to proceed. The
`NeedsConn` has an `async proceed(self, conn: Connection)`, which must
be called with a connection for the actual download to start.

This two-step process allows the downloader to check if we should dial
nodes at all, or are already done without doing anything, while emitting
the exact same flow of events (because we run the same loop) to the
client.

To achieve this, `get_to_db` now uses a genawaiter generator internally.
This means that the big loop that is the iroh-blobs protocol request
flow does not have to be changed at all, only that instead of a closure
we yield and resume, which makes this much easier to integrate into an
external state machine like the downloader.

The changes needed for this for the downloader are a bit verbose because
the downloader itself is generic over a `Getter`, with impls for the
actual impl and a test impl that does not use networking; therefore the
new `NeedsConn` state has to be modeled with an additional associated
type and trait here.

This PR adds three tests:

* Downloading a missing blob from the local node fails without trying to
connect to ourselves
* Downloading an existing blob succeeds without trying to download
* Downloading an existing collection succeeds without trying to download

Closes #2575
Replaced #2576

## Notes and open questions



## Breaking changes

None, only an API addition to the public API of iroh_blobs:
`iroh_blobs::get::check_local_with_progress_if_complete`

---------

Co-authored-by: dignifiedquire <me@dignifiedquire.com>
  • Loading branch information
Frando and dignifiedquire authored Aug 5, 2024
1 parent 43ef8b6 commit 0784403
Show file tree
Hide file tree
Showing 11 changed files with 594 additions and 163 deletions.
191 changes: 134 additions & 57 deletions iroh-blobs/src/downloader.rs
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,12 @@
//! requests to a single node is also limited.
use std::{
collections::{hash_map::Entry, HashMap, HashSet},
collections::{
hash_map::{self, Entry},
HashMap, HashSet,
},
fmt,
future::Future,
num::NonZeroUsize,
sync::{
atomic::{AtomicU64, Ordering},
Expand All @@ -46,7 +50,7 @@ use tokio::{
sync::{mpsc, oneshot},
task::JoinSet,
};
use tokio_util::{sync::CancellationToken, time::delay_queue};
use tokio_util::{either::Either, sync::CancellationToken, time::delay_queue};
use tracing::{debug, error_span, trace, warn, Instrument};

use crate::{
Expand Down Expand Up @@ -75,13 +79,15 @@ pub struct IntentId(pub u64);
/// Trait modeling a dialer. This allows for IO-less testing.
pub trait Dialer: Stream<Item = (NodeId, anyhow::Result<Self::Connection>)> + Unpin {
/// Type of connections returned by the Dialer.
type Connection: Clone;
type Connection: Clone + 'static;
/// Dial a node.
fn queue_dial(&mut self, node_id: NodeId);
/// Get the number of dialing nodes.
fn pending_count(&self) -> usize;
/// Check if a node is being dialed.
fn is_pending(&self, node: NodeId) -> bool;
/// Get the node id of our node.
fn node_id(&self) -> NodeId;
}

/// Signals what should be done with the request when it fails.
Expand All @@ -97,20 +103,39 @@ pub enum FailureAction {
RetryLater(anyhow::Error),
}

/// Future of a get request.
type GetFut = BoxedLocal<InternalDownloadResult>;
/// Future of a get request, for the checking stage.
type GetStartFut<N> = BoxedLocal<Result<GetOutput<N>, FailureAction>>;
/// Future of a get request, for the downloading stage.
type GetProceedFut = BoxedLocal<InternalDownloadResult>;

/// Trait modelling performing a single request over a connection. This allows for IO-less testing.
pub trait Getter {
/// Type of connections the Getter requires to perform a download.
type Connection;
/// Return a future that performs the download using the given connection.
type Connection: 'static;
/// Type of the intermediary state returned from [`Self::get`] if a connection is needed.
type NeedsConn: NeedsConn<Self::Connection>;
/// Returns a future that checks the local store if the request is already complete, returning
/// a struct implementing [`NeedsConn`] if we need a network connection to proceed.
fn get(
&mut self,
kind: DownloadKind,
conn: Self::Connection,
progress_sender: BroadcastProgressSender,
) -> GetFut;
) -> GetStartFut<Self::NeedsConn>;
}

/// Trait modelling the intermediary state when a connection is needed to proceed.
pub trait NeedsConn<C>: std::fmt::Debug + 'static {
/// Proceeds the download with the given connection.
fn proceed(self, conn: C) -> GetProceedFut;
}

/// Output returned from [`Getter::get`].
#[derive(Debug)]
pub enum GetOutput<N> {
/// The request is already complete in the local store.
Complete(Stats),
/// The request needs a connection to continue.
NeedsConn(N),
}

/// Concurrency limits for the [`Downloader`].
Expand Down Expand Up @@ -280,7 +305,7 @@ pub struct DownloadHandle {
receiver: oneshot::Receiver<ExternalDownloadResult>,
}

impl std::future::Future for DownloadHandle {
impl Future for DownloadHandle {
type Output = ExternalDownloadResult;

fn poll(
Expand Down Expand Up @@ -424,10 +449,12 @@ struct IntentHandlers {
}

/// Information about a request.
#[derive(Debug, Default)]
struct RequestInfo {
#[derive(Debug)]
struct RequestInfo<NC> {
/// Registered intents with progress senders and result callbacks.
intents: HashMap<IntentId, IntentHandlers>,
progress_sender: BroadcastProgressSender,
get_state: Option<NC>,
}

/// Information about a request in progress.
Expand Down Expand Up @@ -529,7 +556,7 @@ struct Service<G: Getter, D: Dialer> {
/// Queue of pending downloads.
queue: Queue,
/// Information about pending and active requests.
requests: HashMap<DownloadKind, RequestInfo>,
requests: HashMap<DownloadKind, RequestInfo<G::NeedsConn>>,
/// State of running downloads.
active_requests: HashMap<DownloadKind, ActiveRequestInfo>,
/// Tasks for currently running downloads.
Expand Down Expand Up @@ -666,48 +693,85 @@ impl<G: Getter<Connection = D::Connection>, D: Dialer> Service<G, D> {
on_progress: progress,
};

// early exit if no providers.
if nodes.is_empty() && self.providers.get_candidates(&kind.hash()).next().is_none() {
self.finalize_download(
kind,
[(intent_id, intent_handlers)].into(),
Err(DownloadError::NoProviders),
);
return;
}

// add the nodes to the provider map
let updated = self
.providers
.add_hash_with_nodes(kind.hash(), nodes.iter().map(|n| n.node_id));
// (skip the node id of our own node - we should never attempt to download from ourselves)
let node_ids = nodes
.iter()
.map(|n| n.node_id)
.filter(|node_id| *node_id != self.dialer.node_id());
let updated = self.providers.add_hash_with_nodes(kind.hash(), node_ids);

// queue the transfer (if not running) or attach to transfer progress (if already running)
if self.active_requests.contains_key(&kind) {
// the transfer is already running, so attach the progress sender
if let Some(on_progress) = &intent_handlers.on_progress {
// this is async because it sends the current state over the progress channel
if let Err(err) = self
.progress_tracker
.subscribe(kind, on_progress.clone())
.await
{
debug!(?err, %kind, "failed to subscribe progress sender to transfer");
match self.requests.entry(kind) {
hash_map::Entry::Occupied(mut entry) => {
if let Some(on_progress) = &intent_handlers.on_progress {
// this is async because it sends the current state over the progress channel
if let Err(err) = self
.progress_tracker
.subscribe(kind, on_progress.clone())
.await
{
debug!(?err, %kind, "failed to subscribe progress sender to transfer");
}
}
entry.get_mut().intents.insert(intent_id, intent_handlers);
}
} else {
// the transfer is not running.
if updated && self.queue.is_parked(&kind) {
// the transfer is on hold for pending retries, and we added new nodes, so move back to queue.
self.queue.unpark(&kind);
} else if !self.queue.contains(&kind) {
// the transfer is not yet queued: add to queue.
hash_map::Entry::Vacant(entry) => {
tracing::warn!("is new, queue");
let progress_sender = self.progress_tracker.track(
kind,
intent_handlers
.on_progress
.clone()
.into_iter()
.collect::<Vec<_>>(),
);

let get_state = match self.getter.get(kind, progress_sender.clone()).await {
Err(_err) => {
self.finalize_download(
kind,
[(intent_id, intent_handlers)].into(),
// TODO: add better error variant? this is only triggered if the local
// store failed with local IO.
Err(DownloadError::DownloadFailed),
);
return;
}
Ok(GetOutput::Complete(stats)) => {
self.finalize_download(
kind,
[(intent_id, intent_handlers)].into(),
Ok(stats),
);
return;
}
Ok(GetOutput::NeedsConn(state)) => {
// early exit if no providers.
if self.providers.get_candidates(&kind.hash()).next().is_none() {
self.finalize_download(
kind,
[(intent_id, intent_handlers)].into(),
Err(DownloadError::NoProviders),
);
return;
}
state
}
};
entry.insert(RequestInfo {
intents: [(intent_id, intent_handlers)].into_iter().collect(),
progress_sender,
get_state: Some(get_state),
});
self.queue.insert(kind);
}
}

// store the request info
let request_info = self.requests.entry(kind).or_default();
request_info.intents.insert(intent_id, intent_handlers);
if updated && self.queue.is_parked(&kind) {
// the transfer is on hold for pending retries, and we added new nodes, so move back to queue.
self.queue.unpark(&kind);
}
}

/// Cancels a download intent.
Expand Down Expand Up @@ -860,7 +924,6 @@ impl<G: Getter<Connection = D::Connection>, D: Dialer> Service<G, D> {
) {
self.progress_tracker.remove(&kind);
self.remove_hash_if_not_queued(&kind.hash());
let result = result.map_err(|_| DownloadError::DownloadFailed);
for (_id, handlers) in intents.into_iter() {
handlers.on_finish.send(result.clone()).ok();
}
Expand Down Expand Up @@ -1082,14 +1145,9 @@ impl<G: Getter<Connection = D::Connection>, D: Dialer> Service<G, D> {
/// Panics if hash is not in self.requests or node is not in self.nodes.
fn start_download(&mut self, kind: DownloadKind, node: NodeId) {
let node_info = self.connected_nodes.get_mut(&node).expect("node exists");
let request_info = self.requests.get(&kind).expect("hash exists");

// create a progress sender and subscribe all intents to the progress sender
let subscribers = request_info
.intents
.values()
.flat_map(|state| state.on_progress.clone());
let progress_sender = self.progress_tracker.track(kind, subscribers);
let request_info = self.requests.get_mut(&kind).expect("request exists");
let progress = request_info.progress_sender.clone();
// .expect("queued state exists");

// create the active request state
let cancellation = CancellationToken::new();
Expand All @@ -1098,17 +1156,32 @@ impl<G: Getter<Connection = D::Connection>, D: Dialer> Service<G, D> {
node,
};
let conn = node_info.conn.clone();
let get_fut = self.getter.get(kind, conn, progress_sender);

// If this is the first provider node we try, we have an initial state
// from starting the generator in Self::handle_queue_new_download.
// If this not the first provider node we try, we have to recreate the generator, because
// we can only resume it once.
let get_state = match request_info.get_state.take() {
Some(state) => Either::Left(async move { Ok(GetOutput::NeedsConn(state)) }),
None => Either::Right(self.getter.get(kind, progress)),
};
let fut = async move {
// NOTE: it's an open question if we should do timeouts at this point. Considerations from @Frando:
// > at this stage we do not know the size of the download, so the timeout would have
// > to be so large that it won't be useful for non-huge downloads. At the same time,
// > this means that a super slow node would block a download from succeeding for a long
// > time, while faster nodes could be readily available.
// As a conclusion, timeouts should be added only after downloads are known to be bounded
let fut = async move {
match get_state.await? {
GetOutput::Complete(stats) => Ok(stats),
GetOutput::NeedsConn(state) => state.proceed(conn).await,
}
};
tokio::pin!(fut);
let res = tokio::select! {
_ = cancellation.cancelled() => Err(FailureAction::AllIntentsDropped),
res = get_fut => res
res = &mut fut => res
};
trace!("transfer finished");

Expand Down Expand Up @@ -1433,4 +1506,8 @@ impl Dialer for iroh_net::dialer::Dialer {
fn is_pending(&self, node: NodeId) -> bool {
self.is_pending(node)
}

fn node_id(&self) -> NodeId {
self.endpoint().node_id()
}
}
Loading

0 comments on commit 0784403

Please sign in to comment.