Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documenting "unconfirmed block transfer" history #123

Open
rvagg opened this issue Sep 14, 2022 · 6 comments
Open

Documenting "unconfirmed block transfer" history #123

rvagg opened this issue Sep 14, 2022 · 6 comments
Assignees

Comments

@rvagg
Copy link
Collaborator

rvagg commented Sep 14, 2022

This is now our next highest transfer error after #120. But it's also new. Discovered last week, validated here with #121 and turned from a silent error success into an explicit failure coming out of Filclient with application-research/filclient#101. So now we get data transfer failed: unconfirmed block transfer in the events database in event_details>error.

Summary of problem: a data-transfer completes, giving no indication of errors or problems, but the blockstore doesn't have anything new. Graphsync is given the blockstore to store new blocks in and check for the pre-existence of blocks for a given transfer so it doesn't need to double-up. This error is triggered when we finish a data transfer and check the blockstore for the original requested payload CID and it isn't there. So Graphsync isn't receiving it, and it shouldn't be skipping it either.

Debugging such a transfer, we can see that graphsync gets two requests to deal with:

  1. GraphSyncResponse<id=df88263a-6e62-49db-8983-82ebd1f5d848, status=15, exts=fil/data-transfer/incoming-request/1.1|fil/data-transfer/1.1|>
  2. GraphSyncResponse<id=df88263a-6e62-49db-8983-82ebd1f5d848, status=35, exts=>

i.e. RequestPaused then RequestCancelled, and no blocks get mentioned or transferred at all for these. There’s no partial, completed, failure, or anything else.

@hannahhoward's initial take is:

I bet good money on the root of this problem: unseal errors. When retrieval markets code encounters an error, it cancels the data transfer. Based on where this error occurs in the sequence, it’s almost surely an unseal error.

There’s nothing wrong with data transfer considering the transfer successful on cancel per se ... But retrieval code should send a new voucher result explaining it was a failure first so the other side can pick up what happened. Unfortunately v1 retrieval code has no way to directly send an arbitrary voucher result. ...

... some relevant lines of code to help explain:
https://github.com/filecoin-project/go-fil-markets/blob/ad3d7d3af4351e11544db840665a6a2f34433884/retrievalmarket/impl/providerstates/provider_fsm.go#L41
https://github.com/filecoin-project/go-fil-markets/blob/ad3d7d3af4351e11544db840665a6a2f34433884/retrievalmarket/impl/providerstates/provider_fsm.go#L131
https://github.com/filecoin-project/go-fil-markets/blob/9859795f76ad654d46d460acdc829731f3d42803/retrievalmarket/impl/providerstates/provider_states.go#L69


For now we've just turned it into an explicit failure on the client side, but we need to understand the causes and figure out if we can decrease the number of these we're encountering.

@davidd8 davidd8 assigned rvagg and unassigned rvagg Feb 1, 2023
@davidd8
Copy link
Collaborator

davidd8 commented Feb 1, 2023

Marking this as a P2 for now, since "1m timeouts" are currently the most common autoretrieve error, and this is more of a data gathering task.

@jacobheun
Copy link

I was able to replicate this locally on Boost doing retrieval testing with lassie, and it was due to an issue with the unsealing RPC connection resetting between lotus and boost. It looks like the data transfer is cancelled by Boost but instead of getting that event on the client side it's seeing the unconfirmed block.

Running the retrieval:

lassie fetch -p --provider=/ip4/127.0.0.1/tcp/50000/p2p/12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ -o bafykbzacea3m7wgziavfqkppcl7wv4pos3yhtrqwhlaoqtqsyddfr6ikuljiy.car bafykbzacea3m7wgziavfqkppcl7wv4pos3yhtrqwhlaoqtqsyddfr6ikuljiy
Fetching bafykbzacea3m7wgziavfqkppcl7wv4pos3yhtrqwhlaoqtqsyddfr6ikuljiy from {12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ: [/ip4/127.0.0.1/tcp/50000]}
Querying indexer for bafykbzacea3m7wgziavfqkppcl7wv4pos3yhtrqwhlaoqtqsyddfr6ikuljiy...
Found 1 storage providers candidates from the indexer, querying it:
	12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ
Querying [12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ] (started)...
Querying [12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ] (connected)...
Query response from [12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ]: size=7.9 MiB, price-per-byte=0, unseal-price=0, message=
Retrieving from [12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ] (started)...
Retrieving from [12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ] (proposed)...
Retrieving from [12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ] (accepted)...

retrieval failed: data transfer failed: unconfirmed block transfer
Retrieval failure for [12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ]: retrieval failed: data transfer failed: unconfirmed block transfer
all retrievals failed

Event Logs on Boosts side

2023-02-23 12:59:57.393   DT:Open    
2023-02-23 12:59:57.516 123ms Open New  
2023-02-23 12:59:57.690 174ms DT:NewVoucherResult    
2023-02-23 12:59:57.749 59ms DT:Accept    
2023-02-23 12:59:57.858 109ms DealAccepted Unsealing  
2023-02-23 12:59:57.906 48ms DT:PauseResponder    
2023-02-23 12:59:58.090 184ms UnsealError Failing failed to load blockstore for piece baga6ea4seaqpdt7hwh6gdcuowwvx3v5ooblgbcdbnsax3x77dtr7azqsrhxoaiy: failed to acquire shard for piece CID baga6ea4seaqpdt7hwh6gdcuowwvx3v5ooblgbcdbnsax3x77dtr7azqsrhxoaiy: failed to unseal deal 2: RPCConnectionError
2023-02-23 12:59:58.148 58ms DT:Cancel    
2023-02-23 12:59:58.334 186ms CancelComplete Errored failed to load blockstore for piece baga6ea4seaqpdt7hwh6gdcuowwvx3v5ooblgbcdbnsax3x77dtr7azqsrhxoaiy: failed to acquire shard for piece CID baga6ea4seaqpdt7hwh6gdcuowwvx3v5ooblgbcdbnsax3x77dtr7azqsrhxoaiy: failed to unseal deal 2: RPCConnectionError
2023-02-23 12:59:58.376 42ms DT:ReceiveDataError   stream reset
2023-02-23 12:59:58.549 173ms DT:CleanupComplete   stream reset

@hannahhoward
Copy link
Collaborator

So @jacobheun what I read here is that the SP had an unseal error, canceled the retrieval, but the Lassie side recorded it as success (until it saw the lack of blocks)

@hannahhoward
Copy link
Collaborator

Does that look right to you?

@hannahhoward
Copy link
Collaborator

^^^ Damn my initial hot take looks right! Who knew 6 months ago me.

@jacobheun
Copy link

So @jacobheun what I read here is that the SP had an unseal error, canceled the retrieval, but the Lassie side recorded it as success (until it saw the lack of blocks)

Yes, exactly this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants