Skip to content
This repository has been archived by the owner on Apr 18, 2024. It is now read-only.

~70% of block requests fail due to strn L1 routing error or timeout #42

Closed
lidel opened this issue Feb 23, 2023 · 5 comments
Closed

~70% of block requests fail due to strn L1 routing error or timeout #42

lidel opened this issue Feb 23, 2023 · 5 comments

Comments

@lidel
Copy link
Contributor

lidel commented Feb 23, 2023

Problem

iiuc Rhea produces ~130% of expected HTTP errors when compared with the old setup (snapshot from bifrost-gw-staging-metrics, see Mean values):

Screenshot 2023-02-23 at 19-56-15 bifrost-gw staging metrics - Bifrost - Dashboards - Grafana

Hypothesis

Look at Mean error rates from Caboose. Success rate of 30%-25% feels painfully inefficient, and may explain end user errors.

2023-02-23-194140_3440x1440_scrot

Questions / Ideas

  • Are there other/better explanations of 130% of expected errors in the new setup?
  • How can we avoid wasting time 75% of time? (Ideas welcome)
    • Routing errors (404s from L1s) are something Lassie team can investigate
      • Are these CIDs trully not findable? Last time I checked, IPNI just did not see as much as Accelerated DHT client from Kubo. Is this still the case?
    • Could Caboose keep a cache of recently/frequently failing CIDs and adjust behavior for them? For exampke,
      • if CID fails due to context timeout, the next time it is requested, instead of timeout of 19s and 3 retries, spend full 60s on a single fetch to get some timeouts over the finish line.
      • if CID fails multiple times across the pool, add it to a cooldown bucket and return some error that bifrost-gateway could turn into 429 (Too Many Requests) response with Retry-After header matching the cooldown duration.

cc @willscott @aarshkshah1992 for sanity check / ideas how to mitigate

@willscott
Copy link
Contributor

  • This is an umbrella issue and not just in caboose

For your breakdown:

  • IPNI is continuing to work on stabilizing cascadht / accelerated DHT client such that all queries properly return dht results.
    • A fix was applied this morning that stabilized non-lassie queries.
    • Filed an issue to track ongoing issue. @masih is DRI for this.
  • support for streaming indexer responses should help ttfb from lassie
    • Roll-out of v0.4.4 will help with responsiveness on cache misses, and make lassie more performant in the current setup.
  • the next saturn l1 release will transition the 404's to 502's as you've hoped for.

The mitigations you suggest for caboose behavior make sense as well. I'll break those into separate issues

@willscott
Copy link
Contributor

Filed the final suggestions as #43 and #44

i think #43 probably makes more sense once we start fetching car files rather than blocks

@masih
Copy link

masih commented Feb 24, 2023

About code 0 i.e. timeouts, how can one differentiate between provider lookup timeout and retrieval timeout? @lidel do you have any suggestions?

@lidel
Copy link
Contributor Author

lidel commented Feb 27, 2023

@masih from the perspective of end user's HTTP client talking to ipfs.io both are HTTP 504 (Gateway Timeout).

If we want to bubble up the reason to the end user, then L1 and Caboose should pass the reason in the error response body.
bifrost-gateway returns 504 with wrapped error message in the text/plain response body:

https://github.com/ipfs/bifrost-gateway/blob/c305b3ba95dc13b06392975ab2bbb9b475e319a7/blockstore.go#L70-L87

@lidel lidel closed this as completed Apr 4, 2023
@lidel
Copy link
Contributor Author

lidel commented Apr 4, 2023

Closing as we changed a lot and now working on CAR + Block Fetch backend which needs fine-tuning first:

Screenshot 2023-04-05 at 00-36-34 bifrost-gw staging metrics - Project Rhea - Dashboards - Grafana

Will fill new issues for this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants