Client: graceful way to pause/continue and deal with runtime errors? #1900

mkhorton · 2023-12-17T20:42:03Z

Hi there! Thanks for the great work on the OPTIMADE client, it's really pleasant to use, and looks very well designed. I appreciate it especially having written the (quite poorer) interface in pymatgen, which one day perhaps we can deprecate :)

While using the interface, I frequently encounter timeout errors with some databases. These might look like:

returned: [RuntimeError: 502 -

or

returned: ['RuntimeError: 500 -

This raises a few questions, and I'm not sure the best resolution:

If using --output-file, the output file will be empty when an error is encountered, even if many structures were successfully retrieved. I understand using a callback is probably the best option here.
I can't see a good way to automatically skip the first N pages, if they have already been successfully retrieved.
I'm not sure if these two return codes will trigger the automatic retry; it looks like this will only happen for a TimeoutError, and not a RuntimeError?
The automatic retry does not seem to have a backoff period. Some databases request waiting for 30s for example. Perhaps we could take inspiration from the retry library or similar, and have an automatically increasing sleep time after each retry?

Apologies if there is an existing issue for this. I did have a look but couldn't find one. If I'm mis-using the library and this is already supported somehow, I'd be glad to know! Thanks again!

The text was updated successfully, but these errors were encountered:

ml-evs · 2023-12-18T00:05:10Z

Hi @mkhorton, glad you're finding it useful -- and all good questions!

To lump the first two questions together, callbacks are the only way of really doing this right now; my most robust workflow uses callbacks to rewrite the next link of a query such that it skips to the last result it finds in the database for a given provider. This is austerely documented at here and here, but I can try to dig out the specific code I've been using if that's helpful (and maybe add it as a selectable callback). The usage of --output-file was initially meant to just replicate doing a stdio redirect but could definitely be improved/specialised (e.g., writing a JSONLines file per database queried, then only needing to read the last line to do continuation -- this could also be achieved with a callback but its probably asking a lot of a user to write this).

Just found the code for the restartable mongo callback and its not pretty (note in this case there is no filter; potentially could consider having baked in features for making per-filter databases/output files):

if __name__ == "__main__":
    import pymongo as pm
    from optimade.client import OptimadeClient
    from httpx import URL
    import urllib.parse

    client = pm.MongoClient("mongodb://localhost:27017/optimade_example")
    collection = client.optimade_example.structures
    collection.create_index("immutable_id", unique=True)
    collection.create_index("prefix")

    def insert_into_mongo(url, results):
        """Inserts data into a MongoDB collection."""
        prefix = results["meta"].get("provider", {}).get("prefix", None)
        url = URL(url)
        next_url = None
        duplicates = False
        # start = time.monotonic_ns()
        for entry in results["data"]:
            formula = entry.pop("attributes")["chemical_formula_reduced"]
            entry["chemical_formula_reduced"] = formula
            entry["prefix"] = prefix
            entry["immutable_id"] = f"{url.scheme}://{url.host}{url.path}/{entry['id']}"
            try:
                collection.insert_one(entry)
            except pm.errors.DuplicateKeyError:
                duplicates = True

        if duplicates:
            number_of_results_for_prefix = collection.count_documents(
                {"prefix": prefix}
            )
            suggested_page_offset = number_of_results_for_prefix - 1
            _next_url = results.get("links", {}).get("next")
            if isinstance(_next_url, dict):
                _next_url = _next_url.get("href")
            # If we have already reset the page offset once, don't do it again
            page_offset = urllib.parse.parse_qs(
                urllib.parse.urlparse(_next_url).query
            ).get("page_offset", [None])[0]

            if page_offset is None:
                return
            page_offset = int(page_offset)

            if _next_url and page_offset < 110:
                # Change the page offset to the suggested value using urllib.parse
                next_url = str(
                    URL(_next_url).copy_set_param("page_offset", suggested_page_offset)
                )

        if next_url:
            print(
                f"Overwriting next_url to {next_url}, existing results {suggested_page_offset + 1}"
            )
            return {"next": next_url, "advance_results": number_of_results_for_prefix}

        # (time.monotonic_ns() - start) / 1e9
        # print(f"Callback ran in {elapsed:.2f} s")

        return None

    download_structures = False

    client = OptimadeClient(
        max_results_per_provider=-1,
        # include_providers=["mpds", "omdb", "aflow"],
        callbacks=[insert_into_mongo],
    )

    all_formulae = client.get(response_fields=["chemical_formula_reduced"])

For final 2 questions:

There's no good way to retry in that scenario currently; do you think the errors you are running into are really solved with this though? Most of the runtime errors I see with the client are missing database features or buggy implementations, rather than server flakiness. Bear in mind that there is already an implicit packet retransmission "retry" in TCP that the default timeout should allow for multiple windows of. This doesn't fix your next query though...
This can definitely be added, I've just been holding off for Support request_delay meta value in client #1247, Support new meta->request_delay field #1418 and Support for OPTIMADE v1.2.x #1419 to solve this properly, as OPTIMADE 1.2 now has a field by which server's can request a backoff time in-band. If you think a simple staggered retry would solve your problems then we can definitely add it. I think I had it in an earlier version but always found that the databases that needed the retry were broken anyway, so the retry would never succeed and it made all queries significantly slower as a result.

mkhorton · 2023-12-18T21:48:05Z

Thanks @ml-evs, regarding:

This is austerely documented at here and here, but I can try to dig out the specific code I've been using if that's helpful (and maybe add it as a selectable callback).

If you do have code on hand, that'd be super helpful! otherwise I can muddle through.

do you think the errors you are running into are really solved with this though?

I think it's a mix. Some providers just need some extra time/backoff period, other providers have genuine issues. I think a good test maybe to use some filter that returns a large amount of documents, and then arbitrarily try a page with a very high page number. If it works, great, if it doesn't work, it probably suggests some underlying server issue.

If you think a simple staggered retry would solve your problems then we can definitely add it.

Difficult to say; I'd be in favor of adding it for politeness regardless, since some people might not add that request_delay field to their response. But if you've already tried it... One point of confusion for me is the flow for when a RuntimeError is encountered; is this handled the same as a TimeoutError (i.e., it will re-attempt 5 times), or does it fail immediately?

ml-evs · 2023-12-19T00:11:01Z

If you do have code on hand, that'd be super helpful! otherwise I can muddle through.

Sorry I hid it somewhat in my comment above, you should be able to expand the sentence "Just found the code..." to see the snippet.

I think it's a mix. Some providers just need some extra time/backoff period, other providers have genuine issues. I think a good test maybe to use some filter that returns a large amount of documents, and then arbitrarily try a page with a very high page number. If it works, great, if it doesn't work, it probably suggests some underlying server issue.

Large queries are still a bit of an issue; until recently we still had a whole COLSCAN going on in the reference implementation as we needed to get the number of returned entries, but instead we have now made this optional (and obey a MongoDB internal timeout). Mostly we have been getting away with this by just running sufficiently small databases with enough memory to make this access fast as I really don't want to have to mess around with cursor pagination and such in MongoDB (the OPTIMADE spec is designed such that this should be possible though). I know that the Elasticsearch-based implementations also struggle with more than 10,000 results by default unless you implement the Scroll API (which I do not have the bandwidth or expertise to do in optimade-python-tools) (see #1291). We can definitely try to be more robust to this though.

Difficult to say; I'd be in favor of adding it for politeness regardless, since some people might not add that request_delay field to their response. But if you've already tried it... One point of confusion for me is the flow for when a RuntimeError is encountered; is this handled the same as a TimeoutError (i.e., it will re-attempt 5 times), or does it fail immediately?

I'll remind myself of our current approach and consider adding this, should be straightforward.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client: graceful way to pause/continue and deal with runtime errors? #1900

Client: graceful way to pause/continue and deal with runtime errors? #1900

mkhorton commented Dec 17, 2023

ml-evs commented Dec 18, 2023 •

edited

Loading

mkhorton commented Dec 18, 2023 •

edited

Loading

ml-evs commented Dec 19, 2023

Client: graceful way to pause/continue and deal with runtime errors? #1900

Client: graceful way to pause/continue and deal with runtime errors? #1900

Comments

mkhorton commented Dec 17, 2023

ml-evs commented Dec 18, 2023 • edited Loading

mkhorton commented Dec 18, 2023 • edited Loading

ml-evs commented Dec 19, 2023

ml-evs commented Dec 18, 2023 •

edited

Loading

mkhorton commented Dec 18, 2023 •

edited

Loading