Better LLM retry behavior #6557

rbren · 2025-01-30T22:21:43Z

End-user friendly description of the problem this fixes or functionality that this introduces

Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below
no changelog

Give a summary of what the PR does, explaining any non-trivial design decisions

The LLM is retrying a lot of unrecoverable exceptions, which makes it look like the app is just stuck.

The current configuration also waits a total of 11 minutes (!) for a good response, not including the request time, which can add ~5-8 minutes to that total. So the app looks VERY stuck.

We could potentially move this into a config if these errors are common enough that eval needs them. CC @xingyaoww

Link of any specific issues this addresses

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:185288b-nikolaik   --name openhands-app-185288b   docker.all-hands.dev/all-hands-ai/openhands:185288b

enyst · 2025-01-30T22:37:25Z

openhands/llm/llm.py

    RateLimitError,
-    ServiceUnavailableError,


503 is a transitory error, we could probably keep it?

Hmm. It's transitory but also unexpected...

I'm open to it but I lean towards telling the user their LLM is flaking out rather than OpenHands looking like it's slow

I kinda agree with you actually. We always had a problem in understanding our retry settings, because it's a bit weird to figure out a sensible default for "unexpected stuff happened".

And now we do allow the user to continue normally after reporting the error.

eval is the exception, I'd love to hear from Xingyao on that.

enyst · 2025-01-30T22:42:49Z

There are some issues on litellm on this, the exceptions as defined are mixing permanent and transitory exceptions from the provider. We have some weird code due to that. I would agree that cleaning them and start again is reasonable. 😅

enyst · 2025-01-30T23:17:34Z

Small related detail, there's a try/except due to retries in llm.py, which is unnecessary even in main, and more so now. We might as well clean that out:

Chore: clean up LLM (prompt caching, supports fn calling), leftover renames #6095

openhands/llm/llm.py

enyst · 2025-02-13T20:47:25Z

Please see also a small follow-up here:

[rbren no-retries] add user-friendly messages #6576

rbren · 2025-02-14T15:14:14Z

Thanks @enyst! Any lingering issues here?

enyst

I think it would be great if @xingyaoww can take a look, because it's possible that the removed exceptions are relevant in evals.

Up to you.

xingyaoww

Let's do this now to unblock stuff -- I'll probably make some of these handling specific for evaluation when I run into them :)

Fixes All-Hands-AI#6942 Removed in All-Hands-AI#6557

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>

llamantino · 2025-06-27T11:07:49Z

This part wasn't correct, unfortunately:

    # total wait time: 5 + 10 + 20 + 30 = 65 seconds
    num_retries: int = Field(default=4)
    retry_multiplier: float = Field(default=2)
    retry_min_wait: int = Field(default=5)
    retry_max_wait: int = Field(default=30)

Tenacity counts attempts, not retries, so you actually get 3 retries after the first attempt fails. That said, it's not 5 + 10 + 20 = 35s either, because Tenacity uses binary exponential backoff. The actual total wait time is 18s: 5 + 5 + 8.

If you add that to the waits introduced by LiteLLM attempts, it becomes roughly 24s, less than a minute, and that's a problem, because it's not enough time for per-minute rate limiting blocks to reset. This could explain some (at least 3) weird open rate-limiting OH issues where the agent keeps stopping: there's not enough time for the per-minute limit to reset.

I suggest you change the values again to have the waits span over 60s. Ideally, to cover all cases, one of the waits should be over 60s, because some providers also count failed attempts.

rbren added 2 commits January 30, 2025 17:20

stop retrying on all exceptions

cc0bf44

fix retry behavior

4019cfe

rbren changed the title ~~stop retrying on all exceptions~~ Better LLM retry behavior Jan 30, 2025

enyst reviewed Jan 30, 2025

View reviewed changes

fix tests

c46665c

rbren and others added 3 commits January 31, 2025 11:46

fix test

ee06747

Merge branch 'main' into rb/no-retry-llm

6a829b7

Merge branch 'main' into rb/no-retry-llm

e94c57d

enyst reviewed Feb 3, 2025

View reviewed changes

openhands/llm/llm.py Outdated Show resolved Hide resolved

enyst and others added 2 commits February 4, 2025 00:16

Update openhands/llm/llm.py

e81b312

Merge branch 'main' into rb/no-retry-llm

11937d0

[rbren no-retries] add user-friendly messages (#6576)

185288b

enyst approved these changes Feb 14, 2025

View reviewed changes

mamoodi requested a review from xingyaoww February 17, 2025 15:03

xingyaoww approved these changes Feb 17, 2025

View reviewed changes

rbren merged commit 3a478c2 into main Feb 17, 2025
19 checks passed

rbren deleted the rb/no-retry-llm branch February 17, 2025 15:37

SmartManoj added a commit to SmartManoj/Kevin that referenced this pull request Feb 26, 2025

Add common exceptions for retry

77609ee

Fixes All-Hands-AI#6942 Removed in All-Hands-AI#6557

adityasoni9998 pushed a commit to adityasoni9998/OpenHands that referenced this pull request Mar 3, 2025

Better LLM retry behavior (All-Hands-AI#6557)

e5599ef

Co-authored-by: Engel Nyst <enyst@users.noreply.github.com>

llamantino mentioned this pull request Jun 27, 2025

Improve rate limit message to indicate automatic retry #9281

Merged

enyst mentioned this pull request Jun 27, 2025

Fix: Retry on Bedrock ServiceUnavailableError #9419

Merged

2 tasks

llamantino mentioned this pull request Jul 1, 2025

[Bug]: This request would exceed the rate limit for your organization (...) of 20,000 input tokens per minute #9259

Open

1 task

enyst mentioned this pull request Jul 21, 2025

Fix: Add APIConnectionError to LLM_RETRY_EXCEPTIONS to handle temporary API errors #9818

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Better LLM retry behavior #6557

Better LLM retry behavior #6557

Uh oh!

rbren commented Jan 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

enyst Jan 30, 2025

Uh oh!

rbren Jan 30, 2025

Uh oh!

enyst Jan 30, 2025

Uh oh!

enyst commented Jan 30, 2025

Uh oh!

enyst commented Jan 30, 2025

Uh oh!

Uh oh!

enyst commented Feb 13, 2025

Uh oh!

rbren commented Feb 14, 2025

Uh oh!

enyst left a comment •

edited

Loading

Uh oh!

xingyaoww left a comment

Uh oh!

Uh oh!

llamantino commented Jun 27, 2025

Uh oh!

Uh oh!

Better LLM retry behavior #6557

Better LLM retry behavior #6557

Uh oh!

Conversation

rbren commented Jan 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enyst Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

rbren Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

enyst Jan 30, 2025

Choose a reason for hiding this comment

Uh oh!

enyst commented Jan 30, 2025

Uh oh!

enyst commented Jan 30, 2025

Uh oh!

Uh oh!

enyst commented Feb 13, 2025

Uh oh!

rbren commented Feb 14, 2025

Uh oh!

enyst left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xingyaoww left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llamantino commented Jun 27, 2025

Uh oh!

Uh oh!

rbren commented Jan 30, 2025 •

edited by github-actions bot

Loading

enyst left a comment •

edited

Loading