fix: minimize the occurrence of deadlocks #15281

CAFxX · 2025-10-07T15:13:05Z

Title

Attempt to avoid deadlocks during upserts in daily spend tables.

In our deployments we are observing repeated deadlocks terminating in exceptions under load. While the documentation mentions adding redis to accumulate writes and - primarily - avoid concurrent write transactions, with the appropriate care the database should be able to deal with this workload even without redis and the write coordination (that FWIW would not require redis to begin with, because RDBMSes normally provide some mechanism to emulate a semaphore/mutex¹).

Relevant issues

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
I have added a screenshot of my new test passing locally
My PR passes all unit tests on make test-unit
My PR's scope is as isolated as possible, it only solves 1 specific problem

Type

🐛 Bug Fix
🧹 Refactoring
✅ Test

Changes

The main change is providing a consistent ordering across transactions for the rows being updated, that is the textbook solution to deadlocks. The actual ordering does not matter too much (it matters for data locality, but that depends on other factors as well), what matters is making sure that all concurrent writes use the same sort order.

A smaller secondary change is adding a bit of randomness to the wait durations before retries. When two or more transactions deadlock, the database aborts all of the deadlocked transactions at the same time. If the delay before retry is the same for all involved transactions, as it can be currently, all involved transactions will be retried roughly at the same time. This is more likely, especially under load, to lead to more deadlocks. Instead with this change we wait a random time (still with exponential backoff), and then retry.

Other formatting changes to the files being modified are added by make format.

One test is added to check that the queued rows are sorted before being sent to the database:

tests/test_litellm/proxy/db/test_db_spend_update_writer.py::test_update_daily_spend_sorting 
[gw0] [ 91%] PASSED tests/test_litellm/proxy/db/test_db_spend_update_writer.py::test_update_daily_spend_sorting

https://github.com/BerriAI/litellm/actions/runs/18333952249/job/52214304926?pr=15281#step:8:6926

Some other tests are failing but they seem unrelated.

e.g. GET_LOCK in MySQL and pg_advisory_lock in PostgreSQL ↩

vercel · 2025-10-07T15:13:09Z

@CAFxX is attempting to deploy a commit to the CLERKIEAI Team on Vercel.

A member of the Team first needs to authorize it.

CLAassistant · 2025-10-07T15:13:12Z

All committers have signed the CLA.

JehandadK · 2025-10-08T01:12:47Z

Alert type: db_exceptions
Level: High
Timestamp: 07:30:38
Message: DB read/write call failed: Error occurred during query execution:
ConnectorError(ConnectorError { user_facing_error: None, kind: QueryError(PostgresError { code: "40P01", message: "deadlock detected", severity: "ERROR", detail: Some("Process 1837691 waits for ShareLock on transaction 2775259; blocked by process 1837653.\nProcess 1837653 waits for ShareLock on transaction 2775258; blocked by process 1837691."), column: None, hint: Some("See server log for query details.") }), transient: false })[Non-Blocking]LiteLLM Prisma Client Exception - update spend logs: Error occurred during query execution:
ConnectorError(ConnectorError { user_facing_error: None, kind: QueryError(PostgresError { code: "40P01", message: "deadlock detected", severity: "ERROR", detail: Some("Process 1837691 waits for ShareLock on transaction 2775259; blocked by process 1837653.\nProcess 1837653 waits for ShareLock on transaction 2775258; blocked by process 1837691."), column: None, hint: Some("See server log for query details.") }), transient: false })
Traceback (most recent call last):
  File "/usr/lib/python3.13/site-packages/litellm/proxy/db/db_spend_update_writer.py", line 862, in _update_daily_spend
    async with prisma_client.db.batch_() as batcher:
               ~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/usr/lib/python3.13/site-packages/prisma/client.py", line 844, in __aexit__
    await self.commit()
  File "/usr/lib/python3.13/site-packages/prisma/client.py", line 820, in commit
    await self.__clien
    
    ```
    Error log for deadlocks.

CAFxX force-pushed the cafxx-avoid-deadlock branch 2 times, most recently from 1771e42 to f735257 Compare October 7, 2025 15:31

CAFxX force-pushed the cafxx-avoid-deadlock branch 5 times, most recently from 17f1257 to c7fd3f4 Compare October 8, 2025 04:37

CAFxX marked this pull request as ready for review October 8, 2025 05:32

CAFxX force-pushed the cafxx-avoid-deadlock branch from c7fd3f4 to 83674d9 Compare October 8, 2025 05:41

attempt to avoid/minimize deadlocks

8e32bac

CAFxX force-pushed the cafxx-avoid-deadlock branch from 83674d9 to 8e32bac Compare October 8, 2025 05:45

CAFxX changed the title ~~attempt to avoid/minimize deadlocks~~ fix: attempt to minimize the occurrence of deadlocks Oct 9, 2025

CAFxX changed the title ~~fix: attempt to minimize the occurrence of deadlocks~~ fix: minimize the occurrence of deadlocks Oct 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: minimize the occurrence of deadlocks #15281

fix: minimize the occurrence of deadlocks #15281

CAFxX commented Oct 7, 2025 •

edited

Loading

Uh oh!

vercel bot commented Oct 7, 2025

Uh oh!

CLAassistant commented Oct 7, 2025 •

edited

Loading

Uh oh!

JehandadK commented Oct 8, 2025

Uh oh!

Uh oh!

Uh oh!

fix: minimize the occurrence of deadlocks #15281

Are you sure you want to change the base?

fix: minimize the occurrence of deadlocks #15281

Conversation

CAFxX commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Title

Relevant issues

Pre-Submission checklist

Type

Changes

Footnotes

Uh oh!

vercel bot commented Oct 7, 2025

Uh oh!

CLAassistant commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JehandadK commented Oct 8, 2025

Uh oh!

Uh oh!

CAFxX commented Oct 7, 2025 •

edited

Loading

CLAassistant commented Oct 7, 2025 •

edited

Loading