Skip to content

Conversation

CAFxX
Copy link

@CAFxX CAFxX commented Oct 7, 2025

Title

Attempt to avoid deadlocks during upserts in daily spend tables.

In our deployments we are observing repeated deadlocks terminating in exceptions under load. While the documentation mentions adding redis to accumulate writes and - primarily - avoid concurrent write transactions, with the appropriate care the database should be able to deal with this workload even without redis and the write coordination (that FWIW would not require redis to begin with, because RDBMSes normally provide some mechanism to emulate a semaphore/mutex1).

Relevant issues

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • I have added a screenshot of my new test passing locally
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem

Type

🐛 Bug Fix
🧹 Refactoring
✅ Test

Changes

The main change is providing a consistent ordering across transactions for the rows being updated, that is the textbook solution to deadlocks. The actual ordering does not matter too much (it matters for data locality, but that depends on other factors as well), what matters is making sure that all concurrent writes use the same sort order.

A smaller secondary change is adding a bit of randomness to the wait durations before retries. When two or more transactions deadlock, the database aborts all of the deadlocked transactions at the same time. If the delay before retry is the same for all involved transactions, as it can be currently, all involved transactions will be retried roughly at the same time. This is more likely, especially under load, to lead to more deadlocks. Instead with this change we wait a random time (still with exponential backoff), and then retry.

Other formatting changes to the files being modified are added by make format.

One test is added to check that the queued rows are sorted before being sent to the database:

tests/test_litellm/proxy/db/test_db_spend_update_writer.py::test_update_daily_spend_sorting 
[gw0] [ 91%] PASSED tests/test_litellm/proxy/db/test_db_spend_update_writer.py::test_update_daily_spend_sorting 

https://github.com/BerriAI/litellm/actions/runs/18333952249/job/52214304926?pr=15281#step:8:6926

Some other tests are failing but they seem unrelated.

Footnotes

  1. e.g. GET_LOCK in MySQL and pg_advisory_lock in PostgreSQL

Copy link

vercel bot commented Oct 7, 2025

@CAFxX is attempting to deploy a commit to the CLERKIEAI Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant
Copy link

CLAassistant commented Oct 7, 2025

CLA assistant check
All committers have signed the CLA.

@CAFxX CAFxX force-pushed the cafxx-avoid-deadlock branch 2 times, most recently from 1771e42 to f735257 Compare October 7, 2025 15:31
@JehandadK
Copy link

Alert type: db_exceptions
Level: High
Timestamp: 07:30:38
Message: DB read/write call failed: Error occurred during query execution:
ConnectorError(ConnectorError { user_facing_error: None, kind: QueryError(PostgresError { code: "40P01", message: "deadlock detected", severity: "ERROR", detail: Some("Process 1837691 waits for ShareLock on transaction 2775259; blocked by process 1837653.\nProcess 1837653 waits for ShareLock on transaction 2775258; blocked by process 1837691."), column: None, hint: Some("See server log for query details.") }), transient: false })[Non-Blocking]LiteLLM Prisma Client Exception - update spend logs: Error occurred during query execution:
ConnectorError(ConnectorError { user_facing_error: None, kind: QueryError(PostgresError { code: "40P01", message: "deadlock detected", severity: "ERROR", detail: Some("Process 1837691 waits for ShareLock on transaction 2775259; blocked by process 1837653.\nProcess 1837653 waits for ShareLock on transaction 2775258; blocked by process 1837691."), column: None, hint: Some("See server log for query details.") }), transient: false })
Traceback (most recent call last):
  File "/usr/lib/python3.13/site-packages/litellm/proxy/db/db_spend_update_writer.py", line 862, in _update_daily_spend
    async with prisma_client.db.batch_() as batcher:
               ~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/usr/lib/python3.13/site-packages/prisma/client.py", line 844, in __aexit__
    await self.commit()
  File "/usr/lib/python3.13/site-packages/prisma/client.py", line 820, in commit
    await self.__clien
    
    ```
    Error log for deadlocks. 

@CAFxX CAFxX force-pushed the cafxx-avoid-deadlock branch 5 times, most recently from 17f1257 to c7fd3f4 Compare October 8, 2025 04:37
@CAFxX CAFxX marked this pull request as ready for review October 8, 2025 05:32
@CAFxX CAFxX force-pushed the cafxx-avoid-deadlock branch from c7fd3f4 to 83674d9 Compare October 8, 2025 05:41
@CAFxX CAFxX force-pushed the cafxx-avoid-deadlock branch from 83674d9 to 8e32bac Compare October 8, 2025 05:45
@CAFxX CAFxX changed the title attempt to avoid/minimize deadlocks fix: attempt to minimize the occurrence of deadlocks Oct 9, 2025
@CAFxX CAFxX changed the title fix: attempt to minimize the occurrence of deadlocks fix: minimize the occurrence of deadlocks Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants