fix: minimize the occurrence of deadlocks #15281
Open
+134
−13
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Title
Attempt to avoid deadlocks during upserts in daily spend tables.
In our deployments we are observing repeated deadlocks terminating in exceptions under load. While the documentation mentions adding redis to accumulate writes and - primarily - avoid concurrent write transactions, with the appropriate care the database should be able to deal with this workload even without redis and the write coordination (that FWIW would not require redis to begin with, because RDBMSes normally provide some mechanism to emulate a semaphore/mutex1).
Relevant issues
Pre-Submission checklist
Please complete all items before asking a LiteLLM maintainer to review your PR
tests/litellm/
directory, Adding at least 1 test is a hard requirement - see detailsmake test-unit
Type
🐛 Bug Fix
🧹 Refactoring
✅ Test
Changes
The main change is providing a consistent ordering across transactions for the rows being updated, that is the textbook solution to deadlocks. The actual ordering does not matter too much (it matters for data locality, but that depends on other factors as well), what matters is making sure that all concurrent writes use the same sort order.
A smaller secondary change is adding a bit of randomness to the wait durations before retries. When two or more transactions deadlock, the database aborts all of the deadlocked transactions at the same time. If the delay before retry is the same for all involved transactions, as it can be currently, all involved transactions will be retried roughly at the same time. This is more likely, especially under load, to lead to more deadlocks. Instead with this change we wait a random time (still with exponential backoff), and then retry.
Other formatting changes to the files being modified are added by
make format
.One test is added to check that the queued rows are sorted before being sent to the database:
https://github.com/BerriAI/litellm/actions/runs/18333952249/job/52214304926?pr=15281#step:8:6926
Some other tests are failing but they seem unrelated.
Footnotes
e.g.
GET_LOCK
in MySQL andpg_advisory_lock
in PostgreSQL ↩