Fix ListQueues: iterator leak, unbounded CQL round-trips, and PageState after Close#9523
Open
mykaul wants to merge 4 commits intotemporalio:mainfrom
Open
Fix ListQueues: iterator leak, unbounded CQL round-trips, and PageState after Close#9523mykaul wants to merge 4 commits intotemporalio:mainfrom
mykaul wants to merge 4 commits intotemporalio:mainfrom
Conversation
Both QueueStore.ReadMessages and queueV2Store.ReadMessages rely on CQL LIMIT to bound results but don't set gocql's PageSize. Without it, gocql uses a default page size of 5000, which can cause implicit multi-round-trip fetching when the requested count differs. Setting PageSize explicitly aligns the gocql fetch size with the CQL LIMIT.
The function exhaustively fetches all pages into memory with no cap. Under pathological conditions (many thousands of task queues per build ID), this could cause unbounded memory growth. Add a 10,000 result limit and mark the query as idempotent since it is read-only.
The failingQuery mock lacked a PageSize method, causing a nil pointer dereference after PageSize was added to queueV2Store.ReadMessages.
…lose in ListQueues
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed?
Set explicit
PageSizefor Cassandra queueReadMessagesqueries — ensures CQL pagingis respected even when a LIMIT clause is present.
Add upper bound to
GetTaskQueuesByBuildIdresult set — caps results at 10,000 toprevent unbounded iteration over CQL pages.
Fix iterator leak in
ListQueues— three error paths inside the scan loop(
getQueueFromMetadata,GetPartitionForQueueV2,getMessageCountAndLastID) returnedwithout calling
iter.Close(), leaking the gocql iterator (server-side cursor andconnection resources).
Add CQL round-trip cap to
ListQueues— withALLOW FILTERING, Cassandra may returnunder-filled pages because it scans a fixed number of partitions per page and then
post-filters. The old single-page fetch could return fewer results than requested even
when more exist. The new code loops over CQL pages (up to
maxListQueuesPages = 10)until the requested
PageSizeis filled or all pages are exhausted, returning a validNextPageTokenif the cap is hit.Capture
PageState()beforeiter.Close()— the old code callediter.PageState()after
iter.Close(), which is incorrect per gocql semantics.Pre-allocate
queuesslice withmake([]QueueInfo, 0, PageSize)to avoid repeatedslice growth.
Export
TemplateGetQueueNamesQueryfor test coverage consistency with other templateconstants.
Why?
ListQueuesusesALLOW FILTERINGbecause the partition key is(queue_type, queue_name)but we filter only on
queue_type. Cassandra handles this by scanning a fixed number ofpartitions per CQL page and then discarding non-matching rows. This means a single page
fetch can return 0 matching rows even when many exist — the caller must fetch additional
pages to fill the requested page size.
The iterator leak is a resource correctness issue: on any error during row processing, the
gocql iterator was not closed, leaking a server-side cursor and potentially a connection.
How did you test it?
TestCassandraQueueV2Persistencetests pass)ListQueuesGetQueueNamesQuery— verifies that when theALLOW FILTERINGquery itself fails,iter.Close()error is properly surfaced as*serviceerror.Unavailablewith theQueueV2ListQueuesoperation namefailingQuery.PageStatemock method to support the new multi-page loopPotential risks
maxListQueuesPages = 10cap means that in extreme cases (very large number ofnon-matching partitions),
ListQueuesmay return fewer results thanPageSizewith avalid
NextPageToken. This is a deliberate trade-off to prevent unbounded CQL queries.Callers that need all results should paginate using
NextPageToken.