Skip to content

Fix ListQueues: iterator leak, unbounded CQL round-trips, and PageState after Close#9523

Open
mykaul wants to merge 4 commits intotemporalio:mainfrom
mykaul:fix-cassandra-paging
Open

Fix ListQueues: iterator leak, unbounded CQL round-trips, and PageState after Close#9523
mykaul wants to merge 4 commits intotemporalio:mainfrom
mykaul:fix-cassandra-paging

Conversation

@mykaul
Copy link
Contributor

@mykaul mykaul commented Mar 15, 2026

What changed?

  1. Set explicit PageSize for Cassandra queue ReadMessages queries — ensures CQL paging
    is respected even when a LIMIT clause is present.

  2. Add upper bound to GetTaskQueuesByBuildId result set — caps results at 10,000 to
    prevent unbounded iteration over CQL pages.

  3. Fix iterator leak in ListQueues — three error paths inside the scan loop
    (getQueueFromMetadata, GetPartitionForQueueV2, getMessageCountAndLastID) returned
    without calling iter.Close(), leaking the gocql iterator (server-side cursor and
    connection resources).

  4. Add CQL round-trip cap to ListQueues — with ALLOW FILTERING, Cassandra may return
    under-filled pages because it scans a fixed number of partitions per page and then
    post-filters. The old single-page fetch could return fewer results than requested even
    when more exist. The new code loops over CQL pages (up to maxListQueuesPages = 10)
    until the requested PageSize is filled or all pages are exhausted, returning a valid
    NextPageToken if the cap is hit.

  5. Capture PageState() before iter.Close() — the old code called iter.PageState()
    after iter.Close(), which is incorrect per gocql semantics.

  6. Pre-allocate queues slice with make([]QueueInfo, 0, PageSize) to avoid repeated
    slice growth.

  7. Export TemplateGetQueueNamesQuery for test coverage consistency with other template
    constants.

Why?

ListQueues uses ALLOW FILTERING because the partition key is (queue_type, queue_name)
but we filter only on queue_type. Cassandra handles this by scanning a fixed number of
partitions per CQL page and then discarding non-matching rows. This means a single page
fetch can return 0 matching rows even when many exist — the caller must fetch additional
pages to fill the requested page size.

The iterator leak is a resource correctness issue: on any error during row processing, the
gocql iterator was not closed, leaking a server-side cursor and potentially a connection.

How did you test it?

  • built
  • run locally and tested manually — against both Cassandra and ScyllaDB
  • covered by existing tests (all TestCassandraQueueV2Persistence tests pass)
  • added new unit test: ListQueuesGetQueueNamesQuery — verifies that when the
    ALLOW FILTERING query itself fails, iter.Close() error is properly surfaced as
    *serviceerror.Unavailable with the QueueV2ListQueues operation name
  • added failingQuery.PageState mock method to support the new multi-page loop

Potential risks

  • The maxListQueuesPages = 10 cap means that in extreme cases (very large number of
    non-matching partitions), ListQueues may return fewer results than PageSize with a
    valid NextPageToken. This is a deliberate trade-off to prevent unbounded CQL queries.
    Callers that need all results should paginate using NextPageToken.

mykaul added 3 commits March 15, 2026 13:59
Both QueueStore.ReadMessages and queueV2Store.ReadMessages rely on
CQL LIMIT to bound results but don't set gocql's PageSize. Without
it, gocql uses a default page size of 5000, which can cause implicit
multi-round-trip fetching when the requested count differs. Setting
PageSize explicitly aligns the gocql fetch size with the CQL LIMIT.
The function exhaustively fetches all pages into memory with no cap.
Under pathological conditions (many thousands of task queues per build
ID), this could cause unbounded memory growth. Add a 10,000 result
limit and mark the query as idempotent since it is read-only.
The failingQuery mock lacked a PageSize method, causing a nil pointer
dereference after PageSize was added to queueV2Store.ReadMessages.
@mykaul mykaul requested review from a team as code owners March 15, 2026 15:03
@CLAassistant
Copy link

CLAassistant commented Mar 15, 2026

CLA assistant check
All committers have signed the CLA.

@mykaul mykaul changed the title Fix cassandra paging Fix ListQueues: iterator leak, unbounded CQL round-trips, and PageState after Close Mar 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants