Cache segment metadata on the Overlord to speed up segment allocation and other task actions #17653

kfaraz · 2025-01-22T07:49:41Z

NOTE: For anyone trying out this patch, please make sure to include #17772 as well.

Description

The Overlord performs several metadata operations on the tables druid_segments and druid_pendingSegments,
such as segment commit, allocation, upgrade and mark used / unused.

Segment allocation, in particular, involves several reads/writes to the metadata store and can often become a bottleneck,
causing ingestion to slow down. This effect is particularly pronounced when streaming ingestion is enabled for
multiple datasources or if there is a lot of late arriving data.

This patch adds an in-memory segment cache to the Overlord to speed up all segment metadata operations.

Assumptions

All segment metadata writes are performed by the Overlord (added in Move segment update APIs from Coordinator to Overlord #17545 )
A segment metadata transaction involves only a single datasource.
A segment metadata transaction does not read what it writes.

Design

Summary

Add Overlord runtime property druid.manager.segments.useCache to enable cache
Keep cache disabled by default
When cache is enabled, read metadata from cache and write metadata to both metadata store and cache
Poll metadata store periodically in case any metadata update did not make it to the cache (should not happen under stable operational conditions)
Non-leader Overlords also poll the metadata store and keep the cache ready in case of failover
Upon becoming leader, Overlord needs to finish one poll successfully before cache can be used in transactions

Segment metadata transaction with cache enabled

If cache is not enabled, fall back to old flow
If not leader, do not proceed
If cache is waiting for sync to finish after becoming leader, block until cache is ready
Acquire a lock on cache to ensure that another thread does not update it while we are reading from it.
Start transaction
Get leader term
Perform computations
For every read, just read from the cache
For every write
- check if leadership has been lost or term has changed
- this helps safeguard against cases where we lose leadership during the transaction
- if yes, rollback transaction
- if not, remember write action to commit to cache later
If transaction has succeeded, commit pending writes to cache
Close transaction
Release lock

Lifecycle of cache

Cache is enabled if druid.manager.segments.useCache=true on the Overlord
When Overlord starts, start() cache is called putting it in STANDBY mode.
In STANDBY mode, the cache polls the metadata store at a period of druid.manager.segments.pollDuration to do the following:
- Retrieve all segment IDs and their last updated timestamps
- If the cache has stale or no information for any unused segment, update it
- If the cache has stale or no information for any used segment, fetch entire segment payload
- Retrieve all pending segments and update cache if it has stale information
Cache cannot be used for transactions while it is in STANDBY mode.
When Overlord becomes leader, cache moves to SYNC_PENDING mode.
When the next poll starts, it moves the cache from SYNC_PENDING to SYNC_STARTED
When this poll completes, cache is marked as READY
In READY mode, cache can be used for read and write transactions
When Overlord loses leadership, cache is moved back to STANDBY mode
When Overlord stops, stop() cache is called stopping the poll

Contents of cache

The cache maintains the following fields for every datasource.

Field	Needed In
`Map<String, DataSegmentPlus> idToUsedSegment`	Reads/writes for used segments
`Set<String> unusedSegmentIds`	Checking set of existing segment IDs to avoid duplicate insert
`Map<Interval, Map<String, Integer>> intervalVersionToHighestUnusedPartitionNumber`	Segment allocation to avoid duplicate IDs
`Map<Interval, Map<String, PendingSegmentRecord>> intervalToPendingSegments`	Segment allocation

Code changes

Class / Description
`SegmentsMetadataManagerConfig.useCache`
Enable/disable cache on the Overlord
`DatasourceSegmentMetadataReader`
Interface to perform segment metadata read operations
`DatasourceSegmentMetadataWriter`
Interface to perform segment metadata write operations
`HeapMemorySegmentMetadataCache`
Poll committed and pending segments from the metadata store
`DatasourceSegmentCache`
Cache committed and pending segments of a single datasource
`SegmentMetadataTransaction`
Encapsulate all read/write operations performed within a transaction. This abstraction allows the code to redirect all read/write operations within a transaction to either the cache or to the metadata store itself
`SqlSegmentMetadataTransaction`
Perform read/writes directly on metadata store if cache is disabled or not ready
`CachedSegmentMetadataTransaction`
Perform read only from cache and writes to both metadata store and cache
`SqlSegmentMetadataTransactionFactory`
Create transaction based on state of cache
`IndexerSQLMetadataStorageCoordinator`
Perform all metadata transactions using transaction factory Move metadata reads methods to `SqlSegmentsMetadataQuery` Move metadata write methods to `SqlSegmentMetadataTransaction`

Testing

Run all ITs successfully with cache enabled in this commit
Update existing tests to run both with and without cache
- IndexerSQLMetadataStorageCoordinatorTest
- SegmentAllocateActionTest
- SegmentAllocationQueueTest
Add DatasourceSegmentCacheTest

Local cluster

Ran a cluster with 600K segments
Full sync takes about 1 min

2025-02-10T08:36:43,664 INFO [SegmentMetadataCache-0]
  org.apache.druid.java.util.emitter.core.LoggingEmitter - [metrics]
  {"feed":"metrics",
  "metric":"segment/metadataCache/sync/time",
  "service":"druid/coordinator",
  "host":"localhost:8081",
  "version":"33.0.0-SNAPSHOT",
  "value":68942,
  "timestamp":"2025-02-10T08:36:43.664Z"
  }

Delta sync (retrieve all segment IDs but payloads of only updated segments) takes about 3 seconds

2025-02-10T08:36:46,521 INFO [SegmentMetadataCache-0]
  org.apache.druid.java.util.emitter.core.LoggingEmitter - [metrics]
  {"feed":"metrics",
  "metric":"segment/metadataCache/sync/time",
  "service":"druid/coordinator",
  "host":"localhost:8081",
  "version":"33.0.0-SNAPSHOT",
  "value":2855,
  "timestamp":"2025-02-10T08:36:46.521Z"
  }

Pending items

Cluster testing

Release note

Add Overlord runtime property druid.manager.segments.useCache (default value false).
Set this to true to turn on segment metadata caching on the Overlord. This allows segment metadata operations
such as reads and segment allocation to be sped up significantly.

The following metrics have been added:

segment/used/count
segment/unused/count
segment/pending/count
segment/metadataCache/sync/time
segment/metadataCache/deleted
segment/metadataCache/skipped
segment/metadataCache/used/stale
segment/metadataCache/used/updated
segment/metadataCache/unused/updated
segment/metadataCache/pending/deleted
segment/metadataCache/pending/updated
segment/metadataCache/pending/skipped

Upgrade notes

The flag druid.manager.segments.useCache to enable the segment cache should be turned on only when
Druid has been upgraded to a version containing both this patch #17653 and #17545 .
When Druid is being downgraded to an older, the feature flag must first be turned off.

This PR has:

server/src/main/java/org/apache/druid/metadata/segment/cache/SqlSegmentsMetadataCache.java

server/src/main/java/org/apache/druid/metadata/segment/cache/SegmentsMetadataCache.java

server/src/main/java/org/apache/druid/metadata/segment/cache/DatasourceSegmentCache.java

server/src/main/java/org/apache/druid/metadata/segment/CachedSegmentMetadataTransaction.java

+    // Assume that the metadata write operation succeeded
+    // Do not update the cache just yet, add to the list of pending writes
+    pendingCacheWrites.add(writer -> {
+      T ignored = action.apply(writer);


…on_overlord

AmatyaAvadhanula · 2025-02-05T05:33:45Z

@kfaraz thank you for the changes.

Could you please call out the dependency on #17545 in the description?
I also think that it is important to turn off this feature flag prior to downgrades. (because it is potentially dangerous when the Coordinator is downgraded to Druid version <= 31.0.1 and the Overlord is still on version >= 33.x.y with this config turned on for a brief period.)

kfaraz · 2025-02-05T08:33:26Z

Thanks for the suggestion, @AmatyaAvadhanula . I have added an Upgrade Notes section.

AmatyaAvadhanula

I had a couple of questions about certain choices, which I have left as comments.
I was also hoping to understand the pros and cons of having pending segments and segments being accessed by the same ReadWriteLock in the cache.

Otherwise, the segment allocation changes look good to me.

Will try to wrap up the review of the caching mechanism soon.

AmatyaAvadhanula · 2025-02-06T15:14:02Z

.../src/main/java/org/apache/druid/metadata/segment/cache/HeapMemoryDatasourceSegmentCache.java

+
+  HeapMemoryDatasourceSegmentCache(String dataSource)
+  {
+    super(true);


Could you please add more details explaining the usage of a fair lock?
I recall a discussion where we wanted to change the lock in VersionedIntervalTimeline to false as well.

Sure, will add that in the javadoc.

Edit: I haven't really given much thought to whether we should go with fair or not for the cache. Reading through the javadoc of ReentrantReadWriteLock, fair had seemed an appropriate choice. But I will take another look and document whatever we decide to do.

AmatyaAvadhanula · 2025-02-06T15:20:09Z

.../src/main/java/org/apache/druid/metadata/segment/cache/HeapMemoryDatasourceSegmentCache.java

+  /**
+   * Not being used right now. Could allow lookup of visible segments for a given interval.
+   */
+  private final SegmentTimeline usedSegmentTimeline = SegmentTimeline.forSegments(Set.of());


Adding a timeline with potentially expensive operations because of its own lock seems risky and wasteful given that we are not using it.

Is there a reason we aren't using a TreeMap: Interval -> (DataSegment / SegmentId) and use it instead?
I believe the perf impact would be significant when there are several intervals and segments.
We are creating a Timeline after fetching segments anyway.

Yeah, the timeline is not really being used in the code right now. I will just get rid of it for the time being.

Is there a reason we aren't using a TreeMap: Interval -> (DataSegment / SegmentId) and use it instead?
I believe the perf impact would be significant when there are several intervals and segments.

Hmm, let me evaluate this once. I had decided against it originally since most segment searches are for overlapping intervals rather than exact matches. But I guess we can have some additional logic to prune out intervals which are disjoint, thus benefiting perf as you point out.

Other searches are by segment ID, so keeping an id -> segment map helped.

Eventually, we would most likely end up keeping a timeline itself.
That timeline could also be used to replace the timeline maintained in SqlSegmentsMetadataManager used by CompactionScheduler (and also the coordinator).

gianm · 2025-02-06T21:43:34Z

indexing-service/src/main/java/org/apache/druid/indexing/overlord/DruidOverlord.java

                  segmentAllocationQueue.becomeLeader();
                  taskMaster.becomeHalfLeader(taskRunner, taskQueue);
                }

                @Override
                public void stop()
                {
+                  segmentMetadataCache.stopBeingLeader();


Generally, the order of items in stop() should be the reverse of the order in start(), in case there are dependencies.

gianm · 2025-02-07T19:08:29Z

services/src/main/java/org/apache/druid/cli/CliOverlord.java

@@ -228,6 +230,10 @@ public void configure(Binder binder)
            JsonConfigProvider.bind(binder, "druid.indexer.task.default", DefaultTaskConfig.class);
            binder.bind(RetryPolicyFactory.class).in(LazySingleton.class);

+            binder.bind(SegmentMetadataCache.class)


Is this meant to override the binding to NoopSegmentMetadataCache in SQLMetadataStorageDruidModule? I thought multiple bindings typically weren't allowed.

I wonder why we need the binding outside of CliOverlord at all- why do other server types need to create one?

Ideally, the other server types shouldn't need the binding at all.
But we have this dep graph:
CoreInjectorBuilder -> DerbyMetadataStorageModule -> SQLMetadataStorageModule -> SqlSegmentMetadataTransactionFactory (required for IndexerSQLMetadataStorageCoordinator) -> SegmentMetadataCache.

CoreInjectorBuilder should not even load DerbyMetadataStorageModule, it should be loaded only in CliOverlord and CliCoordinator. But I decided to make this change in a separate PR so that I could test it out properly, just to be on the safe side.

Please let me know if you think I should include it in this PR itself.

I see. IMO, it's ok the way it is, although a comment explaining this would be nice.

gianm · 2025-02-07T20:09:02Z

server/src/main/java/org/apache/druid/metadata/SegmentsMetadataManagerConfig.java

  {
-    return pollDuration;
+    this.pollDuration = Configs.valueOrDefault(pollDuration, Period.minutes(1));
+    this.useCache = Configs.valueOrDefault(useCache, true);


If this is what controls whether the new caching feature is enabled, then want this to be false by default for now. We could change the default to true when it's more proven out.

Yes, I had enabled this temporarily to have all UTs and ITs work with the cache enabled.
I will disable it now as all tests seem to work as expected.
There are already some tests which run in both modes: cache enabled and disabled.

Fixed, default is now false (disabled).

gianm · 2025-02-07T20:12:13Z

server/src/main/java/org/apache/druid/metadata/SqlSegmentsMetadataQuery.java

+    int maxPartitionNum = -1;
+    for (String id : unusedSegmentIds) {
+      final SegmentId segmentId = SegmentId.tryParse(datasource, id);
+      if (segmentId == null) {


Warn (or perhaps even throw?) if the segment ID is unparseable, since in that case, the method may not be returning the correct answer.

Fixed, throws exception now.

gianm · 2025-02-07T20:17:06Z

server/src/main/java/org/apache/druid/metadata/SqlSegmentsMetadataQuery.java

    for (List<String> partition : partitionedSegmentIds) {
-      fetchedSegments.addAll(retrieveSegmentBatchById(datasource, partition, false));
+      fetchedSegments.add(retrieveSegmentBatchById(datasource, partition, false));


I wonder what this code will do exactly. There will be multiple CloseableIterator from retrieveSegmentBatchById existing at once. What effect does that have?

Does the metadata query get made lazily when the iterator first has hasNext() called? If so then it would lead to the metadata queries being issued sequentially, which seems fine. But, if the query is issued as part of iterator creation, this would lead to quite a lot of simultaneously open queries, which might cause problems with the metadata store.

Thanks for the suggestion! Added a CloseableIterator which keeps only one result set open at a time.

gianm · 2025-02-07T21:42:44Z