Skip to content

Metadata scan limit: monitoring gap analysis and options #234

@eazyhozy

Description

@eazyhozy

Problem

The metadata scan has a hard limit (metadata-fetch-limit, default 1,000) applied at the storage scan level. INACTIVE (soft-deleted) rows consume this limit quota alongside active rows, but there is currently no way to monitor the true quota usage before silent truncation occurs.

When the total rows (ACTIVE + INACTIVE) in the metastore exceed the limit, active entities are silently dropped from the scan result, potentially causing serving issues.

Scenario

With metadata-fetch-limit = 1,000 (default) and a single database:

Step Metastore rows Edge ACTIVE Edge INACTIVE
Create 1,000 tables 1,000 1,000 0
Delete 500 tables 1,000 500 500
Create 1 new table ⚠️ 1,001 501 500

After the last step, scanStorage() fetches 1,000 rows (sorted by key, capped by limit). One row is truncated — which may be the newly created active table. The isActive filter then removes ~500 INACTIVE edges, leaving DdlPage.count ≈ 500.

Result: An active table is silently missing from the scan, but DdlPage.count (≈500) gives no indication that truncation occurred.

Note: "Delete" here means the full 2-step soft delete (deactivate → delete), which sets edge to INACTIVE while keeping the metastore row.

No existing API can detect this

API What it exposes Why it's insufficient
DDL API (DdlPage.count) Edge-ACTIVE entity count (after isActive filter) INACTIVE (deleted) rows are filtered out. In the scenario above, count ≈ 500 while true metastore rows = 1,001. No truncation signal.
Metastore dump (/graph/v2/metastore/{global,local}) Raw metastore rows Shows true row count, but MetastoreInspector queries the database directly (bypasses AbstractLabel). Requires merging local + global dumps and phase-prefix filtering. Not storage-agnostic.
Actuator (/actuator/env) metadata-fetch-limit config value Limit value only. No row count information.
In-memory cache Active entities loaded by updateAllMetadata() Same as DDL API — only edge-ACTIVE entities after filter.

Root cause: The true quota usage (scannedRowCount — rows returned from storage before isActive filter) is computed in AbstractLabel.scan() but discarded. No API exposes this value.

Proposed Solution

The server already scans the metastore periodically (updateAllMetadata()). During each scan, scannedRowCount (rows returned from storage, before the isActive filter) is computed in AbstractLabel.scan() but immediately discarded.

Proposal: Retain scannedRowCount and expose it through DdlPage.

// Current
DdlPage(count = activeEntities.size)

// Proposed
DdlPage(count = activeEntities.size, scanCount = scannedRowCount)

This requires minimal changes — no scan logic modification, just passing a count that is already computed:

Component Change
ScanResult Add scannedRowCount: Int field
AbstractLabel.scan() Pass scannedRowCount to ScanResult
DdlPage Add scanCount: Long (default = count for backward compat)
DdlService.getAll() Populate scanCount from scan result

With scanCount exposed, monitoring becomes straightforward: scanCount / metadata-fetch-limit = quota usage.

Note on LocalBackedJdbcHashLabel

LocalBackedJdbcHashLabel.scan() merges results from two stores:

local.zipWith(global) { a, b -> a + b }  // DataFrame.plus()

Since ScanResult is consumed within AbstractLabel.scan() and only DataFrame is returned, the scannedRowCount from each store needs to be carried through the merge. This may require either:

  • Passing scannedRowCount via DataFrame metadata (e.g., stats field), or
  • Adjusting LocalBackedJdbcHashLabel.scan() to sum row counts from both stores

This is an implementation detail to consider.

Alternatives Considered

  • External monitoring only (shell script + existing DDL API): DdlPage.count alone cannot detect truncation when INACTIVE rows are present (see Scenario). Only viable as a rough approximation.
  • Metastore dump API: Can show true row counts, but requires merging local + global stores and phase-prefix filtering. Also, MetastoreInspector queries the underlying database directly (bypassing AbstractLabel), so it is not storage-agnostic — a future storage change (e.g., JDBC → HBase) would break this approach.

Additional Context

  • metadata-fetch-limit (default 1,000) is queryable via /actuator/env/kc.graph.metadata-fetch-limit
  • ScanResult already carries hasNext (scan metadata) — carrying scannedRowCount is analogous
  • DdlPage is only constructed from DdlService.getAll() (scan-based), so scanCount is not semantically out of place

Feedback on this approach is welcome.

Internal scan flow reference

sequenceDiagram
    participant DdlService
    participant Label as AbstractLabel.scan()
    participant SQL as scanStorage()
    participant Filter as .filter { isActive }

    DdlService->>Label: scan(ScanFilter(limit=metadata-fetch-limit))
    Label->>SQL: SELECT ... WHERE k LIKE '{prefix}%' LIMIT {limit}
    SQL-->>Label: allRows (ACTIVE + INACTIVE mixed)
    Note right of SQL: 📊 scannedRowCount = true quota usage
    Label->>Filter: allRows.filter { isActive }
    Filter-->>Label: rows (INACTIVE removed)
    Note right of Filter: ⚠️ scannedRowCount discarded here
    Label-->>DdlService: DataFrame(rows)
    DdlService->>DdlService: DdlPage(count = rows.size)
    Note over DdlService: ❌ Only post-filter count exposed
Loading

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions