Metadata scan limit: monitoring gap analysis and options

## Problem

The metadata scan has a hard limit (`metadata-fetch-limit`, default 1,000) applied at the storage scan level. INACTIVE (soft-deleted) rows consume this limit quota alongside active rows, but **there is currently no way to monitor the true quota usage** before silent truncation occurs.

When the total rows (ACTIVE + INACTIVE) in the metastore exceed the limit, active entities are silently dropped from the scan result, potentially causing serving issues.

### Scenario

With `metadata-fetch-limit = 1,000` (default) and a single database:

| Step | Metastore rows | Edge ACTIVE | Edge INACTIVE |
|------|----------------|-------------|---------------|
| Create 1,000 tables | 1,000 | 1,000 | 0 |
| Delete 500 tables | 1,000 | 500 | 500 |
| Create 1 new table | ⚠️ **1,001** | **501** | 500 |

After the last step, `scanStorage()` fetches 1,000 rows (sorted by key, capped by limit). One row is truncated — which may be the newly created active table. The `isActive` filter then removes ~500 INACTIVE edges, leaving `DdlPage.count ≈ 500`.

**Result:** An active table is silently missing from the scan, but `DdlPage.count` (≈500) gives no indication that truncation occurred.

> Note: "Delete" here means the full 2-step soft delete (deactivate → delete), which sets edge to INACTIVE while keeping the metastore row.

### No existing API can detect this

| API | What it exposes | Why it's insufficient |
|-----|----------------|----------------------|
| DDL API (`DdlPage.count`) | Edge-ACTIVE entity count (after `isActive` filter) | INACTIVE (deleted) rows are filtered out. In the scenario above, `count ≈ 500` while true metastore rows = 1,001. **No truncation signal.** |
| Metastore dump (`/graph/v2/metastore/{global,local}`) | Raw metastore rows | Shows true row count, but `MetastoreInspector` queries the database directly (bypasses `AbstractLabel`). Requires merging local + global dumps and phase-prefix filtering. Not storage-agnostic. |
| Actuator (`/actuator/env`) | `metadata-fetch-limit` config value | Limit value only. No row count information. |
| In-memory cache | Active entities loaded by `updateAllMetadata()` | Same as DDL API — only edge-ACTIVE entities after filter. |

**Root cause:** The true quota usage (`scannedRowCount` — rows returned from storage before `isActive` filter) is computed in `AbstractLabel.scan()` but discarded. No API exposes this value.

## Proposed Solution

The server already scans the metastore periodically (`updateAllMetadata()`). During each scan, `scannedRowCount` (rows returned from storage, before the `isActive` filter) is computed in `AbstractLabel.scan()` but immediately discarded.

**Proposal: Retain `scannedRowCount` and expose it through `DdlPage`.**

```
// Current
DdlPage(count = activeEntities.size)

// Proposed
DdlPage(count = activeEntities.size, scanCount = scannedRowCount)
```

This requires minimal changes — no scan logic modification, just passing a count that is already computed:

| Component | Change |
|-----------|--------|
| `ScanResult` | Add `scannedRowCount: Int` field |
| `AbstractLabel.scan()` | Pass `scannedRowCount` to `ScanResult` |
| `DdlPage` | Add `scanCount: Long` (default = `count` for backward compat) |
| `DdlService.getAll()` | Populate `scanCount` from scan result |

With `scanCount` exposed, monitoring becomes straightforward: `scanCount / metadata-fetch-limit` = quota usage.

### Note on `LocalBackedJdbcHashLabel`

`LocalBackedJdbcHashLabel.scan()` merges results from two stores:

```kotlin
local.zipWith(global) { a, b -> a + b }  // DataFrame.plus()
```

Since `ScanResult` is consumed within `AbstractLabel.scan()` and only `DataFrame` is returned, the `scannedRowCount` from each store needs to be carried through the merge. This may require either:
- Passing `scannedRowCount` via `DataFrame` metadata (e.g., `stats` field), or
- Adjusting `LocalBackedJdbcHashLabel.scan()` to sum row counts from both stores

This is an implementation detail to consider.

## Alternatives Considered

- **External monitoring only** (shell script + existing DDL API): `DdlPage.count` alone cannot detect truncation when INACTIVE rows are present (see Scenario). Only viable as a rough approximation.
- **Metastore dump API**: Can show true row counts, but requires merging local + global stores and phase-prefix filtering. Also, `MetastoreInspector` queries the underlying database directly (bypassing `AbstractLabel`), so it is **not storage-agnostic** — a future storage change (e.g., JDBC → HBase) would break this approach.

## Additional Context

- `metadata-fetch-limit` (default 1,000) is queryable via `/actuator/env/kc.graph.metadata-fetch-limit`
- `ScanResult` already carries `hasNext` (scan metadata) — carrying `scannedRowCount` is analogous
- `DdlPage` is only constructed from `DdlService.getAll()` (scan-based), so `scanCount` is not semantically out of place

Feedback on this approach is welcome.

### Internal scan flow reference

```mermaid
sequenceDiagram
    participant DdlService
    participant Label as AbstractLabel.scan()
    participant SQL as scanStorage()
    participant Filter as .filter { isActive }

    DdlService->>Label: scan(ScanFilter(limit=metadata-fetch-limit))
    Label->>SQL: SELECT ... WHERE k LIKE '{prefix}%' LIMIT {limit}
    SQL-->>Label: allRows (ACTIVE + INACTIVE mixed)
    Note right of SQL: 📊 scannedRowCount = true quota usage
    Label->>Filter: allRows.filter { isActive }
    Filter-->>Label: rows (INACTIVE removed)
    Note right of Filter: ⚠️ scannedRowCount discarded here
    Label-->>DdlService: DataFrame(rows)
    DdlService->>DdlService: DdlPage(count = rows.size)
    Note over DdlService: ❌ Only post-filter count exposed
```

API	What it exposes	Why it's insufficient
DDL API (`DdlPage.count`)	Edge-ACTIVE entity count (after `isActive` filter)	INACTIVE (deleted) rows are filtered out. In the scenario above, `count ≈ 500` while true metastore rows = 1,001. No truncation signal.
Metastore dump (`/graph/v2/metastore/{global,local}`)	Raw metastore rows	Shows true row count, but `MetastoreInspector` queries the database directly (bypasses `AbstractLabel`). Requires merging local + global dumps and phase-prefix filtering. Not storage-agnostic.
Actuator (`/actuator/env`)	`metadata-fetch-limit` config value	Limit value only. No row count information.
In-memory cache	Active entities loaded by `updateAllMetadata()`	Same as DDL API — only edge-ACTIVE entities after filter.

Component	Change
`ScanResult`	Add `scannedRowCount: Int` field
`AbstractLabel.scan()`	Pass `scannedRowCount` to `ScanResult`
`DdlPage`	Add `scanCount: Long` (default = `count` for backward compat)
`DdlService.getAll()`	Populate `scanCount` from scan result

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata scan limit: monitoring gap analysis and options #234

Problem

Scenario

No existing API can detect this

Proposed Solution

Note on `LocalBackedJdbcHashLabel`

Alternatives Considered

Additional Context

Internal scan flow reference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Step	Metastore rows	Edge ACTIVE	Edge INACTIVE
Create 1,000 tables	1,000	1,000	0
Delete 500 tables	1,000	500	500
Create 1 new table	⚠️ 1,001	501	500

Metadata scan limit: monitoring gap analysis and options #234

Description

Problem

Scenario

No existing API can detect this

Proposed Solution

Note on LocalBackedJdbcHashLabel

Alternatives Considered

Additional Context

Internal scan flow reference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Note on `LocalBackedJdbcHashLabel`