-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
ISSUE: Project Hierarchy Aggregation Architecture Decision
Summary
We need to support hierarchical path-based aggregation (file counts + total sizes per directory prefix) for datasets, keyed by resource/project. Two implementation options have been proposed.
Option A — Extend dataframe / Guppy
Concept: Add a new path_aggregation datatype to the existing Guppy / dataframe pipeline.
Benefits:
- Reuses existing Guppy service (no new Helm or CI/CD deployments).
- Reuses existing authN/authZ model — users only see projects they can access.
- Exposes existing GraphQL API — immediately usable by frontend teams.
- Leverages Guppy caching and query optimizations.
- Simple operational footprint (no new service).
Prototype:
Trade-offs:
- Depends on ETL cadence → not real-time.
- Additional computation during ETL refreshes.
- Must keep prefix logic synchronized with indexd definitions.
Delivery path:
- Extend dataframe ETL to compute
(resource, path_prefix)rollups. - Add
path_aggregationtype and schema to Guppy. - Backfill aggregates and expose via GraphQL.
- Frontend can query immediately using the existing Guppy endpoint.
Option B — Create a new indexd-driven service
Concept: Create a standalone microservice (FastAPI/Flask) backed by Postgres triggers that maintain a path_agg_res table in near-real-time.
Benefits:
- Real-time freshness — triggers fire on every indexd insert/update/delete.
- Direct coupling with indexd source-of-truth.
- Lightweight REST API (
GET /{resource}/path/{path}) for analytics or dashboards. - Easy to scale independently.
Trade-offs:
- Requires new Helm chart and operational support. 🚧
- Must implement auth parity (Fence/Arborist integration). 🚧
- Adds another service surface (new REST endpoint vs. GraphQL). 🚧
- Will capture all updates to indexd ❓
- Adds db overhead to index insert performance ( 1 insert now is: 1 insert + 2 reads and 1 write) 👎
- ALL indexd writes are aggregated - even ones that don't wind up being published 👎
Delivery path:
- Define and deploy Postgres triggers for path aggregation.
- Implement FastAPI endpoint with authz middleware.
- Helm deployment, metrics, alerts.
- Optional Guppy resolver to proxy to this API for single-client integration.
Comparison Summary
| Aspect | Option A: Guppy Extension | Option B: New Service |
|---|---|---|
| Service footprint | No new service | New Helm deployment |
| AuthN/AuthZ | Inherits Guppy model | Must replicate Fence/Arborist auth |
| API | GraphQL (existing clients) | REST (new endpoint) |
| Freshness | ETL cadence (batch) | Near-real-time (DB triggers) |
| Ops effort | Low | Higher (new CI/CD, SLOs) |
| Frontend changes | None | Requires new client integration |
| Complexity locus | ETL + Guppy schema | DB triggers + API layer |
Recommendation
Phase 1: Implement Option A (Guppy extension) to deliver immediate value with minimal operational change.
Phase 2: Pilot Option B (indexd-driven service) where near-real-time is required (e.g., admin dashboards).
Acceptance Criteria
- ✅ Given sample inputs (two files under
/data/...),/dataaggregates correctly withfile_count=2,total_size=10. - ✅ Unauthorized users cannot access aggregates for restricted projects.
- ✅ GraphQL (Option A) or REST (Option B) returns expected JSON payloads.
- ✅ Load testing demonstrates acceptable latency (<200 ms read P95).
Next Steps (Sprint Backlog)
For Option A
- Add path rollup computation to dataframe ETL.
- Register
path_aggregationindex in Guppy schema. - Validate output via GraphQL query.
- Update documentation and example queries.
For Option B (pilot)
- Implement
path_agg_restable and triggers. - Create FastAPI REST endpoint with JWT/Fence validation.
- Deploy as separate service behind feature flag.
- Compare results to Guppy aggregates for correctness and freshness.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels