Skip to content

Path Hierarchy Solutions #1

@bwalsh

Description

@bwalsh

ISSUE: Project Hierarchy Aggregation Architecture Decision

Summary

We need to support hierarchical path-based aggregation (file counts + total sizes per directory prefix) for datasets, keyed by resource/project. Two implementation options have been proposed.


Option A — Extend dataframe / Guppy

Concept: Add a new path_aggregation datatype to the existing Guppy / dataframe pipeline.
Benefits:

  • Reuses existing Guppy service (no new Helm or CI/CD deployments).
  • Reuses existing authN/authZ model — users only see projects they can access.
  • Exposes existing GraphQL API — immediately usable by frontend teams.
  • Leverages Guppy caching and query optimizations.
  • Simple operational footprint (no new service).

Prototype:

Trade-offs:

  • Depends on ETL cadence → not real-time.
  • Additional computation during ETL refreshes.
  • Must keep prefix logic synchronized with indexd definitions.

Delivery path:

  1. Extend dataframe ETL to compute (resource, path_prefix) rollups.
  2. Add path_aggregation type and schema to Guppy.
  3. Backfill aggregates and expose via GraphQL.
  4. Frontend can query immediately using the existing Guppy endpoint.

Option B — Create a new indexd-driven service

Concept: Create a standalone microservice (FastAPI/Flask) backed by Postgres triggers that maintain a path_agg_res table in near-real-time.
Benefits:

  • Real-time freshness — triggers fire on every indexd insert/update/delete.
  • Direct coupling with indexd source-of-truth.
  • Lightweight REST API (GET /{resource}/path/{path}) for analytics or dashboards.
  • Easy to scale independently.

Trade-offs:

  • Requires new Helm chart and operational support. 🚧
  • Must implement auth parity (Fence/Arborist integration). 🚧
  • Adds another service surface (new REST endpoint vs. GraphQL). 🚧
  • Will capture all updates to indexd ❓
    • Adds db overhead to index insert performance ( 1 insert now is: 1 insert + 2 reads and 1 write) 👎
    • ALL indexd writes are aggregated - even ones that don't wind up being published 👎

Delivery path:

  1. Define and deploy Postgres triggers for path aggregation.
  2. Implement FastAPI endpoint with authz middleware.
  3. Helm deployment, metrics, alerts.
  4. Optional Guppy resolver to proxy to this API for single-client integration.

Comparison Summary

Aspect Option A: Guppy Extension Option B: New Service
Service footprint No new service New Helm deployment
AuthN/AuthZ Inherits Guppy model Must replicate Fence/Arborist auth
API GraphQL (existing clients) REST (new endpoint)
Freshness ETL cadence (batch) Near-real-time (DB triggers)
Ops effort Low Higher (new CI/CD, SLOs)
Frontend changes None Requires new client integration
Complexity locus ETL + Guppy schema DB triggers + API layer

Recommendation

Phase 1: Implement Option A (Guppy extension) to deliver immediate value with minimal operational change.
Phase 2: Pilot Option B (indexd-driven service) where near-real-time is required (e.g., admin dashboards).


Acceptance Criteria

  • ✅ Given sample inputs (two files under /data/...), /data aggregates correctly with file_count=2, total_size=10.
  • ✅ Unauthorized users cannot access aggregates for restricted projects.
  • ✅ GraphQL (Option A) or REST (Option B) returns expected JSON payloads.
  • ✅ Load testing demonstrates acceptable latency (<200 ms read P95).

Next Steps (Sprint Backlog)

For Option A

  • Add path rollup computation to dataframe ETL.
  • Register path_aggregation index in Guppy schema.
  • Validate output via GraphQL query.
  • Update documentation and example queries.

For Option B (pilot)

  • Implement path_agg_res table and triggers.
  • Create FastAPI REST endpoint with JWT/Fence validation.
  • Deploy as separate service behind feature flag.
  • Compare results to Guppy aggregates for correctness and freshness.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions