Path Hierarchy Solutions

# ISSUE: Project Hierarchy Aggregation Architecture Decision

## Summary

We need to support hierarchical path-based aggregation (file counts + total sizes per directory prefix) for datasets, keyed by resource/project. Two implementation options have been proposed.

---

## Option A — Extend dataframe / Guppy

**Concept:** Add a new `path_aggregation` datatype to the existing Guppy / dataframe pipeline.  
**Benefits:**
- Reuses **existing Guppy service** (no new Helm or CI/CD deployments).
- Reuses **existing authN/authZ** model — users only see projects they can access.
- Exposes **existing GraphQL API** — immediately usable by frontend teams.
- Leverages Guppy caching and query optimizations.
- Simple operational footprint (no new service).

**Prototype:**
- https://github.com/ACED-IDP/aced_etl_pod/pull/55
- https://github.com/ACED-IDP/gen3_util/pull/148

**Trade-offs:**
- Depends on ETL cadence → not real-time.
- Additional computation during ETL refreshes.
- Must keep prefix logic synchronized with indexd definitions.

**Delivery path:**
1. Extend dataframe ETL to compute `(resource, path_prefix)` rollups.
2. Add `path_aggregation` type and schema to Guppy.
3. Backfill aggregates and expose via GraphQL.
4. Frontend can query immediately using the existing Guppy endpoint.

---

## Option B — Create a new indexd-driven service

**Concept:** Create a standalone microservice (FastAPI/Flask) backed by Postgres triggers that maintain a `path_agg_res` table in near-real-time.  
**Benefits:**
- **Real-time freshness** — triggers fire on every indexd insert/update/delete.
- **Direct coupling** with indexd source-of-truth.
- Lightweight REST API (`GET /{resource}/path/{path}`) for analytics or dashboards.
- Easy to scale independently.

**Trade-offs:**
- Requires **new Helm chart** and operational support. 🚧 
- Must **implement auth parity** (Fence/Arborist integration).   🚧 
- Adds another service surface (new REST endpoint vs. GraphQL).  🚧 
- Will capture all updates to indexd ❓ 
  - Adds db overhead to index insert performance ( 1 insert now is: 1 insert + 2 reads and 1 write)  👎 
  - **ALL** indexd writes are aggregated - even ones that don't wind up being published 👎 

**Delivery path:**
1. Define and deploy Postgres triggers for path aggregation.
2. Implement FastAPI endpoint with authz middleware.
3. Helm deployment, metrics, alerts.
4. Optional Guppy resolver to proxy to this API for single-client integration.

---

## Comparison Summary

| Aspect | Option A: Guppy Extension | Option B: New Service |
|--------|---------------------------|-----------------------|
| **Service footprint** | No new service | New Helm deployment |
| **AuthN/AuthZ** | Inherits Guppy model | Must replicate Fence/Arborist auth |
| **API** | GraphQL (existing clients) | REST (new endpoint) |
| **Freshness** | ETL cadence (batch) | Near-real-time (DB triggers) |
| **Ops effort** | Low | Higher (new CI/CD, SLOs) |
| **Frontend changes** | None | Requires new client integration |
| **Complexity locus** | ETL + Guppy schema | DB triggers + API layer |

---

## Recommendation

**Phase 1:** Implement **Option A (Guppy extension)** to deliver immediate value with minimal operational change.  
**Phase 2:** Pilot **Option B (indexd-driven service)** where near-real-time is required (e.g., admin dashboards).

---

## Acceptance Criteria

- ✅ Given sample inputs (two files under `/data/...`), `/data` aggregates correctly with `file_count=2`, `total_size=10`.
- ✅ Unauthorized users cannot access aggregates for restricted projects.
- ✅ GraphQL (Option A) or REST (Option B) returns expected JSON payloads.
- ✅ Load testing demonstrates acceptable latency (<200 ms read P95).

---

## Next Steps (Sprint Backlog)

### For Option A
- [ ] Add path rollup computation to dataframe ETL.
- [ ] Register `path_aggregation` index in Guppy schema.
- [ ] Validate output via GraphQL query.
- [ ] Update documentation and example queries.

### For Option B (pilot)
- [ ] Implement `path_agg_res` table and triggers.
- [ ] Create FastAPI REST endpoint with JWT/Fence validation.
- [ ] Deploy as separate service behind feature flag.
- [ ] Compare results to Guppy aggregates for correctness and freshness.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Path Hierarchy Solutions #1

ISSUE: Project Hierarchy Aggregation Architecture Decision

Summary

Option A — Extend dataframe / Guppy

Option B — Create a new indexd-driven service

Comparison Summary

Recommendation

Acceptance Criteria

Next Steps (Sprint Backlog)

For Option A

For Option B (pilot)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Aspect	Option A: Guppy Extension	Option B: New Service
Service footprint	No new service	New Helm deployment
AuthN/AuthZ	Inherits Guppy model	Must replicate Fence/Arborist auth
API	GraphQL (existing clients)	REST (new endpoint)
Freshness	ETL cadence (batch)	Near-real-time (DB triggers)
Ops effort	Low	Higher (new CI/CD, SLOs)
Frontend changes	None	Requires new client integration
Complexity locus	ETL + Guppy schema	DB triggers + API layer

Path Hierarchy Solutions #1

Description

ISSUE: Project Hierarchy Aggregation Architecture Decision

Summary

Option A — Extend dataframe / Guppy

Option B — Create a new indexd-driven service

Comparison Summary

Recommendation

Acceptance Criteria

Next Steps (Sprint Backlog)

For Option A

For Option B (pilot)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions