design-v2: resolve architecture inconsistencies before implementation

## Summary

A cross-document review of `docs/design-v2/` found several design inconsistencies that should be resolved before implementation proceeds.

The mathematical examples look internally consistent, but the architecture docs still disagree on crate boundaries, identity/caching rules, backend IR boundaries, and a few primitive-lowering details.

## Blocking inconsistencies

1. `computegraph-rs` is described as AD-agnostic, but the core graph identity includes `OpMode::Primal` / `OpMode::Linear { active_mask }`.
   - This makes AD-specific semantics part of the supposedly AD-agnostic layer.
   - Relevant docs:
     - `docs/design-v2/README.md`
     - `docs/design-v2/computegraph-design.md`
     - `docs/design-v2/ad-architecture.md`

2. The cache/identity story is not compatible with the current `InputKey` design.
   - `GlobalValKey::Input(InputKey)` participates in structural identity.
   - `differentiate` generates fresh tangent keys via unique `DiffPassId` values.
   - The API docs still claim “same graph structure -> cache hit”, but without a stable input-key normalization rule this is not well-defined.
   - Relevant docs:
     - `docs/design-v2/computegraph-design.md`
     - `docs/design-v2/chainrules-design.md`
     - `docs/design-v2/tidu-design.md`
     - `docs/design-v2/tensor-api-pseudocode.md`

3. The public AD API is underspecified relative to the lower-level transform contract.
   - `differentiate` creates tangent `InputKey`s inside the returned fragment.
   - The user-facing API shows `y.jvp(&x, &t_x)` but does not explain how `t_x` is bound to those generated tangent keys.
   - `grad()` / VJP seed semantics for non-scalar outputs are also left implicit.
   - Relevant docs:
     - `docs/design-v2/ad-architecture.md`
     - `docs/design-v2/tidu-design.md`
     - `docs/design-v2/tensor-api-pseudocode.md`

4. Backend IR boundaries are inconsistent.
   - The overview says all three standard backends accept StableHLO.
   - Later sections say faer/custom GPU interpret `CompiledProgram` directly.
   - Another section describes faer as a StableHLO interpreter.
   - This changes crate boundaries, lowering responsibilities, and cache layering.
   - Relevant doc:
     - `docs/design-v2/backend-architecture.md`

5. `Dup` lowering is inconsistent with the stated 1:1 StableHLO lowering rule.
   - `Dup` is defined as a multi-output primitive.
   - The backend doc maps it to `stablehlo.broadcast_in_dim`, which is not a multi-output duplication op.
   - Relevant docs:
     - `docs/design-v2/primitive-catalog.md`
     - `docs/design-v2/backend-architecture.md`

## Important but secondary inconsistencies

6. The roadmap phases do not match the stated einsum decomposition requirements.
   - `einsum` decomposition depends on `Reshape`, `Transpose`, and `BroadcastInDim`.
   - The backend roadmap still places several of those in Phase 2 while claiming Phase 1 einsum support.
   - Relevant docs:
     - `docs/design-v2/tensor-design.md`
     - `docs/design-v2/backend-architecture.md`

7. The linalg-to-StableHLO boundary is not fixed consistently.
   - `backend-architecture.md` treats linalg ops broadly as `custom_call`.
   - `stablehlo-primitives.md` lists a direct `cholesky` op.
   - `jax-stablehlo-primitives-needed-for-tenferro.md` uses a different decomposition story again.
   - Relevant docs:
     - `docs/design-v2/backend-architecture.md`
     - `docs/design-v2/stablehlo-primitives.md`
     - `docs/design-v2/jax-stablehlo-primitives-needed-for-tenferro.md`

## Minor doc issues

- `PrimitiveOp`'s `InputKey: ADKey` bound is documented inconsistently.
- `Tensor.strides` uses both `Vec<isize>` and `Vec<usize>` across docs.
- The SVD example in `tensor-api-pseudocode.md` uses `diag(&s)` even though `tensor-design.md` explicitly argues for hyper-edge reconstruction using `s` directly.
- `computegraph-design.md` has a stray code fence in the `GraphOp` section.

## Suggested resolution

Before implementation, pick and document one coherent answer for each of the following:

1. Is `OpMode` part of `computegraph-rs`, or does AD-specific mode metadata live above `computegraph-rs`?
2. What is the canonical cache key for compiled programs?
3. How are user-facing `TracedTensor` inputs mapped onto stable `InputKey`s?
4. What is the exact IR pipeline for faer, custom GPU, and XLA?
5. Is `Dup` a real persistent primitive, or only a transform-time/internal construct?
6. Which linalg ops lower directly to StableHLO ops, and which always lower to `custom_call`?

## Acceptance criteria

- Update the architecture docs so they tell one consistent story about graph identity, AD layering, caching, and backend lowering.
- Make the public API examples consistent with the lower-level transform contracts.
- Reconcile the primitive catalog, backend architecture, and StableHLO planning docs.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

design-v2: resolve architecture inconsistencies before implementation #19

Summary

Blocking inconsistencies

Important but secondary inconsistencies

Minor doc issues

Suggested resolution

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

design-v2: resolve architecture inconsistencies before implementation #19

Description

Summary

Blocking inconsistencies

Important but secondary inconsistencies

Minor doc issues

Suggested resolution

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions