Extract tracing into separate tracing-server with OTLP upsert + platform-server integration (one-shot)

# Extract tracing into separate tracing-server with OTLP upsert + platform-server integration (one-shot)

## User request
We have our own custom tracing in the app (LLM calls, tool calls, etc.). It currently lives in the monolith and shares the DB with the main app. We want to:

- Extract tracing into a separate service
- Use an isolated database
- Make the ingestion API backward compatible with OpenTelemetry, with upsert semantics and support for running spans (no end_time)

Constraints:
- Keep existing UX: real-time visibility of running events and rich views (LLM details, shell streaming output). OpenTelemetry doesn’t allow updating submitted events, so our API must allow upserts and missing end times.
- Support custom views via `$schema` attributes (e.g., `agyn.io/tools/shell`, `agyn.io/llm_call`) and include parameters to fetch additional UI data (e.g., `context_id`, `stream_url`).
- Do not start implementation without approval. Implement in one PR with full tests for tracing-server and platform-server integration.

---

## Technical specification

### Goals / constraints
1) Extract tracing into a dedicated `tracing-server` with its own Postgres DB.
2) Provide an OTLP/HTTP ingestion API with upsert semantics and support for running spans (missing end_time).
3) Preserve current UI and realtime behavior without changing platform-ui. Socket.IO remains the realtime mechanism, owned by platform-server.
4) Keep tracing data minimal; avoid duplication. Context items remain owned by the agent/LLM loop module (platform-server); tracing holds references only and does not request/read platform data.
5) Operate ingestion and internal/query APIs on separate ports so they can be registered as separate services in the mesh.

### Current references in platform-server
- Tracing CRUD + serialization: `packages/platform-server/src/events/run-events.service.ts`
- In-memory bus + Socket.IO gateway: `packages/platform-server/src/events/events-bus.service.ts`, `packages/platform-server/src/gateway/graph.socket.gateway.ts`
- REST endpoints consumed by UI: `packages/platform-server/src/agents/threads.controller.ts` and `packages/platform-server/src/agents/contextItems.controller.ts`
- DB schema for tracing-related tables: `packages/platform-server/prisma/schema.prisma`

### Target architecture
- New package: `packages/tracing-server` (NestJS + Prisma + Postgres) using `TRACING_DATABASE_URL`.
- Two ports/process endpoints (same image/codebase):
  - OTLP Ingestion Server on port 4318 (default OTLP/HTTP) — handles `/v1/traces` (JSON + protobuf)
  - Internal/Query API Server on port 4100 — handles domain endpoints and callbacks
- platform-server acts strictly as a client/bridge for tracing data:
  - Writes and reads tracing data via tracing-server’s internal/query API
  - Continues to emit Socket.IO to the UI; payload shapes remain unchanged
  - Does not enrich tracing payloads with platform-owned data

### OTLP/HTTP ingestion (OpenTelemetry-compatible)
- Endpoint: `POST /v1/traces` on port 4318
  - Content types: `application/json` (OTLP JSON), `application/x-protobuf` (OTLP protobuf)
  - No authentication for now
- Upsert key: `(trace_id, span_id)` with merge rules:
  - start time: first-write-wins
  - end time: set when provided (running → completed)
  - attributes: upsert/overwrite by key (no deletions)
  - status: last-write-wins
  - events: append-only with dedupe by `(timeUnixNano, name, attributesHash)`
  - links: append-only with dedupe
  - resource attrs + scope: upsert by key; last-write-wins
  - track `lastSeenAt/lastUpdatedAt`
- Custom views via `$schema` attribute (`agyn.io/llm_call`, `agyn.io/tools/shell`, etc.) and routing attrs:
  - required: `run_id`, `thread_id`
  - optional: `node_id`, `context_id`, `stream_url`

### Tracing-server internal domain API (for platform-server)
- Served on port 4100; no authentication for now
- Endpoints (responses match current platform-server serializers):
  - POST `/internal/runs/:runId/events/llm/start` → `{ eventId, timelineEvent }`
  - PATCH `/internal/events/:eventId/llm/complete` → `{ timelineEvent }`
  - POST `/internal/runs/:runId/events/tools/start` → `{ eventId, timelineEvent }`
  - PATCH `/internal/events/:eventId/tools/complete` → `{ timelineEvent }`
  - POST `/internal/events/:eventId/tool-output/chunk` → ToolOutputChunkPayload
  - PUT  `/internal/events/:eventId/tool-output/terminal` → ToolOutputTerminalPayload
  - GET  `/internal/runs/:runId/summary` → RunTimelineSummary
  - GET  `/internal/runs/:runId/events` → RunTimelineEventsResult (supports `types,statuses,cursorTs,cursorId,limit,order`)
  - GET  `/internal/runs/:runId/events/:eventId/output` → ToolOutputSnapshot (supports `sinceSeq,limit,order`)

### Realtime behavior (Socket.IO only; no REST polling)
- Socket.IO remains the only realtime mechanism and is owned by platform-server
- To avoid REST polling and any cross-service data lookups, tracing-server will invoke **internal callback endpoints** on platform-server whenever a run event is appended/updated or tool output occurs. Platform-server will forward these payloads over Socket.IO without modification
- Callback endpoints (platform-server, no auth for now):
  - POST `/internal/tracing/run-events` with `{ runId, threadId, mutation: 'append'|'update', event: RunTimelineEvent }`
  - POST `/internal/tracing/tool-output/chunk` with ToolOutputChunkPayload
  - POST `/internal/tracing/tool-output/terminal` with ToolOutputTerminalPayload
- Later, we will extract the realtime gateway out of platform-server; for now, this is the only cross-service path

### Tracing-server data model (Prisma) — minimal projection, no full DB clone
- Canonical OTLP `spans` table with routing fields (runId, threadId, nodeId, schema) denormalized for query
- Minimal projection tables scoped to tracing needs (no foreign keys to platform-server):
  - `run_events` (type/status/timestamps/nodeId/sourceKind/sourceSpanId/metadata/error fields/idempotencyKey)
  - `llm_calls` (provider/model/temperature/topP/stopReason/responseText/rawResponse/usage/toolCalls, references eventId)
  - `tool_executions` (toolName/toolCallId/input/output/execStatus/errorMessage/raw, references eventId)
  - `tool_output_chunks` and `tool_output_terminals` (to support shell streaming snapshots)
  - `event_attachments` (optional; only if needed for payload richness)
  - `injections` and `summarizations` (only if they originate from tracing)
- Context items: tracing-server stores only references (ids, isNew, order). No content or role duplication. UI will retrieve context item details from platform-server via existing endpoints

### Platform-server integration
- Add an HTTP client adapter replacing direct Prisma writes/reads at `RunEventsService` call sites, preserving method signatures
- After write operations, emit Socket.IO using returned payloads from tracing-server (no mutation)
- Read endpoints (`threads.controller.ts`) proxy to tracing-server internal API and return payloads as-is (no enrichment). UI will fetch any additional data (e.g., context items) from platform-server’s dedicated endpoints
- Implement the internal callback endpoints to receive tracing-server notifications and emit Socket.IO without polling

### Backward-compatibility / UX guarantees
- platform-ui REST responses unchanged in shape and endpoint surface (`RunTimelineSummary`, `RunTimelineEventsResult`, `ToolOutputSnapshot`)
- Socket.IO event names and payload shapes unchanged and validated by existing schemas in `GraphSocketGateway`
- Running spans supported via missing `endTimeUnixNano` (status=running, endedAt=null)
- `$schema` + routing attrs required for projection into the run timeline; otherwise spans remain only in `spans`

### Configuration & deployment
- Ports:
  - Tracing Ingest (OTLP HTTP): 4318
  - Tracing Internal/Query API: 4100
- platform-server env: `TRACING_API_BASE_URL` (e.g., http://tracing-api:4100)
- tracing-server env: `TRACING_DATABASE_URL`; no auth vars for now
- docker-compose:
  - Add `tracing-db` (Postgres)
  - Add two services backed by the same image (or one container exposing two ports):
    - `tracing-ingest` on 4318
    - `tracing-api` on 4100
  - Wire platform-server to `tracing-api`
- Health: `/healthz`, `/readyz` on the internal/query API

### Tests
Tracing-server:
- OTLP JSON & protobuf ingestion with upsert semantics; running → completed; attribute/event merging
- `$schema` projection (LLM/tool) → correct `RunTimelineEvent`
- Tool output chunk/terminal storage + snapshot logic
- Emission of platform-server callbacks upon new/updated events and tool output

Platform-server integration:
- Verify Socket.IO payloads emitted after write operations and after receiving tracing-server callbacks are identical to current shapes
- Verify read proxies return payloads as-is

### Acceptance criteria
- `packages/tracing-server` implemented with minimal Prisma schema + migrations (no full DB clone)
- platform-server writes/reads go through tracing-server; REST + sockets unchanged for the UI
- OTLP ingest supports upsert and running spans
- Internal callbacks implemented so no REST polling is needed
- docker-compose integrates tracing-ingest (4318) + tracing-api (4100) + tracing DB
- Tests cover ingestion, projection, callbacks, and platform-server bridging

### Single-PR migration checklist
1) Create `packages/tracing-server` (Nest + Prisma) with `TRACING_DATABASE_URL`
2) Add migrations for `spans` + minimal projection tables
3) Implement internal/query API on 4100 returning identical payloads
4) Implement `POST /v1/traces` on 4318 (JSON + protobuf), upsert rules
5) Implement projection pipeline for `$schema` spans → run_events and trigger platform-server callbacks
6) Platform-server: add HTTP adapter, swap write/read paths to tracing-server; implement callback endpoints and emit Socket.IO based on responses/callbacks
7) Compose + env docs
8) Tests (tracing-server + platform-server integration)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract tracing into separate tracing-server with OTLP upsert + platform-server integration (one-shot) #1291

Extract tracing into separate tracing-server with OTLP upsert + platform-server integration (one-shot)

User request

Technical specification

Goals / constraints

Current references in platform-server

Target architecture

OTLP/HTTP ingestion (OpenTelemetry-compatible)

Tracing-server internal domain API (for platform-server)

Realtime behavior (Socket.IO only; no REST polling)

Tracing-server data model (Prisma) — minimal projection, no full DB clone

Platform-server integration

Backward-compatibility / UX guarantees

Configuration & deployment

Tests

Acceptance criteria

Single-PR migration checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Extract tracing into separate tracing-server with OTLP upsert + platform-server integration (one-shot) #1291

Description

Extract tracing into separate tracing-server with OTLP upsert + platform-server integration (one-shot)

User request

Technical specification

Goals / constraints

Current references in platform-server

Target architecture

OTLP/HTTP ingestion (OpenTelemetry-compatible)

Tracing-server internal domain API (for platform-server)

Realtime behavior (Socket.IO only; no REST polling)

Tracing-server data model (Prisma) — minimal projection, no full DB clone

Platform-server integration

Backward-compatibility / UX guarantees

Configuration & deployment

Tests

Acceptance criteria

Single-PR migration checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions