Skip to content

Extract tracing into separate tracing-server with OTLP upsert + platform-server integration (one-shot) #1291

@rowan-stein

Description

@rowan-stein

Extract tracing into separate tracing-server with OTLP upsert + platform-server integration (one-shot)

User request

We have our own custom tracing in the app (LLM calls, tool calls, etc.). It currently lives in the monolith and shares the DB with the main app. We want to:

  • Extract tracing into a separate service
  • Use an isolated database
  • Make the ingestion API backward compatible with OpenTelemetry, with upsert semantics and support for running spans (no end_time)

Constraints:

  • Keep existing UX: real-time visibility of running events and rich views (LLM details, shell streaming output). OpenTelemetry doesn’t allow updating submitted events, so our API must allow upserts and missing end times.
  • Support custom views via $schema attributes (e.g., agyn.io/tools/shell, agyn.io/llm_call) and include parameters to fetch additional UI data (e.g., context_id, stream_url).
  • Do not start implementation without approval. Implement in one PR with full tests for tracing-server and platform-server integration.

Technical specification

Goals / constraints

  1. Extract tracing into a dedicated tracing-server with its own Postgres DB.
  2. Provide an OTLP/HTTP ingestion API with upsert semantics and support for running spans (missing end_time).
  3. Preserve current UI and realtime behavior without changing platform-ui. Socket.IO remains the realtime mechanism, owned by platform-server.
  4. Keep tracing data minimal; avoid duplication. Context items remain owned by the agent/LLM loop module (platform-server); tracing holds references only and does not request/read platform data.
  5. Operate ingestion and internal/query APIs on separate ports so they can be registered as separate services in the mesh.

Current references in platform-server

  • Tracing CRUD + serialization: packages/platform-server/src/events/run-events.service.ts
  • In-memory bus + Socket.IO gateway: packages/platform-server/src/events/events-bus.service.ts, packages/platform-server/src/gateway/graph.socket.gateway.ts
  • REST endpoints consumed by UI: packages/platform-server/src/agents/threads.controller.ts and packages/platform-server/src/agents/contextItems.controller.ts
  • DB schema for tracing-related tables: packages/platform-server/prisma/schema.prisma

Target architecture

  • New package: packages/tracing-server (NestJS + Prisma + Postgres) using TRACING_DATABASE_URL.
  • Two ports/process endpoints (same image/codebase):
    • OTLP Ingestion Server on port 4318 (default OTLP/HTTP) — handles /v1/traces (JSON + protobuf)
    • Internal/Query API Server on port 4100 — handles domain endpoints and callbacks
  • platform-server acts strictly as a client/bridge for tracing data:
    • Writes and reads tracing data via tracing-server’s internal/query API
    • Continues to emit Socket.IO to the UI; payload shapes remain unchanged
    • Does not enrich tracing payloads with platform-owned data

OTLP/HTTP ingestion (OpenTelemetry-compatible)

  • Endpoint: POST /v1/traces on port 4318
    • Content types: application/json (OTLP JSON), application/x-protobuf (OTLP protobuf)
    • No authentication for now
  • Upsert key: (trace_id, span_id) with merge rules:
    • start time: first-write-wins
    • end time: set when provided (running → completed)
    • attributes: upsert/overwrite by key (no deletions)
    • status: last-write-wins
    • events: append-only with dedupe by (timeUnixNano, name, attributesHash)
    • links: append-only with dedupe
    • resource attrs + scope: upsert by key; last-write-wins
    • track lastSeenAt/lastUpdatedAt
  • Custom views via $schema attribute (agyn.io/llm_call, agyn.io/tools/shell, etc.) and routing attrs:
    • required: run_id, thread_id
    • optional: node_id, context_id, stream_url

Tracing-server internal domain API (for platform-server)

  • Served on port 4100; no authentication for now
  • Endpoints (responses match current platform-server serializers):
    • POST /internal/runs/:runId/events/llm/start{ eventId, timelineEvent }
    • PATCH /internal/events/:eventId/llm/complete{ timelineEvent }
    • POST /internal/runs/:runId/events/tools/start{ eventId, timelineEvent }
    • PATCH /internal/events/:eventId/tools/complete{ timelineEvent }
    • POST /internal/events/:eventId/tool-output/chunk → ToolOutputChunkPayload
    • PUT /internal/events/:eventId/tool-output/terminal → ToolOutputTerminalPayload
    • GET /internal/runs/:runId/summary → RunTimelineSummary
    • GET /internal/runs/:runId/events → RunTimelineEventsResult (supports types,statuses,cursorTs,cursorId,limit,order)
    • GET /internal/runs/:runId/events/:eventId/output → ToolOutputSnapshot (supports sinceSeq,limit,order)

Realtime behavior (Socket.IO only; no REST polling)

  • Socket.IO remains the only realtime mechanism and is owned by platform-server
  • To avoid REST polling and any cross-service data lookups, tracing-server will invoke internal callback endpoints on platform-server whenever a run event is appended/updated or tool output occurs. Platform-server will forward these payloads over Socket.IO without modification
  • Callback endpoints (platform-server, no auth for now):
    • POST /internal/tracing/run-events with { runId, threadId, mutation: 'append'|'update', event: RunTimelineEvent }
    • POST /internal/tracing/tool-output/chunk with ToolOutputChunkPayload
    • POST /internal/tracing/tool-output/terminal with ToolOutputTerminalPayload
  • Later, we will extract the realtime gateway out of platform-server; for now, this is the only cross-service path

Tracing-server data model (Prisma) — minimal projection, no full DB clone

  • Canonical OTLP spans table with routing fields (runId, threadId, nodeId, schema) denormalized for query
  • Minimal projection tables scoped to tracing needs (no foreign keys to platform-server):
    • run_events (type/status/timestamps/nodeId/sourceKind/sourceSpanId/metadata/error fields/idempotencyKey)
    • llm_calls (provider/model/temperature/topP/stopReason/responseText/rawResponse/usage/toolCalls, references eventId)
    • tool_executions (toolName/toolCallId/input/output/execStatus/errorMessage/raw, references eventId)
    • tool_output_chunks and tool_output_terminals (to support shell streaming snapshots)
    • event_attachments (optional; only if needed for payload richness)
    • injections and summarizations (only if they originate from tracing)
  • Context items: tracing-server stores only references (ids, isNew, order). No content or role duplication. UI will retrieve context item details from platform-server via existing endpoints

Platform-server integration

  • Add an HTTP client adapter replacing direct Prisma writes/reads at RunEventsService call sites, preserving method signatures
  • After write operations, emit Socket.IO using returned payloads from tracing-server (no mutation)
  • Read endpoints (threads.controller.ts) proxy to tracing-server internal API and return payloads as-is (no enrichment). UI will fetch any additional data (e.g., context items) from platform-server’s dedicated endpoints
  • Implement the internal callback endpoints to receive tracing-server notifications and emit Socket.IO without polling

Backward-compatibility / UX guarantees

  • platform-ui REST responses unchanged in shape and endpoint surface (RunTimelineSummary, RunTimelineEventsResult, ToolOutputSnapshot)
  • Socket.IO event names and payload shapes unchanged and validated by existing schemas in GraphSocketGateway
  • Running spans supported via missing endTimeUnixNano (status=running, endedAt=null)
  • $schema + routing attrs required for projection into the run timeline; otherwise spans remain only in spans

Configuration & deployment

  • Ports:
    • Tracing Ingest (OTLP HTTP): 4318
    • Tracing Internal/Query API: 4100
  • platform-server env: TRACING_API_BASE_URL (e.g., http://tracing-api:4100)
  • tracing-server env: TRACING_DATABASE_URL; no auth vars for now
  • docker-compose:
    • Add tracing-db (Postgres)
    • Add two services backed by the same image (or one container exposing two ports):
      • tracing-ingest on 4318
      • tracing-api on 4100
    • Wire platform-server to tracing-api
  • Health: /healthz, /readyz on the internal/query API

Tests

Tracing-server:

  • OTLP JSON & protobuf ingestion with upsert semantics; running → completed; attribute/event merging
  • $schema projection (LLM/tool) → correct RunTimelineEvent
  • Tool output chunk/terminal storage + snapshot logic
  • Emission of platform-server callbacks upon new/updated events and tool output

Platform-server integration:

  • Verify Socket.IO payloads emitted after write operations and after receiving tracing-server callbacks are identical to current shapes
  • Verify read proxies return payloads as-is

Acceptance criteria

  • packages/tracing-server implemented with minimal Prisma schema + migrations (no full DB clone)
  • platform-server writes/reads go through tracing-server; REST + sockets unchanged for the UI
  • OTLP ingest supports upsert and running spans
  • Internal callbacks implemented so no REST polling is needed
  • docker-compose integrates tracing-ingest (4318) + tracing-api (4100) + tracing DB
  • Tests cover ingestion, projection, callbacks, and platform-server bridging

Single-PR migration checklist

  1. Create packages/tracing-server (Nest + Prisma) with TRACING_DATABASE_URL
  2. Add migrations for spans + minimal projection tables
  3. Implement internal/query API on 4100 returning identical payloads
  4. Implement POST /v1/traces on 4318 (JSON + protobuf), upsert rules
  5. Implement projection pipeline for $schema spans → run_events and trigger platform-server callbacks
  6. Platform-server: add HTTP adapter, swap write/read paths to tracing-server; implement callback endpoints and emit Socket.IO based on responses/callbacks
  7. Compose + env docs
  8. Tests (tracing-server + platform-server integration)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions