-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Extract tracing into separate tracing-server with OTLP upsert + platform-server integration (one-shot)
User request
We have our own custom tracing in the app (LLM calls, tool calls, etc.). It currently lives in the monolith and shares the DB with the main app. We want to:
- Extract tracing into a separate service
- Use an isolated database
- Make the ingestion API backward compatible with OpenTelemetry, with upsert semantics and support for running spans (no end_time)
Constraints:
- Keep existing UX: real-time visibility of running events and rich views (LLM details, shell streaming output). OpenTelemetry doesn’t allow updating submitted events, so our API must allow upserts and missing end times.
- Support custom views via
$schemaattributes (e.g.,agyn.io/tools/shell,agyn.io/llm_call) and include parameters to fetch additional UI data (e.g.,context_id,stream_url). - Do not start implementation without approval. Implement in one PR with full tests for tracing-server and platform-server integration.
Technical specification
Goals / constraints
- Extract tracing into a dedicated
tracing-serverwith its own Postgres DB. - Provide an OTLP/HTTP ingestion API with upsert semantics and support for running spans (missing end_time).
- Preserve current UI and realtime behavior without changing platform-ui. Socket.IO remains the realtime mechanism, owned by platform-server.
- Keep tracing data minimal; avoid duplication. Context items remain owned by the agent/LLM loop module (platform-server); tracing holds references only and does not request/read platform data.
- Operate ingestion and internal/query APIs on separate ports so they can be registered as separate services in the mesh.
Current references in platform-server
- Tracing CRUD + serialization:
packages/platform-server/src/events/run-events.service.ts - In-memory bus + Socket.IO gateway:
packages/platform-server/src/events/events-bus.service.ts,packages/platform-server/src/gateway/graph.socket.gateway.ts - REST endpoints consumed by UI:
packages/platform-server/src/agents/threads.controller.tsandpackages/platform-server/src/agents/contextItems.controller.ts - DB schema for tracing-related tables:
packages/platform-server/prisma/schema.prisma
Target architecture
- New package:
packages/tracing-server(NestJS + Prisma + Postgres) usingTRACING_DATABASE_URL. - Two ports/process endpoints (same image/codebase):
- OTLP Ingestion Server on port 4318 (default OTLP/HTTP) — handles
/v1/traces(JSON + protobuf) - Internal/Query API Server on port 4100 — handles domain endpoints and callbacks
- OTLP Ingestion Server on port 4318 (default OTLP/HTTP) — handles
- platform-server acts strictly as a client/bridge for tracing data:
- Writes and reads tracing data via tracing-server’s internal/query API
- Continues to emit Socket.IO to the UI; payload shapes remain unchanged
- Does not enrich tracing payloads with platform-owned data
OTLP/HTTP ingestion (OpenTelemetry-compatible)
- Endpoint:
POST /v1/traceson port 4318- Content types:
application/json(OTLP JSON),application/x-protobuf(OTLP protobuf) - No authentication for now
- Content types:
- Upsert key:
(trace_id, span_id)with merge rules:- start time: first-write-wins
- end time: set when provided (running → completed)
- attributes: upsert/overwrite by key (no deletions)
- status: last-write-wins
- events: append-only with dedupe by
(timeUnixNano, name, attributesHash) - links: append-only with dedupe
- resource attrs + scope: upsert by key; last-write-wins
- track
lastSeenAt/lastUpdatedAt
- Custom views via
$schemaattribute (agyn.io/llm_call,agyn.io/tools/shell, etc.) and routing attrs:- required:
run_id,thread_id - optional:
node_id,context_id,stream_url
- required:
Tracing-server internal domain API (for platform-server)
- Served on port 4100; no authentication for now
- Endpoints (responses match current platform-server serializers):
- POST
/internal/runs/:runId/events/llm/start→{ eventId, timelineEvent } - PATCH
/internal/events/:eventId/llm/complete→{ timelineEvent } - POST
/internal/runs/:runId/events/tools/start→{ eventId, timelineEvent } - PATCH
/internal/events/:eventId/tools/complete→{ timelineEvent } - POST
/internal/events/:eventId/tool-output/chunk→ ToolOutputChunkPayload - PUT
/internal/events/:eventId/tool-output/terminal→ ToolOutputTerminalPayload - GET
/internal/runs/:runId/summary→ RunTimelineSummary - GET
/internal/runs/:runId/events→ RunTimelineEventsResult (supportstypes,statuses,cursorTs,cursorId,limit,order) - GET
/internal/runs/:runId/events/:eventId/output→ ToolOutputSnapshot (supportssinceSeq,limit,order)
- POST
Realtime behavior (Socket.IO only; no REST polling)
- Socket.IO remains the only realtime mechanism and is owned by platform-server
- To avoid REST polling and any cross-service data lookups, tracing-server will invoke internal callback endpoints on platform-server whenever a run event is appended/updated or tool output occurs. Platform-server will forward these payloads over Socket.IO without modification
- Callback endpoints (platform-server, no auth for now):
- POST
/internal/tracing/run-eventswith{ runId, threadId, mutation: 'append'|'update', event: RunTimelineEvent } - POST
/internal/tracing/tool-output/chunkwith ToolOutputChunkPayload - POST
/internal/tracing/tool-output/terminalwith ToolOutputTerminalPayload
- POST
- Later, we will extract the realtime gateway out of platform-server; for now, this is the only cross-service path
Tracing-server data model (Prisma) — minimal projection, no full DB clone
- Canonical OTLP
spanstable with routing fields (runId, threadId, nodeId, schema) denormalized for query - Minimal projection tables scoped to tracing needs (no foreign keys to platform-server):
run_events(type/status/timestamps/nodeId/sourceKind/sourceSpanId/metadata/error fields/idempotencyKey)llm_calls(provider/model/temperature/topP/stopReason/responseText/rawResponse/usage/toolCalls, references eventId)tool_executions(toolName/toolCallId/input/output/execStatus/errorMessage/raw, references eventId)tool_output_chunksandtool_output_terminals(to support shell streaming snapshots)event_attachments(optional; only if needed for payload richness)injectionsandsummarizations(only if they originate from tracing)
- Context items: tracing-server stores only references (ids, isNew, order). No content or role duplication. UI will retrieve context item details from platform-server via existing endpoints
Platform-server integration
- Add an HTTP client adapter replacing direct Prisma writes/reads at
RunEventsServicecall sites, preserving method signatures - After write operations, emit Socket.IO using returned payloads from tracing-server (no mutation)
- Read endpoints (
threads.controller.ts) proxy to tracing-server internal API and return payloads as-is (no enrichment). UI will fetch any additional data (e.g., context items) from platform-server’s dedicated endpoints - Implement the internal callback endpoints to receive tracing-server notifications and emit Socket.IO without polling
Backward-compatibility / UX guarantees
- platform-ui REST responses unchanged in shape and endpoint surface (
RunTimelineSummary,RunTimelineEventsResult,ToolOutputSnapshot) - Socket.IO event names and payload shapes unchanged and validated by existing schemas in
GraphSocketGateway - Running spans supported via missing
endTimeUnixNano(status=running, endedAt=null) $schema+ routing attrs required for projection into the run timeline; otherwise spans remain only inspans
Configuration & deployment
- Ports:
- Tracing Ingest (OTLP HTTP): 4318
- Tracing Internal/Query API: 4100
- platform-server env:
TRACING_API_BASE_URL(e.g., http://tracing-api:4100) - tracing-server env:
TRACING_DATABASE_URL; no auth vars for now - docker-compose:
- Add
tracing-db(Postgres) - Add two services backed by the same image (or one container exposing two ports):
tracing-ingeston 4318tracing-apion 4100
- Wire platform-server to
tracing-api
- Add
- Health:
/healthz,/readyzon the internal/query API
Tests
Tracing-server:
- OTLP JSON & protobuf ingestion with upsert semantics; running → completed; attribute/event merging
$schemaprojection (LLM/tool) → correctRunTimelineEvent- Tool output chunk/terminal storage + snapshot logic
- Emission of platform-server callbacks upon new/updated events and tool output
Platform-server integration:
- Verify Socket.IO payloads emitted after write operations and after receiving tracing-server callbacks are identical to current shapes
- Verify read proxies return payloads as-is
Acceptance criteria
packages/tracing-serverimplemented with minimal Prisma schema + migrations (no full DB clone)- platform-server writes/reads go through tracing-server; REST + sockets unchanged for the UI
- OTLP ingest supports upsert and running spans
- Internal callbacks implemented so no REST polling is needed
- docker-compose integrates tracing-ingest (4318) + tracing-api (4100) + tracing DB
- Tests cover ingestion, projection, callbacks, and platform-server bridging
Single-PR migration checklist
- Create
packages/tracing-server(Nest + Prisma) withTRACING_DATABASE_URL - Add migrations for
spans+ minimal projection tables - Implement internal/query API on 4100 returning identical payloads
- Implement
POST /v1/traceson 4318 (JSON + protobuf), upsert rules - Implement projection pipeline for
$schemaspans → run_events and trigger platform-server callbacks - Platform-server: add HTTP adapter, swap write/read paths to tracing-server; implement callback endpoints and emit Socket.IO based on responses/callbacks
- Compose + env docs
- Tests (tracing-server + platform-server integration)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels