Skip to content

Metrics aggregation and alerting for ML experiments—multi-backend export (Prometheus, InfluxDB, Datadog, OpenTelemetry), advanced aggregations (percentiles, histograms, moving averages), threshold-based alerting with anomaly detection (z-score, IQR), and time-series storage. Research-grade observability for the NSAI ecosystem.

License

Notifications You must be signed in to change notification settings

North-Shore-AI/metrics_ex

Repository files navigation

MetricsEx

MetricsEx

CI Status Hex.pm Documentation Elixir License

Metrics aggregation with multiple exporters and intelligent alerting


Purpose

MetricsEx provides comprehensive metrics collection, aggregation, and querying for:

  • Experiment results (Crucible)
  • Model performance (CNS agents)
  • System health (Work jobs, services)
  • Training progress (Tinkex)

Features

  • Multiple metric types: counters, gauges, histograms
  • Fast in-memory storage: ETS-based with configurable retention
  • Flexible aggregations: mean, sum, count, min, max, percentiles
  • Time series support: Fixed interval buckets (minute, hour, day)
  • Telemetry integration: Auto-attach to telemetry events
  • Export formats: JSON API, Prometheus text format
  • Real-time streaming: Phoenix PubSub integration
  • Dashboard-ready: Pre-computed rollups for UI consumption

Installation

Add metrics_ex to your list of dependencies in mix.exs:

def deps do
  [
    {:metrics_ex, "~> 0.2.0"}
  ]
end

Configuration

# config/config.exs
config :metrics_ex,
  retention_hours: 24,
  pubsub: MyApp.PubSub  # Optional: for real-time streaming

Quick Start

Recording Metrics

# Record experiment results
MetricsEx.record(:experiment_result, %{
  experiment_id: "exp_123",
  metric: :entailment_score,
  value: 0.75,
  tags: %{model: "llama-3.1", dataset: "scifact", trace_id: "trace-123", work_id: "work-456"}
})

# Increment counters
MetricsEx.increment(:jobs_completed, tags: %{tenant: "cns"})
MetricsEx.increment(:requests_total, 5, tags: %{endpoint: "/api"})

# Record gauge values (point-in-time)
MetricsEx.gauge(:queue_depth, 42, tags: %{queue: "sno_validation"})

# Record histogram values (distributions)
MetricsEx.histogram(:response_time, 123.45, tags: %{endpoint: "/api"})

# Measure execution time
result = MetricsEx.measure(:database_query, fn ->
  # expensive operation
  MyApp.run_query()
end, tags: %{query: "SELECT"})

Querying Metrics

# Get raw metrics
MetricsEx.get_metrics(name: :jobs_completed, limit: 100)
# => %{metrics: [...], count: 100, timestamp: "2025-12-06T..."}

# Aggregate with grouping
MetricsEx.query(:experiment_result,
  group_by: [:model],
  aggregation: :mean,
  window: :last_24h
)
# => [
#   %{model: "llama-3.1", mean: 0.72},
#   %{model: "qwen", mean: 0.68}
# ]

# Time series data
MetricsEx.time_series(:jobs_completed,
  interval: :hour,
  aggregation: :count,
  window: :last_24h
)
# => [
#   %{timestamp: ~U[2025-12-06 00:00:00Z], count: 45},
#   %{timestamp: ~U[2025-12-06 01:00:00Z], count: 52},
#   ...
# ]

# Pre-computed rollups for dashboards
MetricsEx.rollup(:experiment_result,
  group_by: [:model, :dataset],
  aggregations: [:mean, :count, :p95],
  window: :last_24h
)
# => %{
#   "llama-3.1/scifact" => %{mean: 0.72, count: 150, p95: 0.89},
#   "qwen/fever" => %{mean: 0.68, count: 200, p95: 0.85}
# }

Telemetry Integration

# Attach to telemetry events
MetricsEx.attach_telemetry([
  {[:work, :job, :completed], :counter},
  {[:work, :job, :duration], :histogram},
  {[:crucible, :experiment, :completed], :histogram},
  {[:queue, :depth], :gauge}
])

# Now telemetry events are automatically recorded
:telemetry.execute(
  [:work, :job, :completed],
  %{count: 1},
  %{
    tenant: "cns",
    work_id: "work-123",
    trace_id: "trace-456",
    plan_id: "plan-789",
    step_id: "step-abc"
  }
)

Prometheus Export

# Export all metrics in Prometheus format
prometheus_text = MetricsEx.Storage.Prometheus.export()

# Export specific metric
prometheus_text = MetricsEx.Storage.Prometheus.export(:jobs_completed)

# Use in Phoenix controller
defmodule MyAppWeb.MetricsController do
  use MyAppWeb, :controller

  def prometheus(conn, _params) do
    metrics = MetricsEx.Storage.Prometheus.export()

    conn
    |> put_resp_content_type(MetricsEx.Storage.Prometheus.content_type())
    |> send_resp(200, metrics)
  end
end

Standard Dimensions

MetricsEx standardizes on these dimensions for cross-system lineage and RunIndex correlation:

  • work_id
  • trace_id
  • plan_id
  • step_id

When present in telemetry metadata or in MetricsEx.record/2 payloads, these values are promoted into metric tags so they can be filtered and grouped consistently.

Architecture

Supervision Tree

MetricsEx.Supervisor
├── Phoenix.PubSub (optional)
├── MetricsEx.Storage.ETS (storage backend)
└── MetricsEx.Recorder (recording coordinator)

Components

1. Metric Types (MetricsEx.Metric)

Defines three core metric types:

  • Counter: Monotonically increasing (e.g., request count, jobs completed)
  • Gauge: Point-in-time value (e.g., queue depth, memory usage)
  • Histogram: Distribution of values (e.g., response times, scores)

2. Storage Backend (MetricsEx.Storage.ETS)

  • Fast in-memory ETS storage
  • Configurable retention (default: 24 hours)
  • Automatic cleanup of old metrics
  • Concurrent reads/writes

3. Recorder (MetricsEx.Recorder)

  • GenServer for recording metrics
  • Real-time PubSub broadcasting
  • Type inference based on metric name/value

4. Aggregator (MetricsEx.Aggregator)

  • Flexible aggregation functions: mean, sum, count, min, max, percentiles
  • Group by tags
  • Time series generation
  • Pre-computed rollups

5. Telemetry Handler (MetricsEx.TelemetryHandler)

  • Auto-attach to telemetry events
  • Converts telemetry measurements to metrics
  • Extracts tags from metadata

6. API (MetricsEx.API)

  • JSON-compatible data structures
  • Dashboard-ready endpoints
  • Time window helpers

Aggregation Functions

Function Description Example
:count Number of metrics count: 150
:sum Total sum of values sum: 1234
:mean Average value mean: 0.72
:min Minimum value min: 0.45
:max Maximum value max: 0.95
:p50 50th percentile (median) p50: 0.70
:p95 95th percentile p95: 0.89
:p99 99th percentile p99: 0.93

Time Windows

Window Description
:last_hour Last 60 minutes
:last_24h Last 24 hours
:last_7d Last 7 days
:last_30d Last 30 days

Time Intervals

Interval Description
:minute 1-minute buckets
:hour 1-hour buckets
:day 1-day buckets

System Statistics

MetricsEx.get_stats()
# => %{
#   storage: %{
#     total_metrics: 12345,
#     metrics_stored: 12345,
#     metrics_pruned: 567,
#     retention_hours: 24,
#     memory_bytes: 1048576
#   },
#   recorder: %{
#     metrics_recorded: 12345
#   },
#   timestamp: "2025-12-06T12:00:00Z"
# }

Integration Examples

With Crucible Experiments

# Record experiment metrics
MetricsEx.record(:crucible_experiment, %{
  value: entailment_score,
  tags: %{
    experiment_id: experiment.id,
    model: "llama-3.1",
    dataset: "scifact",
    stage: "validation"
  }
})

# Query experiment results
MetricsEx.query(:crucible_experiment,
  group_by: [:model, :dataset],
  aggregation: :mean,
  window: :last_7d
)

With CNS Agents

# Record agent performance
MetricsEx.record(:cns_agent_metric, %{
  value: beta1_score,
  tags: %{
    agent: "antagonist",
    sno_id: sno.id,
    iteration: 3
  }
})

# Track agent iterations
MetricsEx.time_series(:cns_agent_metric,
  interval: :hour,
  aggregation: :mean,
  tags: %{agent: "synthesizer"},
  window: :last_24h
)

With Work Job System

# Auto-record via telemetry
:telemetry.execute(
  [:work, :job, :completed],
  %{duration: job_duration_ms},
  %{tenant: tenant, queue: queue_name}
)

# Monitor job throughput
MetricsEx.rollup(:work_job_completed,
  group_by: [:tenant, :queue],
  aggregations: [:count, :mean, :p95],
  window: :last_hour
)

Testing

# Run all tests
mix test

# Run with coverage
mix test --cover

# Run specific test file
mix test test/metrics_ex/aggregator_test.exs

Performance

  • Write throughput: 100K+ metrics/sec (async casts to GenServer)
  • Query latency: <10ms for typical aggregations (in-memory ETS)
  • Memory footprint: ~100 bytes per metric + overhead
  • Retention cleanup: Every 5 minutes (configurable)

Roadmap

  • Persistent storage backend (PostgreSQL, ClickHouse)
  • Advanced percentile algorithms (t-digest)
  • Alert rules and notifications
  • Metric cardinality limits
  • Query result caching
  • Grafana integration
  • OpenTelemetry compatibility

Contributing

This project is part of the North-Shore-AI monorepo. See the main repository for contribution guidelines.

License

MIT License - see LICENSE for details.

Related Projects

About

Metrics aggregation and alerting for ML experiments—multi-backend export (Prometheus, InfluxDB, Datadog, OpenTelemetry), advanced aggregations (percentiles, histograms, moving averages), threshold-based alerting with anomaly detection (z-score, IQR), and time-series storage. Research-grade observability for the NSAI ecosystem.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages