- Observability, OpenTelemetry and
clj-otel
- Supported telemetry backends
- Using OpenTelemetry
- Instrumenting libraries and applications
- OpenTelemetry data model
- Semantic conventions
- Context
- Instrumenting asynchronous Clojure code
- Trace sampling
- Exporters
- OpenTelemetry Protocol - OTLP
- Using the OpenTelemetry SDK
- OpenTelemetry distros
- OpenTelemetry Collector
- Alternative Clojure telemetry projects
Observability is the ability to measure a system’s current state based on the telemetry data it generates, such as traces, metrics, logs and events.
An observable system may be queried to describe its composition and behaviour. As systems are often highly distributed in nature, observability addresses the need to understand the behaviour with context across services, technologies and environments.
Monitoring is a complementary activity to observability, where typically a dashboard is configured to report predetermined metrics and alerts are set to trigger on specific conditions in the system. Monitoring is oriented to the detection of known types of issues, as the monitored metrics or conditions are determined in advance.
Observability enables detection and exploration of issues as they occur and assists with root cause analysis. By querying an observable system, unexpected types of issue can be detected, characterised and monitored as they arise, addressing the problem of "unknown unknowns". Questions on the actual behaviour of an observable system may be more readily answered:
-
"What caused this failure?"
-
"What caused this change in behaviour?"
-
"Why do a particular user’s requests fail?"
-
"Is my latest canary deployment stable?"
-
"Should I roll back the latest deployment?"
-
"Are my intended performance improvements in the latest deployment working as expected?"
-
"What uses my service, and what are the services it uses?"
-
"What mix of request types are issued to the system? Is that mix consistent across users?"
-
"Which areas should I work on to improve experience for most users?"
-
"How well is the system behaving for my premium users?"
-
"What user segment suffers high response times and why?"
-
"What is the biggest contributor to the system’s poor performance?"
-
"Why is the system’s error rate worse than last month? Are there new error cases?"
-
"What is the cause of my service level indicators (SLIs) not meeting the service level objectives (SLOs)?"
Through enhanced visibility of system behaviour, observability improves understanding of the impact of changes made by developers and operators.
OpenTelemetry is a CNCF incubating project formed after the merger of OpenTracing and OpenCensus.
OpenTelemetry’s Mission: to enable effective observability by making high-quality, portable telemetry ubiquitous[1].
OpenTelemetry is a set of APIs, SDKs, tooling, integrations and semantic conventions for creating and managing telemetry data. OpenTelemetry provides a vendor-agnostic implementation per language (Java, Go, JavaScript, C++ and others) that exports telemetry data to backends of your choice. The Java implementation supports automatic instrumentation of many libraries, frameworks and application servers and manual instrumentation of library or application source code. OpenTelemetry supports various backends, including OSS projects (Jaeger, Zipkin, Prometheus, Grafana Tempo, SigNoz) and commercial products (Honeycomb, ServiceNow, Aspecto, Dynatrace, New Relic, Datadog and more).
The OpenTelemetry project also provides the OpenTelemetry Collector, a vendor-agnostic deployable process which receives, processes and exports telemetry data.
Work on the OpenTelemetry specification and implementations is in progress by a large community of contributors. Support for traces is the most mature, followed by metrics and logs. See a summary of OpenTelemetry status.
There is a growing consensus in the commercial monitoring and APM tools market to adopt OpenTelemetry as a standard for telemetry in cloud-native software. Today many commercial products accept OpenTelemetry trace data. Product support for metrics and logs is growing. Several vendors have created distributions of the OpenTelemetry SDK and OpenTelemetry Collector, which feature customisations tailored for use with their products.
clj-otel
extends the reach of OpenTelemetry to Clojure libraries and applications by providing:
-
A small idiomatic Clojure API that wraps the OpenTelemetry implementation for Java. This API enables manual instrumentation of Clojure libraries and applications using pure Clojure.
-
Ring middleware and Pedestal interceptors for server span support.
-
Support for creating spans around asynchronous Clojure code.
-
A Clojure wrapper for programmatic configuration of the OpenTelemetry SDK.
clj-otel
is an umbrella project for several Clojure modules clj-otel-*
.
They depend on the OpenTelemetry implementation for Java opentelemetry-java
and the OpenTelemetry instrumentation agent provided by opentelemetry-java-instrumentation
.
OpenTelemetry exports telemetry data to a variety of telemetry backends. The choice of backend(s) is applied when configuring system components for deployment.
Telemetry backends have varied roles and capabilities. Telemetry visualization backends provide interactive tools for query and display of telemetry data. Other backends may provide telemetry data storage and indexing for later retrieval.
The following sections are incomplete selections of open-source software (OSS) and commercial telemetry visualization backends that accept telemetry data from OpenTelemetry. See here for an extended list including other types of telemetry backend.
ℹ️
|
Some commercial telemetry backends have a free version with a reduced capacity or feature set. |
The general workflow for using OpenTelemetry with your library or application is:
-
Add instrumentation to your library or application such that it exports telemetry data.
-
Configure system components to control how the telemetry data are processed and exported, either directly to telemetry backends or via OpenTelemetry Collector instance(s).
-
Use telemetry backend features to explore system behaviour described by the telemetry data.
Instrumenting a library or application involves adding behaviour such that it exports telemetry data as it runs.
Automatic instrumentation dynamically alters the library or application at runtime to export telemetry data. For the Java platform, automatic instrumentation is performed by the OpenTelemetry instrumentation agent, a Java agent that runs with the application. Many libraries, frameworks and application servers are supported by the agent out of the box. For example, the agent will create server spans for requests received by a Jetty server, and client spans for requests issued by an Apache HttpClient instance.
💡
|
If possible, use automatic instrumentation for your application, as this is a quick way to get high quality telemetry with almost no effort. |
Manual instrumentation is the process of adding program code to the library or application at design time to export telemetry data using the OpenTelemetry API.
The clj-otel-api
module in this project wraps the OpenTelemetry API for Java in an idiomatic Clojure facade.
❗
|
Manual instrumentation program code depends on the OpenTelemetry API, never the OpenTelemetry SDK. |
Any combination of automatic and manual instrumentation may be used:
-
Use solely automatic instrumentation to quickly add telemetry without changing any program code.
-
Use solely manual instrumentation if it is not possible to use the instrumentation agent, or the instrumented application does not use a library or framework supported by the agent.
-
Combine automatic and manual instrumentation for enriched telemetry. For example, to enrich a span produced by automatic instrumentation, attributes and events may be added using manual instrumentation.
In observability terms, telemetry data comes from four sources: traces, metrics, logs and events. In the OpenTelemetry data model, data sources are traces, metrics and logs. Events are treated as a specific type of log or captured as part of a trace.
A trace represents the flow of a single transaction throughout the system. A trace comprises a tree of spans, where a span represents a unit of work in a service. Parent-child relationships between spans describe dependencies between them. The root span of a trace typically describes the entire transaction. The other spans represent units of work performed as part of the transaction. Traces provide context for system activity performed in spans.
Span data may include a span kind, name, attributes, start/end timestamps, links to other spans, a list of events and a status.
-
The span kind indicates the relationship between the span and its parent and children in the trace. The span kind is one of:
-
CLIENT
: Covers the client side of issuing a synchronous request, where the client side waits until a response is received. -
SERVER
: Covers the server side of handling a synchronous request, where the remote client waits for a response. -
PRODUCER
: Covers initiation of an asynchronous request, where the corresponding consumer span may start after the producer span ends. -
CONSUMER
: Covers processing of an asynchronous producer request. -
INTERNAL
: An internal operation within the local application or service.
-
-
The span name should identify a class of spans and not include data.
-
The events in a span are timestamped records that may include attributes. Exceptions thrown in a span’s scope are captured as events.
-
The span status has a code
Ok
orError
, and in the case ofError
may also have a string description.
A metric is a numerical measurement over a period of time. Metrics may be used to indicate quantitative aspects of system health, such as resource (memory, disk, compute, network) usage, error rate, message queue length, and request response time. They may also be used to record statistics on application usage, performance and internal operation.
An instrument is used to record measurements. Instruments may support synchronous (invoke) or asynchronous (callback) recording of measurements, depending on the type of instrument. Measurements have a value of type long or double, along with context and attributes.
The available types of instrument are:
-
Counter (sync or async)
-
Up-down counter (sync or async)
-
Histogram (sync only)
-
Gauge (sync or async)
A service log is made of lines of text (possibly structured e.g. in JSON format) written when certain points in the service code are executed. Logs are well suited to ad-hoc debugging and capture of low-level details.
Events are captured as either a specific type of log or as a span event. Events are records that describe actions taken by the system or significant environmental changes, such as a service deployment or change in configuration.
Attributes may be attached to some telemetry data such as spans and resources.
Attributes are a map where each entry has a string key.
Each entry value is a boolean, long, double, string or an array of one of those types.
Entries with a nil
value are dropped.
OpenTelemetry recommends using namespaced attribute names to prevent clashes. See the specification for attributes and attribute naming.
A resource captures information about the entity for which telemetry data is recorded. For example, information on the host and JVM version may be part of a resource. Resources form part of the telemetry data.
The OpenTelemetry SDK contains resource implementations which capture host and process information.
Baggage is a mechanism for propagating telemetry metadata and is represented as a simple map.
It is a means to add contextual information at a point in a transaction, read by a downstream service later in the same transaction and then used as an element of telemetry data, e.g. an attribute.
For example, a user identifier is put in the baggage to indicate the principal of a request and subsequent spans in the trace include a principal
attribute.
OpenTelemetry defines a rich set of conventions for telemetry data. This semantic unification across vendors and technologies promotes analysis of telemetry data created in heterogeneous, polyglot systems. In particular, semantic attributes for spans and metrics are defined for base technologies like HTTP, database, RPC, messaging, FaaS (Function as a Service) and others. See OpenTelemetry semantic conventions documentation.
clj-otel
follows the semantic conventions for areas such as span exception events and manually created HTTP client and server spans.
A context is an immutable map that holds values transmitted across API boundaries and threads.
A context may contain a span, baggage and possibly other values.
A new context is created by adding a key-value association to an existing context.
A context is automatically created for each new span.
opentelemetry-java
implements a context as a io.opentelemetry.context.Context
object instance.
The current context is a context referenced by a specific thread local variable, defined by opentelemetry-java
.
It is the default context value for many functions in clj-otel
and opentelemetry-java
.
The current context should be used when manually instrumenting synchronous code.
🔥
|
Do not use the current context when manually instrumenting asynchronous code. See Instrumenting asynchronous Clojure code. |
The bound context is a context referenced by a binding of a specific Clojure dynamic var, declared by clj-otel
.
Its use is optional for libraries and applications using clj-otel
.
It is provided as a convenience when manually instrumenting asynchronous code.
If used, it overrides the default (current) context used for clj-otel
functions.
ℹ️
|
Binding conveyance is a Clojure feature that conveys dynamic var bindings between blocks of asynchronous code.
See also clojure.core functions binding , bound-fn and bound-fn* .
|
Context propagation is the mechanism used to transmit context values across API boundaries and threads. Context propagation enables traces to become distributed traces, joining clients to servers and producers to consumers. In practice HTTP request header values are injected and extracted using a text map propagator.
OpenTelemetry provides text map propagators for the following protocols:
The W3C Trace Context and W3C baggage header propagation protocols are the most commonly used.
When manually instrumenting asynchronous Clojure code with this library clj-otel
, it is not possible to use the current context.
This is because async Clojure function evaluations share threads, but each evaluation is associated with a distinct context.
There are two general approaches for correct handling of context in asynchronous code:
-
The async function maintains a reference to the associated context during evaluation.
-
The context is passed as an explicit
:context
or:parent
value toclj-otel
functions.
Sampling is the selection of some elements from a set and deriving observations on the complete set based on analysis of those selected elements. Sampling is needed when the volume of raw data is too high to analyse cost-effectively.
Trace sampling may occur at any number of points between the instrumented application and the telemetry backend. OpenTelemetry provides sampler implementations which may be applied in the application and/or the Collector. Some telemetry backends may also apply sampling to trace data they receive, either automatically or with some developer intervention.
Exporters emit telemetry data to consumers, such as the Collector and telemetry backends. Exporters can be push or pull based.
OpenTelemetry Protocol (OTLP) is the OpenTelemetry native protocol for encoding, transport and delivery of telemetry data. OTLP is currently implemented over gRPC and HTTP transports.
Almost all telemetry backends that integrate with OpenTelemetry accept telemetry data in OTLP format. An application or OpenTelemetry Collector exports data to these backends using an OTLP exporter.
The OpenTelemetry SDK implements the creation, sampling, batching and export of telemetry data. The SDK acts as an implementation of the OpenTelemetry API. For an application to export telemetry data, the SDK and its dependencies should be present and configured at runtime.
The SDK and its dependencies are added to an application in one of the following ways:
-
OpenTelemetry instrumentation agent
In this option the application is run with the OpenTelemetry instrumentation agent. The SDK and its dependencies are built into the agent. They are loaded at runtime but do not appear on the application classpath. Also, autoconfiguration is used for configuring the agent and SDK.
-
Autoconfigure SDK extension
In this option, an SDK extension is used alongside the SDK to enable autoconfiguration of the SDK.
clj-otel-sdk-extension-autoconfigure
is a Clojure wrapper for theopentelemetry-sdk-extension-autoconfigure
Java library. The relevant optional Java SDK libraries (exporters, extensions, etc.) also need to be added as runtime dependencies. -
Programmatic configuration
This option is for programmatic configuration of the SDK. This project
clj-otel
provides a moduleclj-otel-sdk
for configuring the SDK in Clojure, as well as other support modulesclj-otel-exporter-*
,clj-otel-extension-*
andclj-otel-sdk-extension-*
for programmatic access to various optional components. The relevant optional SDK libraries also need to be added as compile-time dependencies.
If the SDK is not present at application runtime, all OpenTelemetry API calls default to a no-op implementation where no telemetry data is created.
Autoconfiguration of the OpenTelemetry SDK refers to configuration using system properties or environment variables. Configuration of the OpenTelemetry instrumentation agent uses the same mechanism.
See documentation for SDK autoconfiguration and instrumentation agent configuration.
OpenTelemetry
is a Java interface that acts as an entrypoint to all telemetry functionality at runtime.
Most applications will configure and initialise a single OpenTelemetry
instance.
The instance should be initialised early in the application’s run lifetime, before any telemetry data is produced.
The global OpenTelemetry
is a singleton reference declared and used by Java OpenTelemetry.
The reference is nil
until set, and may be set once only.
The global OpenTelemetry
reference is always automatically set when running an application with the OpenTelemetry instrumentation agent.
|
It is discouraged to set the global OpenTelemetry reference when using the autoconfigure SDK extension or programmatically configuring the SDK.
|
The default OpenTelemetry
is a singleton reference declared and used by clj-otel
.
The reference is nil
until set.
Many clj-otel
functions that take an OpenTelemetry
instance as a parameter will use the default OpenTelemetry
by default.
If the default OpenTelemetry
has not been set, the global OpenTelemetry
is used as a fallback.
❗
|
It is recommended to always set the default OpenTelemetry reference when using the autoconfigure SDK extension or programmatically configuring the SDK.
|
An OpenTelemetry distro (or "distribution") supplied by a vendor is a repackaging of reference OpenTelemetry software, customised for ease of use with the vendor’s products. They are not forks in that they do not extend or change the OpenTelemetry API.
It is not a requirement to use a vendor’s distro since it should always be possible to use the reference OpenTelemetry software and configure it appropriately. The obvious advantage to using a distro is the ease of use. However, a disadvantage is that sometimes the distro version lags behind the reference OpenTelemetry version.
The OpenTelemetry Collector is a vendor-agnostic deployable process to manage telemetry data as it flows from instrumented applications to telemetry backends. The Collector can transform telemetry data by, for example, inserting or filtering attributes. It removes the need to run multiple vendor-specific agents and collectors when working with several telemetry data formats and telemetry backends.
It is not required to use the OpenTelemetry Collector, though it simplifies telemetry data management in large systems with many instrumented services. Some exporters provided by OpenTelemetry have default options set to target a Collector instance running on the same host.
The following are alternative telemetry projects in the Clojure ecosystem, concerned with telemetry data creation or processing.
-
μ/log : Micro-logging library that logs events and data, not words
-
Telemere : Structured telemetry library for Clojure/Script, with native OpenTelemetry support
-
ken : Observability library to instrument Clojure code
-
clojure.log4j2 : Sugar for using Log4j2 from clojure, including
MapMessage
support -
timbre-json-appender : Structured log appender for Timbre using jsonista
-
cartus : Structured logging abstraction with multiple backends
-
Cambium : Structured logging for Clojure
-
clj-journal : Structured logging to systemd journal using native systemd libraries and JNA (Java Native Access)
-
μ/trace : Micro distributed tracing library with the focus on tracking data with custom attributes; a subsystem of μ/log
-
ken tracing support
-
opencensus-clojure : Clojure wrapper for opencensus-java
-
opentracing-clj : OpenTracing API support for Clojure, built on top of opentracing-java
-
omni-trace : Clojure(Script) tracing for debugging
-
Riemann : Network event stream processing system, in Clojure
-
micrometer-clj : Clojure wrapper for Java Micrometer library
-
metrics-clojure : Clojure facade around Dropwizard Metrics library
-
TRACKit! : Clojure wrapper for Dropwizard Metrics library
-
aws-metrics-collector : Clojure AWS Cloudwatch metric collector