Grafana Agent is an telemetry collector for sending metrics, logs, and trace data to the opinionated Grafana observability stack. It works best with:
- Grafana Cloud
- Grafana Enterprise Stack
- OSS deployments of Grafana Loki, Prometheus, Cortex, and Grafana Tempo
The Agent supports collecting telemetry data by utilizing the same battle-tested code from the official platforms. It uses Prometheus for metrics collection, Grafana Loki for log collection, and OpenTelemetry Collector for trace collection.
Unlike Prometheus, the Grafana Agent is just targeting remote_write
,
so some Prometheus features, such as querying, local storage, recording rules,
and alerts aren't present. remote_write
, service discovery, and relabeling
rules are included.
The Grafana Agent has a concept of an "instance", each of which acts as
its own mini Prometheus agent with their own scrape_configs
section and
remote_write
rules. More than one instance is useful when you want to have
completely separated configs that write to two different locations without
needing to worry about advanced metric relabeling rules. Multiple instances also
come into play for the Scraping Service Mode.
The Grafana Agent can be deployed in three modes:
- Prometheus
remote_write
drop-in - Host Filtering mode
- Scraping Service Mode
The default deployment mode of the Grafana Agent is the drop-in
replacement for Prometheus remote_write
. The Agent will act similarly to a
single-process Prometheus, doing service discovery, scraping, and remote
writing.
Host Filtering mode is achieved by setting a host_filter
flag on a specific
instance inside the Agent's configuration file. When this flag is set, the
instance will only scrape metrics from targets that are running on the same
machine as the instance itself. This is extremely useful to migrate to sharded
Prometheus instances in a Kubernetes cluster, where the Agent can be deployed as
a DaemonSet and distribute memory requirements across multiple nodes.
Note that Host Filtering mode and sharding your instances means that if an Agent's metrics are being sent to an alerting system, alerts for that Agent may not be able to be generated if the entire node has problems. This changes the semantics of failure detection, and alerts would have to be configured to catch agents not reporting in.
The final mode, Scraping Service Mode is a third operational mode that
clusters a subset of agents. It acts as the in-between of the drop-in mode
(which does no automatic sharding) and host_filter
mode (which forces sharding
by node). The Scraping Service Mode clusters a set of agents with a set of
shared configs and distributes the scrape load automatically between them. For
more information, please read the dedicated
Scraping Service Mode documentation.
Host Filtering configures Agents to scrape targets that are running on the same machine as the Grafana Agent process. It does the following:
- Gets the hostname of the agent by the
HOSTNAME
environment variable or through the default. - Checks if the hostname of the agent matches the label value for
__address__
service-discovery-specific node labels against the discovered target.
If the filter passes, the target is allowed to be scraped. Otherwise, the target will be silently ignored and not scraped.
For detailed information on the host filtering mode, refer to the operation guide
Grafana Agent supports collecting logs and sending them to Loki using its
loki
subsystem. This is done by utilizing the upstream
Promtail client, which
is the official first-party log collection client created by the Loki
developer team.
Grafana Agent supports collecting traces and sending them to Tempo using its
tempo
subsystem. This is done by utilizing the upstream OpenTelmetry Collector.
The agent is capable of ingesting OpenTelemetry, OpenCensus, Jaeger, Zipkin or Kafka spans.
See documentation on how to configure receivers.
The agent is capable of exporting to any OpenTelemetry GRPC compatible system.
Grafana Agent is optimized for Grafana Cloud,
but can be used while using an on-prem remote_write
-compatible Prometheus API
and an on-prem Loki. Unlike alternatives, Grafana Agent extends the
official code with extra functionality. This allows the Agent to give an
experience closest to its official counterparts compared to alternatives which
may try to reimplement everything from scratch.
Telegraf is a fantastic project and was actually considered as an alternative to building our own agent. It could work, but ultimately it was not chosen due to lacking service discovery and metadata label propagation. While these features could theoretically be added to Telegraf as OSS contributions, there would be a lot of forced hacks involved due to its current design.
Additionally, Telegraf is a much larger project with its own goals for its community, so any changes need to fit the general use cases it was designed for.
With the Grafana Agent as its own project, we can deliver a more curated agent
specifically designed to work seamlessly with Grafana Cloud and other
remote_write
compatible Prometheus endpoints as well as Loki for logs
and Tempo for traces, all-in-one.
For more information on installing and running the agent, see Getting started or Configuration Reference for a detailed reference on the configuration file.