Skip to content

Latest commit

 

History

History
93 lines (82 loc) · 9.77 KB

METRICS.md

File metadata and controls

93 lines (82 loc) · 9.77 KB

This documents the metrics and tags emitted by gostatsd, their type, tags, and interpretation. All internal metrics are snapshot after a flush, then queued internally for sending in the next flush. Specifically this means that internal metrics will lag regular metrics by one flush interval. See below for notes on how channels are monitored.

Metric types:

type description
gauge (flush) A value sent as a gauge with the value reset / calculated / sampled every flush interval
gauge (time) A single duration measured in milliseconds and sent as a gauge
gauge (cumulative) An internal counter sent as a gauge with the value never resetting
counter An internal counter, reset on flush

Metrics:

Name type tags description
aggregator.metrics_received gauge (flush) aggregator_id The number of datapoints received during the flush interval
aggregator.metricmaps_received gauge (flush) aggregator_id The number of datapoint batches received during the flush interval
aggregator.aggregation_time gauge (time) aggregator_id The time taken (in ms) to aggregate all counter and timer
datapoints in this flush interval
aggregator.process_time gauge (time) aggregator_id The time taken to process all synchronous flush actions
aggregator.reset_time gauge (time) aggregator_id The time taken to reset the aggregator after flush
parser.bad_lines_seen gauge (cumulative) The number of unparseable lines
parser.events_received gauge (cumulative) The number of events parsed
parser.metrics_received gauge (cumulative) The number of metrics parsed
receiver.datagrams_received gauge (cumulative) The number of datagrams received
receiver.avg_datagrams_in_batch gauge (flush) The average number of datagrams per batch (up to receive-batch-size). This
can be used to tweak receive-batch-size if necessary to reduce memory usage.
channel.avg gauge (flush) channel The average of all samples in the flush interval
channel.min gauge (flush) channel The minimum sample seen
channel.max gauge (flush) channel The maximum sample seen
channel.last gauge (flush) channel The last sample seen
channel.capacity gauge (flush) channel The capacity of the channel
channel.samples gauge (flush) channel The number of samples seen (guaranteed to be at least 1)
internal_dropped gauge (cumulative) The number of internal metrics which have been dropped
heartbeat gauge (flush) version, commit The value 1, tagged by the version (git tag) and short commit hash
flusher.total_time gauge (time) Time taken to flush all metrics to all backends for the flush interval
backend.created gauge (cumulative) backend Lifetime number of metric batches generated by the backend
backend.retried gauge (cumulative) backend Lifetime number of metric batches retried by the backend
backend.dropped gauge (cumulative) backend Lifetime number of metric batches dropped by the backend (DATALOSS!)
backend.sent gauge (cumulative) backend Lifetime number of metric batches successfully transmitted
cloudprovider.aws.describeinstancecount gauge (cumulative) The cumulative number of times DescribeInstancesPages has been called
cloudprovider.aws.describeinstanceinstances gauge (cumulative) The cumulative number of instances which have been fed in to DescribeInstancesPages
cloudprovider.aws.describeinstancepages gauge (cumulative) The cumulative number of pages from DescribeInstancesPages
cloudprovider.aws.describeinstanceerrors gauge (cumulative) The cumulative number of errors seen from DescribeInstancesPages
cloudprovider.aws.describeinstancefound gauge (cumulative) The cumulative number of instances successfully found via DescribeInstances
cloudprovider.cache_positive gauge (flush) The absolute number of positive entries in the cache
cloudprovider.cache_negative gauge (flush) The absolute number of negative entries in the cache
cloudprovider.cache_refresh_positive gauge (cumulative) The cumulative number of positive refreshes
cloudprovider.cache_refresh_negative gauge (cumulative) The cumulative number of refreshes which had an error refreshing and used old data
cloudprovider.cache_hit gauge (cumulative) The cumulative number of cache hits (host was in the cache)
cloudprovider.cache_late_hit gauge (cumulative) The cumulative number of late cache hits (host was not in the cache, but had a lookup
in progress which completed)
cloudprovider.cache_miss gauge (cumulative) The cumulative number of cache misses
cloudprovider.hosts_queued gauge (flush) type The absolute number of hosts waiting to be looked up
cloudprovider.items_queued gauge (flush) type The absolute number of metrics or events waiting for a host lookup to complete
http.forwarder.invalid counter The number of failures to prepare a batch of metrics to forward
http.forwarder.created counter The number of batches prepared for forwarding
http.forwarder.sent counter The number of batches successfully forwarded
http.forwarder.retried counter The number of retries sending a batch
http.forwarder.dropped counter The number of batches dropped due to inability to forward upstream
http.incoming counter server-name, result, failure The number of batches forwarded to the server, and the results of processing them
http.incoming.metrics counter server-name The number of metrics received over http
Tag Description
aggregator_id The index of an aggregator, the amount corresponds to the --max-workers flag
channel The name of an internal channel
version The git tag of the build
commit The short git commit of the build
backend The backend sending a particular metric
type Either metric or event
result Success to indicate a batch of metrics was successfully processed, failure to indicate a batch of metrics was not processed, with additional failure tag for why)
failure The reason a batch of metrics was not processed
server-name The name of an http-server as specified in the config file

A number of channels are tracked internally, they emit metrics under the channel.* space. They will all have a channel tag, and may have additional tags specified below. Channels are sampled at a regular interval. After a flush, basic stats are sent about the data sampled (internal metrics lag regular metrics by a flush interval) and the samples are reset.

Channel name Additional tags Description
dispatch_aggregator aggregator_id Channel to dispatch metrics to a specific aggregator.
backend_events_sem Semaphore limiting the number of events in flight at once. Corresponds to
the --max-concurrent-events flag.
  • If both --internal-namespace and --namespace are specified, and metrics are dispatched internally, the resulting metric will be namespace.internal_namespace.metric.