Skip to content

Commit

Permalink
LSM-based aggregation (elastic#11117)
Browse files Browse the repository at this point in the history
Replace existing aggregation implementation with apm-aggregation, the LSM-based aggregator.
  • Loading branch information
carsonip authored and bmorelli25 committed Sep 5, 2023
1 parent 334976d commit 0ca71c2
Show file tree
Hide file tree
Showing 34 changed files with 4,275 additions and 7,617 deletions.
5,720 changes: 3,848 additions & 1,872 deletions NOTICE.txt

Large diffs are not rendered by default.

6 changes: 6 additions & 0 deletions changelogs/head.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,9 @@ https://github.com/elastic/apm-server/compare/8.9\...main[View commits]
- Add a self-instrumentation transaction to the agent config periodic refresh action. {pull}11129[11129]
- Stop dropping metadata fields from span documents. {pull}11089[11089]
- Add permissions to reroute events in the integration package. {pull}11168[11168]

[float]
==== Aggregation improvements
- Replace aggregation with LSM-based aggregator which has a lower memory footprint {pull}11117[11117]
- Add `service.language.name` to service destination metrics {pull}11117[11117]
- Modify per-service transaction groups limit to consider more than service.name; Add per-service service destination groups limit and per-service service transaction groups limit {pull}11117[11117]
20 changes: 7 additions & 13 deletions dev_docs/trace_metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,26 +15,19 @@ As transactions are observed by APM Server, it groups them according to various
attributes such as `service.name`, `transaction.name`, and `kubernetes.pod.name`.
The latency is then recorded in an [HDRHistogram](http://hdrhistogram.org/) for
that group. Transaction group latency histograms are periodically indexed (every
minute by default), with configurable precision (defaults to 2 significant figures).
minute by default), with a fixed precision of 2 significant figures.

To protect against memory exhaustion due to high-cardinality transaction names
(or other attributes), at any given time, APM Server places a limit on the number
of services tracked, the number of transaction groups tracked, as well as number
of groups tracked per service.

By default, the limits are 1,000 services per GB of memory, 5,000 transaction groups
per GB of memory. When transaction group latency histograms are indexed, the groups
are reset, enabling a different set of groups to be recorded.
The per-service limit is 10% of the global limit. For example, for a 2GB APM Server,
the limits are 2,000 services, 10,000 transaction groups, and for each service,
there can be a maximum of 1,000 unique transaction groups.
of groups tracked per service. See [docs](https://www.elastic.co/guide/en/apm/guide/current/data-model-metrics.html#_aggregated_metrics_limits_and_overflows) for limits.

## Service transaction metrics

Service transaction metrics are similar to Transaction metrics, but with fewer
dimensions. For example, `transaction.name` is no longer considered during aggregation.

A limit of 1,000 unique service transaction groups per GB of memory is enforced.
See [docs](https://www.elastic.co/guide/en/apm/guide/current/data-model-metrics.html#_aggregated_metrics_limits_and_overflows) for limits.

## Service destination metrics

Expand All @@ -43,15 +36,16 @@ from one service to another. This works much the same as transaction metrics
aggregation: span events describing an operation that involves another service
are grouped by the originating and target services, and the span latency is
accumulated. For these metrics we record only a count and sum, enabling calculation
of throughput and average latency. A default limit of 10,000 groups is
imposed.
of throughput and average latency.

See [docs](https://www.elastic.co/guide/en/apm/guide/current/data-model-metrics.html#_aggregated_metrics_limits_and_overflows) for limits.

## Service summary metrics

Service summary metrics consider transaction, error, log, and metric events and
basically produce a summary of all services sending events.

A limit of 1,000 unique service summary groups per GB of memory is enforced.
See [docs](https://www.elastic.co/guide/en/apm/guide/current/data-model-metrics.html#_aggregated_metrics_limits_and_overflows) for limits.

## Dealing with sampling

Expand Down
18 changes: 13 additions & 5 deletions docs/data-model.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -461,6 +461,7 @@ You can filter and group by these dimensions:
* `metricset.interval`: A string with the aggregation interval the metricset represents.
* `numeric_labels`: Key-value object containing numeric labels set globally by the APM agents.
* `service.environment`: The environment of the service that made the request
* `service.language.name`: The language name of the service that served the transaction, for example `Go`
* `service.name`: The name of the service that made the request
* `service.target.name`: The target service name, for example `customer_db`
* `service.target.type`: The target service type, for example `mysql`
Expand Down Expand Up @@ -540,12 +541,19 @@ there are limits on the number of unique groups tracked at any given time.

Note that all the below limits may change in the future with further improvements.

* For transaction metrics, the limits are 1000 services per GB of APM Server, and 5000 transaction
groups per GB of APM Server. Additionally, each service may only consume up to 10% of the transaction groups,
* For all the following metrics, they share a limit of 1000 services per GB of APM Server.
** For transaction metrics, there is an additional limit of 5000 total transaction groups per GB of APM Server,
and each service may only consume up to 10% of the transaction groups,
which is 500 transaction groups per service per GB of APM Server.
* For service-transaction metrics, the limit is 1000 service transaction groups per GB of APM Server.
* For service-destination metrics, the limit is a constant of 10000 service destination groups.
* For service-summary metrics, the limit is 1000 service summary groups per GB of APM Server.
** For service-transaction metrics, there is an additional limit of 1000 total service transaction groups per GB of APM Server,
and each service may only consume up to 10% of the service transaction groups,
which is 100 service transaction groups per service per GB of APM Server.
** For service-destination metrics, there is an additional limit of a constant 10000 total service destination groups,
and each service may only consume up to 10% of the service destination groups,
which is 1000 service destination groups per service.
** For service-summary metrics, there is no additional limit.

In the above, a service is defined as a combination of `service.name`, `service.environment`, `service.language.name` and `agent.name`.

[float]
===== Overflows
Expand Down
3 changes: 3 additions & 0 deletions docs/data/elasticsearch/service_destination_metric.json
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,9 @@
},
"service": {
"environment": "production",
"language": {
"name": "ruby"
},
"name": "opbeans-ruby",
"target": {
"type": "postgresql"
Expand Down
23 changes: 19 additions & 4 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@ module github.com/elastic/apm-server
go 1.20

require (
github.com/axiomhq/hyperloglog v0.0.0-20230201085229-3ddf4bad03dc
github.com/cespare/xxhash/v2 v2.2.0
github.com/dgraph-io/badger/v2 v2.2007.3-0.20201012072640-f5a7e0a1c83b
github.com/dustin/go-humanize v1.0.1
github.com/elastic/apm-aggregation v0.0.0-20230807142825-c82b2b7e590c
github.com/elastic/apm-data v0.1.1-0.20230803060036-9180b59d7888
github.com/elastic/beats/v7 v7.0.0-alpha2.0.20230808073125-1fe462c68f7d
github.com/elastic/elastic-agent-client/v7 v7.2.0
Expand All @@ -15,7 +15,6 @@ require (
github.com/elastic/gmux v0.2.0
github.com/elastic/go-docappender v0.2.1-0.20230724080315-b714d6181871
github.com/elastic/go-elasticsearch/v8 v8.9.0
github.com/elastic/go-hdrhistogram v0.1.0
github.com/elastic/go-sysinfo v1.11.0
github.com/elastic/go-ucfg v0.8.6
github.com/go-sourcemap/sourcemap v2.1.3+incompatible
Expand Down Expand Up @@ -59,13 +58,21 @@ require (
)

require (
github.com/DataDog/zstd v1.4.4 // indirect
github.com/DataDog/zstd v1.4.5 // indirect
github.com/Microsoft/go-winio v0.6.1 // indirect
github.com/OneOfOne/xxhash v1.2.8 // indirect
github.com/Shopify/sarama v1.38.1 // indirect
github.com/apache/thrift v0.18.1 // indirect
github.com/armon/go-radix v1.0.0 // indirect
github.com/axiomhq/hyperloglog v0.0.0-20230201085229-3ddf4bad03dc // indirect
github.com/beorn7/perks v1.0.1 // indirect
github.com/cespare/xxhash v1.1.0 // indirect
github.com/cockroachdb/errors v1.8.1 // indirect
github.com/cockroachdb/logtags v0.0.0-20190617123548-eb05cc24525f // indirect
github.com/cockroachdb/pebble v0.0.0-20230627193317-c807f60529a3 // indirect
github.com/cockroachdb/redact v1.0.8 // indirect
github.com/cockroachdb/sentry-go v0.6.1-cockroachdb.2 // indirect
github.com/cockroachdb/tokenbucket v0.0.0-20230613231145-182959a1fad6 // indirect
github.com/containerd/containerd v1.7.1 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/dgraph-io/ristretto v0.1.1 // indirect
Expand Down Expand Up @@ -106,10 +113,13 @@ require (
github.com/joeshaw/multierror v0.0.0-20140124173710-69b34d4ec901 // indirect
github.com/json-iterator/go v1.1.12 // indirect
github.com/klauspost/compress v1.16.7 // indirect
github.com/kr/pretty v0.3.1 // indirect
github.com/kr/text v0.2.0 // indirect
github.com/lufia/plan9stats v0.0.0-20211012122336-39d0f177ccd0 // indirect
github.com/magefile/mage v1.15.0 // indirect
github.com/mattn/go-colorable v0.1.13 // indirect
github.com/mattn/go-isatty v0.0.17 // indirect
github.com/matttproud/golang_protobuf_extensions v1.0.4 // indirect
github.com/mitchellh/hashstructure v1.1.0 // indirect
github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal v0.81.0 // indirect
Expand All @@ -119,8 +129,12 @@ require (
github.com/pierrec/lz4/v4 v4.1.17 // indirect
github.com/pmezard/go-difflib v1.0.0 // indirect
github.com/power-devops/perfstat v0.0.0-20210106213030-5aafc221ea8c // indirect
github.com/prometheus/client_golang v1.16.0 // indirect
github.com/prometheus/client_model v0.4.0 // indirect
github.com/prometheus/common v0.44.0 // indirect
github.com/prometheus/procfs v0.10.1 // indirect
github.com/rcrowley/go-metrics v0.0.0-20201227073835-cf1acfcdf475 // indirect
github.com/rogpeppe/go-internal v1.10.0 // indirect
github.com/shirou/gopsutil v3.21.11+incompatible // indirect
github.com/shirou/gopsutil/v3 v3.23.5 // indirect
github.com/shoenig/go-m1cpu v0.1.6 // indirect
Expand All @@ -141,7 +155,8 @@ require (
go.uber.org/atomic v1.11.0 // indirect
go.uber.org/multierr v1.11.0 // indirect
golang.org/x/crypto v0.11.0 // indirect
golang.org/x/mod v0.10.0 // indirect
golang.org/x/exp v0.0.0-20230713183714-613f0c0eb8a1 // indirect
golang.org/x/mod v0.11.0 // indirect
golang.org/x/sys v0.10.0 // indirect
golang.org/x/text v0.11.0 // indirect
golang.org/x/tools v0.9.3 // indirect
Expand Down
Loading

0 comments on commit 0ca71c2

Please sign in to comment.