Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSM-based aggregation #11117

Merged
merged 50 commits into from
Aug 9, 2023
Merged
Show file tree
Hide file tree
Changes from 48 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
0768052
LSM-based aggregation PoC
carsonip Jun 30, 2023
e9710bf
Fix pebble dir removed bug; Add TODO on log config
carsonip Jul 3, 2023
e6c90ae
Update apm-aggregation
carsonip Jul 7, 2023
9236cfe
What have I done?
carsonip Jul 10, 2023
a6ba5ce
Merge branch 'main' into lsm-poc
carsonip Jul 10, 2023
9fd5152
Merge main
carsonip Jul 10, 2023
6ecf778
Respect apm-server log config
carsonip Jul 11, 2023
92a3c21
Use apm-server limits
carsonip Jul 11, 2023
f136127
Update apm-aggregation
carsonip Jul 12, 2023
1d4067c
Use data storage directory
carsonip Jul 12, 2023
a5ab3d2
Use apm-server tracer to record metrics
carsonip Jul 12, 2023
fe51baa
Merge branch 'main' into lsm-poc
carsonip Jul 13, 2023
1838ca1
Use otel global tracer and meter provider
carsonip Jul 13, 2023
2b37ad3
Update apm-aggregation
carsonip Jul 13, 2023
5260b0c
Merge branch 'main' into lsm-poc
carsonip Jul 20, 2023
2c25e6b
Update apm-aggregation
carsonip Jul 20, 2023
54d1c33
Use in-memory option
carsonip Jul 25, 2023
913bd84
test: Remove aggregation from TestMonitoring
carsonip Jul 25, 2023
28b3025
test: Remove aggregation from TestMonitoring again
carsonip Jul 25, 2023
257369f
Fix NOTICE.txt
carsonip Jul 25, 2023
569b7f5
systemtest: Add service.language.name to service destination aggregation
carsonip Jul 25, 2023
e0f14ac
Fix newProcessors make cap
carsonip Jul 25, 2023
c6b74a1
Remove storage config as in-memory is used
carsonip Jul 25, 2023
6bf63a2
Merge branch 'main' into lsm-poc
carsonip Jul 25, 2023
23ee273
Merge branch 'main' into lsm-poc
carsonip Jul 28, 2023
a554a96
Update apm-aggregation
carsonip Jul 28, 2023
162b7ed
Update apm-aggregation
carsonip Jul 31, 2023
68522f6
Update apm-aggregation
carsonip Jul 31, 2023
55b345f
Merge branch 'main' into lsm-poc
carsonip Jul 31, 2023
b23e104
Update apm-aggregation
carsonip Aug 1, 2023
1f8acaf
Update apm-aggregation
carsonip Aug 2, 2023
731d189
Remove global labels from RUM events; Re-enable tests
carsonip Aug 3, 2023
0c8f54f
test: Fix service summary overflow test
carsonip Aug 3, 2023
13870ce
Refactor config
carsonip Aug 3, 2023
2174f6c
Remove old aggregation code
carsonip Aug 3, 2023
16c3721
Update apm-aggregation
carsonip Aug 4, 2023
03a55a7
Completely remove Transaction.MaxServices
carsonip Aug 4, 2023
bdcb0dd
Fix typo
carsonip Aug 7, 2023
3c666e1
Merge branch 'main' into lsm-poc
carsonip Aug 7, 2023
618dcbd
Address review comments
carsonip Aug 8, 2023
77334e6
Handle ErrAggregatorClosed
carsonip Aug 8, 2023
0b85ddb
Update docs about service destination service.language.name
carsonip Aug 8, 2023
4652eda
Add changelog
carsonip Aug 8, 2023
67149d2
Update NOTICE.txt
carsonip Aug 8, 2023
e028a15
Update changelog
carsonip Aug 8, 2023
eb4cc88
Update changelog
carsonip Aug 8, 2023
6f740c6
Merge branch 'main' into lsm-poc
carsonip Aug 8, 2023
e454ed5
Update docs on limits
carsonip Aug 8, 2023
f8a1843
Link to docs
carsonip Aug 9, 2023
f6bdcc9
Merge branch 'main' into lsm-poc
carsonip Aug 9, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5,720 changes: 3,848 additions & 1,872 deletions NOTICE.txt

Large diffs are not rendered by default.

6 changes: 6 additions & 0 deletions changelogs/head.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,9 @@ https://github.com/elastic/apm-server/compare/8.9\...main[View commits]
- Add a self-instrumentation transaction to the agent config periodic refresh action. {pull}11129[11129]
- Stop dropping metadata fields from span documents. {pull}11089[11089]
- Add permissions to reroute events in the integration package. {pull}11168[11168]

[float]
==== Aggregation improvements
- Replace aggregation with LSM-based aggregator which has a lower memory footprint {pull}11117[11117]
- Add `service.language.name` to service destination metrics {pull}11117[11117]
- Modify per-service transaction groups limit to consider more than service.name; Add per-service service destination groups limit and per-service service transaction groups limit {pull}11117[11117]
20 changes: 7 additions & 13 deletions dev_docs/trace_metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,26 +15,19 @@ As transactions are observed by APM Server, it groups them according to various
attributes such as `service.name`, `transaction.name`, and `kubernetes.pod.name`.
The latency is then recorded in an [HDRHistogram](http://hdrhistogram.org/) for
that group. Transaction group latency histograms are periodically indexed (every
minute by default), with configurable precision (defaults to 2 significant figures).
minute by default), with a fixed precision of 2 significant figures.

To protect against memory exhaustion due to high-cardinality transaction names
(or other attributes), at any given time, APM Server places a limit on the number
of services tracked, the number of transaction groups tracked, as well as number
of groups tracked per service.

By default, the limits are 1,000 services per GB of memory, 5,000 transaction groups
per GB of memory. When transaction group latency histograms are indexed, the groups
are reset, enabling a different set of groups to be recorded.
The per-service limit is 10% of the global limit. For example, for a 2GB APM Server,
the limits are 2,000 services, 10,000 transaction groups, and for each service,
there can be a maximum of 1,000 unique transaction groups.
of groups tracked per service. See docs for limits.
carsonip marked this conversation as resolved.
Show resolved Hide resolved
carsonip marked this conversation as resolved.
Show resolved Hide resolved

## Service transaction metrics

Service transaction metrics are similar to Transaction metrics, but with fewer
dimensions. For example, `transaction.name` is no longer considered during aggregation.

A limit of 1,000 unique service transaction groups per GB of memory is enforced.
See docs for limits.

## Service destination metrics

Expand All @@ -43,15 +36,16 @@ from one service to another. This works much the same as transaction metrics
aggregation: span events describing an operation that involves another service
are grouped by the originating and target services, and the span latency is
accumulated. For these metrics we record only a count and sum, enabling calculation
of throughput and average latency. A default limit of 10,000 groups is
imposed.
of throughput and average latency.

See docs for limits.

## Service summary metrics

Service summary metrics consider transaction, error, log, and metric events and
basically produce a summary of all services sending events.

A limit of 1,000 unique service summary groups per GB of memory is enforced.
See docs for limits.

## Dealing with sampling

Expand Down
18 changes: 13 additions & 5 deletions docs/data-model.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -461,6 +461,7 @@ You can filter and group by these dimensions:
* `metricset.interval`: A string with the aggregation interval the metricset represents.
* `numeric_labels`: Key-value object containing numeric labels set globally by the APM agents.
* `service.environment`: The environment of the service that made the request
* `service.language.name`: The language name of the service that served the transaction, for example `Go`
* `service.name`: The name of the service that made the request
* `service.target.name`: The target service name, for example `customer_db`
* `service.target.type`: The target service type, for example `mysql`
Expand Down Expand Up @@ -540,12 +541,19 @@ there are limits on the number of unique groups tracked at any given time.

Note that all the below limits may change in the future with further improvements.

* For transaction metrics, the limits are 1000 services per GB of APM Server, and 5000 transaction
groups per GB of APM Server. Additionally, each service may only consume up to 10% of the transaction groups,
* For all the following metrics, they share a limit of 1000 services per GB of APM Server.
** For transaction metrics, there is an additional limit of 5000 total transaction groups per GB of APM Server,
and each service may only consume up to 10% of the transaction groups,
which is 500 transaction groups per service per GB of APM Server.
* For service-transaction metrics, the limit is 1000 service transaction groups per GB of APM Server.
* For service-destination metrics, the limit is a constant of 10000 service destination groups.
* For service-summary metrics, the limit is 1000 service summary groups per GB of APM Server.
** For service-transaction metrics, there is an additional limit of 1000 total service transaction groups per GB of APM Server,
and each service may only consume up to 10% of the service transaction groups,
which is 100 service transaction groups per service per GB of APM Server.
** For service-destination metrics, there is an additional limit of a constant 10000 total service destination groups,
and each service may only consume up to 10% of the service destination groups,
which is 1000 service destination groups per service.
** For service-summary metrics, there is no additional limit.

In the above, a service is defined as a combination of `service.name`, `service.environment`, `service.language.name` and `agent.name`.

[float]
===== Overflows
Expand Down
3 changes: 3 additions & 0 deletions docs/data/elasticsearch/service_destination_metric.json
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,9 @@
},
"service": {
"environment": "production",
"language": {
"name": "ruby"
},
"name": "opbeans-ruby",
"target": {
"type": "postgresql"
Expand Down
23 changes: 19 additions & 4 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@ module github.com/elastic/apm-server
go 1.20

require (
github.com/axiomhq/hyperloglog v0.0.0-20230201085229-3ddf4bad03dc
github.com/cespare/xxhash/v2 v2.2.0
github.com/dgraph-io/badger/v2 v2.2007.3-0.20201012072640-f5a7e0a1c83b
github.com/dustin/go-humanize v1.0.1
github.com/elastic/apm-aggregation v0.0.0-20230807142825-c82b2b7e590c
github.com/elastic/apm-data v0.1.1-0.20230803060036-9180b59d7888
github.com/elastic/beats/v7 v7.0.0-alpha2.0.20230808073125-1fe462c68f7d
github.com/elastic/elastic-agent-client/v7 v7.2.0
Expand All @@ -15,7 +15,6 @@ require (
github.com/elastic/gmux v0.2.0
github.com/elastic/go-docappender v0.2.1-0.20230724080315-b714d6181871
github.com/elastic/go-elasticsearch/v8 v8.9.0
github.com/elastic/go-hdrhistogram v0.1.0
github.com/elastic/go-sysinfo v1.11.0
github.com/elastic/go-ucfg v0.8.6
github.com/go-sourcemap/sourcemap v2.1.3+incompatible
Expand Down Expand Up @@ -59,13 +58,21 @@ require (
)

require (
github.com/DataDog/zstd v1.4.4 // indirect
github.com/DataDog/zstd v1.4.5 // indirect
github.com/Microsoft/go-winio v0.6.1 // indirect
github.com/OneOfOne/xxhash v1.2.8 // indirect
github.com/Shopify/sarama v1.38.1 // indirect
github.com/apache/thrift v0.18.1 // indirect
github.com/armon/go-radix v1.0.0 // indirect
github.com/axiomhq/hyperloglog v0.0.0-20230201085229-3ddf4bad03dc // indirect
github.com/beorn7/perks v1.0.1 // indirect
github.com/cespare/xxhash v1.1.0 // indirect
github.com/cockroachdb/errors v1.8.1 // indirect
github.com/cockroachdb/logtags v0.0.0-20190617123548-eb05cc24525f // indirect
github.com/cockroachdb/pebble v0.0.0-20230627193317-c807f60529a3 // indirect
github.com/cockroachdb/redact v1.0.8 // indirect
github.com/cockroachdb/sentry-go v0.6.1-cockroachdb.2 // indirect
github.com/cockroachdb/tokenbucket v0.0.0-20230613231145-182959a1fad6 // indirect
github.com/containerd/containerd v1.7.1 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/dgraph-io/ristretto v0.1.1 // indirect
Expand Down Expand Up @@ -106,10 +113,13 @@ require (
github.com/joeshaw/multierror v0.0.0-20140124173710-69b34d4ec901 // indirect
github.com/json-iterator/go v1.1.12 // indirect
github.com/klauspost/compress v1.16.7 // indirect
github.com/kr/pretty v0.3.1 // indirect
github.com/kr/text v0.2.0 // indirect
github.com/lufia/plan9stats v0.0.0-20211012122336-39d0f177ccd0 // indirect
github.com/magefile/mage v1.15.0 // indirect
github.com/mattn/go-colorable v0.1.13 // indirect
github.com/mattn/go-isatty v0.0.17 // indirect
github.com/matttproud/golang_protobuf_extensions v1.0.4 // indirect
github.com/mitchellh/hashstructure v1.1.0 // indirect
github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect
github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal v0.81.0 // indirect
Expand All @@ -119,8 +129,12 @@ require (
github.com/pierrec/lz4/v4 v4.1.17 // indirect
github.com/pmezard/go-difflib v1.0.0 // indirect
github.com/power-devops/perfstat v0.0.0-20210106213030-5aafc221ea8c // indirect
github.com/prometheus/client_golang v1.16.0 // indirect
github.com/prometheus/client_model v0.4.0 // indirect
github.com/prometheus/common v0.44.0 // indirect
github.com/prometheus/procfs v0.10.1 // indirect
github.com/rcrowley/go-metrics v0.0.0-20201227073835-cf1acfcdf475 // indirect
github.com/rogpeppe/go-internal v1.10.0 // indirect
github.com/shirou/gopsutil v3.21.11+incompatible // indirect
github.com/shirou/gopsutil/v3 v3.23.5 // indirect
github.com/shoenig/go-m1cpu v0.1.6 // indirect
Expand All @@ -141,7 +155,8 @@ require (
go.uber.org/atomic v1.11.0 // indirect
go.uber.org/multierr v1.11.0 // indirect
golang.org/x/crypto v0.11.0 // indirect
golang.org/x/mod v0.10.0 // indirect
golang.org/x/exp v0.0.0-20230713183714-613f0c0eb8a1 // indirect
golang.org/x/mod v0.11.0 // indirect
golang.org/x/sys v0.10.0 // indirect
golang.org/x/text v0.11.0 // indirect
golang.org/x/tools v0.9.3 // indirect
Expand Down
Loading
Loading