LSM-based aggregation #11117

carsonip · 2023-06-30T10:28:00Z

Motivation/summary

Replace existing aggregation with apm-aggregation, the LSM-based aggregator.

Tasks:

Pass configured limits to apm-aggregation
Fix various test failures
~~Storage configuration~~ In-memory FS used by pebble
Benchmarks
Remove old aggregation code
~~- [ ] aggregator Close should log if there's an error (x-pack processor stop error is not logged #11355)~~

Checklist

~~(no user facing changes)- [ ] Update CHANGELOG.asciidoc~~
~~- [ ] Update package changelog.yml (only if changes to apmpackage have been made)~~
~~- [ ] Documentation has been updated~~

For functional changes, consider:

Is it observable through the addition of either logging or metrics?
Is its use being published in telemetry to enable product improvement?
Have system tests been added to avoid regression?

How to test these changes

Send some events and ensure that APM UI works as usual
Ensure that APM with instrumentation enabled emits pebble metrics as well as aggregation overflow metrics
Ensure that apm server logs on overflows

Related issues

mergify · 2023-06-30T10:28:33Z

This pull request does not have a backport label. Could you fix it @carsonip? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-7.x is the label to automatically backport to the 7.x branch.
backport-7./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

apmmachine · 2023-06-30T10:31:04Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-07-25T16:56:34.083+0000
Duration: 6 min 49 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate and publish the docker images.
/test windows : Build & tests on Windows.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

mergify · 2023-07-05T05:58:43Z

This pull request is now in conflicts. Could you fix it @carsonip? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b lsm-poc upstream/lsm-poc
git merge upstream/main
git push upstream lsm-poc

carsonip · 2023-07-10T22:00:09Z

Progress update

I've been working on understanding and optimizing the performance of LSM PoC. Performance improvements are WIP in apm-aggregation repo. But here's a glimpse of how LSM PoC fares against main:

                           │ /home/carson/projects/apm-server/apmbench-out/bench-main-2-merge.txt │ /home/carson/projects/apm-server/apmbench-out/bench-after-alloc-after-vtproto-2-merge.txt │
                           │                                sec/op                                │                              sec/op                                vs base                │
1000TransactionsSerial-512                                                           13.45m ± ∞ ¹                                                        21.76m ± ∞ ¹       ~ (p=0.333 n=2) ²
AgentAll-512                                                                         393.2m ± ∞ ¹                                                        418.0m ± ∞ ¹       ~ (p=0.333 n=2) ²
AgentGo-512                                                                          104.2m ± ∞ ¹                                                        103.7m ± ∞ ¹       ~ (p=0.667 n=2) ²
AgentNodeJS-512                                                                      51.13m ± ∞ ¹                                                        50.23m ± ∞ ¹       ~ (p=1.000 n=2) ²
AgentPython-512                                                                      162.8m ± ∞ ¹                                                        139.4m ± ∞ ¹       ~ (p=0.333 n=2) ²
AgentRuby-512                                                                        89.96m ± ∞ ¹                                                        88.98m ± ∞ ¹       ~ (p=0.667 n=2) ²
geomean                                                                              86.29m                                                              91.51m        +6.05%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                           │ /home/carson/projects/apm-server/apmbench-out/bench-main-2-merge.txt │ /home/carson/projects/apm-server/apmbench-out/bench-after-alloc-after-vtproto-2-merge.txt │
                           │                              events/sec                              │                            events/sec                              vs base                │
1000TransactionsSerial-512                                                           74.44k ± ∞ ¹                                                        45.95k ± ∞ ¹       ~ (p=0.333 n=2) ²
AgentAll-512                                                                         45.84k ± ∞ ¹                                                        43.21k ± ∞ ¹       ~ (p=0.333 n=2) ²
AgentGo-512                                                                          50.22k ± ∞ ¹                                                        50.50k ± ∞ ¹       ~ (p=0.667 n=2) ²
AgentNodeJS-512                                                                      39.47k ± ∞ ¹                                                        40.19k ± ∞ ¹       ~ (p=1.000 n=2) ²
AgentPython-512                                                                      43.19k ± ∞ ¹                                                        50.38k ± ∞ ¹       ~ (p=0.333 n=2) ²
AgentRuby-512                                                                        41.73k ± ∞ ¹                                                        42.21k ± ∞ ¹       ~ (p=0.667 n=2) ²
geomean                                                                              47.97k                                                              45.24k        -5.70%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                           │ /home/carson/projects/apm-server/apmbench-out/bench-main-2-merge.txt │ /home/carson/projects/apm-server/apmbench-out/bench-after-alloc-after-vtproto-2-merge.txt │
                           │                                 B/op                                 │                              B/op                                vs base                  │
1000TransactionsSerial-512                                                          5.667Mi ± ∞ ¹                                                    20.012Mi ± ∞ ¹         ~ (p=0.333 n=2) ²
AgentAll-512                                                                        251.7Mi ± ∞ ¹                                                     492.3Mi ± ∞ ¹         ~ (p=0.333 n=2) ²
AgentGo-512                                                                         63.21Mi ± ∞ ¹                                                    120.73Mi ± ∞ ¹         ~ (p=0.333 n=2) ²
AgentNodeJS-512                                                                     30.88Mi ± ∞ ¹                                                     59.81Mi ± ∞ ¹         ~ (p=0.333 n=2) ²
AgentPython-512                                                                     101.3Mi ± ∞ ¹                                                     200.2Mi ± ∞ ¹         ~ (p=0.333 n=2) ²
AgentRuby-512                                                                       59.99Mi ± ∞ ¹                                                    107.15Mi ± ∞ ¹         ~ (p=0.333 n=2) ²
geomean                                                                             50.67Mi                                                           107.3Mi        +111.75%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                           │ /home/carson/projects/apm-server/apmbench-out/bench-main-2-merge.txt │ /home/carson/projects/apm-server/apmbench-out/bench-after-alloc-after-vtproto-2-merge.txt │
                           │                              allocs/op                               │                            allocs/op                              vs base                 │
1000TransactionsSerial-512                                                           73.25k ± ∞ ¹                                                      162.97k ± ∞ ¹        ~ (p=0.333 n=2) ²
AgentAll-512                                                                         3.964M ± ∞ ¹                                                       5.090M ± ∞ ¹        ~ (p=0.333 n=2) ²
AgentGo-512                                                                          854.1k ± ∞ ¹                                                      1151.5k ± ∞ ¹        ~ (p=0.333 n=2) ²
AgentNodeJS-512                                                                      515.0k ± ∞ ¹                                                       659.8k ± ∞ ¹        ~ (p=0.333 n=2) ²
AgentPython-512                                                                      1.700M ± ∞ ¹                                                       2.103M ± ∞ ¹        ~ (p=0.333 n=2) ²
AgentRuby-512                                                                        940.8k ± ∞ ¹                                                      1182.0k ± ∞ ¹        ~ (p=0.333 n=2) ²
geomean                                                                              767.4k                                                             1.078M        +40.42%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

The numbers are quite unstable since I ran the benchmarks on my laptop. But other than 1000TransactionsSerial (similar to 1000Transactions but runs in serial instead of in parallel), all the agent benchmarks seem to be fairly similar.

mergify · 2023-07-12T08:08:43Z

This pull request is now in conflicts. Could you fix it @carsonip? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b lsm-poc upstream/lsm-poc
git merge upstream/main
git push upstream lsm-poc

apmmachine · 2023-07-13T16:04:33Z

📚 Go benchmark report

Diff with the main branch

goos: linux
goarch: amd64
pkg: github.com/elastic/apm-server/internal/agentcfg
cpu: 12th Gen Intel(R) Core(TM) i5-12500
                                  │ build/main/bench.out │             bench.out             │
                                  │        sec/op        │   sec/op     vs base              │
geomean                                      68.22n        68.21n       -0.01%

                                  │ build/main/bench.out │             bench.out              │
                                  │         B/op         │    B/op     vs base                │
geomean                                                ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                  │ build/main/bench.out │             bench.out              │
                                  │      allocs/op       │ allocs/op   vs base                │
geomean                                                ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

pkg: github.com/elastic/apm-server/internal/beater/request
                                             │ build/main/bench.out │              bench.out              │
                                             │        sec/op        │    sec/op     vs base               │
ContextReset/Remote_Addr_ipv6-12                       627.4n ± 32%   758.5n ± 11%  +20.91% (p=0.041 n=6)
ContextReset/Forwarded_ipv4-12                         774.1n ± 32%   579.9n ± 36%  -25.09% (p=0.015 n=6)
ContextResetContentEncoding/empty-12                   135.4n ±  3%   134.5n ±  0%   -0.66% (p=0.002 n=6)
geomean                                                957.0n         921.3n         -3.74%

                                             │ build/main/bench.out │              bench.out               │
                                             │         B/op         │     B/op      vs base                │
geomean                                                           ²                 +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                             │ build/main/bench.out │             bench.out              │
                                             │      allocs/op       │ allocs/op   vs base                │
geomean                                                           ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

pkg: github.com/elastic/apm-server/internal/publish
             │ build/main/bench.out │          bench.out          │
             │        sec/op        │   sec/op    vs base         │

             │ build/main/bench.out │           bench.out            │
             │         B/op         │     B/op       vs base         │

             │ build/main/bench.out │           bench.out           │
             │      allocs/op       │  allocs/op    vs base         │

pkg: github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics
                 │ build/main/bench.out │          bench.out           │
                 │        sec/op        │   sec/op     vs base         │

                 │ build/main/bench.out │            bench.out            │
                 │         B/op         │     B/op      vs base           │
¹ all samples are equal

                 │ build/main/bench.out │           bench.out           │
                 │      allocs/op       │ allocs/op   vs base           │
¹ all samples are equal

pkg: github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics
                        │ build/main/bench.out │          bench.out           │
                        │        sec/op        │   sec/op     vs base         │

                        │ build/main/bench.out │           bench.out           │
                        │         B/op         │    B/op     vs base           │
¹ all samples are equal

                        │ build/main/bench.out │           bench.out           │
                        │      allocs/op       │ allocs/op   vs base           │
¹ all samples are equal

pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling
               │ build/main/bench.out │             bench.out              │
               │        sec/op        │    sec/op     vs base              │
geomean                  419.4n         408.9n        -2.51%

               │ build/main/bench.out │              bench.out               │
               │         B/op         │     B/op      vs base                │
geomean                    382.7          382.0       -0.19%
¹ all samples are equal

               │ build/main/bench.out │             bench.out              │
               │      allocs/op       │ allocs/op   vs base                │
geomean                    3.742        3.742       +0.00%
¹ all samples are equal

pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage
                                             │ build/main/bench.out │              bench.out              │
                                             │        sec/op        │    sec/op      vs base              │
geomean                                               13.05µ          12.78µ         -2.07%

                                             │ build/main/bench.out │               bench.out               │
                                             │         B/op         │     B/op       vs base                │
ReadEvents/proto_codec_big_tx/1000_events-12          2.826Mi ±  0%   2.825Mi ±  0%  -0.04% (p=0.048 n=6)
geomean                                               11.30Ki         11.31Ki        +0.09%
¹ all samples are equal

                                             │ build/main/bench.out │              bench.out              │
                                             │      allocs/op       │  allocs/op   vs base                │
geomean                                                  123.0         123.0       +0.00%
¹ all samples are equal

report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat

mergify · 2023-07-17T00:16:03Z

This pull request is now in conflicts. Could you fix it @carsonip? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b lsm-poc upstream/lsm-poc
git merge upstream/main
git push upstream lsm-poc

axw

LGTM! Just a few minor things.

systemtest/aggregation_test.go

x-pack/apm-server/aggregation/lsm.go

x-pack/apm-server/main.go

x-pack/apm-server/aggregation/lsm.go

carsonip · 2023-08-08T16:29:02Z

changelogs/head.asciidoc

+==== Aggregation improvements
+- Replace aggregation implementation {pull}11117[11117]
+- Add `service.language.name` to service destination metrics {pull}11117[11117]
+- Modify per-service transaction groups limit to consider more than service.name; Add per-service service destination groups limit and per-service service transaction groups limit {pull}11117[11117]


@simitt WDYT about the changelog?

my 2c: I think we should improve the first message to briefly mention what it means for the user. At the very least we should mention the significantly reduced memory cost for aggregation by moving from in-memory to the on-disk LSM based approach.

I wanted to do so, but struggled to word it properly 🤦

dev_docs/trace_metrics.md

kruskall · 2023-08-31T12:13:57Z

Tested and everything seems to be working.

I spotted some issue with the way we are using the vt pool by not returning objects to the pool but it shouldn't be a regression since we were not even using the pool in older versions. It is something that we can iterate on.

carsonip · 2023-08-31T12:45:25Z

@kruskall do you mind leaving a note about where we did not return vt objects, or alternatively create an issue to do so? I'm very interested to know where we missed it.

kruskall · 2023-08-31T14:14:34Z

Yes, I plan to create followup issues an open PRs after making sure that they are correct

Replace existing aggregation implementation with apm-aggregation, the LSM-based aggregator.

LSM-based aggregation PoC

0768052

mergify bot added the backport-skip Skip notification from the automated backport with mergify label Jun 30, 2023

Fix pebble dir removed bug; Add TODO on log config

e9710bf

carsonip force-pushed the lsm-poc branch from 20bf677 to e9710bf Compare July 3, 2023 14:34

carsonip added 4 commits July 7, 2023 01:18

Update apm-aggregation

e6c90ae

What have I done?

9236cfe

Merge branch 'main' into lsm-poc

a6ba5ce

Merge main

9fd5152

carsonip added 2 commits July 11, 2023 16:39

Respect apm-server log config

6ecf778

Use apm-server limits

92a3c21

Update apm-aggregation

f136127

carsonip force-pushed the lsm-poc branch from ddee2da to f136127 Compare July 12, 2023 13:15

carsonip added 5 commits July 12, 2023 14:15

Use data storage directory

1d4067c

Use apm-server tracer to record metrics

a5ab3d2

Merge branch 'main' into lsm-poc

fe51baa

Use otel global tracer and meter provider

1838ca1

Update apm-aggregation

2b37ad3

carsonip added 6 commits July 20, 2023 19:50

Merge branch 'main' into lsm-poc

5260b0c

Update apm-aggregation

2c25e6b

Use in-memory option

54d1c33

test: Remove aggregation from TestMonitoring

913bd84

test: Remove aggregation from TestMonitoring again

28b3025

Fix NOTICE.txt

257369f

carsonip marked this pull request as draft August 7, 2023 19:09

carsonip marked this pull request as ready for review August 7, 2023 19:10

axw approved these changes Aug 8, 2023

View reviewed changes

carsonip added 2 commits August 8, 2023 10:14

Address review comments

618dcbd

Handle ErrAggregatorClosed

77334e6

carsonip mentioned this pull request Aug 8, 2023

Log when aggregation overflow happens #11362

Closed

carsonip added 2 commits August 8, 2023 16:32

Update docs about service destination service.language.name

0b85ddb

Add changelog

4652eda

carsonip mentioned this pull request Aug 8, 2023

Overflows should emit logs and metrics elastic/apm-aggregation#75

Closed

Update NOTICE.txt

67149d2

carsonip commented Aug 8, 2023

View reviewed changes

kruskall approved these changes Aug 8, 2023

View reviewed changes

carsonip added 4 commits August 8, 2023 17:46

Update changelog

e028a15

Update changelog

eb4cc88

Merge branch 'main' into lsm-poc

6f740c6

Update docs on limits

e454ed5

carsonip commented Aug 8, 2023

View reviewed changes

dev_docs/trace_metrics.md Outdated Show resolved Hide resolved

axw reviewed Aug 9, 2023

View reviewed changes

dev_docs/trace_metrics.md Outdated Show resolved Hide resolved

carsonip added 2 commits August 9, 2023 11:43

Link to docs

f8a1843

Merge branch 'main' into lsm-poc

f6bdcc9

carsonip merged commit 020bc63 into elastic:main Aug 9, 2023
8 checks passed

carsonip mentioned this pull request Aug 15, 2023

386 build failing after LSM PR merge #11394

Closed

lahsivjar added test-plan v8.10.0 labels Aug 22, 2023

kruskall self-assigned this Aug 31, 2023

kruskall added the test-plan-ok label Aug 31, 2023

bmorelli25 pushed a commit to bmorelli25/apm-server that referenced this pull request Sep 5, 2023

LSM-based aggregation (elastic#11117)

0ca71c2

Replace existing aggregation implementation with apm-aggregation, the LSM-based aggregator.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSM-based aggregation #11117

LSM-based aggregation #11117

carsonip commented Jun 30, 2023 •

edited

Loading

mergify bot commented Jun 30, 2023

apmmachine commented Jun 30, 2023 •

edited

Loading

Build stats

mergify bot commented Jul 5, 2023

carsonip commented Jul 10, 2023

mergify bot commented Jul 12, 2023

apmmachine commented Jul 13, 2023 •

edited

Loading

mergify bot commented Jul 17, 2023

axw left a comment

carsonip Aug 8, 2023

kruskall Aug 8, 2023

carsonip Aug 8, 2023

carsonip Aug 8, 2023

kruskall commented Aug 31, 2023

carsonip commented Aug 31, 2023

kruskall commented Aug 31, 2023

LSM-based aggregation #11117

LSM-based aggregation #11117

Conversation

carsonip commented Jun 30, 2023 • edited Loading

Motivation/summary

Checklist

How to test these changes

Related issues

mergify bot commented Jun 30, 2023

apmmachine commented Jun 30, 2023 • edited Loading

💚 Build Succeeded

Build stats

🤖 GitHub comments

mergify bot commented Jul 5, 2023

carsonip commented Jul 10, 2023

Progress update

mergify bot commented Jul 12, 2023

apmmachine commented Jul 13, 2023 • edited Loading

📚 Go benchmark report

mergify bot commented Jul 17, 2023

axw left a comment

Choose a reason for hiding this comment

carsonip Aug 8, 2023

Choose a reason for hiding this comment

kruskall Aug 8, 2023

Choose a reason for hiding this comment

carsonip Aug 8, 2023

Choose a reason for hiding this comment

carsonip Aug 8, 2023

Choose a reason for hiding this comment

kruskall commented Aug 31, 2023

carsonip commented Aug 31, 2023

kruskall commented Aug 31, 2023

carsonip commented Jun 30, 2023 •

edited

Loading

apmmachine commented Jun 30, 2023 •

edited

Loading

apmmachine commented Jul 13, 2023 •

edited

Loading