Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSM-based aggregation #11117

Merged
merged 50 commits into from
Aug 9, 2023
Merged

LSM-based aggregation #11117

merged 50 commits into from
Aug 9, 2023

Conversation

carsonip
Copy link
Member

@carsonip carsonip commented Jun 30, 2023

Motivation/summary

Replace existing aggregation with apm-aggregation, the LSM-based aggregator.

Tasks:

  • Pass configured limits to apm-aggregation
  • Fix various test failures
  • Storage configuration In-memory FS used by pebble
  • Benchmarks
  • Remove old aggregation code
    - [ ] aggregator Close should log if there's an error (x-pack processor stop error is not logged #11355)

Checklist

(no user facing changes)- [ ] Update CHANGELOG.asciidoc
- [ ] Update package changelog.yml (only if changes to apmpackage have been made)
- [ ] Documentation has been updated

For functional changes, consider:

  • Is it observable through the addition of either logging or metrics?
  • Is its use being published in telemetry to enable product improvement?
  • Have system tests been added to avoid regression?

How to test these changes

  • Send some events and ensure that APM UI works as usual
  • Ensure that APM with instrumentation enabled emits pebble metrics as well as aggregation overflow metrics
  • Ensure that apm server logs on overflows

Related issues

@mergify
Copy link
Contributor

mergify bot commented Jun 30, 2023

This pull request does not have a backport label. Could you fix it @carsonip? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.x is the label to automatically backport to the 7.x branch.
  • backport-7./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Jun 30, 2023
@apmmachine
Copy link
Contributor

apmmachine commented Jun 30, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-07-25T16:56:34.083+0000

  • Duration: 6 min 49 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate and publish the docker images.

  • /test windows : Build & tests on Windows.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Contributor

mergify bot commented Jul 5, 2023

This pull request is now in conflicts. Could you fix it @carsonip? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b lsm-poc upstream/lsm-poc
git merge upstream/main
git push upstream lsm-poc

@carsonip
Copy link
Member Author

Progress update

I've been working on understanding and optimizing the performance of LSM PoC. Performance improvements are WIP in apm-aggregation repo. But here's a glimpse of how LSM PoC fares against main:

                           │ /home/carson/projects/apm-server/apmbench-out/bench-main-2-merge.txt │ /home/carson/projects/apm-server/apmbench-out/bench-after-alloc-after-vtproto-2-merge.txt │
                           │                                sec/op                                │                              sec/op                                vs base                │
1000TransactionsSerial-512                                                           13.45m ± ∞ ¹                                                        21.76m ± ∞ ¹       ~ (p=0.333 n=2) ²
AgentAll-512                                                                         393.2m ± ∞ ¹                                                        418.0m ± ∞ ¹       ~ (p=0.333 n=2) ²
AgentGo-512                                                                          104.2m ± ∞ ¹                                                        103.7m ± ∞ ¹       ~ (p=0.667 n=2) ²
AgentNodeJS-512                                                                      51.13m ± ∞ ¹                                                        50.23m ± ∞ ¹       ~ (p=1.000 n=2) ²
AgentPython-512                                                                      162.8m ± ∞ ¹                                                        139.4m ± ∞ ¹       ~ (p=0.333 n=2) ²
AgentRuby-512                                                                        89.96m ± ∞ ¹                                                        88.98m ± ∞ ¹       ~ (p=0.667 n=2) ²
geomean                                                                              86.29m                                                              91.51m        +6.05%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                           │ /home/carson/projects/apm-server/apmbench-out/bench-main-2-merge.txt │ /home/carson/projects/apm-server/apmbench-out/bench-after-alloc-after-vtproto-2-merge.txt │
                           │                              events/sec                              │                            events/sec                              vs base                │
1000TransactionsSerial-512                                                           74.44k ± ∞ ¹                                                        45.95k ± ∞ ¹       ~ (p=0.333 n=2) ²
AgentAll-512                                                                         45.84k ± ∞ ¹                                                        43.21k ± ∞ ¹       ~ (p=0.333 n=2) ²
AgentGo-512                                                                          50.22k ± ∞ ¹                                                        50.50k ± ∞ ¹       ~ (p=0.667 n=2) ²
AgentNodeJS-512                                                                      39.47k ± ∞ ¹                                                        40.19k ± ∞ ¹       ~ (p=1.000 n=2) ²
AgentPython-512                                                                      43.19k ± ∞ ¹                                                        50.38k ± ∞ ¹       ~ (p=0.333 n=2) ²
AgentRuby-512                                                                        41.73k ± ∞ ¹                                                        42.21k ± ∞ ¹       ~ (p=0.667 n=2) ²
geomean                                                                              47.97k                                                              45.24k        -5.70%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                           │ /home/carson/projects/apm-server/apmbench-out/bench-main-2-merge.txt │ /home/carson/projects/apm-server/apmbench-out/bench-after-alloc-after-vtproto-2-merge.txt │
                           │                                 B/op                                 │                              B/op                                vs base                  │
1000TransactionsSerial-512                                                          5.667Mi ± ∞ ¹                                                    20.012Mi ± ∞ ¹         ~ (p=0.333 n=2) ²
AgentAll-512                                                                        251.7Mi ± ∞ ¹                                                     492.3Mi ± ∞ ¹         ~ (p=0.333 n=2) ²
AgentGo-512                                                                         63.21Mi ± ∞ ¹                                                    120.73Mi ± ∞ ¹         ~ (p=0.333 n=2) ²
AgentNodeJS-512                                                                     30.88Mi ± ∞ ¹                                                     59.81Mi ± ∞ ¹         ~ (p=0.333 n=2) ²
AgentPython-512                                                                     101.3Mi ± ∞ ¹                                                     200.2Mi ± ∞ ¹         ~ (p=0.333 n=2) ²
AgentRuby-512                                                                       59.99Mi ± ∞ ¹                                                    107.15Mi ± ∞ ¹         ~ (p=0.333 n=2) ²
geomean                                                                             50.67Mi                                                           107.3Mi        +111.75%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                           │ /home/carson/projects/apm-server/apmbench-out/bench-main-2-merge.txt │ /home/carson/projects/apm-server/apmbench-out/bench-after-alloc-after-vtproto-2-merge.txt │
                           │                              allocs/op                               │                            allocs/op                              vs base                 │
1000TransactionsSerial-512                                                           73.25k ± ∞ ¹                                                      162.97k ± ∞ ¹        ~ (p=0.333 n=2) ²
AgentAll-512                                                                         3.964M ± ∞ ¹                                                       5.090M ± ∞ ¹        ~ (p=0.333 n=2) ²
AgentGo-512                                                                          854.1k ± ∞ ¹                                                      1151.5k ± ∞ ¹        ~ (p=0.333 n=2) ²
AgentNodeJS-512                                                                      515.0k ± ∞ ¹                                                       659.8k ± ∞ ¹        ~ (p=0.333 n=2) ²
AgentPython-512                                                                      1.700M ± ∞ ¹                                                       2.103M ± ∞ ¹        ~ (p=0.333 n=2) ²
AgentRuby-512                                                                        940.8k ± ∞ ¹                                                      1182.0k ± ∞ ¹        ~ (p=0.333 n=2) ²
geomean                                                                              767.4k                                                             1.078M        +40.42%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

The numbers are quite unstable since I ran the benchmarks on my laptop. But other than 1000TransactionsSerial (similar to 1000Transactions but runs in serial instead of in parallel), all the agent benchmarks seem to be fairly similar.

@mergify
Copy link
Contributor

mergify bot commented Jul 12, 2023

This pull request is now in conflicts. Could you fix it @carsonip? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b lsm-poc upstream/lsm-poc
git merge upstream/main
git push upstream lsm-poc

@apmmachine
Copy link
Contributor

apmmachine commented Jul 13, 2023

📚 Go benchmark report

Diff with the main branch

goos: linux
goarch: amd64
pkg: github.com/elastic/apm-server/internal/agentcfg
cpu: 12th Gen Intel(R) Core(TM) i5-12500
                                  │ build/main/bench.out │             bench.out             │
                                  │        sec/op        │   sec/op     vs base              │
geomean                                      68.22n        68.21n       -0.01%

                                  │ build/main/bench.out │             bench.out              │
                                  │         B/op         │    B/op     vs base                │
geomean                                                ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                  │ build/main/bench.out │             bench.out              │
                                  │      allocs/op       │ allocs/op   vs base                │
geomean                                                ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

pkg: github.com/elastic/apm-server/internal/beater/request
                                             │ build/main/bench.out │              bench.out              │
                                             │        sec/op        │    sec/op     vs base               │
ContextReset/Remote_Addr_ipv6-12                       627.4n ± 32%   758.5n ± 11%  +20.91% (p=0.041 n=6)
ContextReset/Forwarded_ipv4-12                         774.1n ± 32%   579.9n ± 36%  -25.09% (p=0.015 n=6)
ContextResetContentEncoding/empty-12                   135.4n ±  3%   134.5n ±  0%   -0.66% (p=0.002 n=6)
geomean                                                957.0n         921.3n         -3.74%

                                             │ build/main/bench.out │              bench.out               │
                                             │         B/op         │     B/op      vs base                │
geomean                                                           ²                 +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                             │ build/main/bench.out │             bench.out              │
                                             │      allocs/op       │ allocs/op   vs base                │
geomean                                                           ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

pkg: github.com/elastic/apm-server/internal/publish
             │ build/main/bench.out │          bench.out          │
             │        sec/op        │   sec/op    vs base         │

             │ build/main/bench.out │           bench.out            │
             │         B/op         │     B/op       vs base         │

             │ build/main/bench.out │           bench.out           │
             │      allocs/op       │  allocs/op    vs base         │

pkg: github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics
                 │ build/main/bench.out │          bench.out           │
                 │        sec/op        │   sec/op     vs base         │

                 │ build/main/bench.out │            bench.out            │
                 │         B/op         │     B/op      vs base           │
¹ all samples are equal

                 │ build/main/bench.out │           bench.out           │
                 │      allocs/op       │ allocs/op   vs base           │
¹ all samples are equal

pkg: github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics
                        │ build/main/bench.out │          bench.out           │
                        │        sec/op        │   sec/op     vs base         │

                        │ build/main/bench.out │           bench.out           │
                        │         B/op         │    B/op     vs base           │
¹ all samples are equal

                        │ build/main/bench.out │           bench.out           │
                        │      allocs/op       │ allocs/op   vs base           │
¹ all samples are equal

pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling
               │ build/main/bench.out │             bench.out              │
               │        sec/op        │    sec/op     vs base              │
geomean                  419.4n         408.9n        -2.51%

               │ build/main/bench.out │              bench.out               │
               │         B/op         │     B/op      vs base                │
geomean                    382.7          382.0       -0.19%
¹ all samples are equal

               │ build/main/bench.out │             bench.out              │
               │      allocs/op       │ allocs/op   vs base                │
geomean                    3.742        3.742       +0.00%
¹ all samples are equal

pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage
                                             │ build/main/bench.out │              bench.out              │
                                             │        sec/op        │    sec/op      vs base              │
geomean                                               13.05µ          12.78µ         -2.07%

                                             │ build/main/bench.out │               bench.out               │
                                             │         B/op         │     B/op       vs base                │
ReadEvents/proto_codec_big_tx/1000_events-12          2.826Mi ±  0%   2.825Mi ±  0%  -0.04% (p=0.048 n=6)
geomean                                               11.30Ki         11.31Ki        +0.09%
¹ all samples are equal

                                             │ build/main/bench.out │              bench.out              │
                                             │      allocs/op       │  allocs/op   vs base                │
geomean                                                  123.0         123.0       +0.00%
¹ all samples are equal

report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat

@mergify
Copy link
Contributor

mergify bot commented Jul 17, 2023

This pull request is now in conflicts. Could you fix it @carsonip? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b lsm-poc upstream/lsm-poc
git merge upstream/main
git push upstream lsm-poc

@carsonip carsonip marked this pull request as draft August 7, 2023 19:09
@carsonip carsonip marked this pull request as ready for review August 7, 2023 19:10
Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a few minor things.

systemtest/aggregation_test.go Show resolved Hide resolved
x-pack/apm-server/aggregation/lsm.go Outdated Show resolved Hide resolved
x-pack/apm-server/aggregation/lsm.go Outdated Show resolved Hide resolved
x-pack/apm-server/main.go Outdated Show resolved Hide resolved
x-pack/apm-server/main.go Outdated Show resolved Hide resolved
x-pack/apm-server/aggregation/lsm.go Outdated Show resolved Hide resolved
Comment on lines 26 to 29
==== Aggregation improvements
- Replace aggregation implementation {pull}11117[11117]
- Add `service.language.name` to service destination metrics {pull}11117[11117]
- Modify per-service transaction groups limit to consider more than service.name; Add per-service service destination groups limit and per-service service transaction groups limit {pull}11117[11117]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simitt WDYT about the changelog?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2c: I think we should improve the first message to briefly mention what it means for the user. At the very least we should mention the significantly reduced memory cost for aggregation by moving from in-memory to the on-disk LSM based approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to do so, but struggled to word it properly 🤦

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

dev_docs/trace_metrics.md Outdated Show resolved Hide resolved
dev_docs/trace_metrics.md Outdated Show resolved Hide resolved
@carsonip carsonip merged commit 020bc63 into elastic:main Aug 9, 2023
8 checks passed
@kruskall
Copy link
Member

Tested and everything seems to be working.

I spotted some issue with the way we are using the vt pool by not returning objects to the pool but it shouldn't be a regression since we were not even using the pool in older versions. It is something that we can iterate on.

@kruskall kruskall self-assigned this Aug 31, 2023
@carsonip
Copy link
Member Author

@kruskall do you mind leaving a note about where we did not return vt objects, or alternatively create an issue to do so? I'm very interested to know where we missed it.

@kruskall
Copy link
Member

Yes, I plan to create followup issues an open PRs after making sure that they are correct

bmorelli25 pushed a commit to bmorelli25/apm-server that referenced this pull request Sep 5, 2023
Replace existing aggregation implementation with apm-aggregation, the LSM-based aggregator.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify test-plan test-plan-ok v8.10.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants