Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.15] output: Retry document-level 429s by default (backport #13620) #13621

Merged
merged 3 commits into from
Jul 15, 2024

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Jul 8, 2024

Motivation/summary

Updates the APM Server to automatically retry document-level 429s from Elasticsearch to avoid dropping data. It can be configured/overwritten by output.elasticsearch.max_retries, and defaults to 3.

It uses the default backoff configuration, which could wait up to 1m if enough retries are configured, but can be overwritten as well.

This reduces data loss whenever Elasticsearch is overhelmed and reduces the chances of data loss due to ES push-back.

Checklist

How to test these changes

This may need a bit of setup. It's thoroughly tested in the go-docappender, but we may want to verify that retries happen when document-level 429s happen.

I'd suggest writing a small "proxy" that returns 429s randomly and test that things are indexed (as duplicates).

Related issues

Closes #13619


This is an automatic backport of pull request #13620 done by Mergify.

Updates the APM Server to automatically retry document-level `429`s from
Elasticsearch to avoid dropping data. It can be configured/overwritten
by `output.elasticsearch.max_retries`, and defaults to `3`.

It uses the default backoff configuration, which could wait up to 1m if
enough retries are configured, but can be overwritten as well.

---------

Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
Co-authored-by: Carson Ip <carsonip@users.noreply.github.com>
(cherry picked from commit 28436dc)

# Conflicts:
#	changelogs/head.asciidoc
@mergify mergify bot requested a review from a team as a code owner July 8, 2024 08:57
@mergify mergify bot added backport conflicts There is a conflict in the backported pull request labels Jul 8, 2024
@mergify mergify bot assigned marclop Jul 8, 2024
Copy link
Contributor Author

mergify bot commented Jul 8, 2024

Cherry-pick of 28436dc has failed:

On branch mergify/bp/8.15/pr-13620
Your branch is up to date with 'origin/8.15'.

You are currently cherry-picking commit 28436dc03.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   internal/beater/beater.go
	modified:   internal/beater/beater_test.go

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   changelogs/head.asciidoc

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
@marclop marclop removed the conflicts There is a conflict in the backported pull request label Jul 9, 2024
@inge4pres
Copy link
Contributor

Unfortunately I haven't had the chance to test this out today, nor to investigate further how the change could be validated.

One good thing was mentioned in our weekly by @marclop: the daily benchmarks would give us decent feedback, so I think one interesting way to see if the change affects overall performance would be to run the benchmarks on this PR's branch?

// Default to 1mib flushes, which is the default for go-docappender.
esConfig.FlushBytes = "1 mib"
esConfig.FlushInterval = time.Second
esConfig.Config = elasticsearch.DefaultConfig()
esConfig.MaxIdleConnsPerHost = 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why not moving MaxIdleConnsPerHost into structure initialization as other fields?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's part of Elasticsearch.Config which is set already with elasticsearch.DefaultConfig().

@inge4pres
Copy link
Contributor

I tried running the benchmarks against the smallest instance (1GB) in https://github.com/elastic/apm-server/actions/runs/9872740538

@inge4pres
Copy link
Contributor

I tried running the benchmarks against the smallest instance

benchmarks are broken due to Kibana (Kyung Eun had mentioned that in the standup, and now I recalled it 😓 )

@inge4pres
Copy link
Contributor

Here's a methodology that I think can get us some feedback.

  1. Create an ESS deployment with the smallest hardware profile (1GB, 1 zone)
  2. Run the current 8.15 BC (or latest SNAPSHOT) without the changes in this PR in the integrations server, and use apmbench to ingest a number of events that will force ES to return 429s, likely 5-10 k events/sec should be enough
  3. Install the apmserver from this PR into the integrations server
  4. Run the same load
  5. compare the results: APM Server metricsfrom Cloud dashboards, ES responses and status metrics

The goal will be to detect if ES has higher load or if the number of total ingested events differs between the 2 version.
If no significant changes are detected (+/- 0.5 sigma) IMO we can consider the change good to be merged.

Thoughts?

@inge4pres
Copy link
Contributor

Testing methodology:

  • deploy ES to 2GB, 1 zones / APM to 2 GB
  • run apmbench for roughly 30 minutes
./apmbench -rewrite-ids -rewrite-timestamps -rewrite-span-names -server=https://<id>.apm.us-west2.gcp.elastic-cloud.com:443 -event-rate=5000/s -api-key="<redacted>" -benchtime=4m
  • compare the results with the metrics in the ESS dashboards (APM and ES)

Results

8.15 BC 1

image
image
image
image

8.15 + this PR

image
image
image
image

Notes: in the ES screenshots for the 8.15 + this PR case there is double load being ingested into APM server.
Inexplicably, I was not able to replicate 429s with the same setup across 2 separate ESS deployments, so I aimed at increasing the load. The APM server screenshots are referring to the same load applied in case 1.

Conclusions: given the double load on ES, the peak of 429s that also doubled is somewhat expected.
Latency responses from ES are tightly correlated with its write thread pool queue: when write threads are queued we get higher latency and ES starts returning 429s.
There was not a significant increase in CPU or memory usage in ES.

APM server did get an increase in memory used, with the average going from ~350 MiB to ~500 MiB. I can't explain how this happens.
CPU usage also looks similar, with the exception of the presence of CPU throttling for one data point.

With the data presented I think there is not a significant overhead imposed to APM server or ES by this PR, so I am inclined to merge it.
Thoughts?

@marclop
Copy link
Contributor

marclop commented Jul 12, 2024

Looking at the apm-server RSS before vs after, I do not see cause for concern, particularly since the load was doubled in the after benchmark. @inge4pres correct me if I'm wrong, but that seems to be the case.

Before

image

After

image

@inge4pres
Copy link
Contributor

I do not see cause for concern, particularly since the load was doubled in the after benchmark. @inge4pres correct me if I'm wrong, but that seems to be the case.

I agree 👍🏼
We did see an increase in total memory usage, but not in presence of 429s: the branch with the backport (after) has higher allocated memory overall, but when the 429s are present, thre is no significant visible effect on APM Server.

@mergify mergify bot merged commit 008b12a into 8.15 Jul 15, 2024
16 checks passed
@mergify mergify bot deleted the mergify/bp/8.15/pr-13620 branch July 15, 2024 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants