Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix staleness during target handover #703

Merged
merged 10 commits into from
May 13, 2024

Conversation

thampiotr
Copy link
Contributor

@thampiotr thampiotr commented Apr 29, 2024

PR Description

As described in the required upstream prometheus PR:
grafana/prometheus#34

In certain scenarios, the injection of staleness markers at the end of a scrape loop run is undesired. For example: if a project runs scrape in a cluster where targets are sharded across a number of instances, when a new instance joins a cluster, some portion of targets will be handed over to that new instance. The previous instance should not inject staleness markers in this case or else there may be gaps in metrics.

In this PR, we:

  • update to use the prometheus with the change from above
  • change distributed targets to be able to track which targets have moved from local instance to another remote instance in a cluster
  • disable staleness markers injection for the moved targets
  • add tests to reproduce the issue, test the distributed targets and adding some test helpers

Which issue(s) this PR fixes

Fixes #249
(at least one way in which this symptom can happen)

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated
  • Config converters updated

@thampiotr thampiotr force-pushed the thampiotr/fix-staleness-during-target-handover branch from 1e65243 to b8c6837 Compare May 3, 2024 12:02
@thampiotr thampiotr changed the title WIP: Fix staleness during target handover Fix staleness during target handover May 3, 2024
@thampiotr thampiotr force-pushed the thampiotr/fix-staleness-during-target-handover branch 2 times, most recently from c931b8d to 27e1b00 Compare May 3, 2024 13:09
@thampiotr thampiotr marked this pull request as ready for review May 7, 2024 11:27
@thampiotr
Copy link
Contributor Author

I propose we use our fork with this change until we have a decision on the issue with the upstream.

Copy link
Member

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I have a few questions, but nothing major that would prevent this from moving forward shortly. Nice work with the testing framework!

CHANGELOG.md Outdated Show resolved Hide resolved
internal/component/discovery/distributed_targets.go Outdated Show resolved Hide resolved
internal/util/assertmetrics/assert_metrics.go Show resolved Hide resolved
@thampiotr thampiotr force-pushed the thampiotr/fix-staleness-during-target-handover branch from 5132633 to c596503 Compare May 13, 2024 09:52
@thampiotr thampiotr merged commit 12fb273 into main May 13, 2024
18 checks passed
@thampiotr thampiotr deleted the thampiotr/fix-staleness-during-target-handover branch May 13, 2024 10:10
@@ -0,0 +1,45 @@
package assertmetrics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this file move to a _test.go file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can depend on _test.go files in tests of other packages... Similar to testappender package.

@rfratto rfratto added the backport-to-agent PR should be backported to the agent repo. label May 14, 2024
erikbaranowski pushed a commit that referenced this pull request May 21, 2024
* Refactor out DistributedTargets and clean up.

* Add MovedAway method to DistributedTargets and rename existing methods.

* distributed_targets.go tests and benchmarks and refactor

* tests and refactor

* fix race condition in pyroscope test

* workaround for data race in prometheus

* changelog

* fix changelog

* feedback

* use a better named prom fork branch

(cherry picked from commit 12fb273)
rfratto added a commit to rfratto/alloy that referenced this pull request May 30, 2024
This includes all bugfixes found in main to date except for grafana#703, which
is a more involved change that should probably wait for a minor release.
rfratto added a commit that referenced this pull request May 30, 2024
This includes all bugfixes found in main to date except for #703, which
is a more involved change that should probably wait for a minor release.
rfratto added a commit to rfratto/alloy that referenced this pull request May 30, 2024
This includes all bugfixes found in main to date except for grafana#703, which
is a more involved change that should probably wait for a minor release.

(cherry picked from commit 3bceb1a)
rfratto added a commit that referenced this pull request May 30, 2024
* Fix panic when component ID contains `/` in `otelcomponent.MustNewType(ID)` (#858)

Signed-off-by: Weifeng Wang <qclaogui@gmail.com>
(cherry picked from commit 7bae89c)

* No error when http fails (#841)

* Fail if we see the port is not available

* Update changelog

* cleanup message

* Update CHANGELOG.md

Co-authored-by: Erik Baranowski <39704712+erikbaranowski@users.noreply.github.com>

---------

Co-authored-by: Erik Baranowski <39704712+erikbaranowski@users.noreply.github.com>
(cherry picked from commit 4ca3f84)

* fix panic loki source docker (#875)

(cherry picked from commit 4fb1df9)

* clustering: fix ipv6 advertise addresses (#869)

Signed-off-by: Matthew Penner <me@matthewp.io>
(cherry picked from commit 3df2cd0)

* otelcol: decouple otel/alloy component IDs (#882)

Signed-off-by: Paschalis Tsilias <paschalist0@gmail.com>
(cherry picked from commit d018e6e)

* updates with latest snowflake prometheus exporter (fixes null issues) (#939)

* updates with latest snowflake prometheus exporter (fixes null issues)

* Update CHANGELOG.md

Co-authored-by: William Dumont <william.dumont@grafana.com>

---------

Co-authored-by: William Dumont <william.dumont@grafana.com>
(cherry picked from commit 551d407)

* feat(vcs): bubble up SSH key conversion error for better debugging experience (#943)

* feat(vcs): bubble up SSH key conversion error for better debugging experience

Signed-off-by: hainenber <dotronghai96@gmail.com>

* chore: refactor code to be more succinct

Signed-off-by: hainenber <dotronghai96@gmail.com>

---------

Signed-off-by: hainenber <dotronghai96@gmail.com>
(cherry picked from commit 037893f)

* prepare changelog for 1.1.1 (#958)

This includes all bugfixes found in main to date except for #703, which
is a more involved change that should probably wait for a minor release.

(cherry picked from commit 3bceb1a)

---------

Co-authored-by: Weifeng Wang <qclaogui@gmail.com>
Co-authored-by: mattdurham <mattdurham@ppog.org>
Co-authored-by: William Dumont <william.dumont@grafana.com>
Co-authored-by: Matthew Penner <me@matthewp.io>
Co-authored-by: Paschalis Tsilias <tpaschalis@users.noreply.github.com>
Co-authored-by: Stefan Kurek <stefan.kurek@observiq.com>
Co-authored-by: Đỗ Trọng Hải <41283691+hainenber@users.noreply.github.com>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport-to-agent PR should be backported to the agent repo. frozen-due-to-age
Projects
None yet
Development

Successfully merging this pull request may close these issues.

clustering: deal with gaps during cluster changes
3 participants