Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confirm span merge span name drop strategy #54

Closed
carsonip opened this issue Jul 31, 2023 · 7 comments · Fixed by #88
Closed

Confirm span merge span name drop strategy #54

carsonip opened this issue Jul 31, 2023 · 7 comments · Fixed by #88
Assignees

Comments

@carsonip
Copy link
Member

carsonip commented Jul 31, 2023

With existing code in https://github.com/elastic/apm-aggregation/blob/main/aggregators/merger.go#L263 we are dropping span name even if global limit is reached first. Check if this is desirable.

@carsonip carsonip changed the title Confirm span merge drop name strategy Confirm span merge span name drop strategy Jul 31, 2023
@simitt
Copy link
Contributor

simitt commented Aug 9, 2023

@axw is this something that needs to be fixed before releasing LSM based aggregation in apm-server?

@axw
Copy link
Member

axw commented Aug 10, 2023

The behaviour has changed, so yes I think we should make a decision about whether to keep it before releasing 8.10. (I think the new behaviour is probably fine though.)

@carsonip
Copy link
Member Author

Old behavior in apm-server 8.9 (code)

  • there was only a global limit, no per-service span limit
  • when number of span groups reaches 50% of global limit, span name is dropped
  • for cardinality estimation: hash of key with span name dropped will be recorded

Existing apm-aggregation behavior (code)

  • there are a global limit and a per-service span limit
  • when number of span groups reaches 50% of per-service limit, span name is dropped. Span name dropping never considers global limit.
  • for cardinality estimation per-service: hash of key with span name dropped will be recorded
  • for overflow service cardinality estimation: a mix of keys with and without span name

Observations:

  • The change in behavior is unavoidable as we introduce the hierarchy and per-service limits. Now we have an edge case where it is possible that if the user has a lot of services, each has span groups that stays well within the per-service limit, but they quickly add up to hit the global limit. Then the span name drop strategy does not guard against this kind of use.
  • Somewhat relevant to the above point, the overflow service cardinality estimation now may contain hashes of keys from the above case, while also contain keys without span names from cases where both global limit and per-service limit are hit, causing a bit of inconsistency.

Question:

  • Are the above observations concerning? I understand that these are just edge cases.
  • Should we also drop span names when half of global limit is reached?

@axw

@axw
Copy link
Member

axw commented Aug 11, 2023

The change in behavior is unavoidable as we introduce the hierarchy and per-service limits. Now we have an edge case where it is possible that if the user has a lot of services, each has span groups that stays well within the per-service limit, but they quickly add up to hit the global limit. Then the span name drop strategy does not guard against this kind of use.

The span name drop strategy (which I regret, should have introduced a new metric) is meant to protect against high-cardinality span names produced by a service, not protecting against cardinality across all the services.

I think the change is fine, and we're still protecting against that. If we were to take into account the global limit, then services producing high cardinality span names would penalise other services, by causing them to drop the span name. I don't think that is desirable.

for cardinality estimation per-service: hash of key with span name dropped will be recorded

This seems off. Where is that code? I think we should probably be counting with the span name, so we can see which services are producing high cardinality span names.

@carsonip
Copy link
Member Author

This seems off. Where is that code?

It is this line here, which happens only after we modify fromSpan.Key here. It is nothing new, since this apparently comes from the similar looking code in 8.9 apm-server aggregation.

I think we should probably be counting with the span name

I agree, that's why I raised the question in the first place.
What we do now in apm-aggregation:

  • keep recording new span aggregation groups with different span names until half the per-svc limit
  • after that, when a span comes in only with a new span name, it will be recorded in a group with empty span name. No further new groups will be created
  • new span groups will be created only when dimensions other than span name are new
  • then on overflow, a hash will be computed from span key without span name

I believe we want to change the last point to always estimate cardinality with full unmodified span key.

@axw
Copy link
Member

axw commented Aug 14, 2023

I believe we want to change the last point to always estimate cardinality with full unmodified span key.

@carsonip 👍 agreed. @lahsivjar any concerns with that?

@lahsivjar
Copy link
Contributor

No concerns. SGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants