Confirm span merge span name drop strategy #54

carsonip · 2023-07-31T10:59:19Z

With existing code in https://github.com/elastic/apm-aggregation/blob/main/aggregators/merger.go#L263 we are dropping span name even if global limit is reached first. Check if this is desirable.

simitt · 2023-08-09T20:00:46Z

@axw is this something that needs to be fixed before releasing LSM based aggregation in apm-server?

axw · 2023-08-10T05:59:29Z

The behaviour has changed, so yes I think we should make a decision about whether to keep it before releasing 8.10. (I think the new behaviour is probably fine though.)

carsonip · 2023-08-10T12:39:50Z

Old behavior in apm-server 8.9 (code)

there was only a global limit, no per-service span limit
when number of span groups reaches 50% of global limit, span name is dropped
for cardinality estimation: hash of key with span name dropped will be recorded

Existing apm-aggregation behavior (code)

there are a global limit and a per-service span limit
when number of span groups reaches 50% of per-service limit, span name is dropped. Span name dropping never considers global limit.
for cardinality estimation per-service: hash of key with span name dropped will be recorded
for overflow service cardinality estimation: a mix of keys with and without span name

Observations:

The change in behavior is unavoidable as we introduce the hierarchy and per-service limits. Now we have an edge case where it is possible that if the user has a lot of services, each has span groups that stays well within the per-service limit, but they quickly add up to hit the global limit. Then the span name drop strategy does not guard against this kind of use.
Somewhat relevant to the above point, the overflow service cardinality estimation now may contain hashes of keys from the above case, while also contain keys without span names from cases where both global limit and per-service limit are hit, causing a bit of inconsistency.

Question:

Are the above observations concerning? I understand that these are just edge cases.
Should we also drop span names when half of global limit is reached?

@axw

axw · 2023-08-11T09:29:17Z

The change in behavior is unavoidable as we introduce the hierarchy and per-service limits. Now we have an edge case where it is possible that if the user has a lot of services, each has span groups that stays well within the per-service limit, but they quickly add up to hit the global limit. Then the span name drop strategy does not guard against this kind of use.

The span name drop strategy (which I regret, should have introduced a new metric) is meant to protect against high-cardinality span names produced by a service, not protecting against cardinality across all the services.

I think the change is fine, and we're still protecting against that. If we were to take into account the global limit, then services producing high cardinality span names would penalise other services, by causing them to drop the span name. I don't think that is desirable.

for cardinality estimation per-service: hash of key with span name dropped will be recorded

This seems off. Where is that code? I think we should probably be counting with the span name, so we can see which services are producing high cardinality span names.

carsonip · 2023-08-11T11:34:03Z

This seems off. Where is that code?

It is this line here, which happens only after we modify fromSpan.Key here. It is nothing new, since this apparently comes from the similar looking code in 8.9 apm-server aggregation.

I think we should probably be counting with the span name

I agree, that's why I raised the question in the first place.
What we do now in apm-aggregation:

keep recording new span aggregation groups with different span names until half the per-svc limit
after that, when a span comes in only with a new span name, it will be recorded in a group with empty span name. No further new groups will be created
new span groups will be created only when dimensions other than span name are new
then on overflow, a hash will be computed from span key without span name

I believe we want to change the last point to always estimate cardinality with full unmodified span key.

axw · 2023-08-14T08:13:40Z

I believe we want to change the last point to always estimate cardinality with full unmodified span key.

@carsonip 👍 agreed. @lahsivjar any concerns with that?

lahsivjar · 2023-08-14T13:33:01Z

No concerns. SGTM!

carsonip changed the title ~~Confirm span merge drop name strategy~~ Confirm span merge span name drop strategy Jul 31, 2023

simitt assigned carsonip Aug 10, 2023

carsonip mentioned this issue Aug 14, 2023

Always use full unmodified span key for cardinality estimation #88

Merged

axw closed this as completed in #88 Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confirm span merge span name drop strategy #54

Confirm span merge span name drop strategy #54

carsonip commented Jul 31, 2023 •

edited

Loading

simitt commented Aug 9, 2023

axw commented Aug 10, 2023

carsonip commented Aug 10, 2023

axw commented Aug 11, 2023

carsonip commented Aug 11, 2023

axw commented Aug 14, 2023

lahsivjar commented Aug 14, 2023

Confirm span merge span name drop strategy #54

Confirm span merge span name drop strategy #54

Comments

carsonip commented Jul 31, 2023 • edited Loading

simitt commented Aug 9, 2023

axw commented Aug 10, 2023

carsonip commented Aug 10, 2023

axw commented Aug 11, 2023

carsonip commented Aug 11, 2023

axw commented Aug 14, 2023

lahsivjar commented Aug 14, 2023

carsonip commented Jul 31, 2023 •

edited

Loading