Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/tailsampling] add low-frequency spans sampling policy #36487

Closed

Conversation

larsn777
Copy link
Contributor

Description

Hello,
We occasionally encounter the need to sample low-frequency spans. However, as I understand, this type of sampling has not yet been implemented in the tail sampling module.
This PR implements a new sampling policy that will allow us to sample low-frequency spans with a specified level of accuracy.

Main approaches and implementation details:

  1. Span Identifier. To identify the uniqueness or frequency of a specific span within a given set, we need to select a span identifier. I chose the pair of service name + span name as the identifier. In the future, the choice of fields for this identifier could be made configurable within the sampling policy.
  2. Count-Min Sketch (CMS). To count occurrences of unique spans in a data stream, we need to select an appropriate data structure. A naive solution might be a simple hash table, using the span identifier as the key and an occurrence counter as the value. However, with a high variability of span identifiers, such a table could become too large. As an alternative, I chose the Count-Min Sketch data structure, which generally reduces the memory required for counting span identifiers. Additionally, users can adjust the configuration parameters of this policy to find a trade-off between counting accuracy and resource usage.
  3. Sliding CMS Window. Counting low-frequency spans should happen within a limited time frame (e.g., an hour, half-hour, or minute), as otherwise, we would consume significant resources and likely lose accuracy with CMS over time. A simple approach would be to create a new CMS structure at the beginning of each interval, populate it with data, and then start over with a fresh CMS at the beginning of the next interval. However, this would reset counters at each interval start, leading to a surge of sampled spans at the beginning of each interval. To avoid this, we can use a sliding CMS window. The entire interval (window) is divided into N equal sub-intervals, each with its own CMS instance. As the window slides forward by one sub-interval, we delete the CMS instance for the earliest sub-interval and create a new one. To get the count of span occurrences, we sum occurrences across all sub-intervals and add the new occurrence to the CMS for the latest sub-interval. This approach significantly smooths out the number of spans sampled at interval boundaries.
  4. Limits on Processing and Sampling. As mentioned in the previous point, the sliding window smooths out spikes in sampled spans at interval boundaries, but it doesn’t completely eliminate them, as a large portion of unique spans will likely fall into the first sub-interval, causing minor spikes at sub-interval boundaries. To manage boundary values for sub-intervals, we can set a limit on the number of spans processed and sampled per second.

Testing

Almost all the new code is covered by unit tests.

Documentation

A brief description of the new policy has been added to the module's README. Additionally, an example configuration has been included in the README.

@larsn777 larsn777 requested review from jpkrohling and a team as code owners November 21, 2024 15:21
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Nov 21, 2024
Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really cool, thank you! I wish this was discussed before and the PR was broken down in smaller pieces, but I had fun reviewing it nonetheless.

Besides the comments I left on specific places, I think this needs more documentation than usual, as it's more complex than the other samplers we have.

I made the following diagram to help me understand how it all works. Perhaps you can take a look and incorporate changes to it, and use it in the documentation for this sampler?

Source: https://excalidraw.com/#json=kPPPacYzLGagKGg-LR2mM,jZgRCMwz1mYkemLAj4dHJw

image


// RareSpansCfg configuration for the rare spans sampler.
type RareSpansCfg struct {
// ErrorProbability error probability (δ, delta in the turms of Count-min sketch)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate this explanation, but perhaps it can say something like: "A higher value has better performance, a lower value has better accuracy.", and then proceed with the detailed explanation.

It's also good to provide some recommendation for users to get started. Like: "A good initial value for this is 0.01".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should also document the acceptable range: I believe it should be less or equal to 1, any value above that would crash this with the current code. Anything below .0000000001 would also crash.

// accurate estimate;
// - the larger this value, the more memory will be needed to calculate
// the estimate.
TotalFreq float64 `mapstructure:"total_frequency"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above, this needs some guiding values to users. How do I decide what's a good number? Is it based on the throughput I have? How much memory do I have for, say, 1000 (the value used in the example)?

// The lower the value of MaxErrValue, the more accurate the estimate of the
// frequency of each unique span will be. On the other hand, the smaller the
// value, the more memory will be allocated for CMS data structure.
MaxErrValue float64 `mapstructure:"max_error_value"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: what's a good number, and how should I decide what's best for my use-case?

)

// RingBufferQueue is the ring buffer data structure implementation
type RingBufferQueue[T any] struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be stated in the godoc that the implementation is not thread-safe and concurrent access should be controlled by users.

edit: it should also have more documentation, especially as it deviates from what we'd understand as a regular ring buffer in some aspects (queue/dequeue).

h.startPoint = h.startPoint.Add(h.bucketInterval * (tp.Sub(h.startPoint) / h.bucketInterval))
}

_ = h.cmsBuckets.TailMoveForward()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, the only error that can be returned is when a buffer is full, which should never happen, given the code in the preceding lines. However, if the implementation changes, you probably want to handle or bubble up the error.

}

buckets, err := NewRingBufferQueue[CountMinSketch](emptyCmsBuckets)
_ = buckets.TailMoveForward()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about handling errors

for k := 0; k < rss.Len(); k++ {
keyLen := len(svcName.Str()) + 1 + len(rss.At(k).Name())
if keyLen > spanUniqIDBufferSize {
r.logger.Error("too long span key", zap.Int("key_len", keyLen))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the potential to spam the logs -- I'd add it as debug with both the service name and operation name (for... debugging), and a counter for usage in production.

spsProcessingLimit int64
// spsPrecessed the number of already processed spans for the current
// second.
spsPrecessed int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
spsPrecessed int64
spsProcessed int64

// ShouldBeSampled returns a decision about whether the span should be sampled
// based on its name and the service name.
func (r *RareSpansSampler) ShouldBeSampled(svcName, operationName string) bool {
r.idBuff = r.idBuff[:len(svcName)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"key" is used elsewhere instead of "id" -- I think this could be consistent with that, and be named "keyBuff"

}

func NewSlidingCMSWithStartPoint(bucketsCfg BucketsCfg, cmsCfg CountMinSketchCfg, startTm time.Time) (*SlidingCms, error) {
err := bucketsCfg.Validate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a Validate for the cmsCfg as well, to check the boundaries for the error parameters. As it is, any value above 1 for the ErrorProbability would cause a panic.

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Dec 20, 2024
@mx-psi mx-psi assigned jpkrohling and unassigned mx-psi Dec 20, 2024
@github-actions github-actions bot removed the Stale label Dec 21, 2024
Copy link
Contributor

github-actions bot commented Jan 4, 2025

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jan 4, 2025
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Jan 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
processor/tailsampling Tail sampling processor Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants