[processor/tailsampling] add low-frequency spans sampling policy #36487

larsn777 · 2024-11-21T15:21:24Z

Description

Hello,
We occasionally encounter the need to sample low-frequency spans. However, as I understand, this type of sampling has not yet been implemented in the tail sampling module.
This PR implements a new sampling policy that will allow us to sample low-frequency spans with a specified level of accuracy.

Main approaches and implementation details:

Span Identifier. To identify the uniqueness or frequency of a specific span within a given set, we need to select a span identifier. I chose the pair of service name + span name as the identifier. In the future, the choice of fields for this identifier could be made configurable within the sampling policy.
Count-Min Sketch (CMS). To count occurrences of unique spans in a data stream, we need to select an appropriate data structure. A naive solution might be a simple hash table, using the span identifier as the key and an occurrence counter as the value. However, with a high variability of span identifiers, such a table could become too large. As an alternative, I chose the Count-Min Sketch data structure, which generally reduces the memory required for counting span identifiers. Additionally, users can adjust the configuration parameters of this policy to find a trade-off between counting accuracy and resource usage.
Sliding CMS Window. Counting low-frequency spans should happen within a limited time frame (e.g., an hour, half-hour, or minute), as otherwise, we would consume significant resources and likely lose accuracy with CMS over time. A simple approach would be to create a new CMS structure at the beginning of each interval, populate it with data, and then start over with a fresh CMS at the beginning of the next interval. However, this would reset counters at each interval start, leading to a surge of sampled spans at the beginning of each interval. To avoid this, we can use a sliding CMS window. The entire interval (window) is divided into N equal sub-intervals, each with its own CMS instance. As the window slides forward by one sub-interval, we delete the CMS instance for the earliest sub-interval and create a new one. To get the count of span occurrences, we sum occurrences across all sub-intervals and add the new occurrence to the CMS for the latest sub-interval. This approach significantly smooths out the number of spans sampled at interval boundaries.
Limits on Processing and Sampling. As mentioned in the previous point, the sliding window smooths out spikes in sampled spans at interval boundaries, but it doesn’t completely eliminate them, as a large portion of unique spans will likely fall into the first sub-interval, causing minor spikes at sub-interval boundaries. To manage boundary values for sub-intervals, we can set a limit on the number of spans processed and sampled per second.

Testing

Almost all the new code is covered by unit tests.

Documentation

A brief description of the new policy has been added to the module's README. Additionally, an example configuration has been included in the README.

jpkrohling

This looks really cool, thank you! I wish this was discussed before and the PR was broken down in smaller pieces, but I had fun reviewing it nonetheless.

Besides the comments I left on specific places, I think this needs more documentation than usual, as it's more complex than the other samplers we have.

I made the following diagram to help me understand how it all works. Perhaps you can take a look and incorporate changes to it, and use it in the documentation for this sampler?

Source: https://excalidraw.com/#json=kPPPacYzLGagKGg-LR2mM,jZgRCMwz1mYkemLAj4dHJw

jpkrohling · 2024-12-05T10:51:08Z

processor/tailsamplingprocessor/config.go

+
+// RareSpansCfg configuration for the rare spans sampler.
+type RareSpansCfg struct {
+	// ErrorProbability error probability (δ, delta in the turms of Count-min sketch)


I appreciate this explanation, but perhaps it can say something like: "A higher value has better performance, a lower value has better accuracy.", and then proceed with the detailed explanation.

It's also good to provide some recommendation for users to get started. Like: "A good initial value for this is 0.01".

It should also document the acceptable range: I believe it should be less or equal to 1, any value above that would crash this with the current code. Anything below .0000000001 would also crash.

jpkrohling · 2024-12-05T10:54:40Z

processor/tailsamplingprocessor/config.go

+	//    accurate estimate;
+	//  - the larger this value, the more memory will be needed to calculate
+	//    the estimate.
+	TotalFreq float64 `mapstructure:"total_frequency"`


Similar to the above, this needs some guiding values to users. How do I decide what's a good number? Is it based on the throughput I have? How much memory do I have for, say, 1000 (the value used in the example)?

jpkrohling · 2024-12-05T10:55:35Z

processor/tailsamplingprocessor/config.go

+	// The lower the value of MaxErrValue, the more accurate the estimate of the
+	// frequency of each unique span will be. On the other hand, the smaller the
+	// value, the more memory will be allocated for CMS data structure.
+	MaxErrValue float64 `mapstructure:"max_error_value"`


Same as above: what's a good number, and how should I decide what's best for my use-case?

jpkrohling · 2024-12-05T12:29:22Z

processor/tailsamplingprocessor/internal/cms/ring_buffer.go

+)
+
+// RingBufferQueue is the ring buffer data structure implementation
+type RingBufferQueue[T any] struct {


It should be stated in the godoc that the implementation is not thread-safe and concurrent access should be controlled by users.

edit: it should also have more documentation, especially as it deviates from what we'd understand as a regular ring buffer in some aspects (queue/dequeue).

jpkrohling · 2024-12-05T12:32:09Z

processor/tailsamplingprocessor/internal/cms/sliding_cms.go

+		h.startPoint = h.startPoint.Add(h.bucketInterval * (tp.Sub(h.startPoint) / h.bucketInterval))
+	}
+
+	_ = h.cmsBuckets.TailMoveForward()


Right now, the only error that can be returned is when a buffer is full, which should never happen, given the code in the preceding lines. However, if the implementation changes, you probably want to handle or bubble up the error.

jpkrohling · 2024-12-05T13:03:32Z

processor/tailsamplingprocessor/internal/cms/sliding_cms.go

+	}
+
+	buckets, err := NewRingBufferQueue[CountMinSketch](emptyCmsBuckets)
+	_ = buckets.TailMoveForward()


same comment about handling errors

jpkrohling · 2024-12-05T14:11:42Z

processor/tailsamplingprocessor/internal/sampling/rare_spans.go

+			for k := 0; k < rss.Len(); k++ {
+				keyLen := len(svcName.Str()) + 1 + len(rss.At(k).Name())
+				if keyLen > spanUniqIDBufferSize {
+					r.logger.Error("too long span key", zap.Int("key_len", keyLen))


This has the potential to spam the logs -- I'd add it as debug with both the service name and operation name (for... debugging), and a counter for usage in production.

jpkrohling · 2024-12-05T14:12:15Z

processor/tailsamplingprocessor/internal/sampling/rare_spans.go

+	spsProcessingLimit int64
+	// spsPrecessed the number of already processed spans for the current
+	// second.
+	spsPrecessed int64


Suggested change

spsPrecessed int64

spsProcessed int64

jpkrohling · 2024-12-05T14:13:39Z

processor/tailsamplingprocessor/internal/sampling/rare_spans.go

+// ShouldBeSampled returns a decision about whether the span should be sampled
+// based on its name and the service name.
+func (r *RareSpansSampler) ShouldBeSampled(svcName, operationName string) bool {
+	r.idBuff = r.idBuff[:len(svcName)]


"key" is used elsewhere instead of "id" -- I think this could be consistent with that, and be named "keyBuff"

jpkrohling · 2024-12-05T15:05:35Z

processor/tailsamplingprocessor/internal/cms/sliding_cms.go

+}
+
+func NewSlidingCMSWithStartPoint(bucketsCfg BucketsCfg, cmsCfg CountMinSketchCfg, startTm time.Time) (*SlidingCms, error) {
+	err := bucketsCfg.Validate()


You need a Validate for the cmsCfg as well, to check the boundaries for the error parameters. As it is, any value above 1 for the ErrorProbability would cause a panic.

github-actions · 2024-12-20T05:20:52Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2025-01-04T05:21:00Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2025-01-19T05:20:27Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

[processor/tailsampling] add low-frequency spans sampling policy

8e0beb6

larsn777 requested review from jpkrohling and a team as code owners November 21, 2024 15:21

github-actions bot assigned mx-psi Nov 21, 2024

github-actions bot added the processor/tailsampling Tail sampling processor label Nov 21, 2024

larsn777 added 2 commits November 22, 2024 00:21

[processor/tailsampling] low-frequency spans policy: fix linting

1bb273c

[processor/tailsampling] low-frequency spans policy: fix changelog

1ccb5a6

jpkrohling reviewed Dec 5, 2024

View reviewed changes

github-actions bot added the Stale label Dec 20, 2024

mx-psi assigned jpkrohling and unassigned mx-psi Dec 20, 2024

github-actions bot removed the Stale label Dec 21, 2024

github-actions bot added the Stale label Jan 4, 2025

github-actions bot closed this Jan 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[processor/tailsampling] add low-frequency spans sampling policy #36487

[processor/tailsampling] add low-frequency spans sampling policy #36487

larsn777 commented Nov 21, 2024

jpkrohling left a comment

jpkrohling Dec 5, 2024

jpkrohling Dec 5, 2024

jpkrohling Dec 5, 2024

jpkrohling Dec 5, 2024

jpkrohling Dec 5, 2024

jpkrohling Dec 5, 2024

jpkrohling Dec 5, 2024

jpkrohling Dec 5, 2024

jpkrohling Dec 5, 2024

jpkrohling Dec 5, 2024

jpkrohling Dec 5, 2024

github-actions bot commented Dec 20, 2024

github-actions bot commented Jan 4, 2025

github-actions bot commented Jan 19, 2025

[processor/tailsampling] add low-frequency spans sampling policy #36487

[processor/tailsampling] add low-frequency spans sampling policy #36487

Conversation

larsn777 commented Nov 21, 2024

Description

Testing

Documentation

jpkrohling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 20, 2024

github-actions bot commented Jan 4, 2025

github-actions bot commented Jan 19, 2025