-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[processor/tailsampling] add low-frequency spans sampling policy #36487
[processor/tailsampling] add low-frequency spans sampling policy #36487
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really cool, thank you! I wish this was discussed before and the PR was broken down in smaller pieces, but I had fun reviewing it nonetheless.
Besides the comments I left on specific places, I think this needs more documentation than usual, as it's more complex than the other samplers we have.
I made the following diagram to help me understand how it all works. Perhaps you can take a look and incorporate changes to it, and use it in the documentation for this sampler?
Source: https://excalidraw.com/#json=kPPPacYzLGagKGg-LR2mM,jZgRCMwz1mYkemLAj4dHJw
|
||
// RareSpansCfg configuration for the rare spans sampler. | ||
type RareSpansCfg struct { | ||
// ErrorProbability error probability (δ, delta in the turms of Count-min sketch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate this explanation, but perhaps it can say something like: "A higher value has better performance, a lower value has better accuracy.", and then proceed with the detailed explanation.
It's also good to provide some recommendation for users to get started. Like: "A good initial value for this is 0.01".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should also document the acceptable range: I believe it should be less or equal to 1, any value above that would crash this with the current code. Anything below .0000000001 would also crash.
// accurate estimate; | ||
// - the larger this value, the more memory will be needed to calculate | ||
// the estimate. | ||
TotalFreq float64 `mapstructure:"total_frequency"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the above, this needs some guiding values to users. How do I decide what's a good number? Is it based on the throughput I have? How much memory do I have for, say, 1000 (the value used in the example)?
// The lower the value of MaxErrValue, the more accurate the estimate of the | ||
// frequency of each unique span will be. On the other hand, the smaller the | ||
// value, the more memory will be allocated for CMS data structure. | ||
MaxErrValue float64 `mapstructure:"max_error_value"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above: what's a good number, and how should I decide what's best for my use-case?
) | ||
|
||
// RingBufferQueue is the ring buffer data structure implementation | ||
type RingBufferQueue[T any] struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be stated in the godoc that the implementation is not thread-safe and concurrent access should be controlled by users.
edit: it should also have more documentation, especially as it deviates from what we'd understand as a regular ring buffer in some aspects (queue/dequeue).
h.startPoint = h.startPoint.Add(h.bucketInterval * (tp.Sub(h.startPoint) / h.bucketInterval)) | ||
} | ||
|
||
_ = h.cmsBuckets.TailMoveForward() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, the only error that can be returned is when a buffer is full, which should never happen, given the code in the preceding lines. However, if the implementation changes, you probably want to handle or bubble up the error.
} | ||
|
||
buckets, err := NewRingBufferQueue[CountMinSketch](emptyCmsBuckets) | ||
_ = buckets.TailMoveForward() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about handling errors
for k := 0; k < rss.Len(); k++ { | ||
keyLen := len(svcName.Str()) + 1 + len(rss.At(k).Name()) | ||
if keyLen > spanUniqIDBufferSize { | ||
r.logger.Error("too long span key", zap.Int("key_len", keyLen)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the potential to spam the logs -- I'd add it as debug with both the service name and operation name (for... debugging), and a counter for usage in production.
spsProcessingLimit int64 | ||
// spsPrecessed the number of already processed spans for the current | ||
// second. | ||
spsPrecessed int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spsPrecessed int64 | |
spsProcessed int64 |
// ShouldBeSampled returns a decision about whether the span should be sampled | ||
// based on its name and the service name. | ||
func (r *RareSpansSampler) ShouldBeSampled(svcName, operationName string) bool { | ||
r.idBuff = r.idBuff[:len(svcName)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"key" is used elsewhere instead of "id" -- I think this could be consistent with that, and be named "keyBuff"
} | ||
|
||
func NewSlidingCMSWithStartPoint(bucketsCfg BucketsCfg, cmsCfg CountMinSketchCfg, startTm time.Time) (*SlidingCms, error) { | ||
err := bucketsCfg.Validate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need a Validate for the cmsCfg as well, to check the boundaries for the error parameters. As it is, any value above 1 for the ErrorProbability would cause a panic.
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Description
Hello,
We occasionally encounter the need to sample low-frequency spans. However, as I understand, this type of sampling has not yet been implemented in the tail sampling module.
This PR implements a new sampling policy that will allow us to sample low-frequency spans with a specified level of accuracy.
Main approaches and implementation details:
Testing
Almost all the new code is covered by unit tests.
Documentation
A brief description of the new policy has been added to the module's README. Additionally, an example configuration has been included in the README.