-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PoC Time series #27
PoC Time series #27
Conversation
docs/metrics-data-model.md
Outdated
type Metric struct { | ||
Name string `json:"name"` | ||
Type MetricType `json:"type"` | ||
Contains ValueType `json:"contains"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contains ValueType `json:"contains"` |
Do we need ValueType
?
It seems used only from the summary. Should we move the mapping directly there? If not, then I think we should extend the current JS API for supporting all the possible values (currently, it supports only Time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Good point, but I am not sure how we can remove it. The other state in Metric
will be moved to the MetricsEngine
, since only it works with them, but the value type is passed from inside of the JS scripts and we have to save it somewhere.
Also, what do you mean by "all the possible values"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree and even in the eventual case we would refactor then it's better to keep any discussion in a dedicated topic, I think for now it's fine how it is.
Also, what do you mean by "all the possible values"?
In the current Trend API, you can't define a custom metric with the value type assigned to Data, only Time is supported. If I'm getting the current API right.
docs/metrics-data-model.md
Outdated
Name string `json:"name"` | ||
Type MetricType `json:"type"` | ||
Contains ValueType `json:"contains"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
Observed bool | |
} |
Do we need it with the time series model? Potentially, when the thresholds are parsed we can just initialize a time series and it will be available also if any Sample has been generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thresholds (and the end-of-test summary) can span multiple time series, see my comment at the top. For example, you might have a threshold on http_req_duration{status:200}
, while your actual time series are http_req_duration{status:200,url:http://foo,...}
,http_req_duration{status:200,url:http://bar,...}
,http_req_duration{status:200,url:http://baz,...}
.
We might not need Observed
, I don't know, but it doesn't really matter, it will be something local for the MetricsEngine
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thresholds (and the end-of-test summary) can span multiple time series, see my comment at the top. For example, you might have a threshold on http_req_duration{status:200}, while your actual time series are http_req_duration{status:200,url:http://foo,...},http_req_duration{status:200,url:http://bar,...},http_req_duration{status:200,url:http://baz,...}.
My idea was the summary and the thresholds should use centralized per-time series sinks and aggregate them when required. This should be fine for the summary where we print it just at the end, but in the case of thresholds it will be heavy if we do it for each generated sample.
What is your idea about threshold integration? Should each Threshold keep the Sink autonomously and make the evaluation on top of it?
We might not need Observed, I don't know, but it doesn't really matter, it will be something local for the MetricsEngine.
Ok, so I'm not including it in this document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your idea about threshold integration? Should each Threshold keep the Sink autonomously and make the evaluation on top of it?
Yeah, this is one of the things that I meant with my earlier comment that time series are orthogonal (i.e. independent, unrelated, etc.) to aggregation. Determining the "group" ("time series") of a metric Sample
based on its metric name and tag values is one thing. Aggregating groups of metric samples is quite another, the group won't always be related to the time series.
In the Prometheus outputs (whether remote-write or not), the mapping will probably be 1:1 - you'd have a Sink
for every time series, and every metric Sample
will be sent to exactly one Sink
, right?
However, in the thresholds and the end-of-test summary, the mapping would be n:m. Each threshold definition can span multiple time-series, since it is defined only by partial (non-exhaustive) constraints. The example I gave with http_req_duration{status:200}
can match a request to any URL and method, which may all be in different time series. But you can also have another threshold in the same test that also matches some of the same Samples, e.g. http_req_duration{method:GET}
, and a third threshold http_req_duration{status:200,method:GET}
with both constraints 😅 So one metric Sample
will potentially need to be sent to multiple Sinks. And you would not necessarily be able to combine individual per-time-series Sinks back in a single per-threshold sink. In any case, even if you could, it would be quite inefficient to do so every few seconds... 😞 Not to speak about potential future improvements to thresholds which are also going to make things more complicated, e.g. grafana/k6#2379 😅
That said, even if we can't necessarily reuse or strongly tie together time series and metric sample aggregation/sinks, there are still some benefits 🎉 For example:
- We probably could reuse the same
Sink
implementations here and in other similar outputs and in theMetricsEngine
(thresholds / end-of-test summary). The grouping will just need to be done differently in both places and decoupled from the aggregation 🤷♂️ In my distributed execution PoC ([WIP] PoC for native distributed execution k6#2438) I used circonusllhist just for expedience, and we still need to do more research and testing about the various tradeoffs of such algorithms (Use HDR histograms for calculating percentiles in thresholds and summary stats k6#763 (comment)), but we probably won't need 2 different ones 🤷♂️ 🤞 🙏 😅 - If every metric
Sample
has a link to its time series, this part of theMetricsEngine
can be implemented much more efficiently 🎉 Right now it does nested string comparison for everySubMetric
definition and on everySample
... 😞 With time series, theMetricsEngine
can create a mapping for every new time series it sees to the matching sub-metrics and reuse that mapping when new samples with the same sub-metric arrive.
docs/metrics-data-model.md
Outdated
|
||
## Known issues | ||
|
||
* Name and URL tags: `k6` is tagging HTTP requests with the URL. It will create a high cardinality issue for the time series data model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simplest and more retro-compatible solution to me seems to make the url
tag disabled by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@na-- or even better, when we will have the metadata concept then we can just move it there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I'd like, though there will be some... complications 😞 The iter
, vu
(and maybe ip
) tags should probably also be metadata, not tags, but it's a bit too early to see how all of this will work out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are mostly my comments on the data model document. I still haven't been able to actually read through and grok the code, sorry. I decided to submit these comments first so we have something to talk about until I review the rest.
docs/metrics-data-model.md
Outdated
@@ -0,0 +1,177 @@ | |||
# k6 Time Series | |||
|
|||
We would migrate to a time series data model for getting the benefits of aggregated data so just allocating samples one time and pass around the aggregated time series value (or values - e.g. in case of Trend). The Outputs will switch from fetching the buffered samples to fetching a snapshot of the time series (the same concept adopted by Prometheus' scraping operation). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand exactly what you mean here, but if I do, I am not sure I agree. I think you might be conflating two different concepts that shouldn't necessarily go together.
To me, grouping metric samples with the same metric name and tags (i.e. a "time series") is somewhat orthogonal to actually aggregating the data of every metric sample (i.e. measurement) that is part of the time series. Grouping measurements in time series is somewhat of a prerequisite to aggregating them efficiently, but they are two different things. And we mostly want the first everywhere in the k6 core, so we can implement the second in some places and in potentially different "flavors".
For example, this output will benefit from both, since Prometheus deals with aggregated data. However, the JSON and CSV outputs certainly wont. The most they will benefit from the grouping of metric samples in time series is to maybe cache some things a bit better. The end-of-test summary will benefit somewhat, but not its aggregation spans multiple time series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will try to rewrite this part. If we exclude the aggregation from the time series concept what are the other benefits we get?
I see:
- Efficient comparison - we can compare quickly if a time series matches another just by checking the hash.
- Memory storage - we can reference the hash instead to pass the entire name+tags where tags will end in a copy
- Post-run analysis - We can count the number of generated time series so we know the unique combinations
- Do we have more?
This should be also a quick summary of what 1831
is already documenting.
For example, this output will benefit from both, since Prometheus deals with aggregated data. However, the JSON and CSV outputs certainly wont. The most they will benefit from the grouping of metric samples in time series is to maybe cache some things a bit better. The end-of-test summary will benefit somewhat, but not its aggregation spans multiple time series.
Does it mean we will continue to flush all the samples to the outputs then each output will sink/aggregate autonomously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have more?
Mentioned some side-benefits here: #27 (comment)
Does it mean we will continue to flush all the samples to the outputs then each output will sink/aggregate autonomously?
Yeah, I think so... 🤔 When every metric Sample
is attached to a time-series, at least the aggregation will be quite a bit easier and computationally cheaper. Maybe we can also figure out how to expose Sinks, or some other common aggregation API from the k6 core, so work is not duplicated, but I am not sure if the benefits of that would outweigh the complexity 🤔 In any case, that definitely feels like a v2 kind of thing, we should probably start more simply without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling that we converge on the idea that we don't need time series we just want SampleTags
to be comparable in O(1)
time? If you want to call this time series, I am fine with that as well.
And then we will be aggregating in each output? With the possibility of there being some in core place where we also aggregate them in some manner ... possible in a future version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of... 🤔 If SampleTags
are comparable in constant time, then they can be used as an index. So that gives us the ability to efficiently build a time series for every unique group of metric tags we want 🤷♂️ But we don't always need the exact same groupings, and we don't always need aggregation. So, by necessity, these things would have to be decoupled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If SampleTags are comparable in constant time
Just to verify my alignment on this, in the solution in the document we can't really compare SampleTags
in O(1), we can compare it in O(N). What we can compare in O(1) is the name+SampleTags that in the end most of the time will be the same thing.
But there are exceptions, for example, in the Ingester the first time it sees a new time series it should create a tree of all the parent's time series to know the full set of Sinks to invoke and we have to check all the single tags in the SampleTags for getting the right tree. Right?
Based on it and that we want distributed sinks, for now, do we really need to involve the name in the hash? In this way, we could avoid refactoring the signature here https://github.com/grafana/k6/blob/556747fbe585c981507904d634e705c68891e663/metrics/sample.go#L292-L318 mentioned in the other comment. WDYT? (I have the feeling I'm missing the potential limits of this proposal regarding the future features).
In the case of sinks per time series, we could have a nested index (e.g map[metricName]map[SampleTagHash]*Sink
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to verify my alignment on this, in the solution in the document we can't really compare SampleTags in O(1), we can compare it in O(N). What we can compare in O(1) is the name+SampleTags that in the end most of the time will be the same thing.
Yeah, sorry, that's what I meant. If the metric name + tag set
, i.e. the "time series", is comparable in constant time, we can use it as an index.
But there are exceptions, for example, in the Ingester the first time it sees a new time series it should create a tree of all the parent's time series to know the full set of Sinks to invoke and we have to check all the single tags in the SampleTags for getting the right tree. Right?
Kind of. When the MetricsEngine
's ingester sees a Sample
with a new (for it) time series, it will need to see which of its own sinks for the Sample
's Metric
match the tags of the new time series. And yeah, that operation is not constant, it's actually O(m*n)
, where m
is the number sub-metrics (sub-metric thresholds) and n
is the length of the constraints for each. However, the improvement here is that this matching Sink
list (not a tree, just a simple slice) can be cached for the next time a metric from the same time series is encountered! Whereas now that comparison is done for every metric Sample.
Based on it and that we want distributed sinks, for now, do we really need to involve the name in the hash? In this way, we could avoid refactoring the signature here https://github.com/grafana/k6/blob/556747fbe585c981507904d634e705c68891e663/metrics/sample.go#L292-L318 mentioned in the other comment. WDYT? (I have the feeling I'm missing the potential limits of this proposal regarding the future features).
In the case of sinks per time series, we could have a nested index (e.g map[metricName]map[SampleTagHash]*Sink)
Hmm, interesting idea, I am not sure which will be better... 🤔
On the one hand, if we don't refactor it, we might introduce more complexity everywhere else 🤔 We'd need either these nested maps or composite indexes with (metric_name, tag_set_hash)
tuples everywhere, right? Even if we ignore potential comparison and indexing inefficiencies, that will add development complexity, I think?
On the other hand, if we don't include the metric name in the hash, we'd be able to potentially more easily correlate Samples from different metrics but with the same tags 🤔 Not sure if that would be useful for anything, since we already package related samples in SampleContainers, but 🤷♂️
I think the first (including the metric name in the hash) is the better way to go here, but I am far from sure.
docs/metrics-data-model.md
Outdated
type Metric struct { | ||
Name string `json:"name"` | ||
Type MetricType `json:"type"` | ||
Contains ValueType `json:"contains"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Good point, but I am not sure how we can remove it. The other state in Metric
will be moved to the MetricsEngine
, since only it works with them, but the value type is passed from inside of the JS scripts and we have to save it somewhere.
Also, what do you mean by "all the possible values"?
docs/metrics-data-model.md
Outdated
Name string `json:"name"` | ||
Type MetricType `json:"type"` | ||
Contains ValueType `json:"contains"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thresholds (and the end-of-test summary) can span multiple time series, see my comment at the top. For example, you might have a threshold on http_req_duration{status:200}
, while your actual time series are http_req_duration{status:200,url:http://foo,...}
,http_req_duration{status:200,url:http://bar,...}
,http_req_duration{status:200,url:http://baz,...}
.
We might not need Observed
, I don't know, but it doesn't really matter, it will be something local for the MetricsEngine
.
go.mod
Outdated
@@ -44,6 +44,7 @@ require ( | |||
github.com/prometheus/client_model v0.2.0 // indirect | |||
github.com/prometheus/common/sigv4 v0.1.0 // indirect | |||
github.com/prometheus/procfs v0.6.0 // indirect | |||
github.com/spenczar/tdigest v2.1.0+incompatible // indirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment here: grafana/k6#763 (comment)
I still haven't had time to read the paper, but the youtube presentation is worth watching. IIRC, there are better implementations than t-digest out there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already experimented with https://openhistogram.io (circonusllhist). It works better, the Go implementation and the quantile math seem better. The unique cons I see at the moment is the fact we can't get in an efficient way the buckets and relative numbers - we would use the function for byte/string serialization of the entire histogram. In any case, I see it as fixable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I am not sure this is the best choice, I just picked it because I was in a hurry. I saw some issues with it during my PoC distributed execution experiments. More R&D is definitely needed... 😅
docs/metrics-data-model.md
Outdated
|
||
#### Open questions: | ||
|
||
* `TimeseriesSample` knows the time series concept and it has the reference to a time series. The Sample is raw data and there isn't yet an identified/matched time series to reference. In this way, the collecting phase doesn't require knowledge from a time series database. It could speed up the implementation and introduction of the first stage, where hopefully we don't touch the sample generation (e.g. all the `k6/js/modules`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might not be the best way to go about it 🤔 It would force the calculation of hashes to run in just a single thread. If that was done as soon as a metric is measured in the js/
package (or elsewhere), that relatively CPU intensive task will be spread apart on multiple cores... 🤔
However, if the association with a TimeSeries
is done there, it will require locking, whereas if it was done centrally from a single thread, it will not require any locking, just a map lookup (or insertion, if the time series didn't exist).
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@na-- I see the central solution as a more straightforward step for introducing this new model in the core. We can extend the solution with the distributed part in the next step.
I collapsed the two structures in the current Sample just adding a TimeSeries pointer that I expect the metrics engine or the ingester will resolve. In this way, we can resolve the submetrics/contains issue in the first step, and in a later PR, we can drop the Tags field from the Sample structure by distributing the time series generation to the components on the edge.
Do you see limits for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@na-- I see the central solution as a more straightforward step for introducing this new model in the core. We can extend the solution with the distributed part in the next step.
Hmm... maybe? 😅 isn't the "distributed" part of the time series determination just calculating the hash when the metric sample is assembled? 🤔 That seems like it would be easier to get done at the start, rather than refactoring it later? If it was just dependent on the tags, it could have probably been done by just modifying things here, but since it also depends on the metric name, it's going to need some more refactoring 😞
edit: or just passing the metric name to those functions, even if it's not saved?
docs/metrics-data-model.md
Outdated
|
||
#### Thresholds | ||
|
||
Thresholds should act as an output, it should work on top of a time series data layer. They get the latest value for the time series and check if the value triggers the threshold based on the defined condition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how it already is done after my latest refactoring, see how the MetricsEngine
receives metric samples by just pretending to be another normal Output
:
https://github.com/grafana/k6/blob/b725f03be5873b9b0d933b94a83924ff7c1f15e4/metrics/engine/ingester.go#L14-L16
https://github.com/grafana/k6/blob/b725f03be5873b9b0d933b94a83924ff7c1f15e4/core/engine.go#L98-L103
But it can't directly work on the aggregated data from time series since a single threshold spans multiple time series.
I simplified the model by removing the TagSet concept, if we are going to use the TimeSeries then it should be enough in terms of memory storage and we can instead benefit from relaxing the system to avoid the synchronization it would introduce. |
// This approach also allows to avoid hard to replicate issues with duplicate timestamps. | ||
|
||
labels, err := tagsToLabels(sample.Tags, o.config) | ||
tset, err := o.tagSetFromTags(sample.Tags.CloneTags()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, this is a bit inefficient, since it will make a copy of the map[string]string
. Ideally, all of these things should be done with the minimum amount of copying and allocation required, to reduce GC thrashing. That's only possible on the k6 side, by refactoring the types there directly, so just making a note.
index := sort.Search(len(*set), func(i int) bool { | ||
return (*set)[i].Key > t.Key | ||
}) | ||
|
||
if len(*set) == index { | ||
*set = append(*set, t) | ||
return | ||
} | ||
|
||
*set = append((*set)[:index], append(make([]*Tag, 1), (*set)[index:]...)...) | ||
(*set)[index] = t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this going to be much less efficient than adding all of the tags unsorted and then sorting the whole slice once at the end? This will just do a binary search on every insertion and then copy (potentially big) chunks of the slice every time an element is added... 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right. TagSet was mostly an experiment that I removed from the document in the end because I don't think we really need to have a second layer of aggregation.
docs/metrics-data-model.md
Outdated
type Tag struct { | ||
Key, Value string | ||
} | ||
|
||
type TimeSeries struct { | ||
ID uint64 // hash(metric+tags) | ||
MetricName string | ||
Tags []Tag | ||
} | ||
|
||
type Sample struct { | ||
TimeSeries *TimeSeries | ||
Timestamp uint64 | ||
Value float64 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The basic underlying structure here looks correct, i.e. every element now has the right constituent parts. However, I am not sure the implementation should use these structs precisely.
For one thing, we should probably try to make a lot of these things read-only once they have been created. Public properties are convenient, but we probably don't want any output or JS extension to be able to modify the internals of a TimeSeries
, right? So constructors that create objects with unexported properties and read-only getter methods might be the way to go?
Another thing that I noticed is the fact that the tags are a slice. Having them in a sorted slice is necessary to calculate the hash of the TimeSeries, but we should probably also have some index (i.e. a map
) of their keys, so the rest of the codebase could efficiently lookup the value of specific tags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For one thing, we should probably try to make a lot of these things read-only once they have been created.
A common discussion in the Go world. 😄 I'm not convinced, some questions:
- We will create a not consistent situation between Metric and TimeSeries. For example, why is it relevant to keep TimeSeries protected and to not protect the Metric for a name change? If you are suggesting doing the same for Metric then the impact in the source code seems big.
- How should the Get methods work? I expect they have to make a copy of the internal structure. It sounds not optimal in terms of efficiency.
but we should probably also have some index
I agree.
so the rest of the codebase could efficiently lookup the value of specific tags?
Out of curiosity, when do you expect to use it? I expect most of the time TimeSeries.Tags
access should be used for mapping the entire slice because a TimeSeries has been already resolved by matching the ID. In that case, the slice iteration would be ok.
I see the index as useful for the Contains
operation where we need the pair key+val
as keys otherwise we need to access by key the map but make a string comparison for the value. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, why is it relevant to keep TimeSeries protected and to not protect the Metric for a name change? If you are suggesting doing the same for Metric then the impact in the source code seems big.
Yes, I want to fix Metric
too. I don't think the impact will be that big, it should be a fairly simple substitution. Most extensions don't use the metric name directly, they use a pointer to the Metric
object. And besides, we make no promises for backwards compatibility of the Go APIs.
How should the Get methods work? I expect they have to make a copy of the internal structure. It sounds not optimal in terms of efficiency.
I guess depends on the specific thing. For example, for metrics.Metric
, I expect getter methods that make copies of its individual properties will be enough: m.GetName()
instead of just m.Name
, etc. Besides, strings in Go are immutable, so I don't think it even makes an actual copy, just returns a read-only reference to the original string.
The problem here is that we always get a pointer to a Metric
, we can't have copies of the whole struct since the whole point after the recent refactorings is that we only have a single Metric
object with the same name. So its properties are mutable 😞
Out of curiosity, when do you expect to use it? I expect most of the time
TimeSeries.Tags
access should be used for mapping the entire slice because a TimeSeries has been already resolved by matching the ID. In that case, the slice iteration would be ok.I see the index as useful for the
Contains
operation where we need the pairkey+val
as keys otherwise we need to access by key the map but make a string comparison for the value. Right?
Yeah, mostly for Contains()
, since that is needed by the MetricsEngine
for sub-metric sinks in the end-of-test summary and thresholds. Though that can probably be implemented internally 🤔 Still, I can imagine how some output would like the value of a specific tag (e.g. url
, method
, etc.) to do some pre- or post-processing only on specific raw metric samples 🤷♂️ In general though, it should be possible to efficiently inspect a TimeSeries
and know what are its characteristics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess depends on the specific thing.
I was thinking more about TimeSeries.tags
where GetTags() []Tag
should make a copy before returning the slice to be an "immutable" and in this way, it seems similar to the current code, we are losing the benefit regarding the memory efficiency of having the TimeSeries where we would allocate the Tags slice once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine to have a GetTags()
function, since sometimes you just need a copy of the tags 🤷♂️ As long as we don't have to make these copies in the normal flow of metric samples, it should be fine. But you shouldn't need to make a copy of the whole tag set in order to get the value for a specific tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since sometimes you just need a copy of the tags
Maybe I'm not considering something but it's not sometimes, it's every time a flushMetrics
method is invoked for outputs. We need to map the tags and we don't know any specific tag we just have the Sample and the pointer to the TimeSeries. For example, this line would be replaced with the GetTags()
method, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm maybe, it's probably not worth optimizing that particular line too much 🤷♂️ But if we really needed to optimize it, we can cache the CSV representation of these tags for every TimeSeries
the first time we see it and reuse that for every other time 🙂
docs/metrics-data-model.md
Outdated
id uint64 // hash(metric+tags) | ||
metricName string | ||
tags []Tag | ||
keys map[string]struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keys map[string]struct{} | |
tagKeys map[string]int |
or something like that, to allow for direct lookup? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm...I thinking if we should do one step more at this point. My idea of tags map was something map[keySepValue]struct{}
for having the maximum efficiency on Contains
but it wouldn't support the fetch by Tag.Key
.
So we could consider this data structure for supporting the single GetTag(key)
:
tags map[key]stuct{
value string
valhash uint64
}
In this way, we could support most of the operations with one data structure. The unique missing operation is returning the sorted []Tag for example for GetTags()
, I see two potential solution:
- store just the keys in a sorted slice
- add another field like
order/index
in thestruct
so we know in which specific place set theTag
item.
func (ts TimeSeries) GetTags() []Tag {
s := make([]Tag, len(ts.tags))
for k, data := range {
s[data.index] = Tag{Key: k, Value: data.Value}
}
return s
}
The list of operations we have potentially consider in comparisons:
func (TimeSeries) Contains(subset []Tag) bool
func (TimeSeries) HasKey(string) bool
func (TimeSeries) GetTag(string) Tag
func (TimeSeries) GetTags() []Tag
@na-- Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking better on this, it doesn't really work. Because on Contains
we don't have the relative tag.valhash
to compare. So I think, we can just support the GetTag
by binary search 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand all of your concerns, or maybe I'm missing something 😕 Please correct me if I'm wrong, but we basically need to be able to efficiently access the tags both in a sorted sequential order and in a random access order (i.e. lookup by their key). So, the basic implementation choices are these:
- Store the tag keys and values in a sorted list (
[]Tag
) for efficient sequential access and have an index (map[string]int
) by the key for efficient random access - Store the tag keys and values in a map (
map[string]string
) for the efficient random access and have a sorted list of keys as a[]string
slice for the sequential access. - Something weirder
The first approach seems preferable to me, since the second one involves an extra map per tag access when iterating across all tags sequentially, and anything else seems unnecessary over-engineering at this early point. 🤷♂️
Given that these time series objects should be measured in thousands at most, and not millions, I don't see why we'd want anything more complicated than this? You probably won't save all that much memory usage if you don't have the index and use binary search, for example, since I think the strings for the keys will be reused between the slice and the map 🤷♂️
Besides, we might be verging in bike-shedding territory here 🙂 . The actual implementation should be internal to the TimeSeries
object, right? So I don't think it matters very much implementation you pick, we should be able to change it at any time in the future without anything breaking, right?
This PoC can be considered closed, we achieved our goal to draft a hypothesis for a good data model. The next implementation should use directly the new types defined in |
The code is a PoC based on docs/metrics-data-model.md, more research and polish will be applied when a consensus on the design document will be achieved.
It contains a very basic implementation for tdunning's TDigest based on a 3rd-party library. The math from the library doesn't work so I will replace it soon.