Custom metric threshold calculation using wrong statistics #2390

timterrill · 2022-02-15T21:33:21Z

Brief summary

When experimenting with the differences between custom metric thresholds and thresholds on tagged requests, I discovered that the custom metric thresholds were not using the custom metric data, they were using the overall statistics from the entire test (http_req_duration).

This caused thresholds for the custom metric to fail when the tagged requests passed. In my test example I have 2 requests, each of which has a unique custom metric and a unique tag:

homepage request:
- custom metric = trend_homepage
- request tag = name:tag_homepage
pi request
- custom metric = trend_pi
- request tag = name:tag_pi

The metrics for the tag and custom metric should be the same, therefore the thresholds for each request should both pass or fail.

k6 version

k6 v0.34.1 ((devel), go1.17, darwin/amd64)

OS

macOS 12.1

Docker version and image (if applicable)

No response

Steps to reproduce the problem

To reproduce this issue,

Download https://gist.github.com/timterrill/e91c36a579916b697cb7af061bbcc432
Run with the following command:

$> k6 run k6-threshold-bug.js

NOTE: Depending on the actual response times of the 2 endpoints, the threshold may not actually fail, but repeated runs should reproduce it.

Expected behaviour

The output I expect from this run is as follows (I modified the 2 rows for { trend_XXXX } to be what I expected):

...

   ✓ http_req_duration..............: count=217 min=18.19ms avg=21.26ms  p(98)=26.92ms  p(99.9)=31.21ms  max=31.69ms
       { expected_response:true }...: count=217 min=18.19ms avg=21.26ms  p(98)=26.92ms  p(99.9)=31.21ms  max=31.69ms
     ✓ { name:tag_homepage }........: count=214 min=18.19ms avg=21.25ms  p(98)=26.96ms  p(99.9)=31.22ms  max=31.69ms
     ✓ { name:tag_pi }..............: count=3   min=20.92ms avg=21.41ms  p(98)=22.09ms  p(99.9)=22.12ms  max=22.12ms
     ✓ { trend_homepage }...........: count=214 min=18.19ms avg=21.25ms  p(98)=26.96ms  p(99.9)=31.22ms  max=31.69ms
     ✓ { trend_pi }.................: count=3   min=20.92ms avg=21.41ms  p(98)=22.09ms  p(99.9)=22.12ms  max=22.12ms

...

     trend_homepage.................: count=214 min=18.19ms avg=21.25ms  p(98)=26.96ms  p(99.9)=31.22ms  max=31.69ms
     trend_pi.......................: count=3   min=20.92ms avg=21.41ms  p(98)=22.09ms  p(99.9)=22.12ms  max=22.12ms

Notice how trend_pi, the threshold rows for trend_pi and name:tag_pi all had 3 requests and the same metrics as the custom metric did. The http_req_duration row had 217 requests (which represents all requests for the test run).

Also the name:tag_pi and custom metric trend_pi thresholds BOTH passed (as expected).

Actual behaviour

This is the actual output of my run:

And the part of the summary output to compare against the above section:

ISSUE: The threshold for { trend_pi } failed and both { trend_XXX} rows had the wrong data.

...

   ✓ http_req_duration..............: count=217 min=18.19ms avg=21.26ms  p(98)=26.92ms  p(99.9)=31.21ms  max=31.69ms
       { expected_response:true }...: count=217 min=18.19ms avg=21.26ms  p(98)=26.92ms  p(99.9)=31.21ms  max=31.69ms
     ✓ { name:tag_homepage }........: count=214 min=18.19ms avg=21.25ms  p(98)=26.96ms  p(99.9)=31.22ms  max=31.69ms
     ✓ { name:tag_pi }..............: count=3   min=20.92ms avg=21.41ms  p(98)=22.09ms  p(99.9)=22.12ms  max=22.12ms
     ✓ { trend_homepage }...........: count=217 min=18.19ms avg=21.26ms  p(98)=26.92ms  p(99.9)=31.21ms  max=31.69ms
     ✗ { trend_pi }.................: count=217 min=18.19ms avg=21.26ms  p(98)=26.92ms  p(99.9)=31.21ms  max=31.69ms

...

     trend_homepage.................: count=214 min=18.19ms avg=21.25ms  p(98)=26.96ms  p(99.9)=31.22ms  max=31.69ms
     trend_pi.......................: count=3   min=20.92ms avg=21.41ms  p(98)=22.09ms  p(99.9)=22.12ms  max=22.12ms

Notice how trend_pi had 3 requests, and the threshold row for name:tag_pi had 3 requests and the same metrics as the custom metric did, but the threshold calculation row for trend_pi had 217 requests and the same metrics as the http_req_duration has for the entire test run (THIS IS THE ISSUE).

Also the name:tag_pi threshold passed, where the custom metric trend_pi did not pass. That should be impossible because the simple code made a single http request that was tagged with name:tag_pi, and the custom metric trend_pi just tracked the http_req_duration from the request.

The text was updated successfully, but these errors were encountered:

efdknittlfrank · 2022-02-16T08:35:44Z

To define a threshold on a custom metric, you have to use the custom metric. A custom metric is a standalone metric and independent from the http_req_duration metric. Filtering a metric on a tag without value that does not exist, seems to be a noop.

You actually want to define your threshold as:

export const options = {
        thresholds: {
            'trend_homepage': [
                { threshold: 'p(99.99) < 100', abortOnFail: false }
            ],
            'trend_pi': [
                { threshold: 'p(99.99) < 24', abortOnFail: false }
            ],
        },
}

efdknittlfrank · 2022-02-16T09:43:37Z

Also note that the tags you apply to your custom metric are rather redundant, they always contain the same value.

const piTrend = new Trend('trend_pi', true);
function piRequest(data, metric) {
  const name = 'tag_pi';
  const params = { tags: { name } };
  const res = http.request('GET', 'https://test.k6.io/pi.php?decimals=2', {}, params);
  metric.add(res.timings.duration, { requestName: name, metric: 'timings.duration' })
}

Filtering on the tags will output identical metric values for all combinations:

options = {
  thresholds: {
    'trend_pi': [ 'p(99.99) < 100' ],
    'trend_pi{requestName:tag_pi}': [ 'p(99.99) < 100' ],
    'trend_pi{metric:timings.duration}': [ 'p(99.99) < 100' ],
    'trend_pi(requestName:tag_pi,metric:timings.duration}': [ 'p(99.99) < 100' ],
  }
}

na-- · 2022-02-16T09:45:50Z

Hmm @efdknittlfrank, while you are correct, I think @timterrill found a k6 bug here.

Notice how trend_pi had 3 requests, and the threshold row for name:tag_pi had 3 requests and the same metrics as the custom metric did, but the threshold calculation row for trend_pi had 217 requests and the same metrics as the http_req_duration has for the entire test run (THIS IS THE ISSUE).

Yes, this looks like a problem, though I created a simpler example, since the current one has a lot of (probably unnecessary) complexity. Consider this much simpler test run:

import http from 'k6/http';

export const options = {
  thresholds: {
    'http_reqs': ['count == 2'],
    'http_reqs{key}': ['count == 1'],
    'http_reqs{key:value}': ['count == 1'],
    'http_reqs{nonexistent_key}': ['count == 0'],
  }
}

export default function (data) {
  http.get('https://httpbin.test.k6.io/anything?r=1', { tags: {} })
  http.get('https://httpbin.test.k6.io/anything?r=2', { tags: { 'key': 'value' } })
}

It will result in the following summary (cropped to only the relevant parts):

...
   ✓ http_reqs......................: 2      2.507944/s
     ✓ { key:value }................: 1      1.253972/s
     ✓ { key }......................: 1      1.253972/s
     ✗ { nonexistent_key }..........: 2      2.507944/s
...

My suspicion is that since metric tags are meant to be key:value, having a threshold on http_reqs{nonexistent_key} is the same as having a threshold on http_reqs{nonexistent_key:''}. And since Go maps return an empty string when trying to access a map key that doesn't exist unless you explicitly check if the key is missing, this might be the problem, i.e. k6 will basically treat it as it if was a threshold on the whole metric, since that tag is never used... 😞

We are currently in the midst of refactoring how the thresholds are evaluated (#2356) by @oleiade, so this might already be fixed, or easier to fix after that PR. In any case, it seems like a valid bug, thank you for reporting it! 🙇

na-- · 2022-02-16T11:06:35Z

There is also a similar bug (or bugs) to this in k6 cloud... 😞 Data crunching for thresholds is done somewhat differently there, since potentially multiple instances can generate data. My original example above with the thresholds defined like this:

{
  'http_reqs': ['count == 2'],
  'http_reqs{key}': ['count == 1'],
  'http_reqs{key:value}': ['count == 1'],
  'http_reqs{nonexistent_key}': ['count == 0'],
}

Resulted in the cloud backend not parsing the ones with no key values and reporting just these two thresholds:

However, when I added : to these thresholds, like this:

thresholds: {
  'http_reqs': ['count == 2'],
  'http_reqs{key:}': ['count == 1'],
  'http_reqs{key:value}': ['count == 1'],
  'http_reqs{nonexistent_key:}': ['count == 0'],
}

I got this, which includes the k6 bug, but also includes a bug for the http_reqs{key:} threshold:

So yeah, this is probably going to be somewhat complicated to fix fully...

mstoykov · 2022-02-16T11:16:33Z

@na-- this likely should (and already is in that PR) a "syntax error"

na-- · 2022-02-16T11:49:40Z

@mstoykov, maybe, partially though that would be kind of a breaking change and deserves its own separate evaluation...

But again, the problem persists even if I use the proper syntax for thresholds that want to check if a metric tag exists but has an empty value... These should be valid thresholds, right?

import http from 'k6/http';

export const options = {
  thresholds: {
    'http_reqs': ['count == 2'],
    'http_reqs{key:""}': ['count == 1'],
    'http_reqs{key:value}': ['count == 1'],
    'http_reqs{nonexistent_key:""}': ['count == 0'],
  }
}

export default function (data) {
  http.get('https://httpbin.test.k6.io/anything?r=1', { tags: {} })
  http.get('https://httpbin.test.k6.io/anything?r=2', { tags: { 'key': 'value' } })
}

And yet I still get the same wrong results:

   ✓ http_reqs......................: 2      0.962599/s
     ✓ { key:"" }...................: 1      0.481299/s
     ✓ { key:value }................: 1      0.481299/s
     ✗ { nonexistent_key:"" }.......: 2      0.962599/s

FWIW, this is the part where threshold definitions like http_reqs{key} are basically parsed as the equivalent of http_reqs{key:}:

k6/stats/stats.go

Lines 506 to 515 in c4b88f5

    
           parts := strings.SplitN(kv, ":", 2) 
        
           key := strings.TrimSpace(strings.Trim(parts[0], `"'`)) 
        
           if len(parts) != 2 { 
        
           	tags[key] = "" 
        
           	continue 
        
           } 
        
           value := strings.TrimSpace(strings.Trim(parts[1], `"'`)) 
        
           tags[key] = value

And IMO that part is somewhat fine... If we ignore the fact the UX is slightly confusing, I am not sure it should be treated as a syntax error 🤷‍♂️ The cloud treats threshold definitions like http_reqs{key} a bit differently, but that's a somewhat separate issue.

The bigger problem on the k6 OSS side is with the data crunching, when determining whether a certain metric sample is part of a specific threshold sub-metric. That one is apparently completely broken, since it somehow thinks there are 2 HTTP requests with a nonexistent_key:"" tag...

timterrill · 2022-02-16T15:31:06Z

Thanks @efdknittlfrank and @na-- for the immediate action on this!

FWIW, I was just testing out the two techniques for setting thresholds and was not going to use both in a given test, as I knew it was redundant but I wondered which way I'd like best.

Thanks for pointing out that I was using the wrong syntax for the trend_XXX custom metric thresholds, and I have fixed and tested that on my end.

I'm glad though that it did identify a bug ;)

my $0.02 on a fix might be to ignore empty tag values as I can't imagine how that's useful? Likely if that happens someone made a mistake (like I did), so possibly even generate an error message "Can't set threshold on an undefined tag-value"?

timterrill added the bug label Feb 15, 2022

na-- added the high prio label Feb 16, 2022

na-- added this to the v0.37.0 milestone Feb 16, 2022

na-- assigned oleiade Feb 16, 2022

na-- added the cloud label Feb 16, 2022

na-- modified the milestones: v0.37.0, v0.38.0 Feb 16, 2022

mstoykov mentioned this issue Feb 16, 2022

Parse and validate threshold during init #2356

Closed

na-- assigned na-- and unassigned oleiade Mar 9, 2022

na-- mentioned this issue Mar 9, 2022

Use the registry for (sub-)metric validation and move data crunching out of the Engine #2426

Merged

na-- linked a pull request Mar 9, 2022 that will close this issue

Use the registry for (sub-)metric validation and move data crunching out of the Engine #2426

Merged

na-- closed this as completed in #2426 Mar 29, 2022

na-- mentioned this issue May 2, 2022

release notes v0.38.0 #2507

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom metric threshold calculation using wrong statistics #2390

Custom metric threshold calculation using wrong statistics #2390

timterrill commented Feb 15, 2022

efdknittlfrank commented Feb 16, 2022 •

edited

Loading

efdknittlfrank commented Feb 16, 2022 •

edited

Loading

na-- commented Feb 16, 2022

na-- commented Feb 16, 2022

mstoykov commented Feb 16, 2022

na-- commented Feb 16, 2022 •

edited

Loading

timterrill commented Feb 16, 2022

Custom metric threshold calculation using wrong statistics #2390

Custom metric threshold calculation using wrong statistics #2390

Comments

timterrill commented Feb 15, 2022

Brief summary

k6 version

OS

Docker version and image (if applicable)

Steps to reproduce the problem

Expected behaviour

Actual behaviour

efdknittlfrank commented Feb 16, 2022 • edited Loading

efdknittlfrank commented Feb 16, 2022 • edited Loading

na-- commented Feb 16, 2022

na-- commented Feb 16, 2022

mstoykov commented Feb 16, 2022

na-- commented Feb 16, 2022 • edited Loading

timterrill commented Feb 16, 2022

efdknittlfrank commented Feb 16, 2022 •

edited

Loading

efdknittlfrank commented Feb 16, 2022 •

edited

Loading

na-- commented Feb 16, 2022 •

edited

Loading