@@ -303,7 +303,7 @@ spec:
303303 destination_workload:{{ target }},
304304 !response_code:404
305305 }.as_count()
306- /
306+ /
307307 sum:istio.mesh.request.count{
308308 reporter:destination,
309309 destination_workload_namespace:{{ namespace }},
@@ -326,6 +326,61 @@ Reference the template in the canary analysis:
326326 interval: 1m
327327` ` `
328328
329+ # ## Common Pitfalls for Datadog
330+
331+ The following examples use a ingress-nginx replicaset of three web servers, and a client that performs approximately
332+ 5 requests per second, constantly. Each of the three nginx web servers report their metrics every 15 seconds.
333+
334+ # ### Pitfall 1: Converting metrics to rates (using `.as_rate()`) can have high sampling noise
335+
336+ Example query `sum:nginx_ingress.controller.requests{env:development, ingress:my-ingress} by {env}.as_rate()`
337+ for the past 5 minutes.
338+
339+ 
340+
341+ Datadog does an automatic rollup (up/downsampling) of a timeseries, and the time resolution is based on the
342+ requested interval. The longer the interval, the coarser the time resolution and vice versa. This means, for short
343+ intervals, the time resolution of the query response can be higher than the reporting rate of the app, leading to
344+ a spiky rate graph that oscillates erratically, not even close to the real rate.
345+
346+ This amplifies even more when applying e.g. `default_zero()`, where Datadog inserts zeros for every empty time interval
347+ in the response.
348+
349+ Example query `default_zero(sum:nginx_ingress.controller.requests{env:development, ingress:my-ingress} by {env}.as_rate())`
350+ for the past 5 minutes.
351+
352+ 
353+
354+ To overcome this, you should manually apply a `rollup()` to your query, aggregating at least one complete reporting
355+ interval of your application (in this case : 15 seconds).
356+
357+ # ### Pitfall 2: Datadog metrics tend to return incomplete (thus usually too small) values for the most recent time intervals
358+
359+ Example query : ` sum:nginx_ingress.controller.requests{env:development, ingress:my-ingress} by {env}.as_rate().rollup(15)`
360+
361+ 
362+
363+ The rightmost bar displays a smaller value, because not all targets contributing to the metric have reported
364+ the most recent time interval yet. In extreme cases, the value will be zero. As time goes by, this bar will fill,
365+ but the most recent bar(s) are almost always incomplete. Sometimes, the Datadog UI shades the last bucket in the
366+ example as incomplete, but, this "incomplete data" information is not part of the returned time series, so Flagger
367+ cannot know which samples to trust.
368+
369+ # ### Recommendations on Datadog metrics evaluations
370+
371+ Flagger queries Datadog for an interval between and `analysis.metrics.interval` ago and `now` , and
372+ then (since release (TODO : unreleased)) takes the **first** sample of the result set. It cannot take the
373+ last one, because recent samples might be incomplete. So, for an interval of e.g. `2m`, Flagger evaluates
374+ the value from 2 minutes ago.
375+
376+ - In order to have a result that is not oscillating, you should apply a rollup of at least the reporting interval of
377+ the observed target
378+ - In order to have a recent result, you should use a small interval, but...
379+ - In order to have a complete result, you must take a query interval that contains at least one full rollup window.
380+ This should be the case if the interval is at least two times the rollup window
381+ - In order to always have a metric result, you can apply functions like `default_zero()`, but you must
382+ make sure that receiving a zero does not fail your evaluation
383+
329384# # Amazon CloudWatch
330385
331386You can create custom metric checks using the CloudWatch metrics provider.
@@ -438,11 +493,11 @@ spec:
438493 secretRef:
439494 name: newrelic
440495 query: |
441- SELECT
442- filter(sum(nginx_ingress_controller_requests), WHERE status >= '500') /
496+ SELECT
497+ filter(sum(nginx_ingress_controller_requests), WHERE status >= '500') /
443498 sum(nginx_ingress_controller_requests) * 100
444- FROM Metric
445- WHERE metricName = 'nginx_ingress_controller_requests'
499+ FROM Metric
500+ WHERE metricName = 'nginx_ingress_controller_requests'
446501 AND ingress = '{{ ingress }}' AND namespace = '{{ namespace }}'
447502` ` `
448503
@@ -538,7 +593,7 @@ spec:
538593# # Google Cloud Monitoring (Stackdriver)
539594
540595Enable Workload Identity on your cluster, create a service account key that has read access to the
541- Cloud Monitoring API and then create an IAM policy binding between the GCP service account and the Flagger
596+ Cloud Monitoring API and then create an IAM policy binding between the GCP service account and the Flagger
542597service account on Kubernetes. You can take a look at this [guide](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity)
543598
544599Annotate the flagger service account
@@ -557,7 +612,7 @@ your [service account json](https://cloud.google.com/docs/authentication/product
557612 kubectl create secret generic gcloud-sa --from-literal=project=<project-id >
558613```
559614
560- Then reference the secret in the metric template.
615+ Then reference the secret in the metric template.
561616Note: The particular MQL query used here works if [Istio is installed on GKE](https://cloud.google.com/istio/docs/istio-on-gke/installing).
562617```yaml
563618apiVersion: flagger.app/v1beta1
@@ -568,7 +623,7 @@ metadata:
568623spec:
569624 provider:
570625 type: stackdriver
571- secretRef:
626+ secretRef:
572627 name: gcloud-sa
573628 query: |
574629 fetch k8s_container
@@ -725,7 +780,7 @@ This will usually be set to the same value as the analysis interval of a `Canary
725780Only relevant if the `type` is set to `analysis`.
726781* **arguments (optional)**: Arguments to be passed to an `Analysis`.
727782Arguments are passed as a list of key value pairs, separated by `;` characters,
728- e.g. `foo=bar;bar=foo`.
783+ e.g. `foo=bar;bar=foo`.
729784Only relevant if the `type` is set to `analysis`.
730785
731786For the type `analysis`, the value returned by the provider is either `0`
0 commit comments