Skip to content

Commit

Permalink
Merge pull request #3607 from hhunter-ms/issue_2874
Browse files Browse the repository at this point in the history
Observability docs refresh
  • Loading branch information
hhunter-ms authored Jul 19, 2023
2 parents 375f32b + 221fe8b commit f1fb452
Show file tree
Hide file tree
Showing 8 changed files with 303 additions and 196 deletions.
48 changes: 37 additions & 11 deletions daprdocs/content/en/concepts/observability-concept.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,42 +7,68 @@ description: >
Observe applications through tracing, metrics, logs and health
---

When building an application, understanding how the system is behaving is an important part of operating it - this includes having the ability to observe the internal calls of an application, gauging its performance and becoming aware of problems as soon as they occur. This is challenging for any system, but even more so for a distributed system comprised of multiple microservices where a flow, made of several calls, may start in one microservice but continue in another. Observability is critical in production environments, but also useful during development to understand bottlenecks, improve performance and perform basic debugging across the span of microservices.
When building an application, understanding the system behavior is an important, yet challenging part of operating it, such as:
- Observing the internal calls of an application
- Gauging its performance
- Becoming aware of problems as soon as they occur

While some data points about an application can be gathered from the underlying infrastructure (for example memory consumption, CPU usage), other meaningful information must be collected from an "application-aware" layer–one that can show how an important series of calls is executed across microservices. This usually means a developer must add some code to instrument an application for this purpose. Often, instrumentation code is simply meant to send collected data such as traces and metrics to observability tools or services that can help store, visualize and analyze all this information.
This can be particularly challenging for a distributed system comprised of multiple microservices, where a flow made of several calls may start in one microservice and continue in another.

Having to maintain this code, which is not part of the core logic of the application, is a burden on the developer, sometimes requiring understanding the observability tools' APIs, using additional SDKs etc. This instrumentation may also add to the portability challenges of an application, which may require different instrumentation depending on where the application is deployed. For example, different cloud providers offer different observability tools and an on-premises deployment might require a self-hosted solution.
Observability into your application is critical in production environments, and can be useful during development to:
- Understand bottlenecks
- Improve performance
- Perform basic debugging across the span of microservices

While some data points about an application can be gathered from the underlying infrastructure (memory consumption, CPU usage), other meaningful information must be collected from an "application-aware" layer – one that can show how an important series of calls is executed across microservices. Typically, you'd add some code to instrument an application, which simply sends collected data (such as traces and metrics) to observability tools or services that can help store, visualize, and analyze all this information.

Maintaining this instrumentation code, which is not part of the core logic of the application, requires understanding the observability tools' APIs, using additional SDKs, etc. This instrumentation may also present portability challenges for your application, requiring different instrumentation depending on where the application is deployed. For example:
- Different cloud providers offer different observability tools
- An on-premises deployment might require a self-hosted solution

## Observability for your application with Dapr

When building an application which leverages Dapr API building blocks to perform service-to-service calls and pub/sub messaging, Dapr offers an advantage with respect to [distributed tracing]({{<ref tracing>}}). Because this inter-service communication flows through the Dapr runtime (or "sidecar"), Dapr is in a unique position to offload the burden of application-level instrumentation.
When you leverage Dapr API building blocks to perform service-to-service calls and pub/sub messaging, Dapr offers an advantage with respect to [distributed tracing]({{< ref develop-tracing >}}). Since this inter-service communication flows through the Dapr runtime (or "sidecar"), Dapr is in a unique position to offload the burden of application-level instrumentation.

### Distributed tracing

Dapr can be [configured to emit tracing data]({{<ref setup-tracing.md>}}), and because Dapr does so using the widely adopted protocols of [Open Telemetry (OTEL)](https://opentelemetry.io/) and [Zipkin](https://zipkin.io), it can be easily integrated with multiple observability tools.
Dapr can be [configured to emit tracing data]({{< ref setup-tracing.md >}}) using the widely adopted protocols of [Open Telemetry (OTEL)](https://opentelemetry.io/) and [Zipkin](https://zipkin.io). This makes it easily integrated with multiple observability tools.

<img src="/images/observability-tracing.png" width=1000 alt="Distributed tracing with Dapr">

### Automatic tracing context generation

Dapr uses [W3C tracing]({{<ref w3c-tracing-overview>}}) specification for tracing context, included as part Open Telemetry (OTEL), to generate and propagate the context header for the application or propagate user-provided context headers. This means that you get tracing by default with Dapr.
Dapr uses [W3C tracing]({{< ref w3c-tracing-overview >}}) specification for tracing context, included as part Open Telemetry (OTEL), to generate and propagate the context header for the application or propagate user-provided context headers. This means that you get tracing by default with Dapr.

## Observability for the Dapr sidecar and control plane

You also want to be able to observe Dapr itself, by collecting metrics on performance, throughput and latency and logs emitted by the Dapr sidecar, as well as the Dapr control plane services. Dapr sidecars have a health endpoint that can be probed to indicate their health status.
You can also observe Dapr itself, by:
- Generating logs emitted by the Dapr sidecar and the Dapr control plane services
- Collecting metrics on performance, throughput, and latency
- Using health endpoints probes to indicate the Dapr sidecar health status

<img src="/images/observability-sidecar.png" width=1000 alt="Dapr sidecar metrics, logs and health checks">

### Logging

Dapr generates [logs]({{<ref "logs.md">}}) to provide visibility into sidecar operation and to help users identify issues and perform debugging. Log events contain warning, error, info, and debug messages produced by Dapr system services. Dapr can also be configured to send logs to collectors such as [Fluentd]({{< ref fluentd.md >}}), [Azure Monitor]({{< ref azure-monitor.md >}}), and other observability tools, so that logs can be searched and analyzed to provide insights.
Dapr generates [logs]({{< ref logs.md >}}) to:
- Provide visibility into sidecar operation
- Help users identify issues and perform debugging

Log events contain warning, error, info, and debug messages produced by Dapr system services. You can also configure Dapr to send logs to collectors, such as Open Telemetry Collector, [Fluentd]({{< ref fluentd.md >}}), [New Relic]({{< ref "operations/monitoring/logging/newrelic.md" >}}), [Azure Monitor]({{< ref azure-monitor.md >}}), and other observability tools, so that logs can be searched and analyzed to provide insights.

### Metrics

Metrics are the series of measured values and counts that are collected and stored over time. [Dapr metrics]({{<ref "metrics">}}) provide monitoring capabilities to understand the behavior of the Dapr sidecar and control plane. For example, the metrics between a Dapr sidecar and the user application show call latency, traffic failures, error rates of requests, etc. Dapr [control plane metrics](https://github.com/dapr/dapr/blob/master/docs/development/dapr-metrics.md) show sidecar injection failures and the health of control plane services, including CPU usage, number of actor placements made, etc.
Metrics are a series of measured values and counts collected and stored over time. [Dapr metrics]({{< ref metrics >}}) provide monitoring capabilities to understand the behavior of the Dapr sidecar and control plane. For example, the metrics between a Dapr sidecar and the user application show call latency, traffic failures, error rates of requests, etc.

Dapr [control plane metrics](https://github.com/dapr/dapr/blob/master/docs/development/dapr-metrics.md) show sidecar injection failures and the health of control plane services, including CPU usage, number of actor placements made, etc.

### Health checks

The Dapr sidecar exposes an HTTP endpoint for [health checks]({{<ref sidecar-health.md>}}). With this API, user code or hosting environments can probe the Dapr sidecar to determine its status and identify issues with sidecar readiness.
The Dapr sidecar exposes an HTTP endpoint for [health checks]({{< ref sidecar-health.md >}}). With this API, user code or hosting environments can probe the Dapr sidecar to determine its status and identify issues with sidecar readiness.

Conversely, Dapr can be configured to probe for the [health of your application]({{< ref app-health.md >}}), and react to changes in the app's health, including stopping pub/sub subscriptions and short-circuiting service invocation calls.

## Next steps

Conversely, Dapr can be configured to probe for the [health of your application]({{<ref app-health.md >}}), and react to changes in the app's health, including stopping pub/sub subscriptions and short-circuiting service invocation calls.
- [Learn more about observability in developing with Dapr]({{< ref develop-tracing >}})
- [Learn more about observability in operating with Dapr]({{< ref tracing >}})
Original file line number Diff line number Diff line change
Expand Up @@ -2,41 +2,47 @@
type: docs
title: "App health checks"
linkTitle: "App health checks"
weight: 300
weight: 100
description: Reacting to apps' health status changes
---

App health checks is a feature that allows probing for the health of your application and reacting to status changes.
The app health checks feature allows probing for the health of your application and reacting to status changes.

Applications can become unresponsive for a variety of reasons: for example, they could be too busy to accept new work, could have crashed, or be in a deadlock state. Sometimes the condition can be transitory, for example if the app is just busy (and will eventually be able to resume accepting new work), or if the application is being restarted for whatever reason and is in its initialization phase.
Applications can become unresponsive for a variety of reasons. For example, your application:
- Could be too busy to accept new work;
- Could have crashed; or
- Could be in a deadlock state.

When app health checks are enabled, the Dapr *runtime* (sidecar) periodically polls your application via HTTP or gRPC calls.
Sometimes the condition can be transitory, for example:
- If the app is just busy and will resume accepting new work eventually
- If the application is being restarted for whatever reason and is in its initialization phase

When it detects a failure in the app's health, Dapr stops accepting new work on behalf of the application by:
App health checks are disabled by default. Once you enable app health checks, the Dapr runtime (sidecar) periodically polls your application via HTTP or gRPC calls. When it detects a failure in the app's health, Dapr stops accepting new work on behalf of the application by:

- Unsubscribing from all pub/sub subscriptions
- Stopping all input bindings
- Short-circuiting all service-invocation requests, which terminate in the Dapr runtime and are not forwarded to the application

These changes are meant to be temporary, and Dapr resumes normal operations once it detects that the application is responsive again.

App health checks are disabled by default.

<img src="/images/observability-app-health.webp" width="800" alt="Diagram showing the app health feature. Running Dapr with app health enabled causes Dapr to periodically probe the app for its health.">

### App health checks vs platform-level health checks
## App health checks vs platform-level health checks

App health checks in Dapr are meant to be complementary to, and not replace, any platform-level health checks, like [liveness probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) when running on Kubernetes.

Platform-level health checks (or liveness probes) generally ensure that the application is running, and cause the platform to restart the application in case of failures.

Unlike platform-level health checks, Dapr's app health checks focus on pausing work to an application that is currently unable to accept it, but is expected to be able to resume accepting work *eventually*. Goals include:

- Not bringing more load to an application that is already overloaded.
- Do the "polite" thing by not taking messages from queues, bindings, or pub/sub brokers when Dapr knows the application won't be able to process them.

In this regard, Dapr's app health checks are "softer", waiting for an application to be able to process work, rather than terminating the running process in a "hard" way.

> Note that for Kubernetes, a failing App Health check won't remove a pod from service discovery: this remains the responsibility of the Kubernetes liveness probe, _not_ Dapr.
{{% alert title="Note" color="primary" %}}
For Kubernetes, a failing app health check won't remove a pod from service discovery: this remains the responsibility of the Kubernetes liveness probe, _not_ Dapr.
{{% /alert %}}

## Configuring app health checks

Expand All @@ -52,34 +58,46 @@ The full list of options are listed in this table:
| CLI flags | Kubernetes deployment annotation | Description | Default value |
| ----------------------------- | ----------------------------------- | ----------- | ------------- |
| `--enable-app-health-check` | `dapr.io/enable-app-health-check` | Boolean that enables the health checks | Disabled |
| `--app-health-check-path` | `dapr.io/app-health-check-path` | Path that Dapr invokes for health probes when the app channel is HTTP (this value is ignored if the app channel is using gRPC) | `/healthz` |
| `--app-health-probe-interval` | `dapr.io/app-health-probe-interval` | Number of *seconds* between each health probe | `5` |
| `--app-health-probe-timeout` | `dapr.io/app-health-probe-timeout` | Timeout in *milliseconds* for health probe requests | `500` |
| `--app-health-threshold` | `dapr.io/app-health-threshold` | Max number of consecutive failures before the app is considered unhealthy | `3` |
| [`--app-health-check-path`]({{< ref "app-health.md#health-check-paths" >}}) | `dapr.io/app-health-check-path` | Path that Dapr invokes for health probes when the app channel is HTTP (this value is ignored if the app channel is using gRPC) | `/healthz` |
| [`--app-health-probe-interval`]({{< ref "app-health.md#intervals-timeouts-and-thresholds" >}}) | `dapr.io/app-health-probe-interval` | Number of *seconds* between each health probe | `5` |
| [`--app-health-probe-timeout`]({{< ref "app-health.md#intervals-timeouts-and-thresholds" >}}) | `dapr.io/app-health-probe-timeout` | Timeout in *milliseconds* for health probe requests | `500` |
| [`--app-health-threshold`]({{< ref "app-health.md#intervals-timeouts-and-thresholds" >}}) | `dapr.io/app-health-threshold` | Max number of consecutive failures before the app is considered unhealthy | `3` |

> See the [full Dapr arguments and annotations reference]({{<ref arguments-annotations-overview>}}) for all options and how to enable them.
> See the [full Dapr arguments and annotations reference]({{< ref arguments-annotations-overview >}}) for all options and how to enable them.
Additionally, app health checks are impacted by the protocol used for the app channel, which is configured with the `--app-protocol` flag (self-hosted) or the `dapr.io/app-protocol` annotation (Kubernetes); supported values are `http` (default), `grpc`, `https`, `grpcs`, and `h2c` (HTTP/2 Cleartext).
Additionally, app health checks are impacted by the protocol used for the app channel, which is configured with the following flag or annotation:

| CLI flag | Kubernetes deployment annotation | Description | Default value |
| ----------------------------- | ----------------------------------- | ----------- | ------------- |
| [`--app-protocol`]({{< ref "app-health.md#health-check-paths" >}}) | `dapr.io/app-protocol` | Protocol used for the app channel. supported values are `http`, `grpc`, `https`, `grpcs`, and `h2c` (HTTP/2 Cleartext). | `http` |

### Health check paths

#### HTTP
When using HTTP (including `http`, `https`, and `h2c`) for `app-protocol`, Dapr performs health probes by making an HTTP call to the path specified in `app-health-check-path`, which is `/health` by default.

For your app to be considered healthy, the response must have an HTTP status code in the 200-299 range. Any other status code is considered a failure. Dapr is only concerned with the status code of the response, and ignores any response header or body.

#### gRPC
When using gRPC for the app channel (`app-protocol` set to `grpc` or `grpcs`), Dapr invokes the method `/dapr.proto.runtime.v1.AppCallbackHealthCheck/HealthCheck` in your application. Most likely, you will use a Dapr SDK to implement the handler for this method.

While responding to a health probe request, your app *may* decide to perform additional internal health checks to determine if it's ready to process work from the Dapr runtime. However, this is not required; it's a choice that depends on your application's needs.

### Intervals, timeouts, and thresholds

When app health checks are enabled, by default Dapr probes your application every 5 seconds. You can configure the interval, in seconds, with `app-health-probe-interval`. These probes happen regularly, regardless of whether your application is healthy or not.
#### Intervals
By default, when app health checks are enabled, Dapr probes your application every 5 seconds. You can configure the interval, in seconds, with `app-health-probe-interval`. These probes happen regularly, regardless of whether your application is healthy or not.

#### Timeouts
When the Dapr runtime (sidecar) is initially started, Dapr waits for a successful health probe before considering the app healthy. This means that pub/sub subscriptions, input bindings, and service invocation requests won't be enabled for your application until this first health check is complete and successful.

Health probe requests are considered successful if the application sends a successful response (as explained above) within the timeout configured in `app-health-probe-timeout`. The default value is 500, corresponding to 500 milliseconds (i.e. half a second).
Health probe requests are considered successful if the application sends a successful response (as explained above) within the timeout configured in `app-health-probe-timeout`. The default value is 500, corresponding to 500 milliseconds (half a second).

#### Thresholds
Before Dapr considers an app to have entered an unhealthy state, it will wait for `app-health-threshold` consecutive failures, whose default value is 3. This default value means that your application must fail health probes 3 times *in a row* to be considered unhealthy.

If you set the threshold to 1, any failure causes Dapr to assume your app is unhealthy and will stop delivering work to it.

A threshold greater than 1 can help exclude transient failures due to external circumstances. The right value for your application depends on your requirements.

Thresholds only apply to failures. A single successful response is enough for Dapr to consider your app to be healthy and resume normal operations.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
type: docs
title: "Tracing"
linkTitle: "Tracing"
weight: 300
description: Learn more about tracing scenarios and how to use tracing for visibility in your application
---
Loading

0 comments on commit f1fb452

Please sign in to comment.