Skip to content

Commit 1524c32

Browse files
committed
Add component status reporting document
1 parent aa31b27 commit 1524c32

File tree

4 files changed

+84
-0
lines changed

4 files changed

+84
-0
lines changed

docs/component-status.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# Component Status Reporting
2+
3+
Component status reporting is a collector feature that allows components to report their status (aka health) via status events to extensions. In order for an extension receive these events it must implement the [StatusWatcher interface](https://github.com/open-telemetry/opentelemetry-collector/blob/f05f556780632d12ef7dbf0656534d771210aa1f/extension/extension.go#L54-L63).
4+
5+
### Status Definitions
6+
7+
The system defines six statuses, listed in the table below:
8+
9+
| Status | Meaning |
10+
| ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
11+
| Starting | The component is starting. |
12+
| OK | The component is running without issue. |
13+
| RecoverableError | The component has experienced a transient error and may recover. |
14+
| PermanentError | The component has detected a condition at runtime that will need human intervention to fix. The collector will continue to run in a degraded mode. |
15+
| Stopping | The component is in the process of shutting down. |
16+
| Stopped | The component has completed shutdown. |
17+
18+
Statuses can be categorized into two groups: lifecycle and runtime.
19+
20+
**Lifecycle Statuses**
21+
- Starting
22+
- Stopping
23+
- Stopped
24+
25+
**Runtime Statuses**
26+
- OK
27+
- RecoverableError
28+
- PermanentError
29+
30+
### Transitioning Between Statuses
31+
32+
There is a finite state machine underlying the status reporting API that governs the allowable state transitions. See the state diagram below:
33+
34+
35+
![State Diagram](img/component-status-state-diagram.png)
36+
37+
The finite state machine ensures that components progress through the lifecycle properly and it manages transitions through runtime states so that components do not need to track their state internally. Only changes in status result in new events being generated; repeat reports of the same status are ignored. PermanentError is a permanent runtime state. A component in this state can transition to Stopping, but not to OK or RecoverableError.
38+
39+
![Status Event Generation](img/component-status-event-generation.png)
40+
41+
### Automation
42+
43+
The collector is responsible for starting and stopping components. Since it knows when these events occur and their outcomes, it can automate status reporting of lifecycle events for components.
44+
45+
**Start**
46+
47+
The collector will report a Starting event when starting a component. If an error is returned from Start, the collector will report a PermanentError event. If start returns without an error and the collector hasn't reported status itself, the collector will report an OK event.
48+
49+
**Shutdown**
50+
51+
The collector will report a Stopping event when shutting down a component. If Shutdown returns an error, the collector will report a PermanentError event. If Shutdown completes without an error, the collector will report a Stopped event.
52+
53+
### Best Practices
54+
55+
**Start**
56+
57+
Under most circumstances, a component does not need to report explicit status during component.Start. An exception to this rule is components that start async work (e.g. spawn a go routine). This is because async work may or may not complete before start returns and timing can vary between executions. A component can halt startup by returning an error from start. If start returns an error, automated status reporting will report a PermanentError on behalf of the component. If start returns without an error automated status reporting will report OK, so long has the component hasn't already reported for itself.
58+
59+
**Runtime**
60+
61+
![Runtime State Diagram](img/component-status-runtime-states.png)
62+
During runtime a component should not have to keep track of its state. A component should report status as operations succeed or fail and the finite state machine will handle the rest. Changes in status will result in new status events being emitted. Repeat reports of the same status will no-op. Similarly, attempts to make an invalid state transition, such as PermanentError to OK, will have no effect.
63+
64+
We intend to define guidelines to help component authors distinguish between recoverable and permanent errors on a per-component type basis and we'll update this document as we make decisions. See [this issue](https://github.com/open-telemetry/opentelemetry-collector/issues/9957) for current thoughts and discussions.
65+
66+
**Shutdown**
67+
68+
A component should never have to report explicit status during shutdown. Automated status reporting should handle all cases. To recap, the collector will report Stopping before Shutdown is called. If a component returns an error from shutdown the collector will report a PermanentError and it will report Stopped if Shutdown returns without an error.
69+
70+
### In the Weeds
71+
72+
There are a couple of implementation details that are worth discussing for those who work on or wish to understand the collector internals.
73+
74+
**component.TelemetrySettings**
75+
76+
The API for components to report status is the ReportStatus method on the component.TelemetrySettings instance that is part of the CreateSettings passed to a component's factory during creation. It takes a single argument, a status event. The StatusWatcher interface takes both a component instance ID and a status event. The ReportStatus function is customized for each component and passes along the instance ID with each event. A component doesn't know its instance ID, but its ReportStatus method does.
77+
78+
**servicetelemetry.TelemetrySettings**
79+
80+
The service gets a slightly different TelemetrySettings object, a servicetelemetry.TelemetrySettings, which references the ReportStatus method on a status.Reporter. Unlike the ReportStatus method on component.TelemetrySettings, this version takes two arguments, a component instance ID and a status event. The service uses this function to report status on behalf of the components it manages. This is what the collector uses for the automated status reporting of lifecycle events.
81+
82+
**sharedcomponent**
83+
84+
The collector has the concept of a shared component. A shared component is represented as a single component to the collector, but represents multiple logical components elsewhere. The most common usage of this is the OTLP receiver, where a single shared component represents a logical instance for each signal: traces, metrics, and logs (although this can vary based on configuration). When a shared component reports status it must report an event for each of the logical instances it represents. In the current implementation, shared component reports status for all its logical instances during [Start](https://github.com/open-telemetry/opentelemetry-collector/blob/31ac3336d956d93abede6db76453730613e1f076/internal/sharedcomponent/sharedcomponent.go#L89-L98) and [Shutdown](https://github.com/open-telemetry/opentelemetry-collector/blob/31ac3336d956d93abede6db76453730613e1f076/internal/sharedcomponent/sharedcomponent.go#L105-L117). It also [modifies the ReportStatus method](https://github.com/open-telemetry/opentelemetry-collector/blob/31ac3336d956d93abede6db76453730613e1f076/internal/sharedcomponent/sharedcomponent.go#L34-L44) on component.TelemetrySettings to report status for each logical instance when called.
166 KB
Loading
33.1 KB
Loading
99 KB
Loading

0 commit comments

Comments
 (0)