Skip to content

Commit

Permalink
Disable top level resilience options by default
Browse files Browse the repository at this point in the history
Add diagram to readme for better resilience visualization
  • Loading branch information
Fiery-Fenix committed Nov 21, 2024
1 parent 3296a9c commit d8031ca
Show file tree
Hide file tree
Showing 4 changed files with 97 additions and 31 deletions.
34 changes: 26 additions & 8 deletions exporter/loadbalancingexporter/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,13 +48,35 @@ This also supports service name based exporting for traces. If you have two or m

## Resilience and scaling considerations

The `loadbalancingexporter` will, irrespective of the chosen resolver (`static`, `dns`, `k8s`), create one exporter per endpoint. Each level of exporters, `loadbalancingexporter` itself and all sub-exporters (one per each endpoint), have it's own queue, timeout and retry mechanisms. Importantly, the `loadbalancingexporter` will attempt to re-route data to a healthy endpoint on delivery failure because in-memory queue, retry and timeout setting are enabled by default ([more details on queuing, retry and timeout default settings](https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md)).
The `loadbalancingexporter` will, irrespective of the chosen resolver (`static`, `dns`, `k8s`), create one `otlp` exporter per endpoint. Each level of exporters, `loadbalancingexporter` itself and all sub-exporters (one per each endpoint), have it's own queue, timeout and retry mechanisms. Importantly, the `loadbalancingexporter`, by default, will NOT attempt to re-route data to a healthy endpoint on delivery failure, because in-memory queue, retry and timeout setting are disabled by default ([more details on queuing, retry and timeout default settings](https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md)).

Unfortunately, data loss is still possible if all of the exporter's targets remains unavailable once redelivery is exhausted. Due consideration needs to be given to the exporter queue and retry configuration when running in a highly elastic environment.
```
+------------------+ +---------------+
resiliency options 1 | | | |
-- otlp exporter 1 ------------ backend 1 |
| ---/ | | | |
| ---/ +----|-------------+ +---------------+
| ---/ |
+-----------------+ ---/ |
| --/ |
| loadbalancing | resiliency options 2
| exporter | |
| --\ |
+-----------------+ ----\ |
----\ +----|-------------+ +---------------+
----\ | | | |
--- otlp exporter N ------------ backend N |
| | | |
+------------------+ +---------------+
```

* For all types of resolvers (`static`, `dns`, `k8s`) - if one of endpoints is unavailable - first works queue, retry and timeout settings defined for sub-exporters (under `otlp` property). Once redelivery is exhausted on sub-exporter level, telemetry data returns to `loadbalancingexporter` itself and data redelivery happens according to exporter level queue, retry and timeout settings.
* For all types of resolvers (`static`, `dns`, `k8s`) - if one of endpoints is unavailable - first works queue, retry and timeout settings defined for sub-exporters (under `otlp` property). Once redelivery is exhausted on sub-exporter level, and resilience options 1 are enabled - telemetry data returns to `loadbalancingexporter` itself and data redelivery happens according to exporter level queue, retry and timeout settings.
* When using the `static` resolver and all targets are unavailable, all load-balanced telemetry will fail to be delivered until either one or all targets are restored or valid target is added the static list. The same principle applies to the `dns` and `k8s` resolvers, except for endpoints list update which happens automatically.
* When using `k8s`, `dns`, and likely future resolvers, topology changes are eventually reflected in the `loadbalancingexporter`. The `k8s` resolver will update more quickly than `dns`, but a window of time in which the true topology doesn't match the view of the `loadbalancingexporter` remains.
* Resiliency options 1 (`timeout`, `retry_on_failure` and `sending_queue` settings in `loadbalancing` section) - are useful for highly elastic environment (like k8s), where list of resolved endpoints frequently changed due to deployments, scale-up or scale-down events. In case of permanent change of list of resolved exporters this options provide capability to re-route data into new set of healthy backends. Disabled by default.
* Resiliency options 1 (`timeout`, `retry_on_failure` and `sending_queue` settings in `otlp` section) - are useful for temporary problems with specific backend, like network flukes. Persistent Queue is NOT supported here as all sub-exporter shares the same `sending_queue` configuration, including `storage`. Enabled by default.

Unfortunately, data loss is still possible if all of the exporter's targets remains unavailable once redelivery is exhausted. Due consideration needs to be given to the exporter queue and retry configuration when running in a highly elastic environment.

## Configuration

Expand Down Expand Up @@ -93,7 +115,7 @@ Refer to [config.yaml](./testdata/config.yaml) for detailed examples on using th
* `traceID`: Routes spans based on their `traceID`. Invalid for metrics.
* `metric`: Routes metrics based on their metric name. Invalid for spans.
* `streamID`: Routes metrics based on their datapoint streamID. That's the unique hash of all it's attributes, plus the attributes and identifying information of its resource, scope, and metric data
* loadbalancing exporter supports set of standard [queuing, batching, retry and timeout settings](https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md)
* loadbalancing exporter supports set of standard [queuing, retry and timeout settings](https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md), but they are disable by default to maintain compatibility

Simple example

Expand Down Expand Up @@ -161,8 +183,6 @@ exporters:
max_interval: 30s
max_elapsed_time: 300s
sending_queue:
# please take a note that otlp.sending_queue will be
# disabled automatically in this case to avoid data loss
enabled: true
num_consumers: 2
queue_size: 1000
Expand All @@ -173,8 +193,6 @@ exporters:
# all options from the OTLP exporter are supported
# except the endpoint
timeout: 1s
# doesn't take any effect because loadbalancing.sending_queue
# is enabled
sending_queue:
enabled: true
resolver:
Expand Down
59 changes: 37 additions & 22 deletions exporter/loadbalancingexporter/factory.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ import (
"fmt"

"go.opentelemetry.io/collector/component"
"go.opentelemetry.io/collector/config/configretry"
"go.opentelemetry.io/collector/exporter"
"go.opentelemetry.io/collector/exporter/exporterhelper"
"go.opentelemetry.io/collector/exporter/otlpexporter"
Expand Down Expand Up @@ -40,9 +39,8 @@ func createDefaultConfig() component.Config {
otlpDefaultCfg.Endpoint = "placeholder:4317"

return &Config{
TimeoutSettings: exporterhelper.NewDefaultTimeoutConfig(),
QueueSettings: exporterhelper.NewDefaultQueueConfig(),
BackOffConfig: configretry.NewDefaultBackOffConfig(),
// By default we disable resilience options on loadbalancing exporter level
// to maintain compatibility with workflow in previous versions
Protocol: Protocol{
OTLP: *otlpDefaultCfg,
},
Expand All @@ -69,24 +67,39 @@ func buildExporterSettings(params exporter.Settings, endpoint string) exporter.S
return params
}

func buildExporterResilienceOptions(options []exporterhelper.Option, cfg *Config) []exporterhelper.Option {
if cfg.TimeoutSettings.Timeout > 0 {
options = append(options, exporterhelper.WithTimeout(cfg.TimeoutSettings))
}
if cfg.QueueSettings.Enabled {
options = append(options, exporterhelper.WithQueue(cfg.QueueSettings))
}
if cfg.BackOffConfig.Enabled {
options = append(options, exporterhelper.WithRetry(cfg.BackOffConfig))
}

return options
}

func createTracesExporter(ctx context.Context, params exporter.Settings, cfg component.Config) (exporter.Traces, error) {
c := cfg.(*Config)
exporter, err := newTracesExporter(params, cfg)
if err != nil {
return nil, fmt.Errorf("cannot configure loadbalancing traces exporter: %w", err)
}

options := []exporterhelper.Option{
exporterhelper.WithStart(exporter.Start),
exporterhelper.WithShutdown(exporter.Shutdown),
exporterhelper.WithCapabilities(exporter.Capabilities()),
}

return exporterhelper.NewTraces(
ctx,
params,
cfg,
exporter.ConsumeTraces,
exporterhelper.WithStart(exporter.Start),
exporterhelper.WithShutdown(exporter.Shutdown),
exporterhelper.WithCapabilities(exporter.Capabilities()),
exporterhelper.WithTimeout(c.TimeoutSettings),
exporterhelper.WithQueue(c.QueueSettings),
exporterhelper.WithRetry(c.BackOffConfig),
buildExporterResilienceOptions(options, c)...,
)
}

Expand All @@ -97,17 +110,18 @@ func createLogsExporter(ctx context.Context, params exporter.Settings, cfg compo
return nil, fmt.Errorf("cannot configure loadbalancing logs exporter: %w", err)
}

options := []exporterhelper.Option{
exporterhelper.WithStart(exporter.Start),
exporterhelper.WithShutdown(exporter.Shutdown),
exporterhelper.WithCapabilities(exporter.Capabilities()),
}

return exporterhelper.NewLogs(
ctx,
params,
cfg,
exporter.ConsumeLogs,
exporterhelper.WithStart(exporter.Start),
exporterhelper.WithShutdown(exporter.Shutdown),
exporterhelper.WithCapabilities(exporter.Capabilities()),
exporterhelper.WithTimeout(c.TimeoutSettings),
exporterhelper.WithQueue(c.QueueSettings),
exporterhelper.WithRetry(c.BackOffConfig),
buildExporterResilienceOptions(options, c)...,
)
}

Expand All @@ -118,16 +132,17 @@ func createMetricsExporter(ctx context.Context, params exporter.Settings, cfg co
return nil, fmt.Errorf("cannot configure loadbalancing metrics exporter: %w", err)
}

options := []exporterhelper.Option{
exporterhelper.WithStart(exporter.Start),
exporterhelper.WithShutdown(exporter.Shutdown),
exporterhelper.WithCapabilities(exporter.Capabilities()),
}

return exporterhelper.NewMetrics(
ctx,
params,
cfg,
exporter.ConsumeMetrics,
exporterhelper.WithStart(exporter.Start),
exporterhelper.WithShutdown(exporter.Shutdown),
exporterhelper.WithCapabilities(exporter.Capabilities()),
exporterhelper.WithTimeout(c.TimeoutSettings),
exporterhelper.WithQueue(c.QueueSettings),
exporterhelper.WithRetry(c.BackOffConfig),
buildExporterResilienceOptions(options, c)...,
)
}
34 changes: 34 additions & 0 deletions exporter/loadbalancingexporter/factory_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ import (
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"go.opentelemetry.io/collector/component"
"go.opentelemetry.io/collector/config/configretry"
"go.opentelemetry.io/collector/exporter/exporterhelper"
"go.opentelemetry.io/collector/exporter/exportertest"
"go.opentelemetry.io/collector/exporter/otlpexporter"
"go.opentelemetry.io/collector/otelcol/otelcoltest"
Expand Down Expand Up @@ -123,3 +125,35 @@ func TestBuildExporterSettings(t *testing.T) {
zap.String(zapEndpointKey, testEndpoint),
)
}

func TestBuildExporterResilienceOptions(t *testing.T) {
t.Run("Shouldn't have resilience options by default", func(t *testing.T) {
o := []exporterhelper.Option{}
cfg := createDefaultConfig().(*Config)
assert.Empty(t, buildExporterResilienceOptions(o, cfg))
})
t.Run("Should have timeout option if defined", func(t *testing.T) {
o := []exporterhelper.Option{}
cfg := createDefaultConfig().(*Config)
cfg.TimeoutSettings = exporterhelper.NewDefaultTimeoutConfig()

assert.Len(t, buildExporterResilienceOptions(o, cfg), 1)
})
t.Run("Should have timeout and queue options if defined", func(t *testing.T) {
o := []exporterhelper.Option{}
cfg := createDefaultConfig().(*Config)
cfg.TimeoutSettings = exporterhelper.NewDefaultTimeoutConfig()
cfg.QueueSettings = exporterhelper.NewDefaultQueueConfig()

assert.Len(t, buildExporterResilienceOptions(o, cfg), 2)
})
t.Run("Should have all resilience options if defined", func(t *testing.T) {
o := []exporterhelper.Option{}
cfg := createDefaultConfig().(*Config)
cfg.TimeoutSettings = exporterhelper.NewDefaultTimeoutConfig()
cfg.QueueSettings = exporterhelper.NewDefaultQueueConfig()
cfg.BackOffConfig = configretry.NewDefaultBackOffConfig()

assert.Len(t, buildExporterResilienceOptions(o, cfg), 3)
})
}
1 change: 0 additions & 1 deletion exporter/loadbalancingexporter/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,6 @@ require (
go.opentelemetry.io/collector/config/configgrpc v0.114.0 // indirect
go.opentelemetry.io/collector/config/confignet v1.20.0 // indirect
go.opentelemetry.io/collector/config/configopaque v1.20.0 // indirect
go.opentelemetry.io/collector/config/configretry v1.20.0 // indirect
go.opentelemetry.io/collector/config/configtls v1.20.0 // indirect
go.opentelemetry.io/collector/config/internal v0.114.0 // indirect
go.opentelemetry.io/collector/confmap/provider/envprovider v1.20.0 // indirect
Expand Down

0 comments on commit d8031ca

Please sign in to comment.