Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/elasticsearch] Add sanitization utils for datastream fields #35494

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
f01217c
add sanitization utils for datastream fields
rubvs Sep 30, 2024
f30ab50
Merge branch 'main' into elastic-sanitize-datastream-fields
rubvs Sep 30, 2024
e973b09
test: add sanitization check for exportering datastream
rubvs Sep 30, 2024
fe77f13
Merge branch 'elastic-sanitize-datastream-fields' of github.com:rubvs…
rubvs Sep 30, 2024
bf248b3
Merge branch 'main' into elastic-sanitize-datastream-fields
rubvs Sep 30, 2024
4383f67
improve sanitization func to leverage inlining
rubvs Oct 1, 2024
8bbbc8c
Merge branch 'elastic-sanitize-datastream-fields' of github.com:rubvs…
rubvs Oct 1, 2024
bc3bd6b
Merge branch 'main' into elastic-sanitize-datastream-fields
rubvs Oct 1, 2024
86cc2ae
minor change to datastream sanitization func
rubvs Oct 1, 2024
d69cd4a
Merge branch 'elastic-sanitize-datastream-fields' of github.com:rubvs…
rubvs Oct 1, 2024
188300c
Merge branch 'main' into elastic-sanitize-datastream-fields
rubvs Oct 1, 2024
71ffaca
add changelog
rubvs Oct 1, 2024
5d771ce
Merge branch 'elastic-sanitize-datastream-fields' of github.com:rubvs…
rubvs Oct 1, 2024
43e55d2
Merge branch 'main' into elastic-sanitize-datastream-fields
rubvs Oct 1, 2024
9e3564e
fix linting issue in comment
rubvs Oct 1, 2024
8bdea23
Merge branch 'elastic-sanitize-datastream-fields' of github.com:rubvs…
rubvs Oct 1, 2024
4020fc7
minor changes
rubvs Oct 3, 2024
f85b3a5
Merge branch 'main' into elastic-sanitize-datastream-fields
rubvs Oct 3, 2024
a868f0a
Merge branch 'main' into elastic-sanitize-datastream-fields
rubvs Oct 3, 2024
d89ed94
doc: add reference to data stream field restrictions
rubvs Oct 8, 2024
4bafdcf
Merge branch 'main' into elastic-sanitize-datastream-fields
rubvs Oct 11, 2024
ce54d20
Merge branch 'main' into elastic-sanitize-datastream-fields
rubvs Oct 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .chloggen/elasticsearchexporter_sanitize-datastream-fields.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Use this changelog template to create an entry for release notes.

# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
change_type: bug_fix

# The name of the component, or a single word describing the area of concern, (e.g. filelogreceiver)
component: elasticsearchexporter

# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
note: Sanitize datastream routing fields

# Mandatory: One or more tracking issues related to the change. You can use the PR number here if no issue exists.
issues: [34285]

# (Optional) One or more lines of additional information to render under the primary note.
# These lines will be padded with 2 spaces and then inserted directly into the document.
# Use pipe (|) for multiline entries.
subtext:
Sanitize the dataset and namespace fields according to https://www.elastic.co/guide/en/ecs/current/ecs-data_stream.html.

# If your change doesn't affect end users or the exported elements of any package,
# you should instead start your pull request title with [chore] or use the "Skip Changelog" label.
# Optional: The change log or logs in which this entry should be included.
# e.g. '[user]' or '[user, api]'
# Include 'user' if the change is relevant to end users.
# Include 'api' if there is a change to a library API.
# Default: '[user]'
change_logs: [user]
8 changes: 4 additions & 4 deletions exporter/elasticsearchexporter/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,21 +121,21 @@ This can be customised through the following settings:

- `logs_dynamic_index` (optional): uses resource, scope, or log record attributes to dynamically construct index name.
- `enabled`(default=false): Enable/Disable dynamic index for log records. If `data_stream.dataset` or `data_stream.namespace` exist in attributes (precedence: log record attribute > scope attribute > resource attribute), they will be used to dynamically construct index name in the form `logs-${data_stream.dataset}-${data_stream.namespace}`. Otherwise, if
`elasticsearch.index.prefix` or `elasticsearch.index.suffix` exist in attributes (precedence: resource attribute > scope attribute > log record attribute), they will be used to dynamically construct index name in the form `${elasticsearch.index.prefix}${logs_index}${elasticsearch.index.suffix}`. Otherwise, if scope name matches regex `/receiver/(\w*receiver)`, `data_stream.dataset` will be capture group #1. Otherwise, the index name falls back to `logs-generic-default`, and `logs_index` config will be ignored. Except for prefix/suffix attribute presence, the resulting docs will contain the corresponding `data_stream.*` fields.
`elasticsearch.index.prefix` or `elasticsearch.index.suffix` exist in attributes (precedence: resource attribute > scope attribute > log record attribute), they will be used to dynamically construct index name in the form `${elasticsearch.index.prefix}${logs_index}${elasticsearch.index.suffix}`. Otherwise, if scope name matches regex `/receiver/(\w*receiver)`, `data_stream.dataset` will be capture group #1. Otherwise, the index name falls back to `logs-generic-default`, and `logs_index` config will be ignored. Except for prefix/suffix attribute presence, the resulting docs will contain the corresponding `data_stream.*` fields, see restrictions applied to [Data Stream Fields](https://www.elastic.co/guide/en/ecs/current/ecs-data_stream.html).

- `metrics_index` (optional): The [index] or [data stream] name to publish metrics to. The default value is `metrics-generic-default`.
⚠️ Note that metrics support is currently in development.

- `metrics_dynamic_index` (optional): uses resource, scope or data point attributes to dynamically construct index name.
⚠️ Note that metrics support is currently in development.
- `enabled`(default=true): Enable/disable dynamic index for metrics. If `data_stream.dataset` or `data_stream.namespace` exist in attributes (precedence: data point attribute > scope attribute > resource attribute), they will be used to dynamically construct index name in the form `metrics-${data_stream.dataset}-${data_stream.namespace}`. Otherwise, if
`elasticsearch.index.prefix` or `elasticsearch.index.suffix` exist in attributes (precedence: resource attribute > scope attribute > data point attribute), they will be used to dynamically construct index name in the form `${elasticsearch.index.prefix}${metrics_index}${elasticsearch.index.suffix}`. Otherwise, if scope name matches regex `/receiver/(\w*receiver)`, `data_stream.dataset` will be capture group #1. Otherwise, the index name falls back to `metrics-generic-default`, and `metrics_index` config will be ignored. Except for prefix/suffix attribute presence, the resulting docs will contain the corresponding `data_stream.*` fields.
`elasticsearch.index.prefix` or `elasticsearch.index.suffix` exist in attributes (precedence: resource attribute > scope attribute > data point attribute), they will be used to dynamically construct index name in the form `${elasticsearch.index.prefix}${metrics_index}${elasticsearch.index.suffix}`. Otherwise, if scope name matches regex `/receiver/(\w*receiver)`, `data_stream.dataset` will be capture group #1. Otherwise, the index name falls back to `metrics-generic-default`, and `metrics_index` config will be ignored. Except for prefix/suffix attribute presence, the resulting docs will contain the corresponding `data_stream.*` fields, see restrictions applied to [Data Stream Fields](https://www.elastic.co/guide/en/ecs/current/ecs-data_stream.html).

- `traces_index`: The [index] or [data stream] name to publish traces to. The default value is `traces-generic-default`.

- `traces_dynamic_index` (optional): uses resource, scope, or span attributes to dynamically construct index name.
- `enabled`(default=false): Enable/Disable dynamic index for trace spans. If `data_stream.dataset` or `data_stream.namespace` exist in attributes (precedence: span attribute > scope attribute > resource attribute), they will be used to dynamically construct index name in the form `traces-${data_stream.dataset}-${data_stream.namespace}`. Otherwise, if
`elasticsearch.index.prefix` or `elasticsearch.index.suffix` exist in attributes (precedence: resource attribute > scope attribute > span attribute), they will be used to dynamically construct index name in the form `${elasticsearch.index.prefix}${traces_index}${elasticsearch.index.suffix}`. Otherwise, if scope name matches regex `/receiver/(\w*receiver)`, `data_stream.dataset` will be capture group #1. Otherwise, the index name falls back to `traces-generic-default`, and `traces_index` config will be ignored. Except for prefix/suffix attribute presence, the resulting docs will contain the corresponding `data_stream.*` fields. There is an exception for span events under OTel mapping mode (`mapping::mode: otel`), where span event attributes instead of span attributes are considered, and `data_stream.type` is always `logs` instead of `traces` such that documents are routed to `logs-${data_stream.dataset}-${data_stream.namespace}`.
`elasticsearch.index.prefix` or `elasticsearch.index.suffix` exist in attributes (precedence: resource attribute > scope attribute > span attribute), they will be used to dynamically construct index name in the form `${elasticsearch.index.prefix}${traces_index}${elasticsearch.index.suffix}`. Otherwise, if scope name matches regex `/receiver/(\w*receiver)`, `data_stream.dataset` will be capture group #1. Otherwise, the index name falls back to `traces-generic-default`, and `traces_index` config will be ignored. Except for prefix/suffix attribute presence, the resulting docs will contain the corresponding `data_stream.*` fields, see restrictions applied to [Data Stream Fields](https://www.elastic.co/guide/en/ecs/current/ecs-data_stream.html). There is an exception for span events under OTel mapping mode (`mapping::mode: otel`), where span event attributes instead of span attributes are considered, and `data_stream.type` is always `logs` instead of `traces` such that documents are routed to `logs-${data_stream.dataset}-${data_stream.namespace}`.

- `logstash_format` (optional): Logstash format compatibility. Logs, metrics and traces can be written into an index in Logstash format.
- `enabled`(default=false): Enable/disable Logstash format compatibility. When `logstash_format.enabled` is `true`, the index name is composed using `(logs|metrics|traces)_index` or `(logs|metrics|traces)_dynamic_index` as prefix and the date as suffix,
Expand Down Expand Up @@ -347,4 +347,4 @@ When sending high traffic of metrics to a TSDB metrics data stream, e.g. using O

This will be fixed in a future version of Elasticsearch. A possible workaround would be to use a transform processor to truncate the timestamp, but this will cause duplicate data to be dropped silently.

However, if `@timestamp` precision is not the problem, check your metrics pipeline setup for misconfiguration that causes an actual violation of the [single writer principle](https://opentelemetry.io/docs/specs/otel/metrics/data-model/#single-writer).
However, if `@timestamp` precision is not the problem, check your metrics pipeline setup for misconfiguration that causes an actual violation of the [single writer principle](https://opentelemetry.io/docs/specs/otel/metrics/data-model/#single-writer).
36 changes: 34 additions & 2 deletions exporter/elasticsearchexporter/data_stream_router.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,39 @@ package elasticsearchexporter // import "github.com/open-telemetry/opentelemetry
import (
"fmt"
"regexp"
"strings"
"unicode"

"go.opentelemetry.io/collector/pdata/pcommon"
)

var receiverRegex = regexp.MustCompile(`/receiver/(\w*receiver)`)

const (
maxDataStreamBytes = 100
disallowedNamespaceRunes = "\\/*?\"<>| ,#:"
crobert-1 marked this conversation as resolved.
Show resolved Hide resolved
disallowedDatasetRunes = "-\\/*?\"<>| ,#:"
)

// Sanitize the datastream fields (dataset, namespace) to apply restrictions
// as outlined in https://www.elastic.co/guide/en/ecs/current/ecs-data_stream.html
crobert-1 marked this conversation as resolved.
Show resolved Hide resolved
// The suffix will be appended after truncation of max bytes.
func sanitizeDataStreamField(field, disallowed, appendSuffix string) string {
field = strings.Map(func(r rune) rune {
if strings.ContainsRune(disallowed, r) {
return '_'
}
return unicode.ToLower(r)
}, field)

if len(field) > maxDataStreamBytes-len(appendSuffix) {
field = field[:maxDataStreamBytes-len(appendSuffix)]
}
field += appendSuffix

return field
}

func routeWithDefaults(defaultDSType string) func(
pcommon.Map,
pcommon.Map,
Expand Down Expand Up @@ -53,15 +80,20 @@ func routeWithDefaults(defaultDSType string) func(
dataset = receiverName
}

// The naming convention for datastream is expected to be "logs-[dataset].otel-[namespace]".
// For dataset, the naming convention for datastream is expected to be "logs-[dataset].otel-[namespace]".
// This is in order to match the built-in logs-*.otel-* index template.
var datasetSuffix string
if otel {
dataset += ".otel"
datasetSuffix += ".otel"
}

dataset = sanitizeDataStreamField(dataset, disallowedDatasetRunes, datasetSuffix)
namespace = sanitizeDataStreamField(namespace, disallowedNamespaceRunes, "")

recordAttr.PutStr(dataStreamDataset, dataset)
recordAttr.PutStr(dataStreamNamespace, namespace)
recordAttr.PutStr(dataStreamType, defaultDSType)

return fmt.Sprintf("%s-%s-%s", defaultDSType, dataset, namespace)
}
}
Expand Down
17 changes: 9 additions & 8 deletions exporter/elasticsearchexporter/exporter_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -215,7 +215,8 @@ func TestExporterLogs(t *testing.T) {
server := newESTestServer(t, func(docs []itemRequest) ([]itemResponse, error) {
rec.Record(docs)

assert.Equal(t, "logs-record.dataset-resource.namespace", actionJSONToIndex(t, docs[0].Action))
expected := "logs-record.dataset.____________-resource.namespace.-____________"
assert.Equal(t, expected, actionJSONToIndex(t, docs[0].Action))

return itemsAllOK(docs)
})
Expand All @@ -225,12 +226,12 @@ func TestExporterLogs(t *testing.T) {
})
logs := newLogsWithAttributes(
map[string]any{
dataStreamDataset: "record.dataset",
dataStreamDataset: "record.dataset.\\/*?\"<>| ,#:",
},
nil,
map[string]any{
dataStreamDataset: "resource.dataset",
dataStreamNamespace: "resource.namespace",
dataStreamNamespace: "resource.namespace.-\\/*?\"<>| ,#:",
},
)
logs.ResourceLogs().At(0).ScopeLogs().At(0).LogRecords().At(0).Body().SetStr("hello world")
Expand Down Expand Up @@ -647,7 +648,7 @@ func TestExporterMetrics(t *testing.T) {
server := newESTestServer(t, func(docs []itemRequest) ([]itemResponse, error) {
rec.Record(docs)

expected := "metrics-resource.dataset-data.point.namespace"
expected := "metrics-resource.dataset.____________-data.point.namespace.-____________"
assert.Equal(t, expected, actionJSONToIndex(t, docs[0].Action))

return itemsAllOK(docs)
Expand All @@ -659,11 +660,11 @@ func TestExporterMetrics(t *testing.T) {
})
metrics := newMetricsWithAttributes(
map[string]any{
dataStreamNamespace: "data.point.namespace",
dataStreamNamespace: "data.point.namespace.-\\/*?\"<>| ,#:",
},
nil,
map[string]any{
dataStreamDataset: "resource.dataset",
dataStreamDataset: "resource.dataset.\\/*?\"<>| ,#:",
dataStreamNamespace: "resource.namespace",
},
)
Expand Down Expand Up @@ -1287,7 +1288,7 @@ func TestExporterTraces(t *testing.T) {
server := newESTestServer(t, func(docs []itemRequest) ([]itemResponse, error) {
rec.Record(docs)

expected := "traces-span.dataset-default"
expected := "traces-span.dataset.____________-default"
assert.Equal(t, expected, actionJSONToIndex(t, docs[0].Action))

return itemsAllOK(docs)
Expand All @@ -1299,7 +1300,7 @@ func TestExporterTraces(t *testing.T) {

mustSendTraces(t, exporter, newTracesWithAttributes(
map[string]any{
dataStreamDataset: "span.dataset",
dataStreamDataset: "span.dataset.\\/*?\"<>| ,#:",
},
nil,
map[string]any{
Expand Down
17 changes: 17 additions & 0 deletions exporter/elasticsearchexporter/model_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -960,6 +960,9 @@ func decodeOTelID(data []byte) ([]byte, error) {
}

func TestEncodeLogOtelMode(t *testing.T) {
randomString := strings.Repeat("abcdefghijklmnopqrstuvwxyz0123456789", 10)
maxLenNamespace := maxDataStreamBytes - len(disallowedNamespaceRunes)
maxLenDataset := maxDataStreamBytes - len(disallowedDatasetRunes) - len(".otel")

tests := []struct {
name string
Expand Down Expand Up @@ -1044,6 +1047,20 @@ func TestEncodeLogOtelMode(t *testing.T) {
return assignDatastreamData(or, "", "third.otel")
},
},
{
name: "sanitize dataset/namespace",
rec: buildOTelRecordTestData(t, func(or OTelRecord) OTelRecord {
or.Attributes["data_stream.dataset"] = disallowedDatasetRunes + randomString
or.Attributes["data_stream.namespace"] = disallowedNamespaceRunes + randomString
return or
}),
wantFn: func(or OTelRecord) OTelRecord {
deleteDatasetAttributes(or)
ds := strings.Repeat("_", len(disallowedDatasetRunes)) + randomString[:maxLenDataset] + ".otel"
ns := strings.Repeat("_", len(disallowedNamespaceRunes)) + randomString[:maxLenNamespace]
return assignDatastreamData(or, "", ds, ns)
},
},
}

m := encodeModel{
Expand Down