Validate and sanitization of dataset and namespace field #3946

ruflin · 2023-12-21T09:08:28Z

The fields for the data stream naming scheme have rules around what chars are valid. One of the goals it so make sure a pattern like logs-*-* always applies to the right dataset and namespace. Recently, some configs in helm charts were found that look as following:

        streams:
          - id: container-log-${kubernetes.pod.name}-${kubernetes.container.id}
            data_stream:
              dataset: ${kubernetes.labels.app.kubernetes.io/name|'container_logs'}.log
              type: logs

This is a valid configuration. But as it turns out, the name above can be for example foo-bar. This means the value for data_stream.dataset becomes foo-bar. But for processors like reroute that match on the data stream name with the pattern logs-*-*, foo is assumed to be the dataset, bar and everything afterwards the namespace.

In the context of the reroute processor, these fields are sanitized. Unfortunately Elastic Agent (Beats) doesn't do this today but should.

Unfortunately turning on sanitization directly could be considered a breaking change as now suddenly foo-bar would become foo_bar. At the same time, it is important for the system to function that the values are always the same.

Some ideas for migration users is to at least start logging errors (once) and inform users if this happens and later turn it on by default (with an opt out option).

@axw This might also be relevant in the context of ingesting otel data.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-12-21T15:08:57Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

cmacknz · 2023-12-21T15:53:02Z

Users aren't going to be naming their pods while keeping the data stream naming convention in mind, so this is unfortunate. I wonder how common this is, since - is extremely common as the separator in k8s resource names. Perhaps quite common.

Similar to #3934 (comment) we could use feature flags to deal with the breaking nature of this problem:

elastic-agent/elastic-agent.reference.yml

Lines 194 to 197 in a76f4ba

    
           # This section enables or disables feature flags supported by Agent and its components. 
        
           #agent.features: 
        
           #  fqdn: 
        
           #    enabled: false

This feels more like something that we really should be doing by default though if it is going to cause problems interacting with other parts of the stack. So perhaps in this case we default to sanitization with the option to disable it.

The other option is for things like the reroute processor to just deal with the fact that not everything is going to adhere perfectly to our conventions since this is already out there (and you can of course write to ES without using Elastic agent).

Given logs-foo-bar-default rather than logs-*-* parsing this with the dataset as foo and the namespace as bar-default could we consider parsing only the first and last - characters as the separators? This would allow - in the dataset name and you'd correctly parse foo-bar as the dataset. You'd still have problems if someone put - in the namespace with this approach though (and I suspect someone has or will use a k8s namespace as as datastream namespace).

joshdover · 2023-12-27T13:56:16Z

This feels more like something that we really should be doing by default though if it is going to cause problems interacting with other parts of the stack. So perhaps in this case we default to sanitization with the option to disable it.

+1 - we're generally ok to make breaking changes that are actually bug fixes. We can run this by the breaking change committee if we're still nervous about it but I think the fix warrants the potential negative impact, which should only effect advanced users who have handcrafted a config like this.

Given logs-foo-bar-default rather than logs-*-* parsing this with the dataset as foo and the namespace as bar-default could we consider parsing only the first and last - characters as the separators? This would allow - in the dataset name and you'd correctly parse foo-bar as the dataset. You'd still have problems if someone put - in the namespace with this approach though (and I suspect someone has or will use a k8s namespace as as datastream namespace).

I think changing the parsing rules here on the ES side is likely to be a much more impactful breaking change, even if scoped only to the DSNS. It's also not exactly clear which piece we should prefer supporting dashes in. IMO the current behavior may actually be preferable to ensure that you can send logs to logs-kubernetes.container_logs-my-k8s-namespace and get to reuse the existing index template for the dataset.

I think a better solution may be to focus on is getting away from shippers specifying the dataset and namespace via the index name and instead sending data to logs-generic-default (or even better, just logs) and leveraging the structured data_stream.* fields to route data via the reroute processor. This allows us to side-step any parsing prioritization decisions and allows the user to do anything they need to do on the collection side, and have the backend make sure the data is always valid for the Stack.

This almost works today OOTB with the default templates, but requires that a reroute processor is added to the logs@custom pipeline:

PUT _ingest/pipeline/logs@custom
{
  "processors": [
    {
      "reroute": {
        "ignore_failure": true
      }
    }
  ]
}

POST /logs-generic-default/_bulk
{ "create": {} }
{ "data_stream": { "dataset": "bar", "namespace": "test-me" }, "message": "foo 3"}
{ "create": {} }
{ "data_stream": { "dataset": "with-dashes", "namespace": "test-me" }, "message": "foo 2"}
{ "create": {} }
{ "data_stream": { "dataset": "foo" }, "message": "foo 1"}

ruflin · 2024-01-15T07:42:19Z

Taking what Josh has above, the data stream naming scheme would basically be extended with these 2 rules:

If dataset value is not set, auto detection of the dataset value is happening with the logs-*-* pattern and will be set
If the dataset is set, this is the value to be used. So it can contain - and other chars.

What I think this all comes back to is that whenever the data stream naming scheme is used, the data_stream.* fields are set so anything down stream can use these values to make decisions (trying to find the Github issue for linking).

felixbarny · 2024-01-15T07:56:16Z

I believe this is the issue you're looking for:

[Indexing Strategy] Prevent wrong dataset / namespace values elasticsearch#64991

leehinman · 2024-09-04T18:05:53Z

If we want to keep existing data stream naming I think we could do the following.

Have fleet report an error if any integration is configured to send to a data stream name that doesn't comply with the rules
Have an option to the variable expansion so that it will encode - as something else, for example you could use the HTML entity -

That should catch static config errors (fleet check) and dynamic config errors (variable expansion).

ycombinator · 2024-09-05T01:08:43Z

Have an option to the variable expansion so that it will encode - as something else, for example you could use the HTML entity -

We should probably be consistent with the reroute processor and "encode" the - as a _?

In the context of the reroute processor, these fields are sanitized. Unfortunately Elastic Agent (Beats) doesn't do this today but should.

axw · 2024-09-05T02:32:33Z

@ycombinator +1. FWIW APM Server and MIS has logic to sanitise data stream names, where we also use "_" as the replacement. May be worth copying for consistency: https://github.com/elastic/apm-data/blob/c5848755c65e676325c80780c24a14a941e5f176/model/modelprocessor/datastream.go#L197C6-L209

leehinman · 2024-09-05T14:23:31Z

@ycombinator +1. FWIW APM Server and MIS has logic to sanitise data stream names, where we also use "_" as the replacement. May be worth copying for consistency: https://github.com/elastic/apm-data/blob/c5848755c65e676325c80780c24a14a941e5f176/model/modelprocessor/datastream.go#L197C6-L209

consistency is important, but I'd like to point out that this is a sub-optimal way to "sanitize". You can't go backwards from this, and you could have accidental name collisions. For example:

a-b_c => a_b_c
and
a_b-c => a_b_c

depending on variable substitution you could end up writing very different things to the same data_stream, and you will be making it harder for users to determine where the data_stream name came from.

cmacknz added the Team:Elastic-Agent Label for the Agent team label Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate and sanitization of dataset and namespace field #3946

Validate and sanitization of dataset and namespace field #3946

ruflin commented Dec 21, 2023

elasticmachine commented Dec 21, 2023

cmacknz commented Dec 21, 2023

joshdover commented Dec 27, 2023

ruflin commented Jan 15, 2024

felixbarny commented Jan 15, 2024

leehinman commented Sep 4, 2024

ycombinator commented Sep 5, 2024

axw commented Sep 5, 2024

leehinman commented Sep 5, 2024

Validate and sanitization of dataset and namespace field #3946

Validate and sanitization of dataset and namespace field #3946

Comments

ruflin commented Dec 21, 2023

elasticmachine commented Dec 21, 2023

cmacknz commented Dec 21, 2023

joshdover commented Dec 27, 2023

ruflin commented Jan 15, 2024

felixbarny commented Jan 15, 2024

leehinman commented Sep 4, 2024

ycombinator commented Sep 5, 2024

axw commented Sep 5, 2024

leehinman commented Sep 5, 2024