Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate and sanitization of dataset and namespace field #3946

Open
ruflin opened this issue Dec 21, 2023 · 9 comments
Open

Validate and sanitization of dataset and namespace field #3946

ruflin opened this issue Dec 21, 2023 · 9 comments
Labels
Team:Elastic-Agent Label for the Agent team

Comments

@ruflin
Copy link
Contributor

ruflin commented Dec 21, 2023

The fields for the data stream naming scheme have rules around what chars are valid. One of the goals it so make sure a pattern like logs-*-* always applies to the right dataset and namespace. Recently, some configs in helm charts were found that look as following:

        streams:
          - id: container-log-${kubernetes.pod.name}-${kubernetes.container.id}
            data_stream:
              dataset: ${kubernetes.labels.app.kubernetes.io/name|'container_logs'}.log
              type: logs

This is a valid configuration. But as it turns out, the name above can be for example foo-bar. This means the value for data_stream.dataset becomes foo-bar. But for processors like reroute that match on the data stream name with the pattern logs-*-*, foo is assumed to be the dataset, bar and everything afterwards the namespace.

In the context of the reroute processor, these fields are sanitized. Unfortunately Elastic Agent (Beats) doesn't do this today but should.

Unfortunately turning on sanitization directly could be considered a breaking change as now suddenly foo-bar would become foo_bar. At the same time, it is important for the system to function that the values are always the same.

Some ideas for migration users is to at least start logging errors (once) and inform users if this happens and later turn it on by default (with an opt out option).

@axw This might also be relevant in the context of ingesting otel data.

@cmacknz cmacknz added the Team:Elastic-Agent Label for the Agent team label Dec 21, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@cmacknz
Copy link
Member

cmacknz commented Dec 21, 2023

Users aren't going to be naming their pods while keeping the data stream naming convention in mind, so this is unfortunate. I wonder how common this is, since - is extremely common as the separator in k8s resource names. Perhaps quite common.

Similar to #3934 (comment) we could use feature flags to deal with the breaking nature of this problem:

# This section enables or disables feature flags supported by Agent and its components.
#agent.features:
# fqdn:
# enabled: false

This feels more like something that we really should be doing by default though if it is going to cause problems interacting with other parts of the stack. So perhaps in this case we default to sanitization with the option to disable it.

The other option is for things like the reroute processor to just deal with the fact that not everything is going to adhere perfectly to our conventions since this is already out there (and you can of course write to ES without using Elastic agent).

Given logs-foo-bar-default rather than logs-*-* parsing this with the dataset as foo and the namespace as bar-default could we consider parsing only the first and last - characters as the separators? This would allow - in the dataset name and you'd correctly parse foo-bar as the dataset. You'd still have problems if someone put - in the namespace with this approach though (and I suspect someone has or will use a k8s namespace as as datastream namespace).

@joshdover
Copy link
Contributor

This feels more like something that we really should be doing by default though if it is going to cause problems interacting with other parts of the stack. So perhaps in this case we default to sanitization with the option to disable it.

+1 - we're generally ok to make breaking changes that are actually bug fixes. We can run this by the breaking change committee if we're still nervous about it but I think the fix warrants the potential negative impact, which should only effect advanced users who have handcrafted a config like this.

Given logs-foo-bar-default rather than logs-*-* parsing this with the dataset as foo and the namespace as bar-default could we consider parsing only the first and last - characters as the separators? This would allow - in the dataset name and you'd correctly parse foo-bar as the dataset. You'd still have problems if someone put - in the namespace with this approach though (and I suspect someone has or will use a k8s namespace as as datastream namespace).

I think changing the parsing rules here on the ES side is likely to be a much more impactful breaking change, even if scoped only to the DSNS. It's also not exactly clear which piece we should prefer supporting dashes in. IMO the current behavior may actually be preferable to ensure that you can send logs to logs-kubernetes.container_logs-my-k8s-namespace and get to reuse the existing index template for the dataset.

I think a better solution may be to focus on is getting away from shippers specifying the dataset and namespace via the index name and instead sending data to logs-generic-default (or even better, just logs) and leveraging the structured data_stream.* fields to route data via the reroute processor. This allows us to side-step any parsing prioritization decisions and allows the user to do anything they need to do on the collection side, and have the backend make sure the data is always valid for the Stack.

This almost works today OOTB with the default templates, but requires that a reroute processor is added to the logs@custom pipeline:

PUT _ingest/pipeline/logs@custom
{
  "processors": [
    {
      "reroute": {
        "ignore_failure": true
      }
    }
  ]
}
POST /logs-generic-default/_bulk
{ "create": {} }
{ "data_stream": { "dataset": "bar", "namespace": "test-me" }, "message": "foo 3"}
{ "create": {} }
{ "data_stream": { "dataset": "with-dashes", "namespace": "test-me" }, "message": "foo 2"}
{ "create": {} }
{ "data_stream": { "dataset": "foo" }, "message": "foo 1"}
image

@ruflin
Copy link
Contributor Author

ruflin commented Jan 15, 2024

Taking what Josh has above, the data stream naming scheme would basically be extended with these 2 rules:

  • If dataset value is not set, auto detection of the dataset value is happening with the logs-*-* pattern and will be set
  • If the dataset is set, this is the value to be used. So it can contain - and other chars.

What I think this all comes back to is that whenever the data stream naming scheme is used, the data_stream.* fields are set so anything down stream can use these values to make decisions (trying to find the Github issue for linking).

@felixbarny
Copy link
Member

I believe this is the issue you're looking for:

@leehinman
Copy link
Contributor

If we want to keep existing data stream naming I think we could do the following.

  1. Have fleet report an error if any integration is configured to send to a data stream name that doesn't comply with the rules
  2. Have an option to the variable expansion so that it will encode - as something else, for example you could use the HTML entity -

That should catch static config errors (fleet check) and dynamic config errors (variable expansion).

@ycombinator
Copy link
Contributor

Have an option to the variable expansion so that it will encode - as something else, for example you could use the HTML entity -

We should probably be consistent with the reroute processor and "encode" the - as a _?

In the context of the reroute processor, these fields are sanitized. Unfortunately Elastic Agent (Beats) doesn't do this today but should.

@axw
Copy link
Member

axw commented Sep 5, 2024

@ycombinator +1. FWIW APM Server and MIS has logic to sanitise data stream names, where we also use "_" as the replacement. May be worth copying for consistency: https://github.com/elastic/apm-data/blob/c5848755c65e676325c80780c24a14a941e5f176/model/modelprocessor/datastream.go#L197C6-L209

@leehinman
Copy link
Contributor

@ycombinator +1. FWIW APM Server and MIS has logic to sanitise data stream names, where we also use "_" as the replacement. May be worth copying for consistency: https://github.com/elastic/apm-data/blob/c5848755c65e676325c80780c24a14a941e5f176/model/modelprocessor/datastream.go#L197C6-L209

consistency is important, but I'd like to point out that this is a sub-optimal way to "sanitize". You can't go backwards from this, and you could have accidental name collisions. For example:

a-b_c => a_b_c
and
a_b-c => a_b_c

depending on variable substitution you could end up writing very different things to the same data_stream, and you will be making it harder for users to determine where the data_stream name came from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

8 participants