Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reroute processor sanitization inconsistent with spec #112936

Open
carsonip opened this issue Sep 16, 2024 · 6 comments
Open

reroute processor sanitization inconsistent with spec #112936

carsonip opened this issue Sep 16, 2024 · 6 comments
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team

Comments

@carsonip
Copy link
Member

carsonip commented Sep 16, 2024

Elasticsearch Version

confirmed on 8.15.1. Affects current main.

Installed Plugins

No response

Java Version

bundled

OS Version

Linux carson-elastic 6.8.0-44-generic #44-Ubuntu SMP PREEMPT_DYNAMIC Tue Aug 13 13:35:26 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Problem Description

2 issues:

  • reroute processor allows - in namespace while it is said to be forbidden in ecs docs, but not in reroute docs
  • substring usage in code gets the first 100 chars (as in bytes, not in unicode chars /unicode code points). This may result in invalid unicode after sanitization. Fleet docs says it should be 100 bytes, not 100 unicode chars, while in reroute doc it just says "100 characters". Clarify whether "100 characters" refers to unicode characters. If so, the code needs to be updated to be consistent with doc.

Steps to Reproduce

PUT _index_template/logs-foo
{
  "data_stream": {},
  "index_patterns": ["logs-foo-*"], 
  "composed_of": ["logs-foo@mappings"],
  "ignore_missing_component_templates": ["logs-foo@mappings"],
  "template": {
    "settings": {
      "index.default_pipeline": "logs-foo@pipeline"
    }
  }, 
  "priority": 250
}

PUT _component_template/logs-foo@mappings
{
 "template": {
   "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        }
      }
    }
 } 
}

PUT _ingest/pipeline/logs-foo@pipeline
{
  "processors": [
    {"reroute": {
      "namespace": ["my-new-namespace"]
    }}
  ]
}

POST logs-foo-default/_doc
{
  "@timestamp": "1970-01-01T00:00:00.000Z"
}

GET logs-foo-*/_search

Resulting doc:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": ".ds-logs-foo-my-new-namespace-2024.09.16-000001",
        "_id": "kfs8-5EB0otqeM1lzk3-",
        "_score": 1,
        "_source": {
          "data_stream": {
            "namespace": "my-new-namespace",
            "type": "logs",
            "dataset": "foo"
          },
          "@timestamp": "1970-01-01T00:00:00.000Z"
        }
      }
    ]
  }
}

Logs (if relevant)

No response

@carsonip carsonip added >bug needs:triage Requires assignment of a team area label labels Sep 16, 2024
@pxsalehi pxsalehi added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP and removed needs:triage Requires assignment of a team area label labels Sep 17, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Sep 17, 2024
@carsonip
Copy link
Member Author

substring usage in code gets the first 100 chars (as in bytes, not in unicode chars /unicode code points). This may result in invalid unicode after sanitization. Fleet docs says it should be 100 bytes, not 100 unicode chars, while in reroute doc it just says "100 characters". Clarify whether "100 characters" refers to unicode characters. If so, the code needs to be updated to be consistent with doc.

elastic/ecs#2384 now clarifies that it should be 100 bytes, not 100 unicode characters. There is no problem with the existing ES implementation in truncating to 100 bytes.

@felixbarny
Copy link
Member

Seems like there's a discrepancy between the allowed characters for namespace in the data streams RFC (allows -) and the ECS data_stream fields documentation (disallows -). I was basing my implementation of the reroute processor on the RFC. The reason why - is (potentially) allowed in the namespace is that it doesn't cause ambiguity. You can parse the type and namespace portion by splitting the target string by - and extracting the first and second element. The rest is the namespace.

@ebeahan do you remember why there's a discrepancy about whether or not the - character in data_stream.namespace is allowed? Are there any open discussions about this or is something that happened by accident?

@ebeahan
Copy link
Member

ebeahan commented Sep 23, 2024

@ebeahan do you remember why there's a discrepancy about whether or not the - character in data_stream.namespace is allowed? Are there any open discussions about this or is something that happened by accident?

I don't know of any discussion and suspect it's an oversight. The discrepancy appeared in the earliest version of the eventual ECS field descriptions here.

@felixbarny
Copy link
Member

@ebeahan Since it would be a breaking change for the reroute processor to disallow namespaces that have been configured contained a dash, do you think we could change the ECS definition to allow dashes in the namespace?

@ebeahan
Copy link
Member

ebeahan commented Sep 25, 2024

@felixbarny I opened elastic/ecs#2388. Since removing the restriction makes the field more permissive, I don't see it as a breaking ECS change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants