Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect validation of documents with arrays of strings in multi-fields #1971

Closed
2 tasks
jsoriano opened this issue Jul 12, 2024 · 18 comments · Fixed by #2035
Closed
2 tasks

Incorrect validation of documents with arrays of strings in multi-fields #1971

jsoriano opened this issue Jul 12, 2024 · 18 comments · Fixed by #2035
Assignees
Labels
Team:Ecosystem Label for the Packages Ecosystem team

Comments

@jsoriano
Copy link
Member

jsoriano commented Jul 12, 2024

In elastic/integrations#10353 (comment) we found documents that were causing issues about unexpected arrays of objects, but the value is a valid array of strings. An array of strings should be considered as valid if defined as a string type.

The issue seems to happen with multi-fields and synthetic source, what seems to be frequent when logsdb mode is used.

For example:

          "user_agent.name.text": [
            "Firefox"
          ],

Seems to produce this error:

[1] field "user_agent.name.text" is used as array of objects, expected explicit definition with type group or nested

Workaround in the meantime

As workaround in the meantime, package developers need to import the definition for these fields from ECS using external: ecs.

Tasks

Tasks (see #1971 (comment))

@kpollich kpollich added the Team:Ecosystem Label for the Packages Ecosystem team label Jul 12, 2024
@felixbarny
Copy link
Member

          "user_agent.name.text": [
            "Firefox"
          ],

Is that a snippet from _source or fields?

@jsoriano
Copy link
Member Author

When synthetic is enabled elastic-package gets the fields from fields.

@felixbarny
Copy link
Member

IINM, fields always returns an array, even if the source document was sent with a scalar value.
Do we have to adjust the assertions accordingly when fetching from fields rather than _source?

Also, maybe we should consistently have the same behavior regardless of whether synthetic _source is used and always fetch from fields.

@jsoriano
Copy link
Member Author

The error is a bit misleading. It really comes from user_agent.name.text not having a definition. I wonder if logsdb is always adding the multifield text to keywords?

@felixbarny
Copy link
Member

No, this isn't coming from LogsDB. I suspect this is coming from the ecs@mappings component template.

@jsoriano
Copy link
Member Author

Ah yes, could be, another difference on the failing build is that all packages are tested against 8.15.0-SNAPSHOT, and then they use ecs@mappings.

@jsoriano
Copy link
Member Author

jsoriano commented Aug 8, 2024

Yes, I can confirm the problem is in the ecs@mappings component template.

It defines a number of keyword fields with the .text multifield. But not all these multifields are defined in ECS. For example user_agent.name will match the following rule in ecs@mappings, that will add the .text field:

        {
          "ecs_path_match_keyword_and_match_only_text": {
            "mapping": {
              "fields": {
                "text": {
                  "type": "match_only_text"
                }
              },
              "type": "keyword"
            },
            "path_match": [
              "*.title",
              "*.executable",
              "*.name",
              "*.working_directory",
              "*.full_name",
              "*file.path",
              "*file.target_path",
              "*os.full",
              "email.subject",
              "vulnerability.description",
              "user_agent.original"
            ],
            "unmatch_mapping_type": "object"
          }
        },

But the multifield is not defined in ECS:

user_agent.name:
  dashed_name: user-agent-name
  description: Name of the user agent.
  example: Safari
  flat_name: user_agent.name
  ignore_above: 1024
  level: extended
  name: name
  normalize: []
  short: Name of the user agent.
  type: keyword

So I think we should:

  • Split the rule in ecs@mappings so the fields without multifield don't get the .text multifield.
  • Add a test that loops over all fields defined in ECS to check that ecs@mappings produces equivalent mappings.
  • As workaround in the meantime, package developers need to import the definition for these fields from ECS using external: ecs.

@jsoriano
Copy link
Member Author

jsoriano commented Aug 8, 2024

Updated description

@jsoriano
Copy link
Member Author

jsoriano commented Aug 9, 2024

We have found that some cases are caused by an issue on how we handle definitions of nested fields: elastic/package-spec#784

@mrodm
Copy link
Contributor

mrodm commented Aug 9, 2024

There are some errors that look like they come from when validating fields in transforms. Currently, that validation always runs with normalization:

fieldsValidator, err := fields.CreateValidatorForDirectory(transformRootPath,
fields.WithSpecVersion(pkgManifest.SpecVersion),
fields.WithNumericKeywordFields(config.NumericKeywordFields),
fields.WithEnabledImportAllECSSChema(true),
)

Example of errors found in ti_anomali package:

test case failed: one or more errors found in documents stored in logs-ti_anomali.threatstream-77107 data stream: [0] field "event.category" is not normalized as expected: expected array, found "threat" (string)
[1] field "event.type" is not normalized as expected: expected array, found "indicator" (string)
[2] field "labels.is_ioc_transform_source" is undefined

Those error validating for event.type and event.category fields are raising when checking transforms. If checkTransforms method is commented out in the system tester, validation for those fields run successfully.

As a note, labels.is_ioc_transform_source field still fails even with this fix. Probably, some missing mapping/field definition in transform? Why is this field labels.is_ioc_transform_source not failing with logsDB disabled?

Created PR to disable normalization when synthetics is enabled when validating fields in transforms: #2011

@felixbarny
Copy link
Member

For the ecs@mappings component template, we've made a conscious choice to include a path match for *.name that adds a .text subfield. IIRC, the reasons included making the component template smaller, more consistent, more generic, and more future-proof, in the sense that we don't need to constantly add new field definitions for new *.name fields.

OTOH, one could argue that we're adding unnecessary overhead by adding a text multi-field to fields that don't require one. And deciding whether or not the tradeoff to add a subfield is worth should be made in the ECS repo.

Maybe, what we should have done differently, is to actually propose changing ECS itself to be more consistent so that all *.name fields have a .text sub-field. This would prevent questions like this where ecs@mappings is seemingly not ECS compliant. cc @eyalkoren @ruflin wdyt?

Add a test that loops over all fields defined in ECS to check that ecs@mappings produces equivalent mappings.

Such a test already exists: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/stack/src/javaRestTest/java/org/elasticsearch/xpack/stack/EcsDynamicTemplatesIT.java

However, that test doesn't fail if there are additional multi-fields compared to what's defined in ECS: https://github.com/elastic/elasticsearch/blob/b7b1872dfac518a36fad97cf824b2348ed815ff7/x-pack/plugin/stack/src/javaRestTest/java/org/elasticsearch/xpack/stack/EcsDynamicTemplatesIT.java#L454-L456

As workaround in the meantime, package developers need to import the definition for these fields from ECS using external: ecs.

Could we instead have a similar logic in the tests that allows additional .text subfields without failing the test?

@mrodm
Copy link
Contributor

mrodm commented Aug 12, 2024

As a note, labels.is_ioc_transform_source field still fails even with this fix. Probably, some missing mapping/field definition in transform? Why is this field labels.is_ioc_transform_source not failing with logsDB disabled?

Tested to run the same Elastic stack (8.15.0-SNAPSHOT) with and without enabling LogsDB , and validation for this field just fails with logsdb when checking the documents from the transform.

Checking the documents returned in elastic-package by the preview api, that field just appears when enabling logsdb. So according to that, it is expected, that this is not failing when logsdb is disabled (that field is not present in the doc).

I don't know the cause those fields are just written with logsdb enabled. But it looks like packages should add that field definition in the transform fields folder.

@jsoriano
Copy link
Member Author

Maybe, what we should have done differently, is to actually propose changing ECS itself to be more consistent so that all *.name fields have a .text sub-field. This would prevent questions like this where ecs@mappings is seemingly not ECS compliant. cc @eyalkoren @ruflin wdyt?

+1, I think we should have consistency on this. Otherwise this question will re-appear from time to time.

Such a test already exists

Ah, it looks great. As it was not detecting this case I thought there was not such a test.

As workaround in the meantime, package developers need to import the definition for these fields from ECS using external: ecs.

Could we instead have a similar logic in the tests that allows additional .text subfields without failing the test?

We could, though I still think we should try to find consistency between ECS and ecs@mappings.

@felixbarny
Copy link
Member

We could, though I still think we should try to find consistency between ECS and ecs@mappings.

I've created an ECS issue to discuss how we should best resolve the discrepancy: elastic/ecs#2353

@eyalkoren
Copy link

The way I see it, a mapping is "ECS-compliant" if all fields that it contains that are also defined in ECS get the correct mapping according to ECS. So mapping user_agent.name to keyword makes it compliant for this field.
However, a mapping containing fields that are not defined in ECS doesn't make it non-ECS-compliant. Since subfields are essentially just additional fields, this should apply to user_agent.name.text as well.
I believe this was the logic that allowed us to generalize the *.name fields and what stands behind this comment in the test.

Therefore, what I think is:

  1. generalizing ECS for all *.name definitely makes sense, but it's not a must
  2. tests that validate ECS-compliance should not fail due to having fields not defined in ECS, including subfields

@mrodm
Copy link
Contributor

mrodm commented Aug 13, 2024

As a note, labels.is_ioc_transform_source field still fails even with this fix. Probably, some missing mapping/field definition in transform? Why is this field labels.is_ioc_transform_source not failing with logsDB disabled?

Tested to run the same Elastic stack (8.15.0-SNAPSHOT) with and without enabling LogsDB , and validation for this field just fails with logsdb when checking the documents from the transform.

Checking the documents returned in elastic-package by the preview api, that field just appears when enabling logsdb. So according to that, it is expected, that this is not failing when logsdb is disabled (that field is not present in the doc).

I don't know the cause those fields are just written with logsdb enabled. But it looks like packages should add that field definition in the transform fields folder.

Doing a local test, that labels.is_ioc_transform_source field appears just under fields and not under _source in the Elasticsearch documents (probably because it is a constant_keyword?).

If synthetics is not enabled during system tests, that explains why it does not fail in that scenario because elastic-package uses in those cases the _source to retrieve the fields to run the validation checks (and that field is not present there).

For scenarios with synthetic enabled, that validation would fail (that field is present in the documents) if the field definition is not present in the package (data streams and transforms).

@felixbarny
Copy link
Member

I want to emphasize again that I really think we should always perform the same validation logic, based on fields whether or not synthetic _source is enabled. This simplifies the code paths and we have a consistent validation logic which fewer surprises when changing to synthetic _source.

@jsoriano
Copy link
Member Author

I want to emphasize again that I really think we should always perform the same validation logic, based on fields whether or not synthetic _source is enabled. This simplifies the code paths and we have a consistent validation logic which fewer surprises when changing to synthetic _source.

The problem is that we cannot perform the same validation if presence of fields in _source or fields is different depending on the configurations. In any case we have been discussing about this and created an issue in this line #2016.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Ecosystem Label for the Packages Ecosystem team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants