Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pause] Identify integration fields lists that rely on ordering and duplication #10900

Closed
qcorporation opened this issue Aug 27, 2024 · 5 comments
Labels
Integration:All Applies to all integrations [Integration not found in source] Team:Cloud-Monitoring Label for the Cloud Monitoring team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team [elastic/elastic-agent-data-plane] Team:obs-ds-hosted-services Label for the Observability Hosted Services team [elastic/obs-ds-hosted-services] Team:Obs-InfraObs Label for the Observability Infrastructure Monitoring team [elastic/obs-infraobs-integrations] Team:Security-Deployment and Devices Deployment and Devices Security team [elastic/sec-deployment-and-devices] Team:Security-Linux Platform Linux Platform Security team [elastic/sec-linux-platform] Team:Security-Service Integrations Security Service Integrations Team [elastic/security-service-integrations] Team:Stack Monitoring Stack Monitoring team [elastic/stack-monitoring]

Comments

@qcorporation
Copy link

qcorporation commented Aug 27, 2024

Description

With the upcoming LogsDB release, data is stored without the original _source, and this means that arrays can be reordered and de-duplicated. There are fields where order matters to end-users, such as process.args. Some fields do not function if de-duplication occurs within the array.
A similar issue has been created for standard ECS fields to identify and flag fields that might be affected by relying on ordering or deduplication.

Our goal is to identify integration fields that rely on ordering or duplication. The integrations that could potentially be affected by this limitation within LogsDB are listed below. Code owners have been added as a checklist; if you are listed below, please:

  1. Update the tracker for the appropriate teams and validate that you have reviewed the fields and have signed off the fields under the specific integration are not affected by LogsDB release.
  • Note: Column A within the tracker has a drop down for CODEOWNERS to mark as complete
  1. If the list/arrays are affected by the LogsDB release, work with the integration teams to set array normalization for the affected fields.

@elastic/security-service-integrations

  • abnormal_security
  • akamai
  • anomaly
  • auth0
  • aws
  • azure
  • bbot
  • bitdefender
  • box
  • canva
  • carbon_black_cloud
  • cisco_meraki
  • cisco.secure_endpoint
  • cisco.umbrella
  • cloudflare
  • crowdstrike
  • cybereason
  • darktrace
  • entityanalytics_ad
  • eset
  • f5_big
  • falco
  • forgerock
  • gcp
  • github
  • google_scc
  • google_workspace
  • infoblox
  • jamf_compliance_reporter
  • jumpcloud
  • lastpass
  • m365_defender
  • microsoft_defender_cloud
  • mimecast
  • netskope
  • o365
  • ocsf
  • okta
  • opencti
  • otx
  • panw_cortex
  • ping_one
  • prisma_cloud
  • proofpoint_on_demand
  • proofpoint_tap
  • qualys_vmdr
  • rapid7
  • recordedfuture
  • sentinel_one
  • ses
  • slack
  • snyk
  • spycloud
  • tanium
  • teleport
  • tenable_io
  • tenable_sc
  • threatq
  • ti_crowdstrike
  • trellix_edr
  • trellix_epo
  • trend_micro_vision
  • vectra_detect
  • wiz
  • zeronetworks
  • zscaler_zia
  • zscaler_zpa

@elastic/sec-deployment-and-devices

  • checkpoint
  • cisco.ftd
  • cisco_ise
  • fortinet_fortimail
  • hashicorp_vault
  • iptables
  • modsec
  • netflow
  • panw
  • pfsense
  • rsa
  • sophos
  • stormshield
  • suricata
  • watchguard
  • zeek

@elastic/sec-linux-platform

  • auditd
  • dhcpv4
  • dns
  • memcache
  • rpc
  • sip

@elastic/obs-infraobs-integrations

  • apache.access
  • aws
  • ceph
  • cockroachdb
  • golang
  • haproxy
  • ibmmq
  • influxdb
  • mongodb_atlas
  • nginx.access
  • salesforce

@elastic/obs-ds-hosted-services

  • aws

@elastic/obs-cloudnative-monitoring

  • docker
  • istio
  • kubernetes
  • nginx_ingress_controller

@elastic/stack-monitoring

  • elasticsearch
  • kibana
  • logstash

@elastic/elastic-agent-data-plane

  • log

@elastic/ecosystem

  • package_registry
@qcorporation qcorporation added release-pending Team:Security-Service Integrations Security Service Integrations Team [elastic/security-service-integrations] Team:Security-Deployment and Devices Deployment and Devices Security team [elastic/sec-deployment-and-devices] Team:Security-Linux Platform Linux Platform Security team [elastic/sec-linux-platform] Team:Cloud-Monitoring Label for the Cloud Monitoring team Team:Obs-InfraObs Label for the Observability Infrastructure Monitoring team [elastic/obs-infraobs-integrations] Team:obs-ds-hosted-services Label for the Observability Hosted Services team [elastic/obs-ds-hosted-services] Team:Stack Monitoring Stack Monitoring team [elastic/stack-monitoring] Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team [elastic/elastic-agent-data-plane] and removed release-pending labels Aug 27, 2024
@mjwolf
Copy link
Contributor

mjwolf commented Aug 27, 2024

In my work on this so far, I've seen that fields can be grouped into four categories regarding if order and duplication need to be maintained for lists.

  • "None" -- No dependence on order or duplication in lists, traditional "sets".
  • "Strong" dependence -- For fields where the meaning will be lost if order/duplication is changed, for example, process cmdline will be meaningless if the argument order is rearranged.
  • "Weak" dependence -- The order/duplication is important to the actual implementation related to these fields, but is less important to logging. For example, DNS answer records. The order of records can impact of how the actual implementation uses the records, but it's less likely to be impact logging. Still log users might implicitly expect order will match the implementation order. This could probably be addressed with documentation stating not to expect order to be maintained in the field.
  • "Correlated events" -- Some integrations pack multiple events/records into separate lists. For example:
    events:
      id:
        - 1
        - 2
      name:
        - one
        - two
      desc:
        - ABC
        - def
    
    instead of
    events:
      - event:
          id: 1
          name: one
          desc: ABC
    
      - event:
          id: 2
          name: two
          desc: def
    
    Order needs to be maintained so that event contents are not rearranged and meaning is lost. Integrations could be updated to avoid this and remove the requirement that order is maintained in these fields.

I've updated the spreadsheet to have these categories for "Order/dup important" instead of just true/false.

@andrewkroh andrewkroh added the Integration:All Applies to all integrations [Integration not found in source] label Aug 27, 2024
@consulthys
Copy link
Contributor

If this inventory effort is about LogsDB, does it make sense to investigate integrations that return metrics which would NOT be stored in LogsDB (but more likely in TSDB).

Looking at the tracker spreadsheet for @elastic/stack-monitoring, I'm wondering specifically about the following:

  • .elasticsearch.cluster.stats.nodes.versions[]
  • .elasticsearch.cluster.stats.state.nodes.Unf_fjGESWun1vbtRzJK9w.roles[]
  • .elasticsearch.node.roles[]
  • .kibana.task_manager_metrics.metrics.task_claim.value.duration.counts[]
  • .kibana.task_manager_metrics.metrics.task_claim.value.duration.values[]
  • .logstash.node.stats.logstash.pipelines[]

Thanks for your input

@jvalente-salemstate
Copy link

M365 Defender ( m365_defender.{alert,incident} ) sort of relies on the order, though there's already issues.

There's a list of json objects in the original event that get flattened by dot_expander into arrays of values under m365_defender.incident.alert.evidence.* The order being preserved is needed to determine which evidence item a value in the array it belongs to.

See Alert Evidence under #9050 for some examples. I think this one of the cases where changing the pipeline would be better. I planned to work on that myself this month but things did not work out.

@qcorporation qcorporation changed the title Identify integration fields lists that rely on ordering and duplication [Pause] Identify integration fields lists that rely on ordering and duplication Sep 4, 2024
@qcorporation
Copy link
Author

qcorporation commented Sep 4, 2024

@consulthys @jvalente-salemstate, thank you for the feedback and questions on your respective code bases. We've put this work on hold as the logsdb development has potentially changed their implementation from an opt-out to an opt-in, meaning that by default, it will focus on adoption and minimize breakages vs the reverse, which will require more effort from all integration teams to assess ordering and deduplication dependencies.

The tracker will eventually be updated to reflect the decision made, and potentially, this issue can be closed if deemed unnecessary
cc.ing @andrewkroh

@andrewkroh
Copy link
Member

The approach has changed, and now logsdb will store _source for arrays by default. To get the optimization for array fields that are treated as unordered sets, we can opt-in by setting synthetic_source_keep: "none" in the mapping. Adding this option prevents the array field from being stored in _source.

Support for this mapping parameter is in-progress in Elasticsearch and will be first available in 8.16. But before integrations can use it, package-spec and Fleet need to be updated to allow it in fields.yml. So until those things happen there won't be anything to do. So let's close this. We'll revisit making the optimizations once support is added in the stack.

Related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Integration:All Applies to all integrations [Integration not found in source] Team:Cloud-Monitoring Label for the Cloud Monitoring team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team [elastic/elastic-agent-data-plane] Team:obs-ds-hosted-services Label for the Observability Hosted Services team [elastic/obs-ds-hosted-services] Team:Obs-InfraObs Label for the Observability Infrastructure Monitoring team [elastic/obs-infraobs-integrations] Team:Security-Deployment and Devices Deployment and Devices Security team [elastic/sec-deployment-and-devices] Team:Security-Linux Platform Linux Platform Security team [elastic/sec-linux-platform] Team:Security-Service Integrations Security Service Integrations Team [elastic/security-service-integrations] Team:Stack Monitoring Stack Monitoring team [elastic/stack-monitoring]
Projects
None yet
Development

No branches or pull requests

5 participants