[ES|QL] Return data streams and indices that have data for a query #122122

dgieselaar · 2025-02-08T11:52:01Z

Description

What:

An API endpoint, or a command that can be appended to any query, that will return all the data streams that have data for a given query (+ DSL filter), as fast as possible.

Why:

In many places in Kibana, we have "has data" calls - requests where we only care if there's any data from a specific source in a given time frame, or with a set of filters, etc. This is often used to make UX choices, like showing an onboarding screen. We usually use terminate_after: 1 and/or timeout: 1ms which works reasonably well in this case. However, ES|QL (to my knowledge) does not have an equivalent, so we can not do this for ES|QL queries.

Another reason to do this is to determine relevance of things like queries, visualizations, rules and dashboards to a subset of the data. As an example, we can get all the panels that have data for a specific host by extracting queries from an asset (like a visualization), and combining them with a filter like { terms: { host.name: my-host } }. We can then execute this combined query, and if there is any data, we consider the asset to be relevant to the given filter (or, an entity).

It is also useful to get the actual indices or data streams that match the query: for instance, to give better autocomplete suggestions, or to tell the users what data sources are available for a given entity.

Projects that we expect to need this feature for are:

Streams (to suggest visualizations and dashboards to attach to a Stream)
RCA (to suggest visualizations, dashboards, queries etc for a given entity)

How

The ideal outcome would be to have an endpoint that takes a set of queries, and for each query, return the data streams and indices that have data for the given query. This allows us to get this data for a large amount of queries, and Elasticsearch can optimize the operation, for instance by sharing field caps calls.

An alternative would be to have a command that can be attached to any query that returns just the data sources, and not the actual result of the query. This probably feels a little weird but might be useful as a user-facing feature.

Note: this does not need to be perfect. E.g. for the following query:

FROM traces-apm*
	| STATS root_transaction_name = TOP(transaction.name, 1, "DESC"), has_slow_spans = COUNT() WHERE span.duration.us > 1000000 BY trace.id
	| WHERE has_slow_spans > 0
	| STATS BY root_transaction_name

ES needs to potentially evaluate all the data to determine if there are hits. In this case, it's fine to exit early and just assume there is data. Ideally, it would understand that for this to match, span.duration.us > 1000000 needs to be true for at least one document, but I can imagine things get very complicated at that point.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2025-02-08T11:52:25Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

dgieselaar added :Analytics/ES|QL AKA ESQL >enhancement needs:triage Requires assignment of a team area label labels Feb 8, 2025

elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) and removed needs:triage Requires assignment of a team area label labels Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ES|QL] Return data streams and indices that have data for a query #122122

[ES|QL] Return data streams and indices that have data for a query #122122

dgieselaar commented Feb 8, 2025 •

edited

Loading

elasticsearchmachine commented Feb 8, 2025

[ES|QL] Return data streams and indices that have data for a query #122122

[ES|QL] Return data streams and indices that have data for a query #122122

Comments

dgieselaar commented Feb 8, 2025 • edited Loading

Description

elasticsearchmachine commented Feb 8, 2025

dgieselaar commented Feb 8, 2025 •

edited

Loading