[FEATURE] Support load CSV in PPL (inputlookup or search) #638

LantaoJin · 2024-09-10T06:43:55Z

Support the functionality of loading data from CSV file.

file location

There are two options in which a CSV file to store:

Upload CSV files to Spark scratch dir where set by SPARK_LOCAL_DIRS environment variable or config spark.local.dir, For example, $SPARK_LOCAL_DIRS/<some_identities>/lookups/test.csv. But uploading to an local dir could introduce potential security issues, especially if the Spark application runs on cloud service.
(Preferred) Upload CSV files to external URL. The user should make sure the application has the access permission to the external URL. For example, s3://<bucket>/foo/bar/test.csv, file:///foo/bar/test.csv.

PPL syntax

There are also two options to support this feature:

A. Introduce a new command `inputlookup` or `input`:

input <fileUrl> [predicate]

Usage:

input "s3://bucket_name/folder1/folder2/flights.csv" FlightDelay > 500

The FlightDelay > 500 only works when the flights.csv contains a csv header.

B. Modify the current `search` command to support file:

search file=<fileUrl> [predicate]

Usage:

search file="s3://bucket_name/folder1/folder2/flights.csv" FlightDelay > 500

PS: the current search command syntax is

search index=<indexName> [predicate]
search source=<indexName> [predicate]

Both option A and B could be used in sub-search:

search source=os dept=ps
| eval host=lower(host)
| stats count BY host
| append
  [
    input "s3://key/lookup.csv" | eval host=lower(host) | fields host count
  ]
| stats sum(count) AS total BY host
| where total=0

search source=os dept=ps
| eval host=lower(host)
| stats count BY host
| append
  [
    search file="s3://key/lookup.csv" | eval host=lower(host) | fields host count
  ]
| stats sum(count) AS total BY host
| where total=0

The text was updated successfully, but these errors were encountered:

penghuo · 2024-09-11T16:56:57Z

+1 on (Preferred) Upload CSV files to external URL.
One concern is how to prevent users from accessing any data in the local filesystem, as this poses a security risk.

YANG-DB · 2024-09-11T17:54:30Z

I agree with @penghuo this is a possible security concern - I would propose using a different approach:
use the dashboard for loading a csv file into and index and using this index for the lookup

brijos · 2024-09-11T21:55:44Z

I hate to be that guy, but I know of those in the community who would want to load the CSV into their index as well as those who want to load the CSV into cloud storage. From a priority perspective, index should be the first as it is the easiest (assuming the analyst has write access to the cluster). Dealing with cloud storage introduces permissions friction.

LantaoJin · 2024-09-12T02:32:10Z

+1 on (Preferred) Upload CSV files to external URL. One concern is how to prevent users from accessing any data in the local filesystem, as this poses a security risk.

A straightforward solution is allowing the s3:// schema URL only in product.

From a priority perspective, index should be the first as it is the easiest

Yes. I got the priorities. We have the lookup issue #620 opened. This issue is for the requirement of loading data from a CSV (similar to the the inputlookup command in Splunk).

LantaoJin added enhancement New feature or request untriaged labels Sep 10, 2024

LantaoJin changed the title ~~[FEATURE] Support inputlookup Command in PPL.~~ [FEATURE] Support load CSV in PPL (inputlookup or search) Sep 10, 2024

YANG-DB removed the untriaged label Sep 11, 2024

This was referenced Sep 18, 2024

Support FileSourceRelation to load CSV in PPL #677

Draft

Add INPUT command to load a CSV file #678

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Support load CSV in PPL (inputlookup or search) #638

[FEATURE] Support load CSV in PPL (inputlookup or search) #638

LantaoJin commented Sep 10, 2024 •

edited

Loading

penghuo commented Sep 11, 2024

YANG-DB commented Sep 11, 2024

brijos commented Sep 11, 2024

LantaoJin commented Sep 12, 2024

[FEATURE] Support load CSV in PPL (inputlookup or search) #638

[FEATURE] Support load CSV in PPL (inputlookup or search) #638

Comments

LantaoJin commented Sep 10, 2024 • edited Loading

file location

PPL syntax

A. Introduce a new command inputlookup or input:

B. Modify the current search command to support file:

penghuo commented Sep 11, 2024

YANG-DB commented Sep 11, 2024

brijos commented Sep 11, 2024

LantaoJin commented Sep 12, 2024

LantaoJin commented Sep 10, 2024 •

edited

Loading

A. Introduce a new command `inputlookup` or `input`:

B. Modify the current `search` command to support file: