Support for data normalization in arrays #4291

guodonglai · 2024-03-15T22:40:49Z

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. It would be nice to have [...]

Hi, I have a question about ingestion pipeline: is there a standard way to process arrays of data in ingestion processor that is recommended by data prepper? for example, if I have a {‘category’: [‘a’, ‘b’, ‘c’]}, how can I break elements in the array down?

Describe the solution you'd like
A clear and concise description of what you want to happen.

in the example {‘category’: [‘a’, ‘b’, ‘c’]}, ideally it can be broken down like {‘category’: ‘a’,‘category’: ‘b’,‘category’: ‘c’ }?

Describe alternatives you've considered (Optional)
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

common use case is the transition from NoSql database arrays to Sql database to support table joins in data warehousing.

oeyh · 2024-03-19T19:58:05Z

A few clarification questions:

{‘category’: ‘a’, ‘category’: ‘b’, ‘category’: ‘c’ } is not a valid json object or a valid map. You cannot have multiple keys with the same name. Do you want an array like this instead?

[{"category": "a"}, {"category": "b"}, {"category": "c"}]

Can you expand more about the use case for my understanding?

common use case is the transition from NoSql database arrays to Sql database to support table joins in data warehousing.

Curious to know how you use Data Prepper in this case?

oeyh · 2024-03-19T20:00:21Z

We do support converting from:

[{"category": "a"}, {"category": "b"}, {"category": "c"}]

to

{"category": ["a", "b", "c"]}

with list_to_map processor.

We currently don't support the opposite. But if that is what you are looking for, we can probably enhance map_to_list processor to do that.

guodonglai · 2024-03-19T21:34:28Z

A few clarification questions:

{‘category’: ‘a’, ‘category’: ‘b’, ‘category’: ‘c’ } is not a valid json object or a valid map. You cannot have multiple keys with the same name. Do you want an array like this instead?
[{"category": "a"}, {"category": "b"}, {"category": "c"}]
Can you expand more about the use case for my understanding?

common use case is the transition from NoSql database arrays to Sql database to support table joins in data warehousing.

Curious to know how you use Data Prepper in this case?

Yes that's correct

guodonglai · 2024-03-19T21:38:20Z

We do support converting from:
[{"category": "a"}, {"category": "b"}, {"category": "c"}]
to
{"category": ["a", "b", "c"]}
with list_to_map processor.

We currently don't support the opposite. But if that is what you are looking for, we can probably enhance map_to_list processor to do that.

Yes that's exactly what I am looking for. This use case is more for streaming NoSQL arrays into row based SQL databases for data warehousing. In the end, once we got

[{"category": "a"}, {"category": "b"}, {"category": "c"}]

we will be likely be consuming them as separate rows in a SQL database i.e,

itemId category
A.         a
A          b
A          c

Is there a way in data prepper to perform the for_each function or similar? What is your best recommendation for our use case?

oeyh · 2024-03-20T15:01:52Z

we will be likely be consuming them as separate rows in a SQL database i.e,
itemId category
A a
A b
A c

Thanks for clarifying! This make me think maybe what you need is to split the original event with data {"category": ["a", "b", "c"]} to separated events: {"category": "a"}, {"category": "b"}, and {"category": "c"}. Processors like this would likely work:

  processor:
    - add_entries:
        entries:
          - key: category
            value_expression: join(/category)
            overwrite_if_key_exists: true
    - split_event:
        field: category
        delimiter: ","

join function will turn [a, b, c] into a string "a,b,c" and then split_event will separate them into 3 events. join and split_event will both be available in the upcoming 2.7 release.

This use case is more for streaming NoSQL arrays into row based SQL databases for data warehousing

Still wondering how you plan to use Data Prepper for it. What's the pipeline source and sink here?

guodonglai · 2024-03-20T17:28:47Z

Source will be DynamoDB, sink will be S3 bucket for datalake. What is the output of your proposed processor in the json format? 3 rows in S3 file eventually like the following?

{"category": "a"}
{"category": "b"}
{"category": "c"}

or a single row

[{"category": "a"}, {"category": "b"}, {"category": "c"}]

oeyh · 2024-03-20T18:22:51Z

What is the output of your proposed processor in the json format?

Both are possible. That will depend on the codec used on the s3 sink (ndjson vs json). See this: https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sinks/s3/#codec

It's probably clearer if I include another field in the event, for example, assume the input event has this data:

{"category": ["a", "b", "c"], "itemId": "A"}

With the potential map_to_list method, you will get one output event:

{"category_list": [{"category": "a"}, {"category": "b"}, {"category": "c"}], "itemId": "A"}

With the split_event method, you will get three events instead:

{"category": "a", "itemId": "A"}
{"category": "b", "itemId": "A"}
{"category": "c", "itemId": "A"}

The latter seems to make more sense in your case. What do you think?

guodonglai · 2024-03-20T18:52:48Z

Okay, split_event seems more applicable to my use case, in your example
{"category": "a", "itemId": "A"}
{"category": "b", "itemId": "A"}
{"category": "c", "itemId": "A"},
is it with ndjson or json?

Will split_event and map_to_list all be released in upcoming 2.7? When will it be? I am more than happy to try them out.

oeyh · 2024-03-21T01:32:33Z

In that form it's ndjson. 2.7 release will be coming in 1-2 weeks.

The map_to_list method I mentioned will require code changes, no ETA for that.

guodonglai · 2024-03-21T01:54:06Z

Cool thanks, will give split_event in version 2.7 a try and come back with result. Do you happen to know when this change will be adopted in AWS open search ingestion?

guodonglai added the untriaged label Mar 15, 2024

dlvenable added question Further information is requested and removed untriaged labels Mar 19, 2024

dlvenable added the enhancement New feature or request label Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for data normalization in arrays #4291

Support for data normalization in arrays #4291

guodonglai commented Mar 15, 2024

oeyh commented Mar 19, 2024

oeyh commented Mar 19, 2024 •

edited

Loading

guodonglai commented Mar 19, 2024

guodonglai commented Mar 19, 2024 •

edited

Loading

oeyh commented Mar 20, 2024 •

edited

Loading

guodonglai commented Mar 20, 2024

oeyh commented Mar 20, 2024

guodonglai commented Mar 20, 2024 •

edited

Loading

oeyh commented Mar 21, 2024

guodonglai commented Mar 21, 2024

Support for data normalization in arrays #4291

Support for data normalization in arrays #4291

Comments

guodonglai commented Mar 15, 2024

oeyh commented Mar 19, 2024

oeyh commented Mar 19, 2024 • edited Loading

guodonglai commented Mar 19, 2024

guodonglai commented Mar 19, 2024 • edited Loading

oeyh commented Mar 20, 2024 • edited Loading

guodonglai commented Mar 20, 2024

oeyh commented Mar 20, 2024

guodonglai commented Mar 20, 2024 • edited Loading

oeyh commented Mar 21, 2024

guodonglai commented Mar 21, 2024

oeyh commented Mar 19, 2024 •

edited

Loading

guodonglai commented Mar 19, 2024 •

edited

Loading

oeyh commented Mar 20, 2024 •

edited

Loading

guodonglai commented Mar 20, 2024 •

edited

Loading