Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add include_keys and exclude_keys to S3 sink. #3102

Closed
wants to merge 2 commits into from

Conversation

daixba
Copy link
Contributor

@daixba daixba commented Aug 2, 2023

Description

Add include_keys and exclude_keys to S3 sink (via ndjson codec), Also some inprovement and document updates as per PR comments.

Issues Resolved

Resolve #3080

Resolves #2975

Check List

  • New functionality includes testing.
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Aiden Dai <daixb@amazon.com>
Copy link
Member

@dlvenable dlvenable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @daixba for adding the include/exclude keys to S3. In #2989, you made this feature on sinks in general. So the s3 sink now includes include_keys and exclude_keys.

However, as you see in the implementation, the Codecs must implement them.

In this PR we currently have the possibility of:

sink:
- s3:
     include_keys: ["a", "b"]
     codec:
        newline:
          include_keys: ["c", "d"]

We should either:

  • Remove the include_keys/exclude_keys from the core sink model and let sinks handle them according to how they handle the model; or
  • Update this PR to use the include_keys/exclude_keys from the sink model.

@kkondaka
Copy link
Collaborator

kkondaka commented Aug 2, 2023

I was looking at a sink codec recently and I think we have an issue here. Not just "include_keys/exclude_keys" but also with "tagsTargetKey" as well. We do not want both codec AND the sink to do this include/exclude_keys or tags resulting in duplicate data. Which means there should be one config (at the sink level) and a codec must handle these options. IF codec is not configured, only then Sink should take this configuration.

@daixba
Copy link
Contributor Author

daixba commented Aug 2, 2023

When I first submitted the PR for adding include_keys and exclude_keys, most of the output codec haven't been implemented yet, it's easy to change the codec definition to use the sink context. But right now, all output codec have its own implementation of exclude_keys via config. Based on the current situation, what I submitted is the minmal change.

@daixba
Copy link
Contributor Author

daixba commented Aug 3, 2023

I raised the same at first in #2989 (comment) which passed the codec the whole sink context via something like

void start(OutputStream outputStream, SinkContext sinkContext) throws IOException;

So each codec don't need to add those duplicated configurations again. But right now I can't tell if this is still the right approach. I am not sure if we plan to add the support of both included_keys and excluded_keys in every codec or not, so far I saw many of them has support of excluded_keys.

Based on the current situation, the minimal change is as per this PR and then raise another one to move the include_keys and exclude_keys out from sinkcontext and add them into the index configuration (similar to document root key) for OpenSearch sink, in this setting, most of the output codecs don't have to change.

Please let me know the decision here. Thanks.

@dlvenable
Copy link
Member

I think we should aim for a configuration which looks like the following:

sink:
- s3:
     include_keys: ["a", "b"]
     codec:
        newline:

Not just "include_keys/exclude_keys" but also with "tagsTargetKey" as well. We do not want both codec AND the sink to do this include/exclude_keys or tags resulting in duplicate data.

I think the sinks which use codecs should determine the values here, but not serialize them. This goes in line with the idea of an OutputCodecContext interface.

So the sink would be responsible for deciding:

  • Which keys are included or excluded
  • Which tags are added

The codec has to be responsible for serialization. But, it would just do as the sink tells it.

Would this address your concern @kkondaka ?

@daixba,

I do think in that PR we were discussing the idea of a context or options. We just need to nail the name down. But, the concept is applicable regardless.

interface OutputCodecContext {
  List<String> getIncludeKeys();
  List<String> getExcludeKeys();
  List<String> getTagsTargetKey();

Then on codec:

void writeEvent(Event event, OutputStream outputStream, OutputCodecContext outputCodecContext) throws IOException;

My point in the PR was that the codec should not be bound to the SinkContext. So the concept is similar, but it should have its own class. These may vary independently of each other.

@dlvenable
Copy link
Member

@daixba , To elaborate on the reason for a new context some.

The SinkContext allows Data Prepper core to provide some context to sinks.

The OutputCodecContext would be a mechanism to allow sinks to provide additional context to codecs.

Again, the context passed between these two may not always be the same. It might be the case that the sink needs to tell the codec more. We wouldn't want to add that to the SinkContext. Similarly, there is information in the SinkContext which is not important to the codecs, such as the routes.

@daixba
Copy link
Contributor Author

daixba commented Aug 9, 2023

I have submitted a new PR #3122

Please ignore this one, once that one is resolved, this one will be closed.

Thanks.

@dlvenable
Copy link
Member

@daixba , Are you good to close this PR?

@daixba daixba closed this Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PR feedback improvements from #2989 Add include_keys and exclude_keys options under the sink
3 participants