-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add include_keys and exclude_keys to S3 sink. #3102
Conversation
Signed-off-by: Aiden Dai <daixb@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @daixba for adding the include/exclude keys to S3. In #2989, you made this feature on sinks in general. So the s3
sink now includes include_keys
and exclude_keys
.
However, as you see in the implementation, the Codecs must implement them.
In this PR we currently have the possibility of:
sink:
- s3:
include_keys: ["a", "b"]
codec:
newline:
include_keys: ["c", "d"]
We should either:
- Remove the include_keys/exclude_keys from the core sink model and let sinks handle them according to how they handle the model; or
- Update this PR to use the include_keys/exclude_keys from the sink model.
I was looking at a sink codec recently and I think we have an issue here. Not just "include_keys/exclude_keys" but also with "tagsTargetKey" as well. We do not want both codec AND the sink to do this include/exclude_keys or tags resulting in duplicate data. Which means there should be one config (at the sink level) and a codec must handle these options. IF codec is not configured, only then Sink should take this configuration. |
When I first submitted the PR for adding include_keys and exclude_keys, most of the output codec haven't been implemented yet, it's easy to change the codec definition to use the sink context. But right now, all output codec have its own implementation of exclude_keys via config. Based on the current situation, what I submitted is the minmal change. |
I raised the same at first in #2989 (comment) which passed the codec the whole sink context via something like
So each codec don't need to add those duplicated configurations again. But right now I can't tell if this is still the right approach. I am not sure if we plan to add the support of both included_keys and excluded_keys in every codec or not, so far I saw many of them has support of excluded_keys. Based on the current situation, the minimal change is as per this PR and then raise another one to move the include_keys and exclude_keys out from sinkcontext and add them into the index configuration (similar to document root key) for OpenSearch sink, in this setting, most of the output codecs don't have to change. Please let me know the decision here. Thanks. |
I think we should aim for a configuration which looks like the following:
I think the sinks which use codecs should determine the values here, but not serialize them. This goes in line with the idea of an So the sink would be responsible for deciding:
The codec has to be responsible for serialization. But, it would just do as the sink tells it. Would this address your concern @kkondaka ? I do think in that PR we were discussing the idea of a context or options. We just need to nail the name down. But, the concept is applicable regardless.
Then on codec:
My point in the PR was that the codec should not be bound to the |
@daixba , To elaborate on the reason for a new context some. The The Again, the context passed between these two may not always be the same. It might be the case that the sink needs to tell the codec more. We wouldn't want to add that to the |
I have submitted a new PR #3122 Please ignore this one, once that one is resolved, this one will be closed. Thanks. |
@daixba , Are you good to close this PR? |
Description
Add include_keys and exclude_keys to S3 sink (via ndjson codec), Also some inprovement and document updates as per PR comments.
Issues Resolved
Resolve #3080
Resolves #2975
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.