From f87b93cf6072e1383e873ee504c71b815c250302 Mon Sep 17 00:00:00 2001 From: David Venable Date: Tue, 22 Aug 2023 15:37:34 -0500 Subject: [PATCH] Updates documentation for the Avro codec and S3 sink. Resolves #3162. Signed-off-by: David Venable --- data-prepper-plugins/avro-codecs/README.md | 87 +++------------------- data-prepper-plugins/s3-sink/README.md | 78 ++----------------- 2 files changed, 14 insertions(+), 151 deletions(-) diff --git a/data-prepper-plugins/avro-codecs/README.md b/data-prepper-plugins/avro-codecs/README.md index 7ed68e95d6..3bf678ffe7 100644 --- a/data-prepper-plugins/avro-codecs/README.md +++ b/data-prepper-plugins/avro-codecs/README.md @@ -1,89 +1,20 @@ -# Avro Sink/Output Codec +# Avro codecs -This is an implementation of Avro Sink Codec that parses the Data Prepper Events into Avro records and writes them into the underlying OutputStream. +This project provides [Apache Avro](https://avro.apache.org/) support for Data Prepper. It includes an input codec, and output codec, and common libraries which can be used by other projects using Avro. -## Usages +## Usage -Avro Output Codec can be configured with sink plugins (e.g. S3 Sink) in the Pipeline file. +For usage information, see the Data Prepper documentation: -## Configuration Options +* [S3 source](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/s3/) +* [S3 sink](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sinks/s3/) -``` -pipeline: - ... - sink: - - s3: - aws: - region: us-east-1 - sts_role_arn: arn:aws:iam::123456789012:role/Data-Prepper - sts_header_overrides: - max_retries: 5 - bucket: bucket_name - object_key: - path_prefix: vpc-flow-logs/%{yyyy}/%{MM}/%{dd}/ - threshold: - event_count: 2000 - maximum_size: 50mb - event_collect_timeout: 15s - codec: - avro: - schema: > - { - "type" : "record", - "namespace" : "org.opensearch.dataprepper.examples", - "name" : "VpcFlowLog", - "fields" : [ - { "name" : "version", "type" : ["null", "string"]}, - { "name" : "srcport", "type": ["null", "int"]}, - { "name" : "dstport", "type": ["null", "int"]}, - { "name" : "accountId", "type" : ["null", "string"]}, - { "name" : "interfaceId", "type" : ["null", "string"]}, - { "name" : "srcaddr", "type" : ["null", "string"]}, - { "name" : "dstaddr", "type" : ["null", "string"]}, - { "name" : "start", "type": ["null", "int"]}, - { "name" : "end", "type": ["null", "int"]}, - { "name" : "protocol", "type": ["null", "int"]}, - { "name" : "packets", "type": ["null", "int"]}, - { "name" : "bytes", "type": ["null", "int"]}, - { "name" : "action", "type": ["null", "string"]}, - { "name" : "logStatus", "type" : ["null", "string"]} - ] - } - exclude_keys: - - s3 - buffer_type: in_memory -``` - -## AWS Configuration - -### Codec Configuration: - -1) `schema`: A json string that user can provide in the yaml file itself. The codec parses schema object from this schema string. -2) `exclude_keys`: Those keys of the events that the user wants to exclude while converting them to avro records. - -### Note: - -1) User can provide only one schema at a time i.e. through either of the ways provided in codec config. -2) If the user wants the tags to be a part of the resultant Avro Data and has given `tagsTargetKey` in the config file, the user also has to modify the schema to accommodate the tags. Another field has to be provided in the `schema.json` file: - - `{ - "name": "yourTagsTargetKey", - "type": { "type": "array", - "items": "string" - }` -3) If the user doesn't provide any schema, the codec will auto-generate schema from the first event in the buffer. ## Developer Guide -This plugin is compatible with Java 11. See below - -- [CONTRIBUTING](https://github.com/opensearch-project/data-prepper/blob/main/CONTRIBUTING.md) -- [monitoring](https://github.com/opensearch-project/data-prepper/blob/main/docs/monitoring.md) +See the [CONTRIBUTING](https://github.com/opensearch-project/data-prepper/blob/main/CONTRIBUTING.md) guide for general information on contributions. The integration tests for this plugin do not run as part of the Data Prepper build. +They are included only with the S3 source or S3 sink for now. -The following command runs the integration tests: - -``` -./gradlew :data-prepper-plugins:s3-sink:integrationTest -Dtests.s3sink.region= -Dtests.s3sink.bucket= -``` +See the README files for those projects for information on running those tests. diff --git a/data-prepper-plugins/s3-sink/README.md b/data-prepper-plugins/s3-sink/README.md index 92463d1610..e7e5bd1f53 100644 --- a/data-prepper-plugins/s3-sink/README.md +++ b/data-prepper-plugins/s3-sink/README.md @@ -1,83 +1,15 @@ -# S3 Sink +# S3 sink -This is the Data Prepper S3 sink plugin that sends records to an S3 bucket via S3Client. +The `s3` sink saves batches of events to Amazon Simple Storage Service (Amazon S3) objects. -## Usages +## Usage -The s3 sink should be configured as part of Data Prepper pipeline yaml file. - -## Configuration Options - -``` -pipeline: - ... - sink: - - s3: - aws: - region: us-east-1 - sts_role_arn: arn:aws:iam::123456789012:role/Data-Prepper - sts_header_overrides: - max_retries: 5 - bucket: bucket_name - object_key: - path_prefix: my-elb/%{yyyy}/%{MM}/%{dd}/ - threshold: - event_count: 2000 - maximum_size: 50mb - event_collect_timeout: 15s - codec: - ndjson: - buffer_type: in_memory -``` - -## AWS Configuration - -- `region` (Optional) : The AWS region to use for credentials. Defaults to [standard SDK behavior to determine the region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html). - -- `sts_role_arn` (Optional) : The AWS STS role to assume for requests to S3. which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html). - -- `sts_external_id` (Optional) : The external ID to attach to AssumeRole requests. - -- `max_retries` (Optional) : An integer value indicates the maximum number of times that single request should be retired in-order to ingest data to amazon s3. Defaults to `5`. - -- `bucket` (Required) : The name of the S3 bucket to write to. - -- `object_key` (Optional) : It contains `path_prefix` and `file_pattern`. Defaults to s3 object `events-%{yyyy-MM-dd'T'hh-mm-ss}` inside bucket root directory. - -- `path_prefix` (Optional) : path_prefix nothing but directory structure inside bucket in-order to store objects. Defaults to `none`. - -## Threshold Configuration - -- `event_count` (Required) : An integer value indicates the maximum number of events required to ingest into s3-bucket as part of threshold. - -- `maximum_size` (Optional) : A String representing the count or size of bytes required to ingest into s3-bucket as part of threshold. Defaults to `50mb`. - -- `event_collect_timeout` (Required) : A String representing how long events should be collected before ingest into s3-bucket as part of threshold. All Duration values are a string that represents a duration. They support ISO_8601 notation string ("PT20.345S", "PT15M", etc.) as well as simple notation Strings for seconds ("60s") and milliseconds ("1500ms"). - -## Buffer Type Configuration - -- `buffer_type` (Optional) : Records stored temporary before flushing into s3 bucket. Possible values are `local_file` and `in_memory`. Defaults to `in_memory`. - -## Metrics - -### Counters - -* `s3SinkObjectsSucceeded` - The number of S3 objects that the S3 sink has successfully written to S3. -* `s3SinkObjectsFailed` - The number of S3 objects that the S3 sink failed to write to S3. -* `s3SinkObjectsEventsSucceeded` - The number of records that the S3 sink has successfully written to S3. -* `s3SinkObjectsEventsFailed` - The number of records that the S3 sink has failed to write to S3. - -### Distribution Summaries - -* `s3SinkObjectSizeBytes` - Measures the distribution of the S3 request's payload size in bytes. +For information on usage, see the [s3 sink documentation](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sinks/s3/). ## Developer Guide -This plugin is compatible with Java 11. See below - -- [CONTRIBUTING](https://github.com/opensearch-project/data-prepper/blob/main/CONTRIBUTING.md) -- [monitoring](https://github.com/opensearch-project/data-prepper/blob/main/docs/monitoring.md) +See the [CONTRIBUTING](https://github.com/opensearch-project/data-prepper/blob/main/CONTRIBUTING.md) guide for general information on contributions. The integration tests for this plugin do not run as part of the Data Prepper build.