Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates documentation for the Avro codec and S3 sink. #3211

Merged
merged 1 commit into from
Aug 22, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 9 additions & 78 deletions data-prepper-plugins/avro-codecs/README.md
Original file line number Diff line number Diff line change
@@ -1,89 +1,20 @@
# Avro Sink/Output Codec
# Avro codecs

This is an implementation of Avro Sink Codec that parses the Data Prepper Events into Avro records and writes them into the underlying OutputStream.
This project provides [Apache Avro](https://avro.apache.org/) support for Data Prepper. It includes an input codec, and output codec, and common libraries which can be used by other projects using Avro.

## Usages
## Usage

Avro Output Codec can be configured with sink plugins (e.g. S3 Sink) in the Pipeline file.
For usage information, see the Data Prepper documentation:

## Configuration Options
* [S3 source](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/s3/)
* [S3 sink](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sinks/s3/)

```
pipeline:
...
sink:
- s3:
aws:
region: us-east-1
sts_role_arn: arn:aws:iam::123456789012:role/Data-Prepper
sts_header_overrides:
max_retries: 5
bucket: bucket_name
object_key:
path_prefix: vpc-flow-logs/%{yyyy}/%{MM}/%{dd}/
threshold:
event_count: 2000
maximum_size: 50mb
event_collect_timeout: 15s
codec:
avro:
schema: >
{
"type" : "record",
"namespace" : "org.opensearch.dataprepper.examples",
"name" : "VpcFlowLog",
"fields" : [
{ "name" : "version", "type" : ["null", "string"]},
{ "name" : "srcport", "type": ["null", "int"]},
{ "name" : "dstport", "type": ["null", "int"]},
{ "name" : "accountId", "type" : ["null", "string"]},
{ "name" : "interfaceId", "type" : ["null", "string"]},
{ "name" : "srcaddr", "type" : ["null", "string"]},
{ "name" : "dstaddr", "type" : ["null", "string"]},
{ "name" : "start", "type": ["null", "int"]},
{ "name" : "end", "type": ["null", "int"]},
{ "name" : "protocol", "type": ["null", "int"]},
{ "name" : "packets", "type": ["null", "int"]},
{ "name" : "bytes", "type": ["null", "int"]},
{ "name" : "action", "type": ["null", "string"]},
{ "name" : "logStatus", "type" : ["null", "string"]}
]
}
exclude_keys:
- s3
buffer_type: in_memory
```

## AWS Configuration

### Codec Configuration:

1) `schema`: A json string that user can provide in the yaml file itself. The codec parses schema object from this schema string.
2) `exclude_keys`: Those keys of the events that the user wants to exclude while converting them to avro records.

### Note:

1) User can provide only one schema at a time i.e. through either of the ways provided in codec config.
2) If the user wants the tags to be a part of the resultant Avro Data and has given `tagsTargetKey` in the config file, the user also has to modify the schema to accommodate the tags. Another field has to be provided in the `schema.json` file:

`{
"name": "yourTagsTargetKey",
"type": { "type": "array",
"items": "string"
}`
3) If the user doesn't provide any schema, the codec will auto-generate schema from the first event in the buffer.

## Developer Guide

This plugin is compatible with Java 11. See below

- [CONTRIBUTING](https://github.com/opensearch-project/data-prepper/blob/main/CONTRIBUTING.md)
- [monitoring](https://github.com/opensearch-project/data-prepper/blob/main/docs/monitoring.md)
See the [CONTRIBUTING](https://github.com/opensearch-project/data-prepper/blob/main/CONTRIBUTING.md) guide for general information on contributions.

The integration tests for this plugin do not run as part of the Data Prepper build.
They are included only with the S3 source or S3 sink for now.

The following command runs the integration tests:

```
./gradlew :data-prepper-plugins:s3-sink:integrationTest -Dtests.s3sink.region=<your-aws-region> -Dtests.s3sink.bucket=<your-bucket>
```
See the README files for those projects for information on running those tests.
78 changes: 5 additions & 73 deletions data-prepper-plugins/s3-sink/README.md
Original file line number Diff line number Diff line change
@@ -1,83 +1,15 @@
# S3 Sink
# S3 sink

This is the Data Prepper S3 sink plugin that sends records to an S3 bucket via S3Client.
The `s3` sink saves batches of events to Amazon Simple Storage Service (Amazon S3) objects.

## Usages
## Usage

The s3 sink should be configured as part of Data Prepper pipeline yaml file.

## Configuration Options

```
pipeline:
...
sink:
- s3:
aws:
region: us-east-1
sts_role_arn: arn:aws:iam::123456789012:role/Data-Prepper
sts_header_overrides:
max_retries: 5
bucket: bucket_name
object_key:
path_prefix: my-elb/%{yyyy}/%{MM}/%{dd}/
threshold:
event_count: 2000
maximum_size: 50mb
event_collect_timeout: 15s
codec:
ndjson:
buffer_type: in_memory
```

## AWS Configuration

- `region` (Optional) : The AWS region to use for credentials. Defaults to [standard SDK behavior to determine the region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html).

- `sts_role_arn` (Optional) : The AWS STS role to assume for requests to S3. which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html).

- `sts_external_id` (Optional) : The external ID to attach to AssumeRole requests.

- `max_retries` (Optional) : An integer value indicates the maximum number of times that single request should be retired in-order to ingest data to amazon s3. Defaults to `5`.

- `bucket` (Required) : The name of the S3 bucket to write to.

- `object_key` (Optional) : It contains `path_prefix` and `file_pattern`. Defaults to s3 object `events-%{yyyy-MM-dd'T'hh-mm-ss}` inside bucket root directory.

- `path_prefix` (Optional) : path_prefix nothing but directory structure inside bucket in-order to store objects. Defaults to `none`.

## Threshold Configuration

- `event_count` (Required) : An integer value indicates the maximum number of events required to ingest into s3-bucket as part of threshold.

- `maximum_size` (Optional) : A String representing the count or size of bytes required to ingest into s3-bucket as part of threshold. Defaults to `50mb`.

- `event_collect_timeout` (Required) : A String representing how long events should be collected before ingest into s3-bucket as part of threshold. All Duration values are a string that represents a duration. They support ISO_8601 notation string ("PT20.345S", "PT15M", etc.) as well as simple notation Strings for seconds ("60s") and milliseconds ("1500ms").

## Buffer Type Configuration

- `buffer_type` (Optional) : Records stored temporary before flushing into s3 bucket. Possible values are `local_file` and `in_memory`. Defaults to `in_memory`.

## Metrics

### Counters

* `s3SinkObjectsSucceeded` - The number of S3 objects that the S3 sink has successfully written to S3.
* `s3SinkObjectsFailed` - The number of S3 objects that the S3 sink failed to write to S3.
* `s3SinkObjectsEventsSucceeded` - The number of records that the S3 sink has successfully written to S3.
* `s3SinkObjectsEventsFailed` - The number of records that the S3 sink has failed to write to S3.

### Distribution Summaries

* `s3SinkObjectSizeBytes` - Measures the distribution of the S3 request's payload size in bytes.
For information on usage, see the [s3 sink documentation](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sinks/s3/).


## Developer Guide

This plugin is compatible with Java 11. See below

- [CONTRIBUTING](https://github.com/opensearch-project/data-prepper/blob/main/CONTRIBUTING.md)
- [monitoring](https://github.com/opensearch-project/data-prepper/blob/main/docs/monitoring.md)
See the [CONTRIBUTING](https://github.com/opensearch-project/data-prepper/blob/main/CONTRIBUTING.md) guide for general information on contributions.

The integration tests for this plugin do not run as part of the Data Prepper build.

Expand Down
Loading