Skip to content

Commit

Permalink
Add new s3 sink documentation for Data Prepper 2.8
Browse files Browse the repository at this point in the history
Signed-off-by: Taylor Gray <tylgry@amazon.com>
  • Loading branch information
graytaylor0 committed May 15, 2024
1 parent 6f8261b commit d85aa07
Showing 1 changed file with 48 additions and 11 deletions.
59 changes: 48 additions & 11 deletions _data-prepper/pipelines/configuration/sinks/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,20 +70,50 @@ In order to use the `s3` sink, configure AWS Identity and Access Management (IAM
}
```

## Cross-account S3 access<a name="s3_bucket_ownership"></a>

When Data Prepper fetches data from an S3 bucket, it verifies the ownership of the bucket using the
[bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html).
By default, Data Prepper expects an S3 bucket to be owned by the same that owns the correlating SQS queue.
When no SQS is provided, Data Prepper uses the Amazon Resource Name (ARN) role in the `aws` configuration.

If you plan to ingest data from multiple S3 buckets but each bucket is associated with a different S3 account, you need to configure Data Prepper to check for cross-account S3 access, according to the following conditions:

- If all S3 buckets you want data from belong to an account other than that of the SQS queue, set `default_bucket_owner` to the account ID of the bucket account holder.
- If your S3 buckets are in multiple accounts, use a `bucket_owners` map.

In the following example, the SQS queue is owned by account `000000000000`. The SQS queue contains data from two S3 buckets: `my-bucket-01` and `my-bucket-02`.
Because `my-bucket-01` is owned by `123456789012` and `my-bucket-02` is owned by `999999999999`, the `bucket_owners` map calls both bucket owners with their account IDs, as shown in the following configuration:

```
sink:
- s3:
default_bucket_owner: 111111111111
bucket_owners:
my-bucket-01: 123456789012
my-bucket-02: 999999999999
```

You can use both `bucket_owners` and `default_bucket_owner` together.

## Configuration

Use the following options when customizing the `s3` sink.

Option | Required | Type | Description
:--- | :--- | :--- | :---
`bucket` | Yes | String | The name of the S3 bucket to which the sink writes.
`codec` | Yes | [Codec](#codec) | The codec that determines how the data is serialized in the S3 object.
`aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information.
`threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3.
`object_key` | No | [Object key](#object-key-configuration) | Sets the `path_prefix` of the object in S3. Defaults to the S3 object `events-%{yyyy-MM-dd'T'hh-mm-ss}` found in the root directory of the bucket.
`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`.
`buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type.
`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`.
Option | Required | Type | Description
:--- |:---------|:------------------------------------------------| :---
`bucket` | Yes | String | The name of the S3 bucket to which the sink writes. Supports sending to dynamic buckets using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, `test-${/bucket_id}`. If a dynamic bucket cannot be accessed, it will be sent to the `default_bucket` if one is configured. Otherwise, the object data will be dropped.
`default_bucket` | No | String | The static name of the bucket to send to when a dynamic bucket in `bucket` is not able to be accessed.
`bucket_owners` | No | Map | A map of bucket names that includes the IDs of the accounts that own the buckets. For more information, see [Cross-account S3 access](#s3_bucket_ownership).
`default_bucket_owner` | No | String | The AWS account ID for the owner of an S3 bucket. For more information, see [Cross-account S3 access](#s3_bucket_ownership).
`codec` | Yes | [Codec](#codec) | The codec that determines how the data is serialized in the S3 object.
`aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information.
`threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3.
`aggregate_threshold` | No | [Aggregate Threshold](#threshold-configuration) | Configures when and how to start flushing objects when using dynamic path_prefix to create many groups in memory.
`object_key` | No | [Object key](#object-key-configuration) | Sets the `path_prefix` of the object in S3. Defaults to the S3 object `events-%{yyyy-MM-dd'T'hh-mm-ss}` found in the root directory of the bucket.
`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`.
`buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type.
`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`.

## aws

Expand All @@ -106,6 +136,13 @@ Option | Required | Type | Description
`maximum_size` | No | String | The maximum number of bytes to accumulate before writing an object to S3. Default is `50mb`.
`event_collect_timeout` | Yes | String | The maximum amount of time before Data Prepper writes an event to S3. The value should be either an ISO-8601 duration, such as `PT2M30S`, or a simple notation, such as `60s` or `1500ms`.

## Aggregate threshold configuration

Option | Required | Type | Description
:--- |:-----------------------------------|:-------| :---
`flush_capacity_ratio` | No | Float | The percentage of groups to be force flushed when the aggregate_threshold maximum_size is reached. Default is 0.5
`maximum_size` | Yes | String | The maximum number of bytes to accumulate before force flushing objects. For example, `128mb`.


## Buffer type

Expand All @@ -119,7 +156,7 @@ Option | Required | Type | Description

Option | Required | Type | Description
:--- | :--- | :--- | :---
`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting. For example, you can use `%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3. The prefix path should end with `/`. By default, Data Prepper writes objects to the root of the S3 bucket.
`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting and dynamic injection of values using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, you can use `/${/my_partition_key}/%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3 based on the my_partition_key value. The prefix path should end with `/`. By default, Data Prepper writes objects to the root of the S3 bucket.


## codec
Expand Down

0 comments on commit d85aa07

Please sign in to comment.