Skip to content

Commit

Permalink
Add new s3 sink documentation for Data Prepper 2.8
Browse files Browse the repository at this point in the history
Signed-off-by: Taylor Gray <tylgry@amazon.com>
  • Loading branch information
graytaylor0 committed May 15, 2024
1 parent aae9fc6 commit 60f66ea
Showing 1 changed file with 48 additions and 11 deletions.
59 changes: 48 additions & 11 deletions _data-prepper/pipelines/configuration/sinks/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,20 +70,50 @@ In order to use the `s3` sink, configure AWS Identity and Access Management (IAM
}
```

## Cross-account S3 access<a name="s3_bucket_ownership"></a>

When Data Prepper fetches data from an S3 bucket, it verifies the ownership of the bucket using the
[bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html).
By default, Data Prepper expects an S3 bucket to be owned by the same that owns the correlating SQS queue.
When no SQS is provided, Data Prepper uses the Amazon Resource Name (ARN) role in the `aws` configuration.

If you plan to ingest data from multiple S3 buckets but each bucket is associated with a different S3 account, you need to configure Data Prepper to check for cross-account S3 access, according to the following conditions:

- If all S3 buckets you want data from belong to an account other than that of the SQS queue, set `default_bucket_owner` to the account ID of the bucket account holder.
- If your S3 buckets are in multiple accounts, use a `bucket_owners` map.

In the following example, the SQS queue is owned by account `000000000000`. The SQS queue contains data from two S3 buckets: `my-bucket-01` and `my-bucket-02`.
Because `my-bucket-01` is owned by `123456789012` and `my-bucket-02` is owned by `999999999999`, the `bucket_owners` map calls both bucket owners with their account IDs, as shown in the following configuration:

```
sink:
- s3:
default_bucket_owner: 111111111111
bucket_owners:
my-bucket-01: 123456789012
my-bucket-02: 999999999999
```

You can use both `bucket_owners` and `default_bucket_owner` together.

## Configuration

Use the following options when customizing the `s3` sink.

Option | Required | Type | Description
:--- | :--- | :--- | :---
`bucket` | Yes | String | The name of the S3 bucket to which objects are stored. The `name` must match the name of your object store.
`codec` | Yes | [Codec](#codec) | The codec determining the format of output data.
`aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information.
`threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3.
`object_key` | No | Sets the `path_prefix` and the `file_pattern` of the object store. The file pattern is always `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, those objects are found inside the root directory of the bucket. The `path_prefix` is configurable.
`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`.
`buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type.
`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`.
Option | Required | Type | Description
:--- |:---------|:------------------------------------------------| :---
`bucket` | Yes | String | The name of the S3 bucket to which the sink writes. Supports sending to dynamic buckets using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, `test-${/bucket_id}`. If a dynamic bucket cannot be accessed, it will be sent to the `default_bucket` if one is configured. Otherwise, the object data will be dropped.

Check failure on line 105 in _data-prepper/pipelines/configuration/sinks/s3.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: _id. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: _id. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sinks/s3.md", "range": {"start": {"line": 105, "column": 302}}}, "severity": "ERROR"}
`default_bucket` | No | String | The static name of the bucket to send to when a dynamic bucket in `bucket` is not able to be accessed.
`bucket_owners` | No | Map | A map of bucket names that includes the IDs of the accounts that own the buckets. For more information, see [Cross-account S3 access](#s3_bucket_ownership).
`default_bucket_owner` | No | String | The AWS account ID for the owner of an S3 bucket. For more information, see [Cross-account S3 access](#s3_bucket_ownership).
`codec` | Yes | [Codec](#codec) | The codec that determines how the data is serialized in the S3 object.
`aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information.

Check failure on line 110 in _data-prepper/pipelines/configuration/sinks/s3.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: aws. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: aws. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sinks/s3.md", "range": {"start": {"line": 110, "column": 2}}}, "severity": "ERROR"}

Check failure on line 110 in _data-prepper/pipelines/configuration/sinks/s3.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: aws. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: aws. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sinks/s3.md", "range": {"start": {"line": 110, "column": 98}}}, "severity": "ERROR"}

Check failure on line 110 in _data-prepper/pipelines/configuration/sinks/s3.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'aws' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'aws' is a heading and should be in sentence case.", "location": {"path": "_data-prepper/pipelines/configuration/sinks/s3.md", "range": {"start": {"line": 110, "column": 98}}}, "severity": "ERROR"}
`threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3.
`aggregate_threshold` | No | [Aggregate Threshold](#threshold-configuration) | Configures when and how to start flushing objects when using dynamic path_prefix to create many groups in memory.
`object_key` | No | [Object key](#object-key-configuration) | Sets the `path_prefix` and the `file_pattern` of the object store. The file pattern is always `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, those objects are found inside the root directory of the bucket. The `path_prefix` is configurable.
`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`.
`buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type.
`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`.

## aws

Expand All @@ -106,6 +136,13 @@ Option | Required | Type | Description
`maximum_size` | No | String | The maximum number of bytes to accumulate before writing an object to S3. Default is `50mb`.
`event_collect_timeout` | Yes | String | The maximum amount of time before Data Prepper writes an event to S3. The value should be either an ISO-8601 duration, such as `PT2M30S`, or a simple notation, such as `60s` or `1500ms`.

## Aggregate threshold configuration

Option | Required | Type | Description
:--- |:-----------------------------------|:-------| :---
`flush_capacity_ratio` | No | Float | The percentage of groups to be force flushed when the aggregate_threshold maximum_size is reached. Default is 0.5
`maximum_size` | Yes | String | The maximum number of bytes to accumulate before force flushing objects. For example, `128mb`.


## Buffer type

Expand All @@ -119,7 +156,7 @@ Option | Required | Type | Description

Option | Required | Type | Description
:--- | :--- | :--- | :---
`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting. For example, you can use `%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3. The prefix path should end with `/`. By default, Data Prepper writes objects to the root of the S3 bucket.
`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting and dynamic injection of values using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, you can use `/${/my_partition_key}/%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3 based on the my_partition_key value. The prefix path should end with `/`. By default, Data Prepper writes objects to the root of the S3 bucket.

Check failure on line 159 in _data-prepper/pipelines/configuration/sinks/s3.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: _key. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: _key. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sinks/s3.md", "range": {"start": {"line": 159, "column": 304}}}, "severity": "ERROR"}


## codec
Expand Down

0 comments on commit 60f66ea

Please sign in to comment.