Add new s3 sink documentation for Data Prepper 2.8

Signed-off-by: Taylor Gray <tylgry@amazon.com>
opensearch-project · May 15, 2024 · d85aa07 · d85aa07
1 parent 6f8261b
commit d85aa07
Showing 1 changed file with 48 additions and 11 deletions.
diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md
@@ -70,20 +70,50 @@ In order to use the `s3` sink, configure AWS Identity and Access Management (IAM
 }
 ```
 
+## Cross-account S3 access<a name="s3_bucket_ownership"></a>
+
+When Data Prepper fetches data from an S3 bucket, it verifies the ownership of the bucket using the
+[bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html).
+By default, Data Prepper expects an S3 bucket to be owned by the same that owns the correlating SQS queue.
+When no SQS is provided, Data Prepper uses the Amazon Resource Name (ARN) role in the `aws` configuration.
+
+If you plan to ingest data from multiple S3 buckets but each bucket is associated with a different S3 account, you need to configure Data Prepper to check for cross-account S3 access, according to the following conditions:
+
+- If all S3 buckets you want data from belong to an account other than that of the SQS queue, set `default_bucket_owner` to the account ID of the bucket account holder.
+- If your S3 buckets are in multiple accounts, use a `bucket_owners` map.
+
+In the following example, the SQS queue is owned by account `000000000000`. The SQS queue contains data from two S3 buckets: `my-bucket-01` and `my-bucket-02`.
+Because `my-bucket-01` is owned by `123456789012` and `my-bucket-02` is owned by `999999999999`, the `bucket_owners` map calls both bucket owners with their account IDs, as shown in the following configuration:
+
+```
+sink:
+  - s3:
+      default_bucket_owner: 111111111111
+      bucket_owners:
+        my-bucket-01: 123456789012
+        my-bucket-02: 999999999999
+```
+
+You can use both `bucket_owners` and `default_bucket_owner` together.
+
 ## Configuration 
 
 Use the following options when customizing the `s3` sink.
 
-Option | Required | Type | Description
-:--- | :--- | :--- | :---
-`bucket` | Yes | String | The name of the S3 bucket to which the sink writes.
-`codec` | Yes | [Codec](#codec) | The codec that determines how the data is serialized in the S3 object.
-`aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information.
-`threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3. 
-`object_key` | No | [Object key](#object-key-configuration) | Sets the `path_prefix` of the object in S3. Defaults to the S3 object `events-%{yyyy-MM-dd'T'hh-mm-ss}` found in the root directory of the bucket.
-`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`. 
-`buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type. 
-`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`.
+Option | Required | Type                                            | Description
+:--- |:---------|:------------------------------------------------| :---
+`bucket` | Yes      | String                                          | The name of the S3 bucket to which the sink writes. Supports sending to dynamic buckets using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, `test-${/bucket_id}`. If a dynamic bucket cannot be accessed, it will be sent to the `default_bucket` if one is configured. Otherwise, the object data will be dropped.
+`default_bucket` | No       | String                                          | The static name of the bucket to send to when a dynamic bucket in `bucket` is not able to be accessed. 
+`bucket_owners` | No       | Map                                             | A map of bucket names that includes the IDs of the accounts that own the buckets. For more information, see [Cross-account S3 access](#s3_bucket_ownership). 
+`default_bucket_owner` | No       | String                                          | The AWS account ID for the owner of an S3 bucket. For more information, see [Cross-account S3 access](#s3_bucket_ownership). 
+`codec` | Yes      | [Codec](#codec)                                 | The codec that determines how the data is serialized in the S3 object.
+`aws` | Yes      | AWS                                             | The AWS configuration. See [aws](#aws) for more information.
+`threshold` | Yes      | [Threshold](#threshold-configuration)           | Configures when to write an object to S3. 
+`aggregate_threshold` | No       | [Aggregate Threshold](#threshold-configuration) | Configures when and how to start flushing objects when using dynamic path_prefix to create many groups in memory. 
+`object_key` | No       | [Object key](#object-key-configuration)         | Sets the `path_prefix` of the object in S3. Defaults to the S3 object `events-%{yyyy-MM-dd'T'hh-mm-ss}` found in the root directory of the bucket.
+`compression` | No       | String                                          | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`. 
+`buffer_type` | No       | [Buffer type](#buffer-type)                     | Determines the buffer type. 
+`max_retries` | No       | Integer                                         | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`.
 
 ## aws
 
@@ -106,6 +136,13 @@ Option | Required | Type | Description
 `maximum_size` | No | String | The maximum number of bytes to accumulate before writing an object to S3. Default is `50mb`.
 `event_collect_timeout` | Yes | String | The maximum amount of time before Data Prepper writes an event to S3. The value should be either an ISO-8601 duration, such as `PT2M30S`, or a simple notation, such as `60s` or `1500ms`.
 
+## Aggregate threshold configuration
+
+Option | Required                           | Type   | Description
+:--- |:-----------------------------------|:-------| :---
+`flush_capacity_ratio` | No                                 | Float  | The percentage of groups to be force flushed when the aggregate_threshold maximum_size is reached. Default is 0.5
+`maximum_size` | Yes  | String | The maximum number of bytes to accumulate before force flushing objects. For example, `128mb`.
+
 
 ## Buffer type
 
@@ -119,7 +156,7 @@ Option | Required | Type | Description
 
 Option | Required | Type | Description
 :--- | :--- | :--- | :---
-`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting. For example, you can use `%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3. The prefix path should end with `/`. By default, Data Prepper writes objects to the root of the S3 bucket.
+`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting and dynamic injection of values using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, you can use `/${/my_partition_key}/%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3 based on the my_partition_key value. The prefix path should end with `/`. By default, Data Prepper writes objects to the root of the S3 bucket.
 
 
 ## codec