diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index c752bf6b3db..188e17d91f1 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -70,20 +70,50 @@ In order to use the `s3` sink, configure AWS Identity and Access Management (IAM } ``` +## Cross-account S3 access + +When Data Prepper fetches data from an S3 bucket, it verifies the ownership of the bucket using the +[bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html). +By default, Data Prepper expects an S3 bucket to be owned by the same that owns the correlating SQS queue. +When no SQS is provided, Data Prepper uses the Amazon Resource Name (ARN) role in the `aws` configuration. + +If you plan to ingest data from multiple S3 buckets but each bucket is associated with a different S3 account, you need to configure Data Prepper to check for cross-account S3 access, according to the following conditions: + +- If all S3 buckets you want data from belong to an account other than that of the SQS queue, set `default_bucket_owner` to the account ID of the bucket account holder. +- If your S3 buckets are in multiple accounts, use a `bucket_owners` map. + +In the following example, the SQS queue is owned by account `000000000000`. The SQS queue contains data from two S3 buckets: `my-bucket-01` and `my-bucket-02`. +Because `my-bucket-01` is owned by `123456789012` and `my-bucket-02` is owned by `999999999999`, the `bucket_owners` map calls both bucket owners with their account IDs, as shown in the following configuration: + +``` +sink: + - s3: + default_bucket_owner: 111111111111 + bucket_owners: + my-bucket-01: 123456789012 + my-bucket-02: 999999999999 +``` + +You can use both `bucket_owners` and `default_bucket_owner` together. + ## Configuration Use the following options when customizing the `s3` sink. -Option | Required | Type | Description -:--- | :--- | :--- | :--- -`bucket` | Yes | String | The name of the S3 bucket to which the sink writes. -`codec` | Yes | [Codec](#codec) | The codec that determines how the data is serialized in the S3 object. -`aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information. -`threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3. -`object_key` | No | [Object key](#object-key-configuration) | Sets the `path_prefix` of the object in S3. Defaults to the S3 object `events-%{yyyy-MM-dd'T'hh-mm-ss}` found in the root directory of the bucket. -`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`. -`buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type. -`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`. +Option | Required | Type | Description +:--- |:---------|:------------------------------------------------| :--- +`bucket` | Yes | String | The name of the S3 bucket to which the sink writes. Supports sending to dynamic buckets using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, `test-${/bucket_id}`. If a dynamic bucket cannot be accessed, it will be sent to the `default_bucket` if one is configured. Otherwise, the object data will be dropped. +`default_bucket` | No | String | The static name of the bucket to send to when a dynamic bucket in `bucket` is not able to be accessed. +`bucket_owners` | No | Map | A map of bucket names that includes the IDs of the accounts that own the buckets. For more information, see [Cross-account S3 access](#s3_bucket_ownership). +`default_bucket_owner` | No | String | The AWS account ID for the owner of an S3 bucket. For more information, see [Cross-account S3 access](#s3_bucket_ownership). +`codec` | Yes | [Codec](#codec) | The codec that determines how the data is serialized in the S3 object. +`aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information. +`threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3. +`aggregate_threshold` | No | [Aggregate Threshold](#threshold-configuration) | Configures when and how to start flushing objects when using dynamic path_prefix to create many groups in memory. +`object_key` | No | [Object key](#object-key-configuration) | Sets the `path_prefix` of the object in S3. Defaults to the S3 object `events-%{yyyy-MM-dd'T'hh-mm-ss}` found in the root directory of the bucket. +`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`. +`buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type. +`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`. ## aws @@ -106,6 +136,13 @@ Option | Required | Type | Description `maximum_size` | No | String | The maximum number of bytes to accumulate before writing an object to S3. Default is `50mb`. `event_collect_timeout` | Yes | String | The maximum amount of time before Data Prepper writes an event to S3. The value should be either an ISO-8601 duration, such as `PT2M30S`, or a simple notation, such as `60s` or `1500ms`. +## Aggregate threshold configuration + +Option | Required | Type | Description +:--- |:-----------------------------------|:-------| :--- +`flush_capacity_ratio` | No | Float | The percentage of groups to be force flushed when the aggregate_threshold maximum_size is reached. Default is 0.5 +`maximum_size` | Yes | String | The maximum number of bytes to accumulate before force flushing objects. For example, `128mb`. + ## Buffer type @@ -119,7 +156,7 @@ Option | Required | Type | Description Option | Required | Type | Description :--- | :--- | :--- | :--- -`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting. For example, you can use `%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3. The prefix path should end with `/`. By default, Data Prepper writes objects to the root of the S3 bucket. +`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting and dynamic injection of values using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, you can use `/${/my_partition_key}/%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3 based on the my_partition_key value. The prefix path should end with `/`. By default, Data Prepper writes objects to the root of the S3 bucket. ## codec