From 60f66eaa3c655729d399e01e154001b2ba4956c8 Mon Sep 17 00:00:00 2001 From: Taylor Gray Date: Wed, 15 May 2024 09:41:05 -0500 Subject: [PATCH] Add new s3 sink documentation for Data Prepper 2.8 Signed-off-by: Taylor Gray --- .../pipelines/configuration/sinks/s3.md | 59 +++++++++++++++---- 1 file changed, 48 insertions(+), 11 deletions(-) diff --git a/_data-prepper/pipelines/configuration/sinks/s3.md b/_data-prepper/pipelines/configuration/sinks/s3.md index 71cb7b1f70c..1e7dff1b15f 100644 --- a/_data-prepper/pipelines/configuration/sinks/s3.md +++ b/_data-prepper/pipelines/configuration/sinks/s3.md @@ -70,20 +70,50 @@ In order to use the `s3` sink, configure AWS Identity and Access Management (IAM } ``` +## Cross-account S3 access + +When Data Prepper fetches data from an S3 bucket, it verifies the ownership of the bucket using the +[bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html). +By default, Data Prepper expects an S3 bucket to be owned by the same that owns the correlating SQS queue. +When no SQS is provided, Data Prepper uses the Amazon Resource Name (ARN) role in the `aws` configuration. + +If you plan to ingest data from multiple S3 buckets but each bucket is associated with a different S3 account, you need to configure Data Prepper to check for cross-account S3 access, according to the following conditions: + +- If all S3 buckets you want data from belong to an account other than that of the SQS queue, set `default_bucket_owner` to the account ID of the bucket account holder. +- If your S3 buckets are in multiple accounts, use a `bucket_owners` map. + +In the following example, the SQS queue is owned by account `000000000000`. The SQS queue contains data from two S3 buckets: `my-bucket-01` and `my-bucket-02`. +Because `my-bucket-01` is owned by `123456789012` and `my-bucket-02` is owned by `999999999999`, the `bucket_owners` map calls both bucket owners with their account IDs, as shown in the following configuration: + +``` +sink: + - s3: + default_bucket_owner: 111111111111 + bucket_owners: + my-bucket-01: 123456789012 + my-bucket-02: 999999999999 +``` + +You can use both `bucket_owners` and `default_bucket_owner` together. + ## Configuration Use the following options when customizing the `s3` sink. -Option | Required | Type | Description -:--- | :--- | :--- | :--- -`bucket` | Yes | String | The name of the S3 bucket to which objects are stored. The `name` must match the name of your object store. -`codec` | Yes | [Codec](#codec) | The codec determining the format of output data. -`aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information. -`threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3. -`object_key` | No | Sets the `path_prefix` and the `file_pattern` of the object store. The file pattern is always `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, those objects are found inside the root directory of the bucket. The `path_prefix` is configurable. -`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`. -`buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type. -`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`. +Option | Required | Type | Description +:--- |:---------|:------------------------------------------------| :--- +`bucket` | Yes | String | The name of the S3 bucket to which the sink writes. Supports sending to dynamic buckets using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, `test-${/bucket_id}`. If a dynamic bucket cannot be accessed, it will be sent to the `default_bucket` if one is configured. Otherwise, the object data will be dropped. +`default_bucket` | No | String | The static name of the bucket to send to when a dynamic bucket in `bucket` is not able to be accessed. +`bucket_owners` | No | Map | A map of bucket names that includes the IDs of the accounts that own the buckets. For more information, see [Cross-account S3 access](#s3_bucket_ownership). +`default_bucket_owner` | No | String | The AWS account ID for the owner of an S3 bucket. For more information, see [Cross-account S3 access](#s3_bucket_ownership). +`codec` | Yes | [Codec](#codec) | The codec that determines how the data is serialized in the S3 object. +`aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information. +`threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3. +`aggregate_threshold` | No | [Aggregate Threshold](#threshold-configuration) | Configures when and how to start flushing objects when using dynamic path_prefix to create many groups in memory. +`object_key` | No | [Object key](#object-key-configuration) | Sets the `path_prefix` and the `file_pattern` of the object store. The file pattern is always `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, those objects are found inside the root directory of the bucket. The `path_prefix` is configurable. +`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`. +`buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type. +`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`. ## aws @@ -106,6 +136,13 @@ Option | Required | Type | Description `maximum_size` | No | String | The maximum number of bytes to accumulate before writing an object to S3. Default is `50mb`. `event_collect_timeout` | Yes | String | The maximum amount of time before Data Prepper writes an event to S3. The value should be either an ISO-8601 duration, such as `PT2M30S`, or a simple notation, such as `60s` or `1500ms`. +## Aggregate threshold configuration + +Option | Required | Type | Description +:--- |:-----------------------------------|:-------| :--- +`flush_capacity_ratio` | No | Float | The percentage of groups to be force flushed when the aggregate_threshold maximum_size is reached. Default is 0.5 +`maximum_size` | Yes | String | The maximum number of bytes to accumulate before force flushing objects. For example, `128mb`. + ## Buffer type @@ -119,7 +156,7 @@ Option | Required | Type | Description Option | Required | Type | Description :--- | :--- | :--- | :--- -`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting. For example, you can use `%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3. The prefix path should end with `/`. By default, Data Prepper writes objects to the root of the S3 bucket. +`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting and dynamic injection of values using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, you can use `/${/my_partition_key}/%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3 based on the my_partition_key value. The prefix path should end with `/`. By default, Data Prepper writes objects to the root of the S3 bucket. ## codec