Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new s3 sink documentation for Data Prepper 2.8 #7163

Merged
merged 49 commits into from
Jun 25, 2024
Merged
Changes from 38 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
1c94268
Add new s3 sink documentation for Data Prepper 2.8
graytaylor0 May 15, 2024
34f3d34
Apply suggestions from code review
graytaylor0 May 21, 2024
756dca8
Merge branch 'main' into main
vagimeli May 24, 2024
f8ef118
Update s3.md
vagimeli May 24, 2024
d498755
Merge branch 'main' into main
vagimeli Jun 6, 2024
e608d5e
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 6, 2024
6b472fa
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 6, 2024
bd2fbba
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 6, 2024
426baed
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 6, 2024
64f2193
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 6, 2024
b19ec96
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 6, 2024
339cb2c
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 6, 2024
bb8ae3f
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 6, 2024
7330f63
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 6, 2024
6ff8db9
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 6, 2024
52d63f1
Update s3.md
vagimeli Jun 6, 2024
8fa750b
Merge branch 'main' into main
vagimeli Jun 6, 2024
95d332e
Merge branch 'main' into main
vagimeli Jun 21, 2024
4e30d2d
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 21, 2024
eb83359
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 21, 2024
3dc6c55
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 21, 2024
074f1da
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 21, 2024
0b1e89f
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 21, 2024
518d112
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
4745f64
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
9700cc7
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
f65bfb0
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
67ebd63
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
340ea25
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
d51418f
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
8afd19d
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
7729a67
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
521ac93
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
cf65a37
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
b9c58fe
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
3e23dc4
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
f4e6e2b
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
dffe881
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
e2fd6a2
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
559d0a1
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
a0b8d94
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
11a2612
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
cb19c6b
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
b7b1d93
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
34c0439
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
27ef838
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
8ef550e
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
f2ab4ae
Update _data-prepper/pipelines/configuration/sinks/s3.md
vagimeli Jun 25, 2024
71b8ba7
Merge branch 'main' into main
vagimeli Jun 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 85 additions & 64 deletions _data-prepper/pipelines/configuration/sinks/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,20 @@
```
${pathPrefix}events-%{yyyy-MM-dd'T'HH-mm-ss'Z'}-${currentTimeInNanos}-${uniquenessId}.${codecSuppliedExtension}
```
{% include copy-curl.html %}

When a batch of objects is written to S3, the objects are formatted similarly to the following:
When a batch of objects is written to Amazon S3, the objects are formatted similarly to the following:

```
my-logs/2023/06/09/06/events-2023-06-09T06-00-01-1686290401871214927-ae15b8fa-512a-59c2-b917-295a0eff97c8.json
```
{% include copy-curl.html %}


For more information about how to configure an object, see the [Object key](#object-key-configuration) section.
For more information about how to configure an object, refer to [Object key](#object-key-configuration).

## Usage

The following example creates a pipeline configured with an s3 sink. It contains additional options for customizing the event and size thresholds for which the pipeline sends record events and sets the codec type `ndjson`:
The following example creates a pipeline configured with an `s3` sink. It contains additional options for customizing the event and size thresholds for the pipeline and sets the codec type as `ndjson`:

```
pipeline:
Expand All @@ -49,10 +50,11 @@
ndjson:
buffer_type: in_memory
```
{% include copy-curl.html %}

## IAM permissions

In order to use the `s3` sink, configure AWS Identity and Access Management (IAM) to grant Data Prepper permissions to write to Amazon S3. You can use a configuration similar to the following JSON configuration:
To use the `s3` sink, configure AWS Identity and Access Management (IAM) to grant Data Prepper permissions to write to Amazon S3. You can use a configuration similar to the following JSON configuration:

```json
{
Expand All @@ -69,121 +71,140 @@
]
}
```
{% include copy-curl.html %}

## Cross-account S3 access<a name="s3_bucket_ownership"></a>

When Data Prepper fetches data from an S3 bucket, it verifies bucket ownership using a [bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html).

By default, the S3 sink does not require `bucket_owners`. If `bucket_owners` is configured and a bucket is not included in one of the mapped configurations, `default_bucket_owner` defaults to the account ID in `aws.sts_role_arn`. You can configure both `bucket_owners` and `default_bucket_owner` and apply the settings together.

When ingesting data from multiple S3 buckets with different account associations, configure Data Prepper for cross-account S3 access based on the following conditions:

- For S3 buckets belonging to the same account, set `default_bucket_owner` to that account's ID.
- For S3 buckets belonging to multiple accounts, use a `bucket_owners` map.

A `bucket_owners` map specifies account IDs for buckets belonging to multiple accounts. For example, in the following configuration, `my-bucket-01` is owned by `123456789012` and `my-bucket-02` is owned by `999999999999`:

```
sink:
- s3:
default_bucket_owner: 111111111111
bucket_owners:
my-bucket-01: 123456789012
my-bucket-02: 999999999999
```
{% include copy-curl.html %}

## Configuration

Use the following options when customizing the `s3` sink.

Option | Required | Type | Description
:--- | :--- | :--- | :---
`bucket` | Yes | String | The name of the S3 bucket to which objects are stored. The `name` must match the name of your object store.
`codec` | Yes | [Codec](#codec) | The codec determining the format of output data.
`aws` | Yes | AWS | The AWS configuration. See [aws](#aws) for more information.
`threshold` | Yes | [Threshold](#threshold-configuration) | Configures when to write an object to S3.
`object_key` | No | Sets the `path_prefix` and the `file_pattern` of the object store. The file pattern is always `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, those objects are found inside the root directory of the bucket. The `path_prefix` is configurable.
`compression` | No | String | The compression algorithm to apply: `none`, `gzip`, or `snappy`. Default is `none`.
`buffer_type` | No | [Buffer type](#buffer-type) | Determines the buffer type.
`max_retries` | No | Integer | The maximum number of times a single request should retry when ingesting data to S3. Defaults to `5`.

## aws
Option | Required | Type | Description
:--- |:---------|:------------------------------------------------| :---
`bucket` | Yes | String | Specifies the sink's S3 bucket name. Supports dynamic bucket naming using [Data Prepper expressions]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/expression-syntax/), for example, `test-${/bucket_id}`. If a dynamic bucket is inaccessible and no `default_bucket` is configured, then the object data is dropped.

Check failure on line 105 in _data-prepper/pipelines/configuration/sinks/s3.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: _id. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: _id. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sinks/s3.md", "range": {"start": {"line": 105, "column": 276}}}, "severity": "ERROR"}
`default_bucket` | No | String | A static bucket for inaccessible dynamic buckets in `bucket`.
`bucket_owners` | No | Map | A map of bucket names and their account owner IDs for cross-account access. Refer to [Cross-account S3 access](#s3_bucket_ownership).
`default_bucket_owner` | No | String | The AWS account ID for an S3 bucket owner. Refer to [Cross-account S3 access](#s3_bucket_ownership).
`codec` | Yes | [Codec](#codec) | Serializes data in S3 objects.
`aws` | Yes | AWS | The AWS configuration. Refer to [aws](#aws).

Check failure on line 110 in _data-prepper/pipelines/configuration/sinks/s3.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: aws. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: aws. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sinks/s3.md", "range": {"start": {"line": 110, "column": 2}}}, "severity": "ERROR"}
`threshold` | Yes | [Threshold](#threshold-configuration) | Condition for writing objects to S3.
`aggregate_threshold` | No | [Aggregate threshold](#threshold-configuration) | A condition for flushing objects with a dynamic `path_prefix`.
`object_key` | No | [Object key](#object-key-configuration) | Sets `path_prefix` and `file_pattern` for object storage. The file pattern is `events-%{yyyy-MM-dd'T'hh-mm-ss}`. By default, these objects are found in the bucket's root directory. `path_prefix` is configurable.
`compression` | No | String | The compression algorithm: Either `none`, `gzip`, or `snappy`. Default is `none`.
`buffer_type` | No | [Buffer type](#buffer-type) | The buffer type configuration.
`max_retries` | No | Integer | The maximum number of retries for S3 ingestion requests. Default is `5`.

## `aws`

Option | Required | Type | Description
:--- | :--- | :--- | :---
`region` | No | String | The AWS Region to use for credentials. Defaults to [standard SDK behavior to determine the Region](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/region-selection.html).
`sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon SQS and Amazon S3. Defaults to `null`, which will use the [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html).
`sts_role_arn` | No | String | The AWS Security Token Service (AWS STS) role to assume for requests to Amazon Simple Queue Service (Amazon SQS) and Amazon S3. Defaults to `null`, which uses [standard SDK behavior for credentials](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "the" precede the link text?

vagimeli marked this conversation as resolved.
Show resolved Hide resolved
`sts_header_overrides` | No | Map | A map of header overrides that the IAM role assumes for the sink plugin.
`sts_external_id` | No | String | An STS external ID used when Data Prepper assumes the role. For more information, see the `ExternalId` documentation in the [STS AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) API reference.


`sts_external_id` | No | String | An AWS STS external ID used when Data Prepper assumes the role. For more information, refer to the `ExternalId` section under [AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) in the AWS STS API reference.

## Threshold configuration

Use the following options to set ingestion thresholds for the `s3` sink. When any of these conditions are met, Data Prepper will write events to an S3 object.
Use the following options to set ingestion thresholds for the `s3` sink. Data Prepper writes events to an S3 object when any of these conditions occur.

Option | Required | Type | Description
:--- | :--- | :--- | :---
`event_count` | Yes | Integer | The number of Data Prepper events to accumulate before writing an object to S3.
`maximum_size` | No | String | The maximum number of bytes to accumulate before writing an object to S3. Default is `50mb`.
`event_collect_timeout` | Yes | String | The maximum amount of time before Data Prepper writes an event to S3. The value should be either an ISO-8601 duration, such as `PT2M30S`, or a simple notation, such as `60s` or `1500ms`.
`event_collect_timeout` | Yes | String | The maximum amount of timeout before Data Prepper writes an event to S3. The value should be either an ISO-8601 duration, such as `PT2M30S`, or a simple notation, such as `60s` or `1500ms`.
vagimeli marked this conversation as resolved.
Show resolved Hide resolved

## Aggregate threshold configuration

Use the following options to set rules or limits that trigger certain actions or behavior when an aggregated value crosses a defined threshold.

Option | Required | Type | Description
:--- |:-----------------------------------|:-------| :---
`flush_capacity_ratio` | No | Float | The percentage of groups to be force-flushed when `aggregate_threshold maximum_size` is reached. Percentage is expressed from `0.0`--`1.0`. Default is `0.5`.
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
`maximum_size` | Yes | String | The maximum number of bytes to accumulate before force-flushing objects. For example, `128mb`.

## Buffer type

`buffer_type` is an optional configuration that determines how Data Prepper temporarily stores data before writing an object to S3. The default value is `in_memory`. Use one of the following options:
`buffer_type` is an optional configuration that determines how Data Prepper temporarily stores data before writing an object to S3. The default value is `in_memory`.

Use one of the following options:

- `in_memory`: Stores the record in memory.
- `local_file`: Flushes the record into a file on your local machine. This uses your machine's temporary directory.
- `local_file`: Flushes the record into a file on your local machine. This option uses your machine's temporary directory.
- `multipart`: Writes using the [S3 multipart upload](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html). Every 10 MB is written as a part.

## Object key configuration

Use the following options to define how object keys are constructed for objects stored in S3.

Option | Required | Type | Description
:--- | :--- | :--- | :---
`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting. For example, you can use `%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3. The prefix path should end with `/`. By default, Data Prepper writes objects to the root of the S3 bucket.

`path_prefix` | No | String | The S3 key prefix path to use for objects written to S3. Accepts date-time formatting and dynamic injection of values using [Data Prepper expressions](https://opensearch.org/docs/latest/data-prepper/pipelines/expression-syntax/). For example, you can use `/${/my_partition_key}/%{yyyy}/%{MM}/%{dd}/%{HH}/` to create hourly folders in S3 based on the `my_partition_key` value. The prefix path should end with `/`. By default, Data Prepper writes objects to the S3 bucket root.

Check failure on line 162 in _data-prepper/pipelines/configuration/sinks/s3.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: _key. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: _key. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sinks/s3.md", "range": {"start": {"line": 162, "column": 304}}}, "severity": "ERROR"}

## codec
## `codec`

The `codec` determines how the `s3` source formats data written to each S3 object.

### avro codec
### `avro` codec

The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) document.
The `avro` codec writes an event as an [Apache Avro](https://avro.apache.org/) document. Because Avro requires a schema, you may either define the schema or have Data Prepper automatically generate it. Defining your own schema is recommended, as this will allow it to be tailored to your particular use case.

Check failure on line 170 in _data-prepper/pipelines/configuration/sinks/s3.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Avro. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Avro. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sinks/s3.md", "range": {"start": {"line": 170, "column": 48}}}, "severity": "ERROR"}

Check failure on line 170 in _data-prepper/pipelines/configuration/sinks/s3.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Avro. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Avro. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sinks/s3.md", "range": {"start": {"line": 170, "column": 98}}}, "severity": "ERROR"}
vagimeli marked this conversation as resolved.
Show resolved Hide resolved

Because Avro requires a schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema.
In general, you should define your own schema because it will most accurately reflect your needs.
When you provide your own Avro schema, that schema defines the final structure of your data. Any extra values inside any incoming events that are not mapped in the Arvo schema will not be included in the final destination. Data Prepper does not allow the use of the `include_keys` or `exclude_keys` with a custom schema. This is to avoid confusion between a custom Arvo schema and the `include_keys` or `exclude_keys` sink configurations.

Check failure on line 172 in _data-prepper/pipelines/configuration/sinks/s3.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Avro. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Avro. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sinks/s3.md", "range": {"start": {"line": 172, "column": 27}}}, "severity": "ERROR"}
vagimeli marked this conversation as resolved.
Show resolved Hide resolved

We recommend that you make your Avro fields use a null [union](https://avro.apache.org/docs/current/specification/#unions).
Without the null union, each field must be present or the data will fail to write to the sink.
If you can be certain that each each event has a given field, you can make it non-nullable.
In cases where your data is uniform, you may be able to automatically generate a schema. Auto-generated schemas are based on the first event that the codec receives. The schema will only contain keys from this event, and you must have all keys present in all events in order for the auto-generated schema to produce a working schema. Auto-generated schemas make all fields nullable. Use the `include_keys` and `exclude_keys` sink configurations to control what data is included in the auto-generated schema.

Check failure on line 174 in _data-prepper/pipelines/configuration/sinks/s3.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: nullable. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: nullable. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sinks/s3.md", "range": {"start": {"line": 174, "column": 374}}}, "severity": "ERROR"}
vagimeli marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "in order for the automatically generated schema to produce a working schema" redundant? Do we just mean "in order to automatically generate a working schema"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Revised.


When you provide your own Avro schema, that schema defines the final structure of your data.
Therefore, any extra values inside any incoming events that are not mapped in the Arvo schema will not be included in the final destination.
To avoid confusion between a custom Arvo schema and the `include_keys` or `exclude_keys` sink configurations, Data Prepper does not allow the use of the `include_keys` or `exclude_keys` with a custom schema.

In cases where your data is uniform, you may be able to automatically generate a schema.
Automatically generated schemas are based on the first event received by the codec.
The schema will only contain keys from this event.
Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema.
Automatically generated schemas make all fields nullable.
Use the sink's `include_keys` and `exclude_keys` configurations to control what data is included in the auto-generated schema.
Avro fields should use a null [union](https://avro.apache.org/docs/current/specification/#unions), as this will allow missing values. Otherwise, all required fields must be present for each event. Use non-nullable fields only when you are certain they exist.

Check failure on line 176 in _data-prepper/pipelines/configuration/sinks/s3.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Avro. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Avro. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sinks/s3.md", "range": {"start": {"line": 176, "column": 1}}}, "severity": "ERROR"}
vagimeli marked this conversation as resolved.
Show resolved Hide resolved

Use the following options to configure the codec.

Option | Required | Type | Description
:--- | :--- | :--- | :---
`schema` | Yes | String | The Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration). Not required if `auto_schema` is set to true.
`auto_schema` | No | Boolean | When set to `true`, automatically generates the Avro [schema declaration](https://avro.apache.org/docs/current/specification/#schema-declaration) from the first event.


### ndjson codec

The `ndjson` codec writes each line as a JSON object.
### `ndjson` codec

The `ndjson` codec does not take any configurations.
The `ndjson` codec writes each line as a JSON object. The `ndjson` codec does not take any configurations.

### `json` codec

### json codec

The `json` codec writes events in a single large JSON file.
Each event is written into an object within a JSON array.
The `json` codec writes events in a single large JSON file. Each event is written into an object within a JSON array.

Use the following options to configure the codec.

Option | Required | Type | Description
:--- | :--- | :--- | :---
`key_name` | No | String | The name of the key for the JSON array. By default this is `events`.

### `parquet` codec

### parquet codec

The `parquet` codec writes events into a Parquet file.
When using the Parquet codec, set the `buffer_type` to `in_memory`.
The `parquet` codec writes events into a Parquet file. When using the codec, set `buffer_type` to `in_memory`.

The Parquet codec writes data using the Avro schema.
Because Parquet requires an Avro schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema.
However, we generally recommend that you define your own schema so that it can best meet your needs.
The `parquet` codec writes data using the schema. Because Parquet requires an Avro schema, you may either define the schema yourself or have Data Prepper automatically generate it. Defining your own schema is recommended, as this will allow it to be tailored to your particular use case.

Check failure on line 203 in _data-prepper/pipelines/configuration/sinks/s3.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Avro. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Avro. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_data-prepper/pipelines/configuration/sinks/s3.md", "range": {"start": {"line": 203, "column": 79}}}, "severity": "ERROR"}
vagimeli marked this conversation as resolved.
Show resolved Hide resolved

For details on the Avro schema and recommendations, see the [Avro codec](#avro-codec) documentation.
For details on the Avro schema and recommendations, refer to [Avro codec](#avro-codec).
vagimeli marked this conversation as resolved.
Show resolved Hide resolved

Use the following options to configure the codec.

Option | Required | Type | Description
:--- | :--- | :--- | :---
Expand All @@ -192,7 +213,7 @@

### Setting a schema with Parquet

The following example shows you how to configure the `s3` sink to write Parquet data into a Parquet file using a schema for [VPC Flow Logs](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#flow-log-records):
The following example pipeline shows how to configure the `s3` sink to write Parquet data into a Parquet file using a schema for [VPC Flow Logs](https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#flow-log-records):

```
pipeline:
Expand Down Expand Up @@ -235,4 +256,4 @@
event_collect_timeout: PT15M
buffer_type: in_memory
```

{% include copy-curl.html %}
Loading