A Data Prepper instance requires 2 configuration files to run, and allows an optional 3rd Log4j 2 configuration file (see Logging).
- A YAML file which describes the data pipelines to run (including sources, processors, and sinks)
- A YAML file containing Data Prepper server settings, primarily for interacting with the exposed Data Prepper server APIs
- An optional Log4j 2 configuration file (can be JSON, YAML, XML, or .properties)
For Data Prepper before version 2.0, the .jar
file expects the pipeline configuration file path followed by the server configuration file path. Example:
java -jar data-prepper-core-$VERSION.jar pipelines.yaml data-prepper-config.yaml
Optionally add "-Dlog4j.configurationFile=config/log4j2.properties"
to the command if you would like to pass a custom Log4j 2 configuration file. If no properties file is provided, Data Prepper will default to the log4j2.properties file in the shared-config directory.
For Data Prepper 2.0 or above, Data Prepper is launched through data-prepper
script with no additional command line arguments needed:
bin/data-prepper
Configuration files are read from specific subdirectories in the application's home directory:
pipelines/
: for pipelines configurations; pipelines configurations can be written in one and more yaml filesconfig/data-prepper-config.yaml
: for Data Prepper server configurations
You can continue to supply your own pipeline configuration file path followed by the server configuration file path, but the support for this method will be dropped in a future release. Example:
bin/data-prepper pipelines.yaml data-prepper-config.yaml
Additionally, Log4j 2 configuration file is read from config/log4j2.properties
in the application's home directory.
Example Pipeline configuration file (pipelines.yaml):
version: 2
entry-pipeline:
workers: 4
delay: "100"
source:
otel_trace_source:
ssl: false
sink:
- pipeline:
name: "raw-pipeline"
- pipeline:
name: "service-map-pipeline"
raw-pipeline:
workers: 4
source:
pipeline:
name: "entry-pipeline"
processor:
- otel_trace_raw:
sink:
- stdout:
service-map-pipeline:
workers: 4
delay: "100"
source:
pipeline:
name: "entry-pipeline"
processor:
- service_map_stateful:
sink:
- stdout:
This sample pipeline creates a source to receive trace data and outputs transformed data to stdout.
delay
(Optional): Anint
representing the maximum duration in milliseconds to retrieve records from the buffer. If the buffer's specified batch_size has not been reached before this duration is exceeded, a partial batch is used. If this value is set to 0, all available records up to the batch size will be immediately returned. If the buffer is empty, the buffer will block for up to 5 milliseconds to wait for records. Default value is3000
.workers
(Optional): Anint
representing the number of ProcessWorker threads for the pipeline. Default value is1
.
The pipeline configuration file now supports an optional version
attribute. This can help users ensure the pipeline configuration
used is compatible with the running data prepper version. Data Prepper now compares the version supplied in the confirmation at start
time and will throw an exception if the version in the pipeline is greater than the running Data Prepper version.
This attribute can be specified with a shorthand format with only the major version (i.e. 2
) or major and minor version
(i.e. 2.1
).
Data Prepper Version | Pipeline Configuration Version | Compatible |
---|---|---|
2.1 | 2 | true |
2.1 | 2.1 | true |
2.1 | 2.0 | true |
2.1 | null | true |
2.1 | 1.5 | true |
2.1 | 1 | true |
2.1 | 3.0 | false |
2.1 | 3 | false |
Data Prepper allows the following properties to be configured:
ssl
: boolean indicating TLS should be used for server APIs. Defaults totrue
key_store_file_path
: string path to a .jks or .p12 keystore file. Required ifssl
istrue
- alias for this property is
keyStoreFilePath
, which is deprecated and planned for removal
- alias for this property is
key_store_password
string password for keystore. Optional, defaults to empty string- alias for this property is
keyStorePassword
, which is deprecated and planned for removal
- alias for this property is
private_key_password
string password for private key within keystore. Optional, defaults to empty string- alias for this property is
privateKeyPassword
, which is deprecated and planned for removal
- alias for this property is
server_port
: integer port number to use for server APIs. Defaults to4900
- alias for this property is
serverPort
, which is deprecated and planned for removal
- alias for this property is
metric_registries
: list of metrics registries for publishing the generated metrics. Defaults to Prometheus; Prometheus and CloudWatch are currently supported.- alias for this property is
metricRegistries
, which is deprecated and planned for removal
- alias for this property is
metric_tags
: map of metric tag key-value pairs applied as common metric tags to meter registries. Defaults to empty map. The maximum number of pairs is limited to 3. Note thatserviceName
is a reserved tag key withDataPrepper
as default tag value. Its value could also be set through the environment variableDATAPREPPER_SERVICE_NAME
. IfserviceName
is defined inmetric_tags
, the value will overwrite those set through the above mechanism.- alias for this property is
metricTags
, which is deprecated and planned for removal
- alias for this property is
metric_tag_filters
: list of pattern and tags. For each metric, only tags from the first pattern which matches to metric name will be added to the metric when processed in order configured. If none of the patterns match them tags frommetricTags
will be applied. Defaults to empty list.pattern
: A string representing the Ant-style pattern of the metrics to match. Path separator for Ant Path is "." which is the separator used in all the metrics. You can find more on Ant-style path patterns here.tags
: A map of key-value pairs applied to metrics that match with pattern. The maximum number of pairs is limited to 3. Note thatserviceName
is a reserved tag key withDataPrepper
as default tag value. Its value could also be set through the environment variableDATAPREPPER_SERVICE_NAME
. IfserviceName
is defined inmetricTags
, the value will overwrite those set through the above mechanism. Example Data Prepper configuration file (data-prepper-config.yaml) with SSL enabled:
ssl: true
key_store_file_path: "/usr/share/data-prepper/keystore.p12"
key_store_password: "password"
private_key_password: "password"
server_port: 4900
metric_registries: [Prometheus]
metric_tags:
custom_key: custom_value
metric_tag_filters:
- pattern: "test-pipeline.grok.**"
tags:
custom_key: custom_value
The Data Prepper Docker image runs with SSL enabled using a default self-signed certificate. For more robust security, you should generate your own private key and certificate. You can generate the certificate using existing tools such as OpenSSL. If you'd like a short primer, you can mimic the steps used to create the default certificate, and change them to suite your needs. Please note that for PKCS12 files (.p12), you should use the same password for the keystore and private key.
To run the Data Prepper Docker image with the default data-prepper-config.yaml
, the command should look like this:
For Data Prepper 2.0 or above:
docker run \
--name data-prepper-test \
-p 4900:4900 \
--expose 21890 \
-v /full/path/to/pipelines.yaml:/usr/share/data-prepper/pipelines/pipelines.yaml \
data-prepper/data-prepper:latest
For Data Prepper before 2.0:
docker run \
--name data-prepper-test \
-p 4900:4900 \
--expose 21890 \
-v /full/path/to/pipelines.yaml:/usr/share/data-prepper/pipelines.yaml \
data-prepper/data-prepper:latest
To disable SSL, create a data-prepper-config.yaml
with the following configuration.
ssl: false
In order to pass your own data-prepper-config.yaml
, mount it as a volume in the Docker image by adding the argument below to docker run
. Note that the config must be mounted to proper path inside the container:
For Data Prepper 2.0 or above:
-v /full/path/to/data-prepper-config.yaml:/usr/share/data-prepper/config/data-prepper-config.yaml
For Data Prepper before 2.0:
-v /full/path/to/data-prepper-config.yaml:/usr/share/data-prepper/data-prepper-config.yaml
If your data-prepper-config.yaml
has SSL enabled, and you are using your own keystore, it will need to be mounted as a Docker volume as well. Note that the mount path should correspond with
the keyStoreFilePath
field from your data-prepper-config.yaml
. It is recommended to mount to /usr/share/data-prepper/config/data-prepper-config.yaml
(for Data Prepper 2.0 or above) or /usr/share/data-prepper/data-prepper-config.yaml
(for Data Prepper before 2.0) to ensure that the path exists in the Docker image.
To do so, add the argument below to the docker run
command.
-v /full/path/to/keystore.p12:/usr/share/data-prepper/keystore.p12
Data Prepper supports circuit breakers which will interrupt adding objects to the buffer when certain conditions are met.
Heap circuit breaker: When the JVM heap usage reaches a configurable size stop accepting requests to buffers.
Configuration
circuit_breakers:
heap:
usage: 6.5gb
reset: 2s
usage
- float - The absolute value of JVM memory which will trip the circuit breaker. This can be defined with bytes (b
), kilobytes (kb
), megabytes (mb
), or gigabytes (gb
).reset
- Duration - The time between when the circuit is tripped and the next attempt to validate will occur. Defaults to 1s.check_interval
- Duration - The time between checks of the heap usage. Defaults to 500ms.
Starting in Data Prepper 1.3.0, Prepper plugins were renamed to Processors. The use of the prepper or processor name in pipeline configuration files is still supported. However, the use of both processor and prepper in the same configuration file is not supported.
Starting in Data Prepper 2.0, The use of the prepper name in pipeline configuration files is no longer supported.
An example of deprecated prepper pipeline configuration file (pipelines.yaml):
grok-pipeline:
source:
http:
prepper:
- grok:
match:
log: [ "%{COMMONAPACHELOG}" ]
sink:
- stdout:
To continue to use the same configuration in Data Prepper 2.0 or above, rename prepper
to processor
:
grok-pipeline:
source:
http:
processor:
- grok:
match:
log: [ "%{COMMONAPACHELOG}" ]
sink:
- stdout: