Skip to content

Latest commit

 

History

History
148 lines (90 loc) · 7.32 KB

spark-sql-streaming-properties.adoc

File metadata and controls

148 lines (90 loc) · 7.32 KB

Configuration Properties

Configuration properties are used to fine-tune Spark Structured Streaming applications.

You can set them for a SparkSession when it is created using config method.

import org.apache.spark.sql.SparkSession
val spark = SparkSession
  .builder
  .config("spark.sql.streaming.metricsEnabled", true)
  .getOrCreate
Tip
Read up on SparkSession in The Internals of Spark SQL book.
Table 1. Structured Streaming’s Properties
Name Description

spark.sql.streaming.aggregation.stateFormatVersion

(internal) Version of the state format

Default: 2

Supported values:

Used when StatefulAggregationStrategy execution planning strategy is executed (and plans a streaming query with an aggregate that simply boils down to creating a StateStoreRestoreExec with the proper implementation version of StreamingAggregationStateManager)

Among the checkpointed properties that are not supposed to be overriden after a streaming query has once been started (and could later recover from a checkpoint after being restarted)

spark.sql.streaming.checkpointLocation

Default checkpoint directory for storing checkpoint data

Default: (empty)

spark.sql.streaming.continuous.executorQueueSize

(internal) The size (measured in number of rows) of the queue used in continuous execution to buffer the results of a ContinuousDataReader.

Default: 1024

spark.sql.streaming.continuous.executorPollIntervalMs

(internal) The interval (in millis) at which continuous execution readers will poll to check whether the epoch has advanced on the driver.

Default: 100 (ms)

spark.sql.streaming.disabledV2MicroBatchReaders

(internal) A comma-separated list of fully-qualified class names of data source providers for which MicroBatchReadSupport is disabled. Reads from these sources will fall back to the V1 Sources.

Default: (empty)

Use SQLConf.disabledV2StreamingMicroBatchReaders to get the current value.

spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion

(internal) State format version used by flatMapGroupsWithState operation in a streaming query.

Default: 2

Supported values:

  • 1

  • 2

spark.sql.streaming.maxBatchesToRetainInMemory

(internal) The maximum number of batches which will be retained in memory to avoid loading from files.

Default: 2

Maximum count of versions a State Store implementation should retain in memory.

The value adjusts a trade-off between memory usage vs cache miss:

  • 2 covers both success and direct failure cases

  • 1 covers only success case

  • 0 or negative value disables cache to maximize memory size of executors

Used exclusively when HDFSBackedStateStoreProvider is requested to initialize.

spark.sql.streaming.metricsEnabled

Flag whether Dropwizard CodaHale metrics are reported for active streaming queries

Default: false

Use SQLConf.streamingMetricsEnabled to get the current value

spark.sql.streaming.minBatchesToRetain

(internal) The minimum number of entries to retain for failure recovery

Default: 100

Use SQLConf.minBatchesToRetain to get the current value

spark.sql.streaming.multipleWatermarkPolicy

Global watermark policy that is the policy to calculate the global watermark value when there are multiple watermark operators in a streaming query

Default: min

Supported values:

  • min - chooses the minimum watermark reported across multiple operators

  • max - chooses the maximum across multiple operators

Cannot be changed between query restarts from the same checkpoint location.

spark.sql.streaming.noDataProgressEventInterval

(internal) How long to wait between two progress events when there is no data (in millis) when ProgressReporter is requested to finish a trigger

Default: 10000L

Use SQLConf.streamingNoDataProgressEventInterval to get the current value

spark.sql.streaming.numRecentProgressUpdates

Number of progress updates to retain for a streaming query

Default: 100

spark.sql.streaming.pollingDelay

(internal) Time delay (in ms) before StreamExecution polls for new data when no data was available in a batch.

Default: 10

spark.sql.streaming.stateStore.maintenanceInterval

The initial delay and how often to execute StateStore’s maintenance task.

Default: 60s

spark.sql.streaming.stateStore.providerClass

(internal) The fully-qualified class name of the StateStoreProvider implementation that manages state data in stateful streaming queries. This class must have a zero-arg constructor.

Use SQLConf.stateStoreProviderClass to get the current value.

spark.sql.streaming.unsupportedOperationCheck

(internal) When enabled (true), StreamingQueryManager makes sure that the logical plan of a streaming query uses supported operations only.

Default: true