KafkaRDD

KafkaRDD class represents a RDD dataset from Apache Kafka. It uses KafkaRDDPartition for partitions that know their preferred locations as the host of the topic (not port however!). It then nicely maps a RDD partition to a Kafka partition.

Tip	Studying `KafkaRDD` class can greatly improve understanding of Spark (core) in general, i.e. how RDDs are used for distributed computations.

KafkaRDD overrides methods of RDD class to base them on offsetRanges, i.e. partitions.

You can create KafkaRDD using KafkaUtils.createRDD(sc: SparkContext, kafkaParams: Map[String, String], offsetRanges: Array[OffsetRange]).

Tip	Enable `INFO` logging level for `org.apache.spark.streaming.kafka.KafkaRDD` logger to see what happens in KafkaRDD. Add the following line to `conf/log4j.properties`: `log4j.logger.org.apache.spark.streaming.kafka.KafkaRDD=INFO`

Computing Partitions

To compute a partition, KafkaRDD, checks for validity of beginning and ending offsets (so they range over at least one element) and returns an (internal) KafkaRDDIterator.

You should see the following INFO message in the logs:

INFO KafkaRDD: Computing topic [topic], partition [partition] offsets [fromOffset] -> [toOffset]

It creates a new KafkaCluster every time it is called as well as kafka.serializer.Decoder for the key and the value (that come with a constructor that accepts kafka.utils.VerifiableProperties).

It fetches batches of kc.config.fetchMessageMaxBytes size per topic, partition, and offset (it uses kafka.consumer.SimpleConsumer.fetch(kafka.api.FetchRequest) method).

Caution

FIXME Review

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark-streaming-kafka-kafkardd.adoc

spark-streaming-kafka-kafkardd.adoc

KafkaRDD

Computing Partitions

Files

spark-streaming-kafka-kafkardd.adoc

Latest commit

History

spark-streaming-kafka-kafkardd.adoc

File metadata and controls

KafkaRDD

Computing Partitions