Skip to content

Latest commit

 

History

History
36 lines (22 loc) · 1.98 KB

spark-streaming-kafka-kafkardd.adoc

File metadata and controls

36 lines (22 loc) · 1.98 KB

KafkaRDD

KafkaRDD class represents a RDD dataset from Apache Kafka. It uses KafkaRDDPartition for partitions that know their preferred locations as the host of the topic (not port however!). It then nicely maps a RDD partition to a Kafka partition.

Tip
Studying KafkaRDD class can greatly improve understanding of Spark (core) in general, i.e. how RDDs are used for distributed computations.

KafkaRDD overrides methods of RDD class to base them on offsetRanges, i.e. partitions.

You can create KafkaRDD using KafkaUtils.createRDD(sc: SparkContext, kafkaParams: Map[String, String], offsetRanges: Array[OffsetRange]).

Tip

Enable INFO logging level for org.apache.spark.streaming.kafka.KafkaRDD logger to see what happens in KafkaRDD.

Add the following line to conf/log4j.properties:

log4j.logger.org.apache.spark.streaming.kafka.KafkaRDD=INFO

Computing Partitions

To compute a partition, KafkaRDD, checks for validity of beginning and ending offsets (so they range over at least one element) and returns an (internal) KafkaRDDIterator.

You should see the following INFO message in the logs:

INFO KafkaRDD: Computing topic [topic], partition [partition] offsets [fromOffset] -> [toOffset]

It creates a new KafkaCluster every time it is called as well as kafka.serializer.Decoder for the key and the value (that come with a constructor that accepts kafka.utils.VerifiableProperties).

It fetches batches of kc.config.fetchMessageMaxBytes size per topic, partition, and offset (it uses kafka.consumer.SimpleConsumer.fetch(kafka.api.FetchRequest) method).

Caution
FIXME Review