KafkaRDD
class represents a RDD dataset from Apache Kafka. It uses KafkaRDDPartition
for partitions that know their preferred locations as the host of the topic (not port however!). It then nicely maps a RDD partition to a Kafka partition.
Tip
|
Studying KafkaRDD class can greatly improve understanding of Spark (core) in general, i.e. how RDDs are used for distributed computations.
|
KafkaRDD
overrides methods of RDD
class to base them on offsetRanges
, i.e. partitions.
You can create KafkaRDD
using KafkaUtils.createRDD(sc: SparkContext, kafkaParams: Map[String, String], offsetRanges: Array[OffsetRange])
.
Tip
|
Enable Add the following line to
|
To compute
a partition, KafkaRDD
, checks for validity of beginning and ending offsets (so they range over at least one element) and returns an (internal) KafkaRDDIterator
.
You should see the following INFO message in the logs:
INFO KafkaRDD: Computing topic [topic], partition [partition] offsets [fromOffset] -> [toOffset]
It creates a new KafkaCluster
every time it is called as well as kafka.serializer.Decoder for the key and the value (that come with a constructor that accepts kafka.utils.VerifiableProperties).
It fetches batches of kc.config.fetchMessageMaxBytes
size per topic, partition, and offset (it uses kafka.consumer.SimpleConsumer.fetch(kafka.api.FetchRequest) method).
Caution
|
FIXME Review |