Skip to content

fsanaulla/spark-http-rdd

Repository files navigation

spark-http-rdd

Scala CI Maven Central Scala Steward badge

Installation

Add it into your build.sbt

Spark 3

Compiled for scala 2.12

libraryDependencies += "com.github.fsanaulla" %% "spark3-http-rdd" % <version>

Spark 2

Cross-compiled for scala 2.11, 2.12

libraryDependencies += "com.github.fsanaulla" %% "spark2-http-rdd" % <version>

Usage

Let's define our source URI:

val baseUri: URI = ???

We will build our partitions on top of it using array of URIModifier that looks like:

val uriPartitioner: Array[URIModifier] = Array(
  URIModifier.fromFunction { uri =>
    // uri modification logic, 
    // for example appending path, adding query params etc
  },
  ...
)

Important: Number of URIModifier should be equal to desired number of partitions. Each URI will be used as a base URI for separate partition

Then we should define the way how we will work with http endpoint responses. By default it expect to receive line separated number of rows where each row will be processed as separate entity during process of response mapping

val mapping: String => T = ??? 

And then you can create our RDD:

val rdd: RDD[T] =
  HttpRDD.create(
    sc,
    baseUri,
    uriPartitioner,
    mapping
  )

More details available in the source code. Also as an example you can use integration tests

About

RDD primitive for fetching data from an HTTP source

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published