Learning Spark internals using groupBy (to cause shuffle)

Execute the following operation and explain transformations, actions, jobs, stages, tasks, partitions using spark-shell and web UI.

sc.parallelize(0 to 999, 50).zipWithIndex.groupBy(_._1 / 10).collect

You may also make it a little bit heavier with explaining data distribution over cluster and go over the concepts of drivers, masters, workers, and executors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

learning-spark-internals-using-groupby.adoc

learning-spark-internals-using-groupby.adoc

Learning Spark internals using groupBy (to cause shuffle)

Files

learning-spark-internals-using-groupby.adoc

Latest commit

History

learning-spark-internals-using-groupby.adoc

File metadata and controls

Learning Spark internals using groupBy (to cause shuffle)