Goal: Improve Spark’s performance where feasible.
measure performance bottlenecks using new metrics, including block-time analysis
a live demo of a new performance analysis tool
CPU — not I/O (network) — is often a critical bottleneck
community dogma = network and disk I/O are major bottlenecks
a TPC-DS workload, of two sizes: a 20 machine cluster with 850GB of data, and a 60 machine cluster with 2.5TB of data.
network is almost irrelevant for performance of these workloads
network optimization could only reduce job completion time by, at most, 2%
10Gbps networking hardware is likely not necessary
serialized compressed data
From Making Sense of Spark Performance - Kay Ousterhout (UC Berkeley) at Spark Summit 2015:
is better -
mind serialization time
impacts CPU - time to serialize and network - time to send the data over the wire
Tungsten - recent initiative from Databrics - aims at reducing CPU time
jobs become more bottlenecked by IO