Skip to content

Latest commit

 

History

History
19 lines (11 loc) · 1.38 KB

spark-sql-tungsten.adoc

File metadata and controls

19 lines (11 loc) · 1.38 KB

Project Tungsten

One of the main motivations of Project Tungsten is to greatly reduce the usage of Java objects to minimum by introducing its own memory management. Tungsten uses a compact storage format for data representation that also reduces memory footprint. With a known schema for datasets, the proper data layout is possible immediately with the data being already serialized (that further reduces or completely avoids serialization between JVM object representation and Spark’s internal one).

Project Tungsten uses sun.misc.unsafe API for direct memory access to bypass the JVM in order to avoid garbage collection.

The optimizations provided by the project Tungsten:

  1. Memory Management using Binary In-Memory Data Representation aka Tungsten row format.

  2. Cache-Aware Computations with Cache-Aware Layout for high cache hit rates

  3. Whole-Stage Code Generation

Tungsten does code generation, i.e. generates JVM bytecode on the fly, to access Tungsten-managed memory structures that gives a very fast access.

Tungsten also introduces cache-aware data structures that are aware of the physical machine caches at different levels - L1, L2, L3.