Project Tungsten

One of the main motivations of Project Tungsten is to greatly reduce the usage of Java objects to minimum by introducing its own memory management. Tungsten uses a compact storage format for data representation that also reduces memory footprint. With a known schema for datasets, the proper data layout is possible immediately with the data being already serialized (that further reduces or completely avoids serialization between JVM object representation and Spark’s internal one).

Project Tungsten uses sun.misc.unsafe API for direct memory access to bypass the JVM in order to avoid garbage collection.

The optimizations provided by the project Tungsten:

Memory Management using Binary In-Memory Data Representation aka Tungsten row format.
Cache-Aware Computations with Cache-Aware Layout for high cache hit rates
Whole-Stage Code Generation

Tungsten does code generation, i.e. generates JVM bytecode on the fly, to access Tungsten-managed memory structures that gives a very fast access.

Tungsten also introduces cache-aware data structures that are aware of the physical machine caches at different levels - L1, L2, L3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

spark-sql-tungsten.adoc

spark-sql-tungsten.adoc

Project Tungsten

Further reading or watching

Collapse file tree

Files

spark-sql-tungsten.adoc

Latest commit

History

spark-sql-tungsten.adoc

File metadata and controls

Project Tungsten

Further reading or watching