One of the main motivations of Project Tungsten is to greatly reduce the usage of Java objects to minimum by introducing its own memory management. Tungsten uses a compact storage format for data representation that also reduces memory footprint. With a known schema for datasets, the proper data layout is possible immediately with the data being already serialized (that further reduces or completely avoids serialization between JVM object representation and Spark’s internal one).
Project Tungsten uses sun.misc.unsafe
API for direct memory access to bypass the JVM in order to avoid garbage collection.
The optimizations provided by the project Tungsten:
-
Memory Management using Binary In-Memory Data Representation aka Tungsten row format.
-
Cache-Aware Computations with Cache-Aware Layout for high cache hit rates
Tungsten does code generation, i.e. generates JVM bytecode on the fly, to access Tungsten-managed memory structures that gives a very fast access.
Tungsten also introduces cache-aware data structures that are aware of the physical machine caches at different levels - L1, L2, L3.