Skip to content

What Really Matters to Performance?

Jerry.Wang edited this page May 3, 2021 · 8 revisions

After the benchmark result released, people are surprised about the big difference, and someone even suspects the numbers. Both Scala edition and Rust edition are doing the same job on an in-memory tuplespace, the code is open source and everyone can check. The numbers are true. If one day the persistent-layer is added, I bet the Rust edition would outperform more.

You may ask why the Rust edition is so fast. In fact, from what I see, it is not so fast. If you look at the second case, it took 3393 ms to iterate 100,000 rounds, that is 0.0339 ms for a round on a 2.8GHz CPU, which means a single iteration takes about 100,000 CPU cycles. That is not fast if comparing with other languages.

So, the question should be: Why is the Scala RNode slow?

The benchmark test repeats the same snippts of codes to ensure JVM internally compiles code into native machine codes just-in-time. Both Rust and Scala edition feed native machine code to CPU, but why Scala RNode is slow? lock contentions in tuplespace? redundant codes in sorting ADT? garbage collection?

Besides these reasons, a major one is left out there.

In the context of performance, there are two kinds of work, CPU-bound and IO-bound. For a rholang-interpreter alike application, most of work(e.g. normalization/reducation) is bound to CPU.

Light Speed is Limited, Memory is Slow

A modern CPU typically works at frequency 3G Hz, that is 3x10⁹ cycles per second if simply consider one core. While light speed is about 3x10⁸ m/s, that is to say, within a CPU cycle, light can only move about a decimeter distance. That's the limit of signal velocity.

Suppose a CPU instruction loads a value from certain main memory location within a CPU cycle, the memory cell must be connected by a metallic path whose length cannot exceed 5cm as it is a double journey. And all the gates on the path must operate with near zero time, that is impractical for manufacture. The main memory access time is about 50 ns. CPU has to wait for ~100 cycles to access the main memory. As a result, processor designers added CPU internal caches (L1/L2/L3) to reduce the penalty of main memory access. Please don't be surprised if one day L4 is added to CPU.

octocat

When an instruction loads certain memory location, the memory page must be loaded into CPU cache line. In another words, all memory must be mapped into CPU cache before accessing. If CPU tries to load a memory location which is missing in CPU cache, the instruction is stalled there and has to wait till the memory page is fetched into cache.

octocat

The above info graphics shows the CPU cycles required by each kind of operation. The exact number does not make much sense, just pay attention to the order of magnitude. Suppose CPU is executing an ADD instruction, if the referenced memory is missed in register and cache, the CPU cycles differ by two orders of magnitude from the case when it is hit in register, that is 100 times slow!

Thus it is critical important to leverage CPU cache for CPU-bound tasks.

CPU Cache Locality

CPU predicts which memory pages to be cached basing on locality. And there are two crucial types of locality.

  • Temporal Locality (a.k.a Time Locality) If at one point a particular memory location is referenced, then it is likely that the same location will be referenced again in the near future.

  • Spatial Locality (a.k.a Space Locality) If a particular storage location is referenced at a particular time, then it is likely that nearby memory locations will be referenced in the near future.

Temporal Locality

Let's analyze the temporal locality first. The RNode is coded with scala in a functional programing style. The functional programing makes the code tidy and stateless, no side-effect, and also looks beautiful. There is no problem for functional programing itself. However, functional programing is usually paired with immutable data structures.

Whenever changes are required on immutable data structures, the old data is dropped and new data is allocated. Take the ParSortMatcher as an example, everytime when it is called, new Seq are allocated. It is hard to heat the same memory location. Even worse, the size of CPU cache is very small, (e.g. L1 data cache is just 32KB). Rememebr that all memory pages must be cached before accessing. When new memory reference need be accessed, CPU has to evict some old cache. From CPU's view, there is no JVM, CPU sees the original immutable data are accessed recently. Even the original memory reference are collected from JVM's view, CPU will not likely evict them. Instead, CPU evicts other memory reference which might be useful. Thus, immutable data structures fails CPU cache prediction! If you look at the same part of Rust edition, it performs in-place operation instead.

Spatial Locality

There are two kinds of allocation, stack and heap.

Stack is a continuous memory region framed by stack pointers(e.g. ESP/EBP registers in x86). When execution enters a scope (usually a function call), the stack pointer moves to reverse space for local storage; When a scope exits, the stack pointer moves to revert. The stack memory works very well with CPU cache prediction because the data being accessed in a scope are located nearby on the stack.

Heap is a much larger region managed by programing language to store everything allocated dynamically. The heap management is invisible to CPU. From the view of CPU, accessing to memory of heap is not continuous. Obviously, memory on heap is poor of spatial locality.

To allocate on stack, the data size must be known at compile-time, and the allocated data must not escape the scope of execution.

In Scala, JVM uses escape analysis to optimize the allocation and allocate on stack if possible, but in fact its effort is very limited since most of cases the data escapses the scope or the data size is unknown at compile-time (e.g. length of Seq/String are unknown at compile-time). In this reason, vast majority of data are allocated on heap in JVM.

While in Rust, data are allocated on stack by default except a few types(e.g.Box<T> / Vec<T> / etc). And the rust edition tries to avoid using them if possible. E.g. Box<dyn trait> is avoided by adding more lines of codes for compile-time polymorphism; Vec<T> is replaced with SmallVec<T> to allocate array on stack if possible. Still there are more places can be optimized in Rust edition, but generally the generated machine code is friendly for CPU cache prediction. Also, I have to mention that, the heap allocation is not cheap. No matter how efficient the heap is managed, it is far slower than stack allocation.

Conclusion

JVM is not the right choice to implement performance-critical CPU-bound tasks. Scala makes the situation worse by immutable data structures. The machine codes generated by Scala is poor of temporal locality and spatial locality.

About Me

I am a software developer with nearly 30-years coding experience in many different programing languages since my childhood. Recent decade I have been working as software architect to design high-performance systems. That's why rho-calculus's scalability is so amazing to me. But if RhoVM is two orders of magnitude slower, it makes no sense to scale.