False sharing for blocked multi-threading

Introduction

When processing an array, we can split it up into contiguous blocks and assign each block to a thread. However, if each thread is repeatedly writing to the block, false sharing could occur at the block boundaries. To avoid this, we instruct each thread to write to its own buffer before copying it back to the original array. The question is, does this make a difference?

Setup

We try a few different approaches to characterize the performance of our system:

In the naive approach, each thread directly modifies the values in its block across multiple iterations. This is most likely to be subject to false sharing at block boundaries, provided that the blocks are not inadvertently aligned to cache lines.
In the spaced approach, we create a common buffer that has spacing between blocks equal to the cache line size (128 bytes for Macs, 64 bytes for x86-64). This should avoid any false sharing as threads can never write to the same cache line.
In the aligned approach, we create a common buffer where each block is aligned to the next cache line boundary. This should also avoid any false sharing as well as checking the effect of aligned memory access.
In the new-aligned approach, each thread allocates a buffer that is aligned to the cache line size and modifies the values in that thread-specific buffer. This is similar to the aligned approach but also checks for potential optimizations when the compiler knows that memory is local to a thread.
In the new-vector approach, each thread allocates a temporary std::vector and modifies the values in that thread-specific vector. This hopes to mitigate false sharing by allowing heap memory to be allocated in thread-specific arenas.

To compile:

cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build

We can now test all approaches with different sizes of the blocks (i.e., jobs per thread) as well as varying numbers of threads.

# block size of 1000 with 4 threads
./build/sharing -n 1000 -t 4

Results

We run the test binary with ./build/sharing -n <BLOCK SIZE> -t 4, using block sizes that aren't a power of 2. This avoids unintended alignment to cache lines in the naive approach, which would eliminate false sharing. The milliseconds per operation is shown for each approach in the tables below.

First, on an Apple M2 running Sonoma 14.6.1 with Apple clang 15.0.0:

block size	naive	spaced	aligned	new-aligned	new-vector
11	8	5	5	4	4
21	11	5	5	5	5
51	17	9	7	7	7
101	26	14	12	12	12
201	42	31	23	24	23
501	136	95	51	51	52
1001	292	173	105	103	105
2001	388	327	207	207	209

Then on Intel i7 running Ubuntu 22.04 with GCC 11.4.0:

block size	naive	spaced	aligned	new-aligned	new-vector
11	23	3	3	4	3
21	26	4	4	5	4
51	45	9	9	9	8
101	72	18	15	17	13
201	100	37	35	31	25
501	126	68	66	78	64
1001	252	130	150	151	125
2001	358	277	252	315	257

These results show that false sharing is responsible for a considerable performance loss in the naive approach compared to the spaced approach. On the Apple M2, some additional performance is extracted by aligning to the cache line boundaries, possibly from aligned SIMD loads. However, this is not consistently observed in my Intel runs - something happens at block sizes of 1000 to 1005 that causes the aligned approach to be slower! The new-aligned approach is also slower than other non-naive methods for Intel. This particular drop seems to be something specific with the alignment parameter in new as creating a std::vector and aligning it with std::align recovers performance.

Interestingly, the new-vector approach is good in all scenarios despite the lack of any guarantees with respect to false sharing or alignment. This is pleasing as it implies that performance can be achieved with idiomatic C++ (namely, the creation of a temporary vector in each thread's scope). It is also a reminder to not be overly prescriptive as this can cause premature pessimizations in complex CPU architectures.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

False sharing for blocked multi-threading

Introduction

Setup

Results

About

Releases

Packages

Languages

tatami-inc/test-false_sharing

Folders and files

Latest commit

History

Repository files navigation

False sharing for blocked multi-threading

Introduction

Setup

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages