Investigating C++ optimization techniques while building Game of Life.
(the above gif is the benchmark scenario; but because of video capture, it's slow as heck; click for video)
make
make test
make benchmark
Benchmark hardware:
- i5-7600K @ 4.4GHz
- GTX 1080
Benchmark scenario:
- The benchmark is "time to compute 2000 generations; on a wrapping 2560x1440 sized board"
- The initial board is always the same, and contains
- A breeder on the top half
- Seeded random cells in the bottom left
- Chess grid of 8x8 alive/dead cells in the bottom right
Finalised the benchmark board.
9227c6a55ede200a1b6fe827c93010963e704f3d
Switching from a 2D vector std::vector<std::vector<bool>>
to a 1D array std::unique_pr<new bool[]>
.
9284f489604e814bb362fe5aa0c5e41ec5158edc
Merging and inlining the "finding neighbour positions" and "counting neighbours" logic.
f64f4adbf7c6f956d2edcff397483de44ccaabec
Rendering the board logic, rather than creating a list of pixels to render.
cda613a50c4314d087152fb26cf222c76d5ea92d
Removing the shared_ptr wrappers, speed > safety.
73e6448fe2a021d9063bf50323b4b99cd551ee15
Reduced the number of neighbour checks from 8 (all) to 3.
Achieved by having a "sliding window" of the neighbour counts for the "left", "middle" and "right" neighbouring columns. When the next cell computes it's neighbours it shifts the window over (so "middle" becomes "left") and only has to generate the new "right" column.
01cce885bd57f1a5aa4d113a32f4413a61f826a5
Only compute the y levels for the above and below cells once per row.
9e80c59830588b05ee29c754530fabf50f51a9e0
Segmented the rendering and computation into separate threads. At this stage the rendering takes less time than the computation, so effectively I've reduced render time to 0.
2619b4164bd2fca3fd8b30dca22cd83a75fe43b9
Finally gave in to the lazy performance improvement, throwing more threads at the problem.
2f68783cc8a3d9bd7590e617b71a5a0df010af56
When converting values of 0
/maxInt
to 0
/1
we now use a mathematical expression 1 - (state + 1)
instead of a ternary.
21bd48749570ac9c05663c661f60ce5ede1223ff
Insane free performance improvement by passing the optimization flag -Ofast
.
21761f93075a7aa26c314ae675b92df07c5f5892
Enabling the compiler flag -Wsign-conversion
to find and remove any unecessary signed to unsigned int conversions.
86b3634383c6bd2d2f4104c441b2a2328f2bf6f8
For now I am going to focus on the logic performance instead of the main loop performance. The first step towards that is removing the "render" centric logic which was storing cells as "pixel friendly" values.
If at some point I can figure out how to tell SDL to render 0/1 byte values, I will gladly do so, and the render performance will be restored.
b075a2f68ded65e3828d18e972e9a0fd0a24d596
This speed increase is likely due to the fact that it consumes more CPU which would otherwise be used by the render thread.
46f00cb633e6186ad78cf3f679dd82dd0414f3c3
This speed increase is primarily from changing the render limit to 30fps, but also changing the thread count to 12.
I'm not entirely sure why this thread count is more performant, but it's pretty evident in the results: benchmark_full.txt
ff5d78d8e77fc307229be5448c1370b9ab70bb19
Moved to SDLs Surface based rendering, instead of texture rendering.
Inpiration for change: https://github.com/eabrash/game-of-life
a81d839cc3d99b3d7293040a3e36cba7a46a33fe
Added a padding to the game board, removing the need to do any modulus operations while computing x and y positions.
Inpiration for change: https://www.youtube.com/watch?v=ndAfWKmKF34
90e4b104e9ea92222a562a63d9183fc3df6919f8
Move to a smaller data type to allow better CPU caching.
I did not realise how much of a difference it would make.
644b7a9f91951d203f999374105ce20069a32c4b
By leveraging the ability to read 8 cell states in one read cycle using *reinterpret_cast<uint64_t*>(cells)
, we can perform some smarter skip logic.
If we read the 8 cells above, adjacent and below and find that we have 0 total alive cells, we can skip ahead 6 positions.
67762ed7fbb419eea90d763bd8cbe0fe4596f5c1
By replacing the system level mutex (std::mutex) with a more simple "request worker pause" spin lock.
We don't need a complex mutex when the communication is tied from a main thread to a single worker thread.
ed7b50d50eb512cd558e1c7c6ac22e311d0018c5
By replacing the primitive multi-branching cell state calculation with a lookup table.
By using the Intel compiler.
Intel(R) C++ Compiler Classic 2021.5.0
b25f071659e718dfce5e8ad0a832a44267a8970d
By moving to the new Intel compiler.
Intel(R) oneAPI DPC++/C++ Compiler 2022.0.0
ef55d32d9d77edd43c91bff7eba516ba7fa1e4a1
By creating a job pool, so that faster threads who previously sat idle could now do more actual work.
d350d8bf8b49191952a72b7cb74fa694b5c289c3
By fitting 2 "skip" bits into a single byte.
It turns out that adding more than 2 (ie. 4 or 8), actually decreased performance.
With 8 skip bits in 1 byte, I was only getting 1.43s; which was actually slower than 1 bit per byte (1.37s).
07ca8c52e0397edf7dd60e864a479e1ebb10b060
Bear in mind these findings were made on my computer(details of which are here), and may not be applicable to your machine or code.
Commit here.
Context:
- To calculate a new cell state, you need to count the number of alive neighbours
- To count the number of alive neighbours, you need to retrieve their states from memory
- The memory is stored in a flat array of size
width*height
Base code:
- Keeps track of 3 offsets to use on the
input
array
unsigned int yAboveBase = 0;
unsigned int yBelowBase = 0;
unsigned int yBase = 0;
neighbours[2] = input[yBelowBase + nextX] + input[yBase + nextX] + input[yAboveBase + nextX];
10% faster code:
- Creates 3 arrays which have their base offsets assigned to the
0
index
unsigned int *neighboursBelow neighboursBelow = &input[lastYBase];
unsigned int *neighboursMiddle neighboursMiddle = &input[middleYBase];
unsigned int *neighboursAbove neighboursAbove = &input[nextYBase];
neighbours[2] = neighboursBelow[nextX] + neighboursMiddle[nextX] + neighboursAbove[nextX];
I assume the increase in performance is through the change to access memory being sequentially.
Commit here.
I had initially enabled the "fastest" compiler optimization flag in this commit, with the assumption that tangible improvements in performance could only be made by making changes to the code (in this case by "optimizing" it).
What I have seen is that some "optimizations" in code may actually negatively impact performance because it limits the CPUs ability to perform it's own runtime optimizations. It is my assumption that in the case of this code base, leaving in the extra code as "hints" to the CPU is actually allowing it to improve internal branch prediction and caching strategies.
Here are my results for each optimization flag provided by gcc:
-Ofast : 4.42ms (2nd)
-O3 : 4.54ms (4th)
-O2 : 4.52ms (3rd)
-O1 : 4.28ms (the best)
-O0 : 8.33ms (6th)
-Os : 4.65ms (5th)
Code at the time of this experiment here.
// the best
inline const auto math(unsigned int n) {
return (1 - (n + 1));
}
// 3.3% slower
inline const auto mask(unsigned int n) {
return n & 1;
}
// 3.3% slower
inline const auto shift(unsigned int n) {
return n >> 31;
}
// 30% slower
inline const auto bool(unsigned int n) {
return n == UINT32_MAX;
}
After taking a look at the assembly, it turns out the first function compiles to the neg instruction as opposed to the and or shift instructions. I guess neg is just an easier computation.
Code at the time of this experiment here.
// 4.51ms: the best
if (currentStateBool && (totalNeighbours < 2 || totalNeighbours > 3))
output[i] = DEAD;
else if (!currentStateBool && totalNeighbours == 3)
output[i] = ALIVE;
else
output[i] = currentStateBool;
// 4.89ms: 8.5% slower
if (totalNeighbours == 3)
output[i] = ALIVE;
else if (totalNeighbours < 2 || totalNeighbours > 3)
output[i] = DEAD;
else
output[i] = currentStateBool;
// 5.07ms: 12.5% slower
if (totalNeighbours < 2 || totalNeighbours > 3)
output[i] = DEAD;
else if (totalNeighbours == 3)
output[i] = ALIVE;
else
output[i] = currentStateBool;
I had assumed that the 3rd case would actually be the fastest, because most cells are dead and have less than 2 neighbours.
My only vague idea about how this first solution could be faster is that it is able to do better branch prediction.
Commit on which experiment was conducted: 39d49374ee69aba70c7ab093397e131ecbb80665
This:
output[i] = DEAD;
output[i + 1] = DEAD;
output[i + 2] = DEAD;
output[i + 3] = DEAD;
output[i + 4] = DEAD;
output[i + 5] = DEAD;
BM_Main/iterations:1/repeats:3/process_time/real_time 2.17 s 5.98 s 1
BM_Main/iterations:1/repeats:3/process_time/real_time 2.14 s 5.96 s 1
BM_Main/iterations:1/repeats:3/process_time/real_time 2.18 s 5.99 s 1
BM_Main/iterations:1/repeats:3/process_time/real_time_mean 2.16 s 5.98 s 3
BM_Main/iterations:1/repeats:3/process_time/real_time_median 2.17 s 5.98 s 3
BM_Main/iterations:1/repeats:3/process_time/real_time_stddev 0.019 s 0.014 s 3
is way faster than this:
for (uint j = 0; j < 6; j++) {
output[i + j] = DEAD;
}
BM_Main/iterations:1/repeats:3/process_time/real_time 2.37 s 6.74 s 1
BM_Main/iterations:1/repeats:3/process_time/real_time 2.41 s 6.77 s 1
BM_Main/iterations:1/repeats:3/process_time/real_time 2.41 s 6.76 s 1
BM_Main/iterations:1/repeats:3/process_time/real_time_mean 2.40 s 6.76 s 3
BM_Main/iterations:1/repeats:3/process_time/real_time_median 2.41 s 6.76 s 3
BM_Main/iterations:1/repeats:3/process_time/real_time_stddev 0.019 s 0.017 s 3
So, I had this extra pointer renderRaw
in my Board
struct which was legacy from previous render strategies. I deleted the pointer and it's associated malloc
in this commit, only to find out, it causes a 5% decrease in the overall performance of the app.
Very good utility for visualising assembly instructions of your program: https://godbolt.org/
Great util for generating binary image files: https://www.dcode.fr/binary-image
Reference for alternative computation/render strategies: https://github.com/eabrash/game-of-life