Note: This model, wolf-sheep-predation, was sampled from Netlogo for the purposes of applying high performance and code optimizing techniques I was taught at university.
Once you have the code downloaded, you can adjust model parameters in declare_env.cpp and run the simulation with make
and ./main
in the terminal. Use mpiexec -n 4 ./main
to run the simulation with 4 nodes. To visualize the code, you will have to adjust the plot size in visualize_simulation.py to match the size in declare_env.cpp before running the python script. You can always checkout into the vector_non_mpi
branch to run the simulation with the base code for performance comparisons.
Please comment out the usage of saveSimulationState() function within the code when testing performance, as this is only used for visualization purposes and can significantly slow the speed of the code runtime.
Watch the full visualization on youtube here
The aim of the above wolf-sheep-predation model is to simulate stable predator prey dynamics between wolf and sheep.
The system is called unstable if it tends to result in extinction for one or more species involved. In contrast, a system is stable if it tends to maintain itself over time, despite fluctuations in population sizes.
The simulation environment is a 2D grid of patches where each patch object contains a color, green/brown, and a countdown for grass regrowth . A green patch turns brown when eaten and needs to “regrow”.
We have animals in the simulation, they have energy and a position. They can move around, consume resources, and reproduce. Sheep eat grass while wolves eat sheep. Animals lose energy when they move and die if their energy goes below 0.
The simulation has stopping conditions for ending the simulation:
- If all wolves and sheep die, the simulation ends.
- If all the wolves die and the number of sheep exceed a certain “max-sheep”, the simulation ends: “The sheep have inherited the earth”.
Important Model Parameters:
- MODEL-VERSION: Whether sheep, wolves, and grass are modelled (“sheep-wolves-grass”) or only sheep and wolves are modelled (“sheep-wolves”).
- INITIAL-NUMBER-SHEEP, INITIAL-NUMBER-WOLVES: Initial size of sheep and wolf population respectively.
- SHEEP-REPRODUCE, WOLF-REPRODUCE: Amount of energy sheep gain from each grass patch eaten, amount of energy wolves gain for eating one sheep respectively.
- GRASS-REGROWTH-TIME: Time taken for grass to regrow.
I benchmarked the base code for 2 worlds:
- 5000 ticks with a 100x100 grid (initially, 100 sheep & 50 wolves)
- 10 ticks with a 30000x30000 grid (initially, 30000 sheep & 15000 wolves)
Observations:
as seen above, as the problem size increases, performance degrades even with much less ticks...
Used MPI to split the simulation world, i.e., the 2D Grid world of patches, across each process and appropriately transferred animals between the processes when they move beyond their respective grid boundaries each tick. I visualized the MPI code in a similar manner to the base code and verified that the animals are moving between nodes/processes appropriately. There are 2 nodes in this visualization setup for a 100x100 grid. The yellow rows indicate the rows 49 and 50 (row 0 of the second node). Animals that move between rows 49 and 50 were actually sent between the nodes for inter-node communication with MPI.
I benchamrked the MPI code for the same 2 worlds as the base code:
- 5000 ticks with a 100x100 grid (initially, 100 sheep & 50 wolves)
- 10 ticks with a 30000x30000 grid (initially, 30000 sheep & 15000 wolves)
The world with 10 ticks, 30000x30000 grid performed ~4 times better than the base code when spltting the workload across 4 nodes:
But the world with 5000 ticks, 100x100 grid performed 3x worse with 4 nodes:
The results are in alignment with MPI theory. MPI shines when we have larger grid worlds and less messaging overhead. In smaller worlds, the workload per process is relatively light and the MPI communication overhead dominates since we're sending animals between processes each tick. This is why we see a performance degradation in the smaller world. This is in alignment with Gustaffson's law since the larger workload benefitted from the extra processing power. Since MPI communication costs grow with the amount of data exchanges, I wonder if the performance improvements would be retained in the larger world as the number of ticks continue to grow...
The results are also in alignment with Amdahl's law since the go() function accounts for at least ~90 of the codebase that can be parallelised and we have ~4x performance improvement with 4 nodes.
I used gprof and gprof2dot to profile my code and generate an image that can help me easily identify any additional performance bottlenecks.
We find that the performance bottlenecks were from the following functions:
- go() : grass() & growGrass()
- setup()
Applied the following Optimization techniques to further improve performance:
- Openmp for threading
- Loop unrolling
I did try to use blocking/tiling for improving the cache performance but this model doesn't seem to be the best scenario for it and I was getting unreliable results when using it and it didn't really improve performance.
The before and after performance was benchmarked for a 30000x30000 grid world (initially, 30000 sheep & 15000 wolves) for 3 ticks and 2 separate MPI nodes. This resulted in an additional 7 seconds shaved off of the runtime.
Before:
After: