-
Notifications
You must be signed in to change notification settings - Fork 560
Parallelism and concurrency overview #957
Comments
re: the second part, happy to have these documented on github and available for posterity :) Feel free to keep opening issues! |
Your understanding is spot on.
With regards to threading, inference is always performed synchronously from the point of view of the selfplay thread. Because we run Minigo selfplay at scale by playing multiple games concurrently, we don't need to asynchronous inference to achieve good GPU utilization. Even on very small models, the engine runs at >95% utilization on a v100, and close to 100% for a full sized model. It has never been the goal of the project to write the fastest tournament engine (I doubt we'll ever enter one), so as you guessed this means the engine is somewhat slower when playing a single game than something like Leela. |
What is the ratio of number of Just to confirm my understanding (sorry for silly questions): assuming ratio is 1:1, and considering synchronous CPU/GPU execution, after GPU compute finishes there is a small gap in GPU utilisation while CPU does it thing to advance games and prepare next batch. But because in Go neural network is fairly large (GPU compute takes long) and game logic is comparatively quick, the GPU utilisation gap is small and thus non issue? Thanks! |
Gentlemen, could you confirm my further analysis of
Then, in each
As for parameters in
Ahh, I think I see it now. As long as Presumably Does above seem right? |
Your analysis of the threading parameters is correct. Their values were all chosen to get >95% utilization on the VM that I'm using to test the MLperf benchmark: it has 48 physical cores (96 hyperthreads) running at 2GHz and 8 v100 GPUs. Optimal values will be different depending on the relative performance of your CPUs and GPUs.
If you're interested in how these parameters affect performance, I recommend reading the |
Hi By "double-buffer the inference", do you mean simply running multiple independent models on same GPU (with obvious memory penalty)? Or is there something more going on? Assuming it's just independent models, is there any explicit mechanism to make sure models execute non-overlapping stages of GPU pipeline (transfer, evaluate, transfer back)? I kind of expect simply running 2-3 models per GPU would sort itself out on it's own in this use case, but just want to confirm. Also, sorry for repeating, but could you explicitly confirm it's 1x I will definitely look into profiling after I manage to setup sacrificial CUDA dev. box for compilation purposes. Thanks again for all the help, this was super useful! I think this wraps up my questions for now! |
Aha! I had a sneaky suspicion that having Off topic: what do you think about slightly alternative approach, having multiple game threads push eval requests to an async queue and then having neural-net thread(s) picking them up to form batches and execute on GPU? Basically implementing multiple-producer-multiple-consumer pattern. It seems to me MiniGo approach of having multiple concurrent games per thread is a strong benefit to keep total number of threads low. What's your opinion on other possible pros/cons of both approaches? |
The threading model you describe is actually how Minigo selfplay used to be set up: we ran one selfplay thread for each game, and their inference requests were batched up and executed on separate inference threads. This was absolutely fine for the full sized Minigo run, we'd have maybe 8 games playing in parallel on a VM with 48 physical cores. However, the model used for MLPerf is much smaller and we had to run significantly more selfplay threads than there were CPU cores to generate enough work for the GPU. This resulted in a large context switching overhead and reduced maximum GPU utilization. The large number of threads and context switching overhead also made profiling the CPU code difficult. Once we switched to the current threading model, the simpler CPU traces showed there were some surprising hotspots in the code (e.g. calling Line 28 in e44f412
Line 258 in df51963
Line 114 in 64d5410
We also found that it was measurably faster to have the tree search thread call |
But if I understand correctly, in your current model tree search thread offloads leaf selection to thread pool in |
Yep, optimizing a multithreaded system is hard :) The original implementation of the It turned out to be a net win for the |
Quick note: if you found the hard-coded |
@tommadams I think what you say makes sense, but I need to think more about the implications. @amj Yeah, that's exactly what I thought. The |
I'd caution against reading too much into what I wrote, these are specific optimizations I made for our architecture and hardware setup. Bottlenecks will vary based on model size, CPU & GPU compute speed, board size, code architecture, etc. The most important take away should be: make sure it's easy to profile your code :) |
Hi
I'm further reading the source code, wanted to clarify my understanding of concurrency/parallelism implementation is correct.
Inference options:
FakeDualNet
- returns fixed policy/valueLiteDualNet
- TFLite integer inference, why? faster CPU inference? play against Minigo on mobile?RandomDualNet
- returns random policy/valueTFDualNet
- standard TensorFlow CPU/GPU inferenceTPUDualNet
- TPU inferenceWaitingModel
- for testingModelBatcher
- request asynchronous mini-batch inference from BufferedModelBufferedModel
- runs in it's own thread, does inference possibly combining mini-batches from multipleModelBatcher
Concurrency/parallelism sources:
Selfplayer
manages multiple threadsSelfplayGame
virtual_losses
*concurrent_games_per_thread
ModelBatcher
/BufferedModel
)Reasons for above architecture is as follows:
Is above correct? Especially are there any other reasons for this setup, that I'm missing?
Also, I don't seem to see any way to execute single tree in parallel for tournament game (e.g. against human champion)?
Also 2, I will have possibly more questions, what would be a preferred communication channel for not-an-issues? Should I keep creating GitHub issues?
Thanks again for your time
The text was updated successfully, but these errors were encountered: