A hands-on tutorial on the new parallelism features in OCaml 5. This tutorial was run on the 19th of May 2022 at the Tarides retreat. Currently, the alpha version of OCaml 5 has been released, and the full version is set for release in September 2022.
This tutorial works on x86-64 and Arm64 architectures on Linux and macOS.
Before we move on to the instructions, check your version of opam with opam --version
, then follow the instructions below for your version. You can also quickly update to the latest version of opam (currently 2.1.2) by running:
bash -c "sh <(curl -fsSL https://raw.githubusercontent.com/ocaml/opam/master/shell/install.sh)"
With opam version >= 2.1:
opam update
opam switch create 5.1.0
opam install . --deps-only
eval $(opam env)
Since we will be doing performance measurements, it is recommended that you also
install hyperfine
.
OCaml 5 distinguishes concurrency and parallelism. Concurrency is overlapped execution of concurrent tasks. Parallelism is simultaneous execution of tasks. OCaml 5 provides effect handlers for concurrency and domains for parallelism.
We will focus on the parallelism features in this tutorial.
Domains are units of parallel computation. New domains can be spawned using
Domain.spawn
primitive:
$ ocaml
# Domain.spawn;;
- : (unit -> 'a) -> 'a Domain.t = <fun>
# Domain.spawn (fun _ -> print_endline "I ran in parallel");;
I ran in parallel
- : unit Domain.t = <abstr>
Use Ctrl+D
to exit.
(If you get the error "Cannot find file topfind," run opam install ocamlfind
, part of the findlib
package.)
The same example is also in src/par.ml:
$ cat src/par.ml
Domain.spawn (fun _ -> print_endline "I ran in parallel")
The dune
command compiles the native version of the above program and runs it:
$ dune exec src/par.exe
I ran in parallel
In this section of the tutorial, we will be running parallel programs. The results observed will be dependent on the number of cores that you have on your machine. I am writing this tutorial on an 2.3 GHz Quad-Core Intel Core i7 MacBook Pro with 4 cores and 8 hardware threads. It is reasonable to expect a speedup of 4x on embarrassingly parallel programs (and a little more if Hyper-Threading gods are kind to us).
We shall use the program to compute the nth Fibonacci number as the running example. The program is in src/fib.ml.
let n = try int_of_string Sys.argv.(1) with _ -> 40
let rec fib n = if n < 2 then 1 else fib (n - 1) + fib (n - 2)
let main () =
let r = fib n in
Printf.printf "fib(%d) = %d\n%!" n r
let _ = main ()
The program is a vanilla implementation of the Fibonacci function.
$ dune build src/fib.exe
$ hyperfine 'dune exec src/fib.exe 40'
Benchmark 1: dune exec src/fib.exe 40
Time (mean ± σ): 498.5 ms ± 4.0 ms [User: 477.8 ms, System: 14.1 ms]
Range (min … max): 493.0 ms … 507.5 ms 10 runs
On my machine, it takes 500ms to compute the 40th Fibonacci number.
Spawned domains can be joined to get their results. The program src/fib_twice.ml computes the nth Fibonacci number twice in parallel.
let n = try int_of_string Sys.argv.(1) with _ -> 40
let rec fib n = if n < 2 then 1 else fib (n - 1) + fib (n - 2)
let main () =
let d1 = Domain.spawn (fun _ -> fib n) in
let d2 = Domain.spawn (fun _ -> fib n) in
let r1 = Domain.join d1 in
Printf.printf "fib(%d) = %d\n%!" n r1;
let r2 = Domain.join d2 in
Printf.printf "fib(%d) = %d\n%!" n r2
let _ = main ()
The program spawns two domains which compute the nth Fibonacci number.
Domain.spawn
returns a Domain.t
value which can be joined to get the result
of the parallel computation. Domain.join
blocks until the computation runs to
completion.
$ dune build src/fib_twice.exe
$ hyperfine 'dune exec src/fib_twice.exe 40'
Benchmark 1: dune exec src/fib_twice.exe 40
Time (mean ± σ): 499.7 ms ± 0.9 ms [User: 940.1 ms, System: 15.5 ms]
Range (min … max): 498.7 ms … 501.6 ms 10 runs
You can see that computing the nth Fibonacci number twice almost took the same time as computing it once thanks to parallelism.
Domains are heavy-weight entities. Each domain directly maps to an operating system thread. Hence, they are relatively expensive to create and tear down. Moreover, each domain brings its own runtime state local to the domain. In particular, each domain has its own minor heap area and major heap pools. Due to the overhead of domains, the recommendation is that you spawn exactly one domain per available core.
OCaml 5 GC is designed to be a low-latency garbage collector with short stop-the-world pauses. Whenever a domain exhausts its minor heap arena, it calls for a stop-the-world, parallel minor GC, where all the domains collect their minor heaps. The domains also perform concurrent (not stop-the-world) collection of the major heap. The major collection cycle involves a number of very short stop-the-world pauses.
Overall, the behaviour of OCaml 5 GC should match that of the OCaml 4 GC for sequential programs, and remains scalable and low-latency for parallel programs. For more information, please have a look at the ICFP 2020 paper and talk on "Retrofitting Parallelism onto OCaml".
Compute the nth Fibonacci number in parallel by parallelising recursive calls. For this exercise, only spawn new domains for the top two recursive calls. You program will only spawn two additional domains. The skeleton is in the file src/fib_par.ml:
let n = try int_of_string Sys.argv.(1) with _ -> 40
let rec fib n = if n < 2 then 1 else fib (n - 1) + fib (n - 2)
let fib_par n =
if n > 20 then begin
(* Only use parallelism when problem size is large enough *)
failwith "not implemented"
end else fib n
let main () =
let r = fib_par n in
Printf.printf "fib(%d) = %d\n%!" n r
let _ = main ()
When you finish the exercise, you will notice that with 2 cores, the speed up is nowhere close to 2x.
% hyperfine 'dune exec src/fib.exe 42'
Benchmark 1: dune exec src/fib.exe 42
Time (mean ± σ): 1.251 s ± 0.014 s [User: 1.223 s, System: 0.016 s]
Range (min … max): 1.236 s … 1.285 s 10 runs
% hyperfine 'dune exec solutions/fib_par.exe 42'
Benchmark 1: dune exec solutions/fib_par.exe 42
Time (mean ± σ): 1.140 s ± 0.053 s [User: 1.625 s, System: 0.021 s]
Range (min … max): 1.072 s … 1.191 s 10 runs
This is because of the fact that the work is not balanced between the two recursive calls of the Fibonacci function.
fib(n) = fib(n-1) + fib(n-2)
fib(n) = (fib(n-2) + fib(n-3)) + fib(n-2)
The left recursive call does more work than the right branch. We shall get to 2x speedup eventually. First, we need to take a detour.
Domain.join
is a way to synchronize with the domain. OCaml 5 also provides
other features for inter-domain communication.
OCaml has mutable reference cells and arrays. Can we share ref cells and arrays between multiple domains and access them in parallel? The answer is yes. But the value that may be returned by a read may not be the latest one written to that memory location due to the influence of compiler and hardware optimizations. The description of the exact value returned by such racy accesses is beyond the scope of the tutorial. For more information on this, you should refer to the PLDI 2018 paper on "Bounding Data Races in Space and Time".
OCaml reference cells and arrays are known as non-atomic data structures. Whenever two domains race to access a non-atomic memory location, and one of the access is a write, then we say that there is a data race. When your program does not have a data race, then the behaviours observed are sequentially consistent -- the observed behaviour can simply be understood as the interleaved execution of different domains. This guarantee is known as data-race-freedom sequential-consistency (DRF-SC).
An important aspect of the OCaml 5 memory model is that, even if your program has data races, your program will not crash (memory safety). The recommendation for the OCaml user is that avoid data races for ease of reasoning.
How do we avoid races? One option is to use the
Atomic
module
which provides low-level atomic mutable references. Importantly, races on atomic
references are not data races, and hence, the programmer will observe
sequentially consistent behaviour.
The program src/incr.ml increments a counter 1M times twice in parallel. As you can see, the non-atomic increment under counts:
% dune exec src/incr.exe
Non-atomic ref count: 1101799
Atomic ref count: 2000000
Atomic module is used for low-level inter-domain communication. They are used
for implementing lock-free data structures. For example, the program
src/msg_passing.ml shows an implementation of message
passing between domains. The program uses get
and set
on the atomic
reference r
for communication. Although the domains race on the access to r
,
since r
is an atomic variable, it is not a data race.
% dune exec src/msg_passing.exe
Hello
Atomic module also has compare_and_set
primitive. compare_and_set r old new
atomically compares the current value of the atomic reference r
with the old
value and replaces that with the new
value. The program
src/incr_cas.ml shows how to implement atomic increment
(inefficiently) using compare_and_set
:
let rec incr r =
let curr = Atomic.get r in
if Atomic.compare_and_set r curr (curr + 1) then ()
else begin
Domain.cpu_relax ();
incr r
end
% dune exec src/incr_cas.exe
Atomic ref count: 2000000
Complete the implementation of the non-blocking atomic stack. The skeleton file
is src/prod_cons_nb.ml. Remember that
compare_and_set
uses physical equality. The old
value provided must
physically match the current value of the atomic reference for the comparison to
succeed.
The only primitive that we have seen so far that blocks a domain is
Domain.join
. OCaml 5 also provides blocking synchronization through
Mutex
,
Condition
and
Semaphore
modules. These are the same modules that are present in OCaml 4 to synchronize
between Threads
. These modules have been lifted up to the level of domains.
In the last exercise src/prod_cons_nb.ml, the pop
operation on the atomic stack returns None
if the stack is empty. In this
exercise, you will complete the implementation of a blocking variant of the
stack where the pop
operation blocks until a matching push
appears. The
skeleton file is src/prod_cons_b.ml.
This exercise may be hard if you have not programmed with mutex and condition variables previously. Fret not. In the next section, we shall look at a higher-level API for parallel programming built on these low-level constructs.
The primitives that we have seen so far are all that OCaml 5 expresses for parallelism. It turns out that these primitives are almost sufficient to implement efficient nested data-parallel programs such as the parallel recursive Fibonacci program.
The missing piece is that we also need an efficient way to suspend the current computation and resume it later, which effect handlers provide. We shall keep the focus of this tutorial on the parallelism primitives. Hence, if you are keen to learn about effect handlers, please do check out the effect handlers tutorial in the OCaml 5 manual.
Domainslib is a library that
provides support for nested-parallel programming, which is epitomized by
the parallelism available in the recursive Fibonacci computation. At its core,
domainslib
has an efficient implementation of work-stealing queue in order to
efficiently share tasks with other domains.
Let's first install domainslib
:
% opam install domainslib
At its core, domainslib
provides an
async/await
mechanism for spawning parallel tasks and waiting on their results. On top of
this mechanism, domainslib
provides parallel
iterators.
Let us now parallelise Fibonacci using domainslib. The program is in the file src/fib_domainslib.ml:
module T = Domainslib.Task
let num_domains = try int_of_string Sys.argv.(1) with _ -> 1
let n = try int_of_string Sys.argv.(2) with _ -> 40
let rec fib n = if n < 2 then 1 else fib (n - 1) + fib (n - 2)
let rec fib_par pool n =
if n > 20 then begin
let a = T.async pool (fun _ -> fib_par pool (n-1)) in
let b = T.async pool (fun _ -> fib_par pool (n-2)) in
T.await pool a + T.await pool b
end else fib n
let main () =
let pool = T.setup_pool ~num_additional_domains:(num_domains - 1) () in
let res = T.run pool (fun _ -> fib_par pool n) in
T.teardown_pool pool;
Printf.printf "fib(%d) = %d\n" n res
let _ = main ()
The program takes the number of domains to use as the first argument and the input as the second argument.
Let's start with the main function. The first
thing to do in order to use domainslib is to set up a pool of domains on which
the nested parallel tasks will run. The domain invoking the run
function will
also participate in executing the tasks submitted to the pool. We invoke the
parallel Fibonacci function fib_par
in the run
function. Finally, we
teardown the pool and print the result.
For sufficiently large inputs (n > 20
), the fib_par
function spawns the left
and the right recursive calls asynchronously in the pool using async
function.
async
function returns a promise for the result. The result of an async
is
obtained by await
ing on the promise, which may block if the promise is not
resolved.
For small inputs, the function simply calls the sequential Fibonacci function. It is important to switch to sequential mode for small problem sizes. If not, the cost of parallelisation will outweigh the work available.
Let's see how this program scales compared to our earlier implementations.
% hyperfine 'dune exec src/fib.exe 42'
Benchmark 1: dune exec src/fib.exe 42
Time (mean ± σ): 1.251 s ± 0.014 s [User: 1.223 s, System: 0.016 s]
Range (min … max): 1.236 s … 1.285 s 10 runs
% hyperfine 'dune exec solutions/fib_par.exe 42'
Benchmark 1: dune exec solutions/fib_par.exe 42
Time (mean ± σ): 1.140 s ± 0.053 s [User: 1.625 s, System: 0.021 s]
Range (min … max): 1.072 s … 1.191 s 10 runs
% hyperfine 'dune exec src/fib_domainslib.exe 2 42'
Benchmark 1: dune exec src/fib_domainslib.exe 2 42
Time (mean ± σ): 666.6 ms ± 9.2 ms [User: 1264.1 ms, System: 18.1 ms]
Range (min … max): 662.0 ms … 692.1 ms 10 runs
The domainslib version scales extremely well. This holds true even as the core
count increases. On a machine with 24 cores, for fib(48)
,
Cores | Time (Seconds) | Vs Serial | Vs Self |
---|---|---|---|
1 | 37.787 | 0.98 | 1 |
2 | 19.034 | 1.94 | 1.99 |
4 | 9.723 | 3.8 | 3.89 |
8 | 5.023 | 7.36 | 7.52 |
16 | 2.914 | 12.68 | 12.97 |
24 | 2.201 | 16.79 | 17.17 |
Implement parallel version of tak
function:
let rec tak x y z =
if x > y then
tak (tak (x-1) y z) (tak (y-1) z x) (tak (z-1) x y)
else z
The skeleton file is in src/tak_par.ml. Calculating the time
complexity of tak
function turns out to be tricky. Use x < 20 && y < 20
as
the sequential cutoff -- if the condition holds, call the sequential version of
tak
.
% hyperfine 'dune exec src/tak.exe 36 24 12' 'dune exec solutions/tak_par.exe 2 36 24 12' 'dune exec solutions/tak_par.exe 4 36 24 12'
Benchmark 1: dune exec src/tak.exe 36 24 12
Time (mean ± σ): 7.259 s ± 0.191 s [User: 7.162 s, System: 0.049 s]
Range (min … max): 6.921 s … 7.540 s 10 runs
Benchmark 2: dune exec solutions/tak_par.exe 2 36 24 12
Time (mean ± σ): 3.112 s ± 0.063 s [User: 6.082 s, System: 0.046 s]
Range (min … max): 3.020 s … 3.188 s 10 runs
Benchmark 3: dune exec solutions/tak_par.exe 4 36 24 12
Time (mean ± σ): 1.793 s ± 0.039 s [User: 6.938 s, System: 0.049 s]
Range (min … max): 1.741 s … 1.871 s 10 runs
Observe that there is super-linear speedup going from the sequential version to the 2 core version! Why?
Implement a parallel version of merge sort. It easy to implement a version that doesn't scale :-) If you use a list for holding the intermediate results, GC impact will kill scalability.
You should use an array for holding the elements to be sorted. The observation is that during the merge step, the length of the merged result is exactly the sum of the input arrays. Hence, one may use an additional array of the same size as the input array to hold the merge results.
Many numerical algorithms use for loops. The parallel for primitive provides a straight-forward way to parallelize such code. Lets take the spectral-norm benchmark from the computer language benchmarks game. The sequential version of the benchmark is available at src/spectralnorm.ml.
We can see that the program has several for loops. How do we know which part of the
program is amenable to parallelism? We can profile the program using perf
to
answer this. perf
only works on Linux.
$ dune build src/spectralnorm.exe
$ perf record --call-graph dwarf ./_build/default/src/spectralnorm.exe
1.274224152
[ perf record: Woken up 115 times to write data ]
[ perf record: Captured and wrote 28.535 MB perf.data (3542 samples) ]
We build the program. The command perf record --call-graph dwarf
informs
perf
to record a trace which includes the call graph information. The report
can be viewed with:
$ perf report
The report shows that the functions eval_A_times_u
and eval_At_times_u
and
their children each take around 50% of the execution time. Those are the useful
ones to parallelise.
The parallel version of the program is in
src/spectralnorm_par.ml. The sequential loop in
eval_A_times_u
:
for i = 0 to n do
let vi = ref 0. in
for j = 0 to n do vi := !vi +. eval_A i j *. u.(j) done;
v.(i) <- !vi
done
becomes:
T.parallel_for pool ~start:0 ~finish:n ~body:(fun i ->
let vi = ref 0. in
for j = 0 to n do vi := !vi +. eval_A i j *. u.(j) done;
v.(i) <- !vi
)
The rest of the code changes is just boilerplate code. The resultant code scales nicely:
% hyperfine 'dune exec src/spectralnorm.exe 4096'
Benchmark 1: dune exec src/spectralnorm.exe 4096
Time (mean ± σ): 2.060 s ± 0.016 s [User: 2.017 s, System: 0.026 s]
Range (min … max): 2.027 s … 2.078 s 10 runs
% hyperfine 'dune exec src/spectralnorm_par.exe 2 4096' 'dune exec src/spectralnorm_par.exe 4 4096'
Benchmark 1: dune exec src/spectralnorm_par.exe 2 4096
Time (mean ± σ): 1.169 s ± 0.053 s [User: 2.201 s, System: 0.030 s]
Range (min … max): 1.109 s … 1.294 s 10 runs
Benchmark 2: dune exec src/spectralnorm_par.exe 4 4096
Time (mean ± σ): 702.3 ms ± 20.7 ms [User: 2599.1 ms, System: 39.5 ms]
Range (min … max): 674.0 ms … 741.4 ms 10 runs
Implement parallel version of Game of Life simulation. The sequential version is in src/game_of_life.ml. The sequential version takes the number of iterations and the board size as the first and second arguments.
You should modify src/game_of_life_par.ml with the parallel version. Currently, this file is the same as the sequential version except that it takes the number of domains as the first argument, the number iterations as the second argument and the board size as the third argument.
Let's parallelise something more tricky -- the sequential version of mandelbrot from the computer language benchmarks game. The sequential version is available in src/mandelbrot.ml.
$ dune exec src/mandelbrot.exe 1024 > output.bmp
The tricky bit here is that the program outputs bytes to stdout
in the body of
the loop. In the parallel version, the order of the output should be preserved.
In the parallel version -- src/mandelbrot_par.ml -- we
use the parallel_for_reduce
primitive. Each parallel iteration accumulates the
output in a Buffer.t
and returns it. parallel_for_reduce
accumulates the
outputs in a list, which is finally output to stdout
.
% hyperfine 'dune exec src/mandelbrot.exe 4096 > output.bmp'
Benchmark 1: dune exec src/mandelbrot.exe 4096 > output.bmp
Time (mean ± σ): 1.755 s ± 0.006 s [User: 1.717 s, System: 0.023 s]
Range (min … max): 1.750 s … 1.771 s 10 runs
% hyperfine 'dune exec src/mandelbrot_par.exe 2 4096 > output.bmp'
Benchmark 1: dune exec src/mandelbrot_par.exe 2 4096 > output.bmp
Time (mean ± σ): 871.9 ms ± 7.2 ms [User: 1662.0 ms, System: 22.6 ms]
Range (min … max): 866.4 ms … 888.9 ms 10 runs
% hyperfine 'dune exec src/mandelbrot_par.exe 4 4096 > output.bmp'
Benchmark 1: dune exec src/mandelbrot_par.exe 4 4096 > output.bmp
Time (mean ± σ): 486.5 ms ± 7.5 ms [User: 1723.0 ms, System: 23.7 ms]
Range (min … max): 474.5 ms … 502.8 ms 10 runs