Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Effects: double translation of functions and dynamic switching between direct-style and CPS code #1461

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

OlivierNicole
Copy link
Contributor

@OlivierNicole OlivierNicole commented Apr 28, 2023

Based on an initial suggestion by @lpw25, we generate two versions for each functions that may be called from inside an effect handler (according to the existing static analysis): a direct-style version and a CPS version. At runtime, direct-style versions are used, except when entering an effect handler, in which case only CPS is run until the outermost effect handler is exited. Such functions are conceptually transformed into pairs of functions. For performance, the CPS closure is stored in the f.cps field of the direct-style function f. This is a joint work with @vouillon.

We encountered a design difficulty: when functions are transformed into pairs of functions, it is unclear how to deal with captured identifiers when the functions are nested. To avoid this problem, functions that must be transformed are lambda-lifted, and thus no longer have any free variables except toplevel ones.

The transform is rather successful in preserving the performance of small / monomorphic programs.

  • I hypothesize that hamming is slower for the same reason as on current master: it uses lazy values, which are an obstacle for the global flow analysis. Edit: I am not able to reproduce a slowdown on hamming in my latest round of benchmarks.
  • A number of micro-benchmarks are somewhat faster, maybe because the static analysis performed during the CPS transform is better at finding exact-arity calls.
  • I am not sure why fft is slightly slower, however. The generated codes look very similar.

image

The difference becomes negligible on large programs. CAMLboy is actually… 3 % faster with effects enabled (compared to 25 % slower previously): 520 FPS instead of 505 FPS, although the standard deviation is high at ~11 FPS, so it would be fair to say that the difference is not discernable.

ocamlc is not discernably slower, either (compared to 10 % slower previously).

As some functions must be generated in two versions, the size of the generated code is larger (up to 76 % larger), a few percents larger when compressed.

image

image

Compiling ocamlc is about 70 % slower; the resulting file is 64 % larger when compressed.

@OlivierNicole
Copy link
Contributor Author

I think this PR is best reviewed as a whole, not commit by commit.

@OlivierNicole
Copy link
Contributor Author

OlivierNicole commented Apr 28, 2023

It looks like these two effect handler benchmarks are slower with this PR, 18 % and 8 % slower, respectively. I need to spend some time on it to understand why.

5.2.0 this PR
generators 5.593 s 6.651 s
chameneos 36.6 ms 39.5 ms

@kayceesrk
Copy link

Chameneos runs are too short. You should increase the input size. It takes the input size as a command line argument. Something that runs for a second is more representative as it eliminates the noise due to JIT and other constant time costs.

@OlivierNicole
Copy link
Contributor Author

Good point. I find that chameneos is 9,8 % slower with this PR, 3.428 s versus 3.753 s.

My theory is that effect handlers are slightly slower due to the fact that function calls in CPS code cost an additional field access (applying f.cps instead of just f). So these benchmarks that use effect handlers intensively are unfavorable. However, I expect that programs whose execution mixes more usual code with some effect handling (i.e., programs that do not spent all of their time in effect handlers) will see their performance much improved by this PR, like the non-effect-using programs above.

@kayceesrk
Copy link

I agree with the reasoning and do not expect real programs to behave like generators or chameneos. The performance difference is small enough that I would consider the performance to be good enough for programs that heavily use effect handlers.

@kayceesrk
Copy link

Btw, the original numbers aren't useful to understand the improvements brought about by this PR. For this, you need 3 variants:

  1. default
  2. --enable=effects on master
  3. --enable=effects on this PR

I'd be interested to see the difference between (2) and (3) in addition to the current numbers which show the difference between (1) and (3).

@vouillon
Copy link
Member

vouillon commented May 2, 2023

My theory is that effect handlers are slightly slower due to the fact that function calls in CPS code cost an additional field access (applying f.cps instead of just f). So these benchmarks that use effect handlers intensively are unfavorable.

Note that f.cps(x1,...,xn) is a method call, which is somewhat slower than a plain function call. It might be faster to do the following instead:f.cps.call(null,x1,...,xn)

I had to do that in #1397:

(* Make sure we are performing a regular call, not a (slower)
method call *)
match f with
| J.EAccess _ | J.EDot _ ->
J.call (J.dot f (Utf8_string.of_string_exn "call")) (s_var "null" :: params) J.N
| _ -> J.call f params J.N

@OlivierNicole
Copy link
Contributor Author

I believe that the form f.cps.call(null, x1, ..., xn) is already the one used.

Btw, the original numbers aren't useful to understand the improvements brought about by this PR. For this, you need 3 variants:

1. default

2. --enable=effects on `master`

3. --enable=effects on this PR

I'd be interested to see the difference between (2) and (3) in addition to the current numbers which show the difference between (1) and (3).

Here are the graphs showing the difference between --enable=effects on master (revision 5.2.0) and --enable=effects on this PR:

image

image

image

@kayceesrk
Copy link

Thanks. The execution time improvement is smaller than what I would have expected. Is that surprising to you or does it match your expectation?

Also, it would be useful to have all the variants plotted in the same graph with direct as the baseline.

@OlivierNicole
Copy link
Contributor Author

It more or less matches my expectation. My reasoning is the following: on most of these small, monomorphic benchmarks, the static analysis will eliminate most CPS calls at compile time. Therefore, the dynamic switching will not change the run time a lot and maybe slightly worsen it. On benchmarks that heavily use effect handlers, I also expect the run time to be worse: most of the time is spent in CPS code anyway, the dynamic switching only adds overhead.

I therefore expect the biggest improvements to happen on larger programs, on which the static analysis does not work as well due to higher-order and mutability; and which do not spend most of their time in effect handlers.

If my hypothesis is verified, then the question is: is this trade-off acceptable? Keeping in mind that there might be ways to improve this PR to save more performance.

@OlivierNicole
Copy link
Contributor Author

Also, it would be useful to have all the variants plotted in the same graph with direct as the baseline.

I have updated the PR message with new graphs showing all the variants.

@OlivierNicole
Copy link
Contributor Author

I tried to build a benchmark that uses Domainslib, as an example of a more typical effect-using program. But the linker complains that the primitive caml_thread_initialize missing.

I tried to add it in a new runtime/thread.js file but I doesn’t seem to be taken into account, I’m not sure what is the way to add a primitive.

Also, when I build js_of_ocaml I’m getting a lot of primitive-related messages that I don’t really understand:

$ dune exec -- js_of_ocaml
Entering directory '/home/olivier/jsoo/js_of_ocaml'
warning: overriding primitive "caml_call_gen"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:154
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:151
warning: overriding primitive "caml_call_gen_cps"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:159
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:156
warning: overriding primitive "caml_call_gen"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:154
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:151
warning: overriding primitive "caml_call_gen_cps"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:159
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:156
warning: overriding primitive "caml_call_gen"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:154
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:151
warning: overriding primitive "caml_call_gen_cps"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:159
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:156
warning: overriding primitive "caml_call_gen"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:154
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:151
warning: overriding primitive "caml_call_gen_cps"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:159
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:156
warning: overriding primitive "caml_call_gen"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:154
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:151
warning: overriding primitive "caml_call_gen_cps"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:159
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:156
warning: overriding primitive "caml_call_gen"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:154
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:151
warning: overriding primitive "caml_call_gen_cps"
  old: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib.js:159
  new: /home/olivier/jsoo/js_of_ocaml/_build/default/runtime/stdlib_modern.js:156

@OlivierNicole
Copy link
Contributor Author

I tried to build a benchmark that uses Domainslib, as an example of a more typical effect-using program. But the linker complains that the primitive caml_thread_initialize missing.

I solved it by downgrading to lockfree 0.3.0 as suggested by @jonludlam. But the resulting program never completes. I assume that the mock parallelism provided by the runtime doesn’t suffice for using Domainslib—a “domain” must be stuck forever spinwaiting or something.

@OlivierNicole OlivierNicole force-pushed the optim_effects branch 2 times, most recently from 5362ff4 to 91e352e Compare May 17, 2023 14:41
@OlivierNicole
Copy link
Contributor Author

I think that this PR is ready for review. The only two problems that prevent the CI from being green are:

  • the Array.fold_left_map function being available only from 4.13. What is the policy in this case, do we add it to compiler/lib/stdlib.mlor do we avoid using it?
  • a stack overflow when running the testsuite with profile using-effects, which I have yet to investigate.

@OlivierNicole OlivierNicole marked this pull request as ready for review May 18, 2023 00:27
@hhugo
Copy link
Member

hhugo commented May 18, 2023

the Array.fold_left_map function being available only from 4.13. What is the policy in this case, do we add it to compiler/lib/stdlib.ml or do we avoid using it?

Just add it to the stdlib module compiler/lib/stdlib.ml

compiler/lib/code.ml Outdated Show resolved Hide resolved
@hhugo
Copy link
Member

hhugo commented May 22, 2023

Lambda_lifting doesn't seem to be used anymore, is it expected ? Should Lambda_lifting_simple replace Lambda_lifting ?

@hhugo
Copy link
Member

hhugo commented May 22, 2023

From a quick look at the PR, the benefit of such change is not clear, can you highlight examples where we see clear improvements ?

@OlivierNicole
Copy link
Contributor Author

Lambda_lifting doesn't seem to be used anymore, is it expected ? Should Lambda_lifting_simple replace Lambda_lifting ?

As I discovered just today, Lambda_lifting is still relevant to avoid generating too deeply nested functions. I just pushed a commit that reinstates the post-CPS-transform Lambda_lifting.f pass. Therefore, Lambda_lifting is now used.

There are now two lambda lifting passes for two different reasons. Lambda_lifting and Lambda_lifting_simple do rather different things. The latter simply lifts functions to toplevel, but it takes as a parameters which functions to lift and returns some information about the lifted functions, to be used by the subsequent CPS transform. It also handles mutually recursive functions. Lambda_lifting does no such thing as they are not useful for its purpose, however the lifting threshold and baseline are configurable.

For this reason I am not convinced that there is an interest in merging the two modules.

From a quick look at the PR, the benefit of such change is not clear, can you highlight examples where we see clear improvements ?

I am convinced that most real-world effect-using programs will benefit from this PR, for the reasons given in my above message; but it’s hard to prove it, because we don’t have (yet) examples of such typical programs that work in Javascript. Programs using Domainslib don’t work well with js_of_ocaml (and are arguably not really relevant as JS is not a multicore language). Concurrency libraries like Eio are a more natural fit. I am currently trying to cook up a benchmark using the experimental JS backend for Eio.

@hhugo
Copy link
Member

hhugo commented May 22, 2023

Given the size impact of this change, it would be nice to be able to disable (or control) this optimization. There are programs that would not benefit from this optimization, it be nice to not have to pay the size cost for no benefits.

The two lambda_lifting modules are confusing, we should either merge them (allowing to share duplicated code) or at least find better names.

Compiling ocamlc is about 70 % slower

Do you know where this come from ? does it mostly come from the new lambda lifting pass ? or are other passes heavily affected as well (generate, js_assign, ..)

compiler/lib/subst.mli Outdated Show resolved Hide resolved
@OlivierNicole
Copy link
Contributor Author

Thank you for the review and sorry for the response delay, I have been prioritizing another objective in the previous weeks.

One update is that there is no performance gain on programs that use Eio, which is a shame as it is expected to be one of the central uses of effects. More generally, when the program stays almost all the time within at least one level of effect handlers, there is essentially no performance gain as we run the CPS versions of every function. And I expect this programming pattern (installing a topmost effect handler at the beginning of the program) to be the most common with effects.

So it is unclear to me yet if the implementation of double translation can be adapted to accommodate this.

@kayceesrk
Copy link

I'm trying to understand the results in this PR, but with the given the current results, the sense I get is that double translation is

  1. slower than default and enabling effects with larger code size on programs that do not use effect handlers: Effects: double translation of functions and dynamic switching between direct-style and CPS code #1461 (comment)
  2. slower than enabling effects on programs that use effect handlers: Effects: double translation of functions and dynamic switching between direct-style and CPS code #1461 (comment)

Is this a correct reading? What is missing (or I am failing to see) are cases where the double translation is faster where it is expected to be. Do you have a (micro-)benchmark that exhibits the best-case scenario for double translation in terms of running time? Such a program would be useful to determine the upper limit of the performance benefit; any real programs will have an improvement that is smaller.

@OlivierNicole
Copy link
Contributor Author

  1. slower than default and enabling effects with larger code size on programs that do not use effect handlers: Effects: double translation of functions and dynamic switching between direct-style and CPS code #1461 (comment)

I don’t think the data points unilaterally in the direction of a slowdown. Microbenchmarks are sometimes faster, sometimes slower. Arguably, when they are slower, they can be up to 20 % slower, but this only happens on one benchmark. Macroscopic benchmarks that do not use effect handlers show no significant differences in run time. It is true that the code size is larger.

Is this a correct reading? What is missing (or I am failing to see) are cases where the double translation is faster where it is expected to be. Do you have a (micro-)benchmark that exhibits the best-case scenario for double translation in terms of running time?

I expect the improvements to happen on larger programs which do not spend most of their time in effect handlers. Unfortunately, this does not seem to be a typical use case as of now, as e.g. Eio programs tend to install an effect handler at the start of the program, in which case dynamic switching will probably not help and will only increase the code size. I can try to write programs that would benefit from dynamic switching, I just fear they will not represent real-life use cases.

@lpw25
Copy link

lpw25 commented Dec 5, 2023

I think that in its current state you're probably not going to see much benefit on programs using effects for concurrency. However, you probably would see benefit in, say, a program that used effects to implement generators. Such programs use effects, but don't spend most of their execution time within the effect handler.

The real win with this approach requires some analysis that can tell you when a particular call of a function won't use effects. That would come naturally out of the type-based approaches that we'll be experimenting with at Jane Street, but you could also probably write other static analyses for obtain such information. For instance, you could look at a call to List.map and see that List.map itself doesn't use effects and that none of its arguments contain functions that use effects, and know that this specific call couldn't use effects.

@OlivierNicole
Copy link
Contributor Author

For instance, you could look at a call to List.map and see that List.map itself doesn't use effects and that none of its arguments contain functions that use effects, and know that this specific call couldn't use effects.

Our current data flow analysis does that but it is not context-sensitive, i.e. each function is analyzed in a context which is the join of all possible call contexts. So heavily used higher-order functions like List.map are quickly marked as possibly performing effects. Maybe things could be improved by making the analysis partially context-sensitive.

@kayceesrk
Copy link

kayceesrk commented Dec 6, 2023

However, you probably would see benefit in, say, a program that used effects to implement generators.

Would this program make sense? https://github.com/ocaml-multicore/effects-examples/blob/master/generator.ml

The program benchmarks three different versions of iterating over a binary tree -- a plain iterator, a generator implemented by hand using CPS and one using effect handlers. The native code executable produces the following output:

% dune exec ./generator.exe
Iter: mean = 0.100558, sd = 0.000017
Gen_cps: mean = 0.336419, sd = 0.000009
Gen_eff: mean = 1.506840, sd = 0.000424

I would be curious to see the performance of this program with

  1. CPS but no static analysis (from Effects: partial CPS transform #1384; but I'm not sure if the option to enable/disable the analysis is behind a flag)
  2. CPS but with the static analysis (current trunk)
  3. This PR.

@OlivierNicole
Copy link
Contributor Author

Thank you for suggesting this benchmark.

Here is the output with only the static analysis (current state on master):

Iter: mean = 1.623000, sd = 0.007848
Gen_cps: mean = 0.928100, sd = 0.004282
Gen_eff: mean = 5.990300, sd = 0.046392

Here is the output with static analysis + double translation (this PR):

Iter: mean = 0.269500, sd = 0.001488
Gen_cps: mean = 0.667400, sd = 0.000239
Gen_eff: mean = 6.906500, sd = 0.158372

A first notable thing is that the effect-using code (Gen_eff) is about 17 % slower. A second notable thing is that code that functions that do not use effects at all (Iter and Gen_cps) are 500 % and 39 % faster, respectively.

How is this possible? It’s because all three benchmarks use a function, namely Tree.iter, which needs to be translated in CPS. Without double translation, Iter and Gen_eff are “contaminated” as they have to run Tree.iter in CPS. With double translation, not effect-using code is essentially unaffected.

The consequences of this “contamination” are aggravated if we modify the benchmark to call Tree.iter more often. (This is not so unrealistic: think about a code that relies a lot on List.map, for instance, a function that easily gets contaminated.)

diff --git a/test_generator/generator.ml b/test_generator/generator.ml
index df7f2741f6..49e1a81073 100644
--- a/test_generator/generator.ml
+++ b/test_generator/generator.ml
@@ -17,6 +17,8 @@ module type TREE = sig
   (** [deep n] constructs a tree of depth n, in linear time, where every node at
       level [l] has value [l]. *)
 
+  val iter : ('a -> unit) -> 'a t -> unit
+
   val to_iter : 'a t -> ('a -> unit) -> unit
   (** Iterator function. *)
 
@@ -123,7 +125,12 @@ let t = Tree.deep n
 let iter_fun () = Tree.to_iter t (fun _ -> ())
 let m, sd = benchmark iter_fun 10
 let () = printf "Iter: mean = %f, sd = %f\n%!" m sd
-let rec consume_all f = match f () with None -> () | Some _ -> consume_all f
+let t' = Tree.deep 4
+let total = ref 0
+let rec consume_all f =
+  match f () with
+  | None -> ()
+  | Some n -> Tree.iter (fun m -> total := !total + m + n) t'; consume_all f
 
 let gen_cps_fun () =
   let f = Tree.to_gen_cps t in

Output with only the static analysis:

Iter: mean = 1.621400, sd = 0.013772
Gen_cps: mean = 32.865400, sd = 3.180572
Gen_eff: mean = 41.179400, sd = 6.977796

Output with static analysis + double translation:

Iter: mean = 0.254200, sd = 0.002514
Gen_cps: mean = 5.582200, sd = 0.113679
Gen_eff: mean = 13.450000, sd = 0.489052

Here, all three benchmarks are much faster with double translation, even the one using effects.

This is also the reason why double translation would combine very well with the locality modes of Jane Street for effects: as @vouillon’s experiments have shown, they allow many more functions to be kept in direct style, but—paradoxically—this forces to switch more often between direct style and CPS, which is costly (another case of contamination). With double translation, this is not the case: all these functions can be called in direct style.

@kayceesrk
Copy link

Thanks for the benchmark results. The purpose of the PR is now apparent from the performance side.

@OlivierNicole OlivierNicole changed the title Effects: dynamic switching between direct-style and CPS code Effects: double translation of functions and dynamic switching between direct-style and CPS code Jun 5, 2024
OlivierNicole added a commit to OlivierNicole/js_of_ocaml that referenced this pull request Jun 6, 2024
... dynamic switching between direct-style and CPS code. (ocsigen#1461)
OlivierNicole added a commit to OlivierNicole/js_of_ocaml that referenced this pull request Jun 6, 2024
... dynamic switching between direct-style and CPS code. (ocsigen#1461)
OlivierNicole added a commit to OlivierNicole/js_of_ocaml that referenced this pull request Jun 10, 2024
... dynamic switching between direct-style and CPS code. (ocsigen#1461)
OlivierNicole added a commit to OlivierNicole/js_of_ocaml that referenced this pull request Jun 10, 2024
... dynamic switching between direct-style and CPS code. (ocsigen#1461)
OlivierNicole added a commit to OlivierNicole/js_of_ocaml that referenced this pull request Jun 10, 2024
... dynamic switching between direct-style and CPS code. (ocsigen#1461)
OlivierNicole added a commit to OlivierNicole/js_of_ocaml that referenced this pull request Jun 10, 2024
... dynamic switching between direct-style and CPS code. (ocsigen#1461)
@OlivierNicole
Copy link
Contributor Author

This PR is now rebased on master and ready again for review (cc @vouillon).

Of note: as previously, a number of expect tests underwent some variable name substitution, i.e., they are the same up to alpha equivalence. I have not looked closely at the cause; I assume that in the CPS transform, some new variables are now created and not always used. If this is a problem, I can take some time to look at it.

@OlivierNicole
Copy link
Contributor Author

I pushed a commit which adds a new primitive, caml_assume_no_effects, which allows to guarantee that a function is called in its (faster) direct-style version, for optimization purposes. See commit message for details.

... dynamic switching between direct-style and CPS code. (ocsigen#1461)
Passing a function [f] as argument of `caml_assume_no_effects`
guarantees that, when compiling with `--enable doubletranslate`, the
direct-style version of [f] is called, which is faster than the CPS
version. As a consequence, performing an effect in a transitive callee
of [f] will raise `Effect.Unhandled`, regardless of any effect handlers
installed before the call to `caml_assume_no_effects`, unless a new
effect handler was installed in the meantime.

Usage:

```
external assume_no_effects : (unit -> 'a) -> 'a = "caml_assume_no_effects"

... caml_assume_no_effects (fun () -> (* Will be called in direct style... *)) ...
```

When double translation is disabled, `caml_assume_no_effects` simply
acts like `fun f -> f ()`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants