Skip to content

Expression Profiling with Sharrow #937

@jpn--

Description

@jpn--

The current expression profiler added in #936 can measure the runtime performance of spec file expressions in legacy mode by observing the elapsed time for evaluating each expression. This is called instrumentation, as we are adding instruments directly to the code being observed, to monitor the runtime of each expression. This is effective because expressions are evaluated in a serial fashion: all the work to evaluate a single expression across all choosers is completed before the work to evaluate the next expression begins. Therefore, the number of time observations needed to complete the profiling increases only linearly with the number of expressions, and is independent of the number of choosers.

This profiler does not work with sharrow.

Why? Because the sharrow code evaluates spec files differently: at least from the perspective of the raw code itself, choosers are evaluated serially, with all expressions for each chooser evaluated at once, before moving on to the next chooser. This may not even hold exactly after compilation, as the numba compiler may optimize the code by processing data in a different order or by handling several choosers simultaneously. Moreover, if timing code is written into numba compiled expressions, it is possible that the compiler would optimize away the timing itself (i.e., it could detect that the substantive code is completely independent of the timing code, and “complete” the timing instructions before completing the substantive instructions). Disabling the optimization to prevent this could also disable some of the substantive optimizations that allow sharrow to run efficiently, foiling the purpose of the profiling.

Moreover, even if those code optimizations are disabled to ensure processing is sequential, the number of time observations required increases with the product of the number of expressions and the number of choosers, imposing a potentially massive overhead on profiling that will likely distort the results. This overhead might be mitigated or avoided by using a statistical sampler instead of instrumentation. Profiling via sampling involves having an outside tool peek into the running process at regular (frequent) intervals and see which what is happening. This is how the current ActivitySim memory profiling works, although profiling gross memory usage is easier than profiling compiled code. There is a statistical profiling tool for numba called “Profila”, https://github.com/pythonspeed/profila. However, this tool is useful only for single-process (non-parallelized) code, and more importantly, only runs on Linux. It is primarily supported by a single developer, who appears to be actively working on it, and who does show some interest in being able to port the tool to other platforms, but not Windows. Given that most ActivitySim users run on Windows and do not want to work in Linux, this tool seems undesirable.

Design Options for Sharrow Profiler

Given the complexity, the expected overhead induced, and the likelihood that attempting to directly measure expression runtime from inside a sharrow component will fundamentally change the runtimes being measured, it is not a recommended solution. As an alternative, we propose to measure sharrow expression runtime from outside of sharrow, or more specifically from outside of the numba compiled portions of sharrow. This could be approached one of two ways:

  1. Compile a separate sharrow flow for each single expression in a spec file, and externally measure the runtime for each flow.
  2. Compile a separate sharrow flow for each marginal expression in a spec file, and externally measure the marginal increase in runtime for each flow.

Either of these approaches would move the runtime measurement outside of the numba compiler into interpreted Python code, which would help solve many of the issues described above. However, each comes with drawbacks.

Profiling by Expression

Profiling the sharrow-compiled spec files by writing separate sharrow-compiled pieces for each individual expression is appealing, because it would seem to give the cleanest view into the runtime for each expression. It also avoids the risk of weird results that might be difficult for less-skilled users to interpret (e.g. negative runtimes). There is some potential complexity to implement this, as spec files can include temporary values that are cross-referenced in subsequent expressions, and these values would not be internally available to sharrow when running with expressions as singletons. It would be possible to feed these expressions to the sharrow process externally by pre-computing the values and adding them to a dictionary of inputs, but this is not exactly the same as having cached temporaries already available inside the compiled code, and the runtime implications of this are unknown.

Profiling by Marginal Expression

Under this approach, sharrow spec files would be constructed where each expression is evaluated in a compiled context together with all the expressions that come before it. The marginal runtime would then be observed by comparing against the runtime of the previous test, which included one fewer expression. This approach has a distinct advantage over the one-at-a-time approach, in that it is measuring an overall runtime in a context that is more similar to the ultimate result (i.e. running the complete spec file). This will allow the compiler to employ a wider set of optimizations, more like the optimizations used for the full spec, which is ultimately the question of interest.

This approach does have some downsides as well. First, because runtimes will be calculated by taking the difference of two different results, there is twice as much opportunity for exogenous factors (other processes, thermal throttling, etc) to significantly skew the results. Second, overall runtimes always have some small amounts of stochasticity to them, so profiling will only be reliable if undertaken on datasets that are large enough that the signal will overwhelm this noise. Third, the total runtime needed to do the profiling will grow in proportion to the square of the number of expressions, instead of simply linear growth. This is probably not a serious problem as profiling will only be done infrequently, but exceedingly long timelines to complete profiling results can be annoying nonetheless.

Which approach is better?

It is unclear to me at this time which approach is better. I welcome input from the ActivitySim community if anyone strongly believes that one is superior, based on evidence or experience from other projects. Absent such insight, the best way to evaluate these two approaches may be to implement them both and do some testing. Ultimately they are not extremely different, so implementing both will not require anywhere near double the resources of implementing only one.

Metadata

Metadata

Assignees

No one assigned

    Labels

    FeatureNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions