-
Notifications
You must be signed in to change notification settings - Fork 87
Updated Cross correlation C routines for 0.3.0
- Project to implement OpenMP parallelisation for computing cross correlations in EQcorrscan
- Limited by memory (would like to run more templates at once)
- Outer loop over channels
- Within the channel loop there are FFTs and other loops (dot product, normalisation)
- Introduce OpenMP at the outer level (equivalent to current Python multiprocessing parallelisation)
- Introduce OpenMP/threading at the inner level
The first step was to add OpenMP to the outer loop (over channels). Some key points:
- Creating FFTW plans is not thread safe - they must be created before the OpenMP loop and can then be shared across all threads
- The number of threads cannot be more than the number of channels
- We have to allocate an FFT workspace for each thread (memory increases with the number of threads)
After making these improvements we noticed a significant proportion of time was spent doing some post-processing in Python on the data passed back from C (line profiler was useful here).
- To improve performance we moved this post-processing into C (which is faster than Python) and into the same loop as the computation (less loops improves efficiency)
- Reducing data in place rather than passing it all to Python and reducing there also resulted in lower memory usage
- After moving the reduction to C we had to use OpenMP atomic operations to stop multiple threads accessing the same array element at the same time
Version | Run time (s) | Peak memory (GB) |
---|---|---|
Multiprocessing | 89.7 ± 0.1 | 135.6 |
OpenMP | 44.7 ± 0.1 | 102.3 |
The table above show results for the test dataset with 100 templates (out of 1990) and running on 9 threads (the number of channels). The OpenMP version is two times faster and uses less memory. However we are still limited to relatively small numbers of templates due to the high memory requirements.
At the time of writing this version is in the latest release and is set as the default method for computing cross correlations.
Further details, including code changes and more results, can be found in the pull request.
After implementing these initial changes the precision of the calculations was changed from double to single precision (outside the work done here). This gave a good boost in performance and reduction in memory usage (see the table below).
Version | Run time (s) | Peak memory (GB) |
---|---|---|
Multiprocessing | 89.7 ± 0.1 | 135.6 |
OpenMP (double) | 44.7 ± 0.1 | 102.3 |
OpenMP (float) | 30.7 ± 0.1 | 53.1 |
The next stage of the project introduced parallelisation within the channel loop. The aim here was to improve performance without the cost of increasing memory usage.
First we identified the sections of the code that were taking most time and found these to be the FFTs and the dot product and normalisation loops.
FFTW supports running with threads (see here). This was enabled by adding a couple of extra lines to the code (very easy to enable).
A further improvement was made by switching to use MKL instead of FFTW. This required no changes to the source code as MKL comes with an FFTW-compatible interface. In all cases that we tested, MKL performed better than FFTW.
-
The dot product loop was trivial to parallelise using OpenMP.
-
To parallelise the normalisation loop we had to split it into two loops:
- Compute
mean
andvar
values for each iteration and store in memory. This loop is run in serial. - Perform the normalisation, using the values computed in the previous step. This loop is parallelised with OpenMP.
- Compute
- Performance between inner and outer threading, for the same number of total cores, is very similar (especially as the number of templates increases)
- Able to use more cores with inner only threading, so absolute performance is better
- Memory usage is signifantly lower with inner only threading and does not increase with core count - more templates can be used at once
Version | Cores | Run time (s) | Peak memory (GB) |
---|---|---|---|
Multiprocessing | 9 | 89.7 ± 0.1 | 135.6 |
OpenMP (double) | 9 | 44.7 ± 0.1 | 102.3 |
OpenMP (float) | 9 | 30.7 ± 0.1 | 53.1 |
OpenMP (MKL) | 9 | 22.3 ± 0.1 | 53.3 |
OpenMP (inner + MKL) | 24 | 17.3 ± 0.3 | 9.3 |
- MKL gives around 1.4x boost for 100 templates and 9 outer threads
- Overall speedup after inner threading is ~5.2x
- Memory usage is around 7% of original
The changes have been accepted into the develop branch and will be going into the next release.
Further details can be found in the pull request.
- Inner threading set as the default (best for memory and either best or close to best for performance)
- Much lower memory usage that doesn't scale with core count, allowing more templates and higher numbers of cores to be used
- Performance is good - generally better than outer threading, especially when more cores are used
- A combination of inner and outer threading may be best for performance, if memory is not a concern
- The number of cores can be passed as an argument to the Python method