Updated Cross correlation C routines for 0.3.0

NeSI EQcorrscan project summary - implemented by Chris Scott

Introduction

Project to implement OpenMP parallelisation for computing cross correlations in EQcorrscan
Limited by memory (would like to run more templates at once)

Code structure and proposed changes

Outer loop over channels
Within the channel loop there are FFTs and other loops (dot product, normalisation)

Introduce OpenMP at the outer level (equivalent to current Python multiprocessing parallelisation)
Introduce OpenMP/threading at the inner level

Outer threading

The first step was to add OpenMP to the outer loop (over channels). Some key points:

Creating FFTW plans is not thread safe - they must be created before the OpenMP loop and can then be shared across all threads
The number of threads cannot be more than the number of channels
We have to allocate an FFT workspace for each thread (memory increases with the number of threads)

After making these improvements we noticed a significant proportion of time was spent doing some post-processing in Python on the data passed back from C (line profiler was useful here).

To improve performance we moved this post-processing into C (which is faster than Python) and into the same loop as the computation (less loops improves efficiency)
Reducing data in place rather than passing it all to Python and reducing there also resulted in lower memory usage
- After moving the reduction to C we had to use OpenMP atomic operations to stop multiple threads accessing the same array element at the same time

Version	Run time (s)	Peak memory (GB)
Multiprocessing	89.7 ± 0.1	135.6
OpenMP	44.7 ± 0.1	102.3

The table above show results for the test dataset with 100 templates (out of 1990) and running on 9 threads (the number of channels). The OpenMP version is two times faster and uses less memory. However we are still limited to relatively small numbers of templates due to the high memory requirements.

At the time of writing this version is in the latest release and is set as the default method for computing cross correlations.

Further details, including code changes and more results, can be found in the pull request.

Single precision

After implementing these initial changes the precision of the calculations was changed from double to single precision (outside the work done here). This gave a good boost in performance and reduction in memory usage (see the table below).

Version	Run time (s)	Peak memory (GB)
Multiprocessing	89.7 ± 0.1	135.6
OpenMP (double)	44.7 ± 0.1	102.3
OpenMP (float)	30.7 ± 0.1	53.1

Inner threading

The next stage of the project introduced parallelisation within the channel loop. The aim here was to improve performance without the cost of increasing memory usage.

First we identified the sections of the code that were taking most time and found these to be the FFTs and the dot product and normalisation loops.

FFTs

FFTW supports running with threads (see here). This was enabled by adding a couple of extra lines to the code (very easy to enable).

A further improvement was made by switching to use MKL instead of FFTW. This required no changes to the source code as MKL comes with an FFTW-compatible interface. In all cases that we tested, MKL performed better than FFTW.

Loops

The dot product loop was trivial to parallelise using OpenMP.
To parallelise the normalisation loop we had to split it into two loops:
1. Compute mean and var values for each iteration and store in memory. This loop is run in serial.
2. Perform the normalisation, using the values computed in the previous step. This loop is parallelised with OpenMP.

Summary

Performance between inner and outer threading, for the same number of total cores, is very similar (especially as the number of templates increases)
Able to use more cores with inner only threading, so absolute performance is better
Memory usage is signifantly lower with inner only threading and does not increase with core count - more templates can be used at once

Version	Cores	Run time (s)	Peak memory (GB)
Multiprocessing	9	89.7 ± 0.1	135.6
OpenMP (double)	9	44.7 ± 0.1	102.3
OpenMP (float)	9	30.7 ± 0.1	53.1
OpenMP (MKL)	9	22.3 ± 0.1	53.3
OpenMP (inner + MKL)	24	17.3 ± 0.3	9.3

MKL gives around 1.4x boost for 100 templates and 9 outer threads
Overall speedup after inner threading is ~5.2x
Memory usage is around 7% of original

The changes have been accepted into the develop branch and will be going into the next release.

Further details can be found in the pull request.

Summary

Inner threading set as the default (best for memory and either best or close to best for performance)
- Much lower memory usage that doesn't scale with core count, allowing more templates and higher numbers of cores to be used
- Performance is good - generally better than outer threading, especially when more cores are used
A combination of inner and outer threading may be best for performance, if memory is not a concern
- The number of cores can be passed as an argument to the Python method

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated Cross correlation C routines for 0.3.0

NeSI EQcorrscan project summary - implemented by Chris Scott

Introduction

Code structure and proposed changes

Outer threading

Single precision

Inner threading

FFTs

Loops

Summary

Summary

Clone this wiki locally