Parallelism for image stacks #21

IanButterworth · 2019-10-01T17:01:42Z

Firstly, thanks again for this package. I'm getting really nice results with it.

Given I'm registering an image stack relative to its first image, I was wondering whether there's any easy opportunities for further parallelization? I say further because I've seen that qd_translate() does seem to exceed a single thread, which is great, but only reaches ~250% on my 6 core cpu, which restricts the opportunity for per-image for loop threading. Also, might there be any easy way to use CuArrays & GPUs?

For instance, the simplest per-image thread parallelization I could imagine would be something like this. Note that if tforms has already been calculated for the previous frame, it's used for initial_tfm, otherwise the default pre-populated (0.0, 0.0) is used (that's the idea at least):

#obviously you wouldn't actually try to register random noise..
imgstack = map(x->rand(100,100),1:100) 

tforms = map(x->RegisterQD.Translation(0.0, 0.0),length(imgstack))
mxshift = (10,10)

t = Threads.@threads for i in 1:length(imgstack)
    tforms[i], mm = qd_translate(
            imgstack[1],
            imgstack[i],
            mxshift;
            maxevals = 1000,
            crop = true,
            print_interval = typemax(Int),
            initial_tfm = tforms[i-1],
        )
end
wait(t)

The text was updated successfully, but these errors were encountered:

timholy · 2019-10-01T17:28:15Z

I think you've gotten the main route to parallelization. We do the same thing internally.

The 250% you see is really due to FFTW. With the new threading architecture in 1.3 and updates to FFTW to use it, the two strategies should play together nicely.

timholy · 2019-10-01T17:53:42Z

Actually, one thing you can do is profile to find out how much of the time is spent in best_shift vs _analyze for your images. If it's all best_shift then FFTW is presumably accounting for most of the computational time.

IanButterworth · 2019-10-01T18:40:14Z

From the 2nd row up, left side is _analyze right is best_shift, so it seems FFTW isn't the main expense here.

I employed the approach above for nthreads=6 and see about a 2x speed up, which kind of makes sense given I have 6 cores, and a normal for ran at 250%.

One thing I was hesitant about was that I hadn't appreciated that Threads.@threads was sequential. On first impression it looked like the iterations were sequenced randomly, which would've made the majority of iterations use (0.0,0.0) for initial_tfm, but reading the source for @threads I realized it's sequential per thread. So on nthreads=6 only 6 frames miss the initial_tfm boost, one of which is the first frame, which I think is pretty good!

timholy · 2019-10-01T19:00:09Z

Wow, that's interesting. The _analyze part is only responsible for sub-pixel registration and is capped at a single-pixel shift, so it's not exactly the most important part of the whole process.

RegisterQD.jl/src/translations.jl

Lines 22 to 24 in e22e980

    
           upper = fill(1.0, ndims(fixed)) 
        
           lower = -upper 
        
           root, x0 = _analyze(f, lower, upper; minwidth=minwidth, print_interval=100, kwargs...)

You could try passing in minwidth (default value = 0.01, i.e., 1/100th of a pixel) if you don't need such high precision.

IanButterworth · 2019-10-01T19:18:13Z

I tried minwidth =0.1 but I think the main issue is that I'm not setting rtol / atol or fvalue so it's running for 1000 evals each and every time.

I'm trying to write something fairly generic, so can't predict the global minimum very well. I think I might just have to keep it set to a fixed number of evals, unless there's some other way to tell if it's converged on a solution?

timholy · 2019-10-01T19:19:50Z

Yep, capping the number of evaluations seems likely to be the best strategy. If you're in 2d and only care about 0.1 accuracy, then 100 should let it try each option.

IanButterworth · 2019-10-01T19:27:20Z

Ah, that makes sense!

timholy · 2019-10-01T19:27:43Z

You could also perhaps pass nquasinewton to prevent QuadDIRECT from switching to "exploit" mode quite so early, forcing it to do more "explore."

https://github.com/timholy/QuadDIRECT.jl/blob/fad16ea9c9fa4c41923df6c068ad585310f314c8/src/algorithm.jl#L529

IanButterworth · 2019-10-01T19:57:41Z

Edit: Erroneous results, check next comment

Oh.. this parallel approach is much MUCH faster if I set FFTW.set_num_threads(1).

2m30s before parallel and with FFTW.set_num_threads(4)
1m00s with parallel and with FFTW.set_num_threads(4)
0m16s with parallel and with FFTW.set_num_threads(1)

That's with:

nquasinewton = ((N+1)*(N+2)÷2) + 4 (where N=ndims(fixed), so 4(?) more than normal)
minwidth = fill(0.1, ndims(fixed))
maxevals = (1/0.1)^2

IanButterworth · 2019-10-02T01:24:24Z

@timholy I've just gone back and re-tested more systematically, and I think I was wrapping two changes into one. The results were a bit exaggerated!

These should be more representative.

Video 1

1m45s before parallel and with FFTW.set_num_threads(4)
0m29s with parallel and with FFTW.set_num_threads(4)
0m27s with parallel and with FFTW.set_num_threads(1)

Video 2

1m19s before parallel and with FFTW.set_num_threads(4)
0m32s with parallel and with FFTW.set_num_threads(4)
0m18s with parallel and with FFTW.set_num_threads(1)

timholy · 2019-10-02T05:34:55Z

I was surprised at the magnitude of the effect, this makes more sense!

We are contemplating whether any of the parameter changes you mention in #21 (comment) deserve to be made official. The maxevals change, perhaps to 10^ndims(img), is the one I'm most tempted by.

IanButterworth · 2019-10-02T22:39:48Z

Indeed. The logic seems good, but I guess the default will need a tweak to not break anisotropicicity? There may be a more elegant way than this, but something like:

maxevals = *(inv.(minwidth)...)

which would handle cases where minwidth is anisotropic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelism for image stacks #21

Parallelism for image stacks #21

IanButterworth commented Oct 1, 2019 •

edited

Loading

timholy commented Oct 1, 2019

timholy commented Oct 1, 2019

IanButterworth commented Oct 1, 2019

timholy commented Oct 1, 2019

IanButterworth commented Oct 1, 2019

timholy commented Oct 1, 2019

IanButterworth commented Oct 1, 2019

timholy commented Oct 1, 2019

IanButterworth commented Oct 1, 2019 •

edited

Loading

IanButterworth commented Oct 2, 2019

timholy commented Oct 2, 2019

IanButterworth commented Oct 2, 2019

Parallelism for image stacks #21

Parallelism for image stacks #21

Comments

IanButterworth commented Oct 1, 2019 • edited Loading

timholy commented Oct 1, 2019

timholy commented Oct 1, 2019

IanButterworth commented Oct 1, 2019

timholy commented Oct 1, 2019

IanButterworth commented Oct 1, 2019

timholy commented Oct 1, 2019

IanButterworth commented Oct 1, 2019

timholy commented Oct 1, 2019

IanButterworth commented Oct 1, 2019 • edited Loading

IanButterworth commented Oct 2, 2019

timholy commented Oct 2, 2019

IanButterworth commented Oct 2, 2019

IanButterworth commented Oct 1, 2019 •

edited

Loading

IanButterworth commented Oct 1, 2019 •

edited

Loading