Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelism for image stacks #21

Open
IanButterworth opened this issue Oct 1, 2019 · 12 comments
Open

Parallelism for image stacks #21

IanButterworth opened this issue Oct 1, 2019 · 12 comments

Comments

@IanButterworth
Copy link
Member

IanButterworth commented Oct 1, 2019

Firstly, thanks again for this package. I'm getting really nice results with it.

Given I'm registering an image stack relative to its first image, I was wondering whether there's any easy opportunities for further parallelization? I say further because I've seen that qd_translate() does seem to exceed a single thread, which is great, but only reaches ~250% on my 6 core cpu, which restricts the opportunity for per-image for loop threading. Also, might there be any easy way to use CuArrays & GPUs?

For instance, the simplest per-image thread parallelization I could imagine would be something like this. Note that if tforms has already been calculated for the previous frame, it's used for initial_tfm, otherwise the default pre-populated (0.0, 0.0) is used (that's the idea at least):

#obviously you wouldn't actually try to register random noise..
imgstack = map(x->rand(100,100),1:100) 

tforms = map(x->RegisterQD.Translation(0.0, 0.0),length(imgstack))
mxshift = (10,10)

t = Threads.@threads for i in 1:length(imgstack)
    tforms[i], mm = qd_translate(
            imgstack[1],
            imgstack[i],
            mxshift;
            maxevals = 1000,
            crop = true,
            print_interval = typemax(Int),
            initial_tfm = tforms[i-1],
        )
end
wait(t)
@timholy
Copy link
Member

timholy commented Oct 1, 2019

I think you've gotten the main route to parallelization. We do the same thing internally.

The 250% you see is really due to FFTW. With the new threading architecture in 1.3 and updates to FFTW to use it, the two strategies should play together nicely.

@timholy
Copy link
Member

timholy commented Oct 1, 2019

Actually, one thing you can do is profile to find out how much of the time is spent in best_shift vs _analyze for your images. If it's all best_shift then FFTW is presumably accounting for most of the computational time.

@IanButterworth
Copy link
Member Author

From the 2nd row up, left side is _analyze right is best_shift, so it seems FFTW isn't the main expense here.
Screen Shot 2019-10-01 at 2 21 40 PM

I employed the approach above for nthreads=6 and see about a 2x speed up, which kind of makes sense given I have 6 cores, and a normal for ran at 250%.

One thing I was hesitant about was that I hadn't appreciated that Threads.@threads was sequential. On first impression it looked like the iterations were sequenced randomly, which would've made the majority of iterations use (0.0,0.0) for initial_tfm, but reading the source for @threads I realized it's sequential per thread. So on nthreads=6 only 6 frames miss the initial_tfm boost, one of which is the first frame, which I think is pretty good!

@timholy
Copy link
Member

timholy commented Oct 1, 2019

Wow, that's interesting. The _analyze part is only responsible for sub-pixel registration and is capped at a single-pixel shift, so it's not exactly the most important part of the whole process.

upper = fill(1.0, ndims(fixed))
lower = -upper
root, x0 = _analyze(f, lower, upper; minwidth=minwidth, print_interval=100, kwargs...)

You could try passing in minwidth (default value = 0.01, i.e., 1/100th of a pixel) if you don't need such high precision.

@IanButterworth
Copy link
Member Author

I tried minwidth =0.1 but I think the main issue is that I'm not setting rtol / atol or fvalue so it's running for 1000 evals each and every time.

I'm trying to write something fairly generic, so can't predict the global minimum very well. I think I might just have to keep it set to a fixed number of evals, unless there's some other way to tell if it's converged on a solution?

@timholy
Copy link
Member

timholy commented Oct 1, 2019

Yep, capping the number of evaluations seems likely to be the best strategy. If you're in 2d and only care about 0.1 accuracy, then 100 should let it try each option.

@IanButterworth
Copy link
Member Author

Ah, that makes sense!

@timholy
Copy link
Member

timholy commented Oct 1, 2019

You could also perhaps pass nquasinewton to prevent QuadDIRECT from switching to "exploit" mode quite so early, forcing it to do more "explore."

https://github.com/timholy/QuadDIRECT.jl/blob/fad16ea9c9fa4c41923df6c068ad585310f314c8/src/algorithm.jl#L529

@IanButterworth
Copy link
Member Author

IanButterworth commented Oct 1, 2019

Edit: Erroneous results, check next comment

Oh.. this parallel approach is much MUCH faster if I set FFTW.set_num_threads(1).

  • 2m30s before parallel and with FFTW.set_num_threads(4)
  • 1m00s with parallel and with FFTW.set_num_threads(4)
  • 0m16s with parallel and with FFTW.set_num_threads(1)

That's with:

  • nquasinewton = ((N+1)*(N+2)÷2) + 4 (where N=ndims(fixed), so 4(?) more than normal)
  • minwidth = fill(0.1, ndims(fixed))
  • maxevals = (1/0.1)^2

@IanButterworth
Copy link
Member Author

@timholy I've just gone back and re-tested more systematically, and I think I was wrapping two changes into one. The results were a bit exaggerated!

These should be more representative.

Video 1

  • 1m45s before parallel and with FFTW.set_num_threads(4)
  • 0m29s with parallel and with FFTW.set_num_threads(4)
  • 0m27s with parallel and with FFTW.set_num_threads(1)

Video 2

  • 1m19s before parallel and with FFTW.set_num_threads(4)
  • 0m32s with parallel and with FFTW.set_num_threads(4)
  • 0m18s with parallel and with FFTW.set_num_threads(1)

@timholy
Copy link
Member

timholy commented Oct 2, 2019

I was surprised at the magnitude of the effect, this makes more sense!

We are contemplating whether any of the parameter changes you mention in #21 (comment) deserve to be made official. The maxevals change, perhaps to 10^ndims(img), is the one I'm most tempted by.

@IanButterworth
Copy link
Member Author

Indeed. The logic seems good, but I guess the default will need a tweak to not break anisotropicicity? There may be a more elegant way than this, but something like:

maxevals = *(inv.(minwidth)...)

which would handle cases where minwidth is anisotropic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants