Non-deterministic GPU results #113

marius311 · 2023-03-08T07:33:30Z

I've got a case where subsequent (adjoint) NFFTs on GPU are giving different results. I don't have a MWE yet but just to prove it:

Is this something which one could expect to happen in any scenario? Perhaps I'm doing something wrong? Any hints that are possible from the little info I provded might help me track down the MWE. In any case trying to get something...

tknopp · 2023-03-08T07:58:51Z

That certainly is not supposed to happen. A minimal working example running in the REPL would be helpful. Furthermore you could check if the input vector v has changed.

marius311 · 2023-03-08T08:00:40Z

Ok, here's a reproducer:

julia> using CUDA, NFFT, CuNFFT

julia> Ny, Nx = 1024, 2048
(1024, 2048)

julia> x = CUDA.randn(Ny, Nx);

julia> knots = CUDA.rand(2, Ny*Nx) .- 0.5f0;

julia> plan = NFFT.plan_nfft(CuArray{Float32}, knots, size(x));

julia> CUDA.@allowscalar [(adjoint(plan) * complex(x[:]))[1] for i=1:10]
10-element Vector{ComplexF32}:
 -1568.3418f0 - 972.4891f0im
 -1568.3413f0 - 972.489f0im
 -1568.3414f0 - 972.48883f0im
 -1568.3412f0 - 972.4891f0im
 -1568.3416f0 - 972.4893f0im
 -1568.3418f0 - 972.4893f0im
 -1568.3416f0 - 972.4891f0im
 -1568.3413f0 - 972.48944f0im
 -1568.3413f0 - 972.4892f0im
 -1568.3412f0 - 972.4895f0im

Not quite as numerically drastic as my real case but its there. I don't see anything like this for FFTs.

  [052768ef] CUDA v3.13.1
  [a9291f20] CuNFFT v0.3.5
  [efe261a4] NFFT v0.13.3

julia> CUDA.versioninfo()
CUDA toolkit 11.4, artifact installation
NVIDIA driver 515.65.1, for CUDA 11.7
CUDA driver 11.7

Libraries: 
- CUBLAS: 11.6.5
- CURAND: 10.2.5
- CUFFT: 10.5.2
- CUSOLVER: 11.2.0
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+515.65.1
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

Environment:
- JULIA_CUDA_VERSION: 11.4
- JULIA_CUDA_USE_BINARYBUILDER: true

1 device:
  0: NVIDIA A100-PCIE-40GB (sm_80, 20.063 GiB / 40.000 GiB available)

marius311 · 2023-03-08T08:02:29Z

Input vector does not appear to change.

marius311 · 2023-03-08T09:18:28Z

Following on that MWE, it looks like its the sparse matrix product?

v = complex(vec(x))
tmpVec1, tmpVec2 = similar(plan.tmpVec), similar(plan.tmpVec)
NFFT.convolve_transpose!(plan, v, tmpVec1)
NFFT.convolve_transpose!(plan, v, tmpVec2)
tmpVec1 == tmpVec2 # false

tknopp · 2023-03-08T09:24:53Z

Well, thats just

function AbstractNFFTs.convolve_transpose!(p::CuNFFTPlan{T,D}, fHat::CuArray{U}, g::CuArray{Complex{T},D}) where {D,T,U}
  mul!(vec(g), p.B, fHat)
  return
end

see https://github.com/JuliaMath/NFFT.jl/blob/master/CuNFFT/src/implementation.jl#L72.

So probably that issue also arises without NFFT.jl but just by using CUDA? In that case it would make sense so generate a similar MWE and discuss/report this with CUDA devs.

marius311 · 2023-03-08T09:28:05Z

Yea, I pinged on Slack. I think there's one possibility its still an NFFT issue maybe is if that matrix somehow accidentally has repeated entries? I think thats technically representable by that data structure. Not sure the expected behavior in that case or how to even check.

tknopp · 2023-03-08T09:29:50Z

Ah yes, you are right. Indeed we might need to have a look at the sparse matrix construction.

marius311 · 2023-03-08T10:10:35Z

No, I think it basically "is" a CUDA issue,

B = CUDA.CUSPARSE.CuSparseMatrixCSC(sprand(Float32, 8388608, 2097152, 1e-5))
v = CUDA.randn(2097152);
w = CUDA.randn(8388608);
B * v == B * v # false
B' * w == B' * w # true

Although if you make it CSR then the true/false above switch. Which I guess the opens up an option if NFFT wanted to "fix" this problem and also get a speedboost which is, probably as an option, to store the matrix both in CSR and CSC, and use the appropriate one for the forward vs. adjoint NFFT. I don't know if this is something you wanted to do, but figured I'd mention. They're certainly not huge numerical errors (although unfortunately in my real case they're propagating to some significant downstream ones, which maybe is a separate thing I should figure out).

tknopp · 2023-03-08T10:26:27Z

Ok, this is gets interesting. Storing both CSR and CSC seems very clever, I have not thought about that before. We should certainly provide some flag that allows to control that but I would be open to integrate that. Please go ahead, if you want to give that a go. I will not come to that in the near future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-deterministic GPU results #113

Non-deterministic GPU results #113

marius311 commented Mar 8, 2023

tknopp commented Mar 8, 2023

marius311 commented Mar 8, 2023

marius311 commented Mar 8, 2023

marius311 commented Mar 8, 2023

tknopp commented Mar 8, 2023

marius311 commented Mar 8, 2023 •

edited

Loading

tknopp commented Mar 8, 2023

marius311 commented Mar 8, 2023 •

edited

Loading

tknopp commented Mar 8, 2023

Non-deterministic GPU results #113

Non-deterministic GPU results #113

Comments

marius311 commented Mar 8, 2023

tknopp commented Mar 8, 2023

marius311 commented Mar 8, 2023

marius311 commented Mar 8, 2023

marius311 commented Mar 8, 2023

tknopp commented Mar 8, 2023

marius311 commented Mar 8, 2023 • edited Loading

tknopp commented Mar 8, 2023

marius311 commented Mar 8, 2023 • edited Loading

tknopp commented Mar 8, 2023

marius311 commented Mar 8, 2023 •

edited

Loading

marius311 commented Mar 8, 2023 •

edited

Loading