Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic GPU results #113

Open
marius311 opened this issue Mar 8, 2023 · 9 comments
Open

Non-deterministic GPU results #113

marius311 opened this issue Mar 8, 2023 · 9 comments

Comments

@marius311
Copy link
Contributor

I've got a case where subsequent (adjoint) NFFTs on GPU are giving different results. I don't have a MWE yet but just to prove it:

Screenshot from 2023-03-07 23-30-57

Is this something which one could expect to happen in any scenario? Perhaps I'm doing something wrong? Any hints that are possible from the little info I provded might help me track down the MWE. In any case trying to get something...

@tknopp
Copy link
Member

tknopp commented Mar 8, 2023

That certainly is not supposed to happen. A minimal working example running in the REPL would be helpful. Furthermore you could check if the input vector v has changed.

@marius311
Copy link
Contributor Author

Ok, here's a reproducer:

julia> using CUDA, NFFT, CuNFFT

julia> Ny, Nx = 1024, 2048
(1024, 2048)

julia> x = CUDA.randn(Ny, Nx);

julia> knots = CUDA.rand(2, Ny*Nx) .- 0.5f0;

julia> plan = NFFT.plan_nfft(CuArray{Float32}, knots, size(x));

julia> CUDA.@allowscalar [(adjoint(plan) * complex(x[:]))[1] for i=1:10]
10-element Vector{ComplexF32}:
 -1568.3418f0 - 972.4891f0im
 -1568.3413f0 - 972.489f0im
 -1568.3414f0 - 972.48883f0im
 -1568.3412f0 - 972.4891f0im
 -1568.3416f0 - 972.4893f0im
 -1568.3418f0 - 972.4893f0im
 -1568.3416f0 - 972.4891f0im
 -1568.3413f0 - 972.48944f0im
 -1568.3413f0 - 972.4892f0im
 -1568.3412f0 - 972.4895f0im

Not quite as numerically drastic as my real case but its there. I don't see anything like this for FFTs.

  [052768ef] CUDA v3.13.1
  [a9291f20] CuNFFT v0.3.5
  [efe261a4] NFFT v0.13.3

julia> CUDA.versioninfo()
CUDA toolkit 11.4, artifact installation
NVIDIA driver 515.65.1, for CUDA 11.7
CUDA driver 11.7

Libraries: 
- CUBLAS: 11.6.5
- CURAND: 10.2.5
- CUFFT: 10.5.2
- CUSOLVER: 11.2.0
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+515.65.1
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

Environment:
- JULIA_CUDA_VERSION: 11.4
- JULIA_CUDA_USE_BINARYBUILDER: true

1 device:
  0: NVIDIA A100-PCIE-40GB (sm_80, 20.063 GiB / 40.000 GiB available)

@marius311
Copy link
Contributor Author

Input vector does not appear to change.

@marius311
Copy link
Contributor Author

Following on that MWE, it looks like its the sparse matrix product?

v = complex(vec(x))
tmpVec1, tmpVec2 = similar(plan.tmpVec), similar(plan.tmpVec)
NFFT.convolve_transpose!(plan, v, tmpVec1)
NFFT.convolve_transpose!(plan, v, tmpVec2)
tmpVec1 == tmpVec2 # false

@tknopp
Copy link
Member

tknopp commented Mar 8, 2023

Well, thats just

function AbstractNFFTs.convolve_transpose!(p::CuNFFTPlan{T,D}, fHat::CuArray{U}, g::CuArray{Complex{T},D}) where {D,T,U}
  mul!(vec(g), p.B, fHat)
  return
end

see https://github.com/JuliaMath/NFFT.jl/blob/master/CuNFFT/src/implementation.jl#L72.

So probably that issue also arises without NFFT.jl but just by using CUDA? In that case it would make sense so generate a similar MWE and discuss/report this with CUDA devs.

@marius311
Copy link
Contributor Author

marius311 commented Mar 8, 2023

Yea, I pinged on Slack. I think there's one possibility its still an NFFT issue maybe is if that matrix somehow accidentally has repeated entries? I think thats technically representable by that data structure. Not sure the expected behavior in that case or how to even check.

@tknopp
Copy link
Member

tknopp commented Mar 8, 2023

Ah yes, you are right. Indeed we might need to have a look at the sparse matrix construction.

@marius311
Copy link
Contributor Author

marius311 commented Mar 8, 2023

No, I think it basically "is" a CUDA issue,

B = CUDA.CUSPARSE.CuSparseMatrixCSC(sprand(Float32, 8388608, 2097152, 1e-5))
v = CUDA.randn(2097152);
w = CUDA.randn(8388608);
B * v == B * v # false
B' * w == B' * w # true

Although if you make it CSR then the true/false above switch. Which I guess the opens up an option if NFFT wanted to "fix" this problem and also get a speedboost which is, probably as an option, to store the matrix both in CSR and CSC, and use the appropriate one for the forward vs. adjoint NFFT. I don't know if this is something you wanted to do, but figured I'd mention. They're certainly not huge numerical errors (although unfortunately in my real case they're propagating to some significant downstream ones, which maybe is a separate thing I should figure out).

@tknopp
Copy link
Member

tknopp commented Mar 8, 2023

Ok, this is gets interesting. Storing both CSR and CSC seems very clever, I have not thought about that before. We should certainly provide some flag that allows to control that but I would be open to integrate that. Please go ahead, if you want to give that a go. I will not come to that in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants