Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Added skeleton of batch based GPU assignment #2820

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

spectre-ns
Copy link
Contributor

@spectre-ns spectre-ns commented Jan 4, 2025

Checklist

DO NOT MERGE

  • The title and commit message(s) are descriptive.
  • Small commits made to fix your PR have been squashed to avoid history pollution.
  • Tests have been added for new features or bug fixes.
  • API of new functions and classes are documented.

Description

@JohanMabille This is a skeleton for how to move simple operations to the GPU using a similar strategy as with XSIMD. I'm curious if this would be an extensible strategy. I know the code doesn't compile I have made many short-cuts to demonstrate the concept.

Points of Concern:

  • Containers are copied multiple times when referenced in multiple expressions rather than one immutable shadow copy.
    • GPU memory allocations and host - device transfers are expensive
  • Expressions are evaluated serially through the expression tree when multiple streams/thread could be used in a reduction tree.
  • Each batch is essentially a kernel launch which has overhead... ie. no kernel fusion. (this would require us to generate kernels with metatemplate code... This would likely require implementing the assignment operation as a kernel launch across a thread grid)
  • Currently proposing we use thrust or something similar from AMD/Intel which will have a cost as well but this eliminates the need to worry about launching kernels, streams and synchronization.
  • The current method 'dispatches' work from the host to a device in an opaque way. We could also create a gpu_container for the public interface and attempt to implement the assignment as a CUDA kernel.

#192

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant