Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
108 commits
Select commit Hold shift + click to select a range
e327b1e
Start of ExperimentalCUDACodeGen impementation
aydogdub May 8, 2025
e1d75d8
clean-up my workspace
aydogdub May 8, 2025
7a9a6f7
initial complete implementation of GPU_Warp schedule
aydogdub May 12, 2025
9d81a37
small test of GPU_Warp schedule
aydogdub May 12, 2025
f47aba6
fix ThreadBlock scope generation and improve readability
aydogdub May 13, 2025
2d4a8f3
setting up testing environment
aydogdub May 16, 2025
665b217
fix issues and enable launch bounds hint
aydogdub May 16, 2025
797df83
refactoring computation of kernel dimensions
aydogdub May 20, 2025
c575a32
report 1
aydogdub May 23, 2025
0270047
clean-up and finishing small TODO's
aydogdub May 25, 2025
9be949f
Some TODO's and improving _generate_kernel_scope function
aydogdub May 26, 2025
c1ec70a
new configs
aydogdub May 26, 2025
8f2a5da
Refactoring: Getting rid of complicated cuda stream handling and repl…
aydogdub Jun 1, 2025
52a6394
test for out of kernel memory copies
aydogdub Jun 2, 2025
6ceb4e8
provisional memory copy solution
aydogdub Jun 2, 2025
52a89c5
A naive alternative of a cuda stream scheduler as a pass
aydogdub Jun 4, 2025
9f26992
adapt such that the naive GPU scheduler pass works and can be used by…
aydogdub Jun 4, 2025
5a0f479
from now on, write important notes here to help future developers and…
aydogdub Jun 4, 2025
af8faf1
Looking at the effects of NaiveGPUStreamScheduler
aydogdub Jun 4, 2025
8b5020a
Major refactoring. Create several files, implement Strategy Pattern f…
aydogdub Jun 5, 2025
5970391
Synchronization Insertion pass, almost done + notebook examples
aydogdub Jun 12, 2025
50f5431
finish open work on DefaultSharedMemorySync Pass (skipping sequential…
aydogdub Jun 13, 2025
2a560bc
scratch notebooks where I visually checked the passes, deleted unnece…
aydogdub Jun 13, 2025
3079690
const arrays utils alpha impl
ThrudPrimrose Jun 13, 2025
54193dd
filling the strategy provisionally, might need change in future
aydogdub Jun 13, 2025
aa50223
Extend synchronization pass and set scratch + configs.
aydogdub Jun 15, 2025
558a815
Fixed mistakes and integrated the synchronization pass to the codegen…
aydogdub Jun 15, 2025
a545ce8
fix collaborative synchronization- not requred anymore
aydogdub Jun 16, 2025
b0c5e07
Yakups examples, bad example of issues of legacy codegen and testing …
aydogdub Jun 16, 2025
409971c
yapkups sdfg examples, stored for inspection and testing for later
aydogdub Jun 22, 2025
88d29a0
New smem tests, copy past old tests to new testing folder, apply fixe…
aydogdub Jun 22, 2025
2332a63
copying yakups validation adaption (and symbolic.py, which has newcod…
aydogdub Jun 25, 2025
df452ba
User can now choose name current thread blocks variable name, which m…
aydogdub Jun 25, 2025
5aad68a
Fixed and refactored GPU stream sync pass
aydogdub Jun 26, 2025
34aa669
Fixed mistake in the copy strategy
aydogdub Jun 27, 2025
d03c5d0
Add validation check to check for intersatte edge assignments to scal…
ThrudPrimrose Jul 3, 2025
e13eade
Merge branch 'main' into const_array_utils
ThrudPrimrose Jul 3, 2025
974573a
Alpha implementation sketch
ThrudPrimrose Jul 3, 2025
8f7a258
Fix scope mistakes- missing brackets. Leads to errors for certain wei…
aydogdub Jul 3, 2025
fc3cf9b
change default architecture, might be more of a personal issue. Yakup…
aydogdub Jul 3, 2025
b4aa787
Fixed Yakups tb pass, reported MapTiling issue (applied workaround) a…
aydogdub Jul 3, 2025
40f0b31
Add pass to detect const
ThrudPrimrose Jul 3, 2025
8114c06
using union instead |
ThrudPrimrose Jul 3, 2025
8490a64
Using Union instead |
ThrudPrimrose Jul 3, 2025
c932fd2
refactor
ThrudPrimrose Jul 3, 2025
a84021e
Fix name clash
ThrudPrimrose Jul 3, 2025
8a5bb99
Add improved validation test for the interstate_edge_utils
ThrudPrimrose Jul 4, 2025
37b604d
Run precommit hook
ThrudPrimrose Jul 4, 2025
1112913
Typefix
ThrudPrimrose Jul 4, 2025
76d0aa1
Merge branch 'main' into const_array_utils
ThrudPrimrose Jul 4, 2025
498e783
Rename
ThrudPrimrose Jul 4, 2025
3355569
allows to switch bettwen the codegens without local definition of ptr…
aydogdub Jul 7, 2025
9f77a03
Missing Explicit ThreadBlock Maps will now be handled by a pass that …
aydogdub Jul 7, 2025
15b9e1c
small change
aydogdub Jul 7, 2025
db2c874
cleaning up
aydogdub Jul 7, 2025
443292a
Update infer_const_args.py
ThrudPrimrose Jul 8, 2025
c0f4633
stuff to make async memcpy work and AddThreadBlockMap for cpu.py
aydogdub Jul 8, 2025
19972df
Merge branch 'main' into newgpucodegen
aydogdub Jul 8, 2025
84b585b
format files to dace style, using pre-commit run --all
aydogdub Jul 8, 2025
b42f094
Merge branch 'main' into const_array_utils
ThrudPrimrose Jul 8, 2025
e102233
Merge remote-tracking branch 'upstream/const_array_utils' into newgpu…
aydogdub Jul 8, 2025
a7ccb31
Improve API
ThrudPrimrose Jul 8, 2025
d86bf47
Merge remote-tracking branch 'upstream/const_array_utils' into newgpu…
aydogdub Jul 9, 2025
2700706
removing workspace folder for PR and from the repository
aydogdub Jul 10, 2025
bd632d6
provisional implementation for constant checks
aydogdub Jul 14, 2025
09caf2c
ensure correct CUDA backend is selected
aydogdub Jul 14, 2025
31fe8f8
Yakups fixes during Meeting
aydogdub Jul 14, 2025
20b4e09
Update map free symbols
ThrudPrimrose Jul 14, 2025
1838a17
Update
ThrudPrimrose Jul 14, 2025
a71b0ff
Ensure no gpu stream synchronization within Kernels occur
aydogdub Jul 15, 2025
1ec735e
Handle GPU_global, in-kernel defined transients used for backwards co…
aydogdub Jul 28, 2025
1b63608
small refactoring
aydogdub Aug 5, 2025
b8f282b
Experimental way to support Stream objects
aydogdub Aug 5, 2025
8206f3e
streams as opaque types
aydogdub Aug 5, 2025
7245b1b
Revert merge and Implement initial support for dynamic inputs
aydogdub Aug 6, 2025
d74e7dd
Merge branch 'newgpucodegen' of https://github.com/aydogdub/dace into…
aydogdub Aug 6, 2025
217d8c2
New approach for GPU streams- make it explicit
aydogdub Aug 15, 2025
9c65a1f
merge and adapt const-checks
aydogdub Aug 15, 2025
e668f29
Merge branch 'newgpucodegen' of https://github.com/aydogdub/dace into…
aydogdub Aug 15, 2025
f416188
Add support for expanded tasklets using GPU streams. Fix small issues
aydogdub Aug 17, 2025
8b2ece1
finish GPU stream management, fix issues to increase test coverage an…
aydogdub Aug 24, 2025
b979533
small refactoring
aydogdub Aug 26, 2025
3a848db
failing
aydogdub Aug 28, 2025
264191e
various fixes and clean ups, especially regarding GPU stream management
aydogdub Sep 10, 2025
01e462a
Fixing GPU stream management and clean up
aydogdub Sep 10, 2025
983d80d
set back to legacy CUDACodeGen
aydogdub Sep 10, 2025
087f08a
Merge remote-tracking branch 'upstream/main' into newgpucodegen
aydogdub Sep 10, 2025
802f24a
fix
aydogdub Sep 10, 2025
85f2cb1
reset to normal cuda.py file
aydogdub Sep 10, 2025
3d76c3e
start of new pipeline
aydogdub Sep 11, 2025
f149b01
fix
aydogdub Sep 15, 2025
e2ad61c
fixes
aydogdub Sep 15, 2025
db07e4c
trying fix
aydogdub Sep 15, 2025
abe3e7b
fix of yakup
aydogdub Sep 15, 2025
cb7f48c
fix
aydogdub Sep 16, 2025
f46bd46
quick
aydogdub Sep 16, 2025
fa85b76
enable spezialization via location
aydogdub Sep 17, 2025
c286558
missed synchronization now added
aydogdub Sep 17, 2025
15fc13b
fix, clean-up and pre-commit
aydogdub Sep 17, 2025
0f84a4a
fix
aydogdub Sep 17, 2025
7bc1226
fix missed case in default shared memory synchornization
aydogdub Sep 24, 2025
0ffbc83
avoid unnecessary smem sync, and add additional case where stream syn…
aydogdub Oct 7, 2025
7a080cb
add GPU stream pipeline passes and necessary helpers
aydogdub Nov 21, 2025
eb10eac
added tests and adhzstments
aydogdub Dec 19, 2025
69fbb73
run pre-commit
aydogdub Dec 19, 2025
b72aa2c
Merge branch 'main' into new-gpu-stream-passes
aydogdub Dec 22, 2025
17ec218
Merge branch 'main' into new-gpu-codegen-dev
ThrudPrimrose Jan 5, 2026
37db583
Merge remote-tracking branch 'aydogdub/new-gpu-stream-passes' into ne…
ThrudPrimrose Jan 5, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion dace/codegen/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,8 @@ foreach(DACE_FILE ${DACE_FILES})
# Make the path absolute
set(DACE_FILE ${DACE_SRC_DIR}/${DACE_FILE})
# Now treat the file according to the deduced target
if(${DACE_FILE_TARGET} STREQUAL "cuda")
# previous: if(${DACE_FILE_TARGET} STREQUAL "cuda"). Needed to work with experimental
if(${DACE_FILE_TARGET} STREQUAL "experimental_cuda" OR ${DACE_FILE_TARGET} STREQUAL "cuda")
if(${DACE_FILE_TARGET_TYPE} MATCHES "hip")
set(DACE_ENABLE_HIP ON)
set(DACE_HIP_FILES ${DACE_HIP_FILES} ${DACE_FILE})
Expand Down
47 changes: 43 additions & 4 deletions dace/codegen/instrumentation/gpu_events.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ def on_scope_entry(self, sdfg: SDFG, cfg: ControlFlowRegion, state: SDFGState, n
'GPU_Device map scopes')

idstr = 'b' + self._idstr(cfg, state, node)
stream = getattr(node, '_cuda_stream', -1)
stream = self._get_gpu_stream(state, node)
outer_stream.write(self._record_event(idstr, stream), cfg, state_id, node)

def on_scope_exit(self, sdfg: SDFG, cfg: ControlFlowRegion, state: SDFGState, node: nodes.ExitNode,
Expand All @@ -139,7 +139,7 @@ def on_scope_exit(self, sdfg: SDFG, cfg: ControlFlowRegion, state: SDFGState, no
s = self._get_sobj(node)
if s.instrument == dtypes.InstrumentationType.GPU_Events:
idstr = 'e' + self._idstr(cfg, state, entry_node)
stream = getattr(node, '_cuda_stream', -1)
stream = self._get_gpu_stream(state, node)
outer_stream.write(self._record_event(idstr, stream), cfg, state_id, node)
outer_stream.write(self._report('%s %s' % (type(s).__name__, s.label), cfg, state, entry_node), cfg,
state_id, node)
Expand All @@ -153,7 +153,7 @@ def on_node_begin(self, sdfg: SDFG, cfg: ControlFlowRegion, state: SDFGState, no
if node.instrument == dtypes.InstrumentationType.GPU_Events:
state_id = state.parent_graph.node_id(state)
idstr = 'b' + self._idstr(cfg, state, node)
stream = getattr(node, '_cuda_stream', -1)
stream = self._get_gpu_stream(state, node)
outer_stream.write(self._record_event(idstr, stream), cfg, state_id, node)

def on_node_end(self, sdfg: SDFG, cfg: ControlFlowRegion, state: SDFGState, node: nodes.Node,
Expand All @@ -165,7 +165,46 @@ def on_node_end(self, sdfg: SDFG, cfg: ControlFlowRegion, state: SDFGState, node
if node.instrument == dtypes.InstrumentationType.GPU_Events:
state_id = state.parent_graph.node_id(state)
idstr = 'e' + self._idstr(cfg, state, node)
stream = getattr(node, '_cuda_stream', -1)
stream = self._get_gpu_stream(state, node)
outer_stream.write(self._record_event(idstr, stream), cfg, state_id, node)
outer_stream.write(self._report('%s %s' % (type(node).__name__, node.label), cfg, state, node), cfg,
state_id, node)

def _get_gpu_stream(self, state: SDFGState, node: nodes.Node) -> int:
"""
Return the GPU stream ID assigned to a given node.

- In the CUDACodeGen, the stream ID is stored as the private attribute
``_cuda_stream`` on the node.
- In the ExperimentalCUDACodeGen, streams are explicitly assigned to tasklets
and GPU_Device-scheduled maps (kernels) via a GPU stream AccessNode. For
other node types, no reliable stream assignment is available.

Parameters
----------
state : SDFGState
The state containing the node.
node : dace.sdfg.nodes.Node
The node for which to query the GPU stream.

Returns
-------
int
The assigned GPU stream ID, or ``-1`` if none could be determined.
"""
if config.Config.get('compiler', 'cuda', 'implementation') == 'legacy':
stream = getattr(node, '_cuda_stream', -1)

else:
stream = -1
for in_edge in state.in_edges(node):
src = in_edge.src
if (isinstance(src, nodes.AccessNode) and src.desc(state).dtype == dtypes.gpuStream_t):
stream = int(in_edge.data.subset)

for out_edge in state.out_edges(node):
dst = out_edge.dst
if (isinstance(dst, nodes.AccessNode) and dst.desc(state).dtype == dtypes.gpuStream_t):
stream = int(out_edge.data.subset)

return stream
1 change: 1 addition & 0 deletions dace/codegen/targets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@
from .mlir.mlir import MLIRCodeGen
from .sve.codegen import SVECodeGen
from .snitch import SnitchCodeGen
from .experimental_cuda import ExperimentalCUDACodeGen
16 changes: 11 additions & 5 deletions dace/codegen/targets/cpp.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,14 +236,22 @@ def memlet_copy_to_absolute_strides(dispatcher: 'TargetDispatcher',

def is_cuda_codegen_in_device(framecode) -> bool:
"""
Check the state of the CUDA code generator, whether it is inside device code.
Check the state of the (Experimental) CUDA code generator, whether it is inside device code.
"""
from dace.codegen.targets.cuda import CUDACodeGen
from dace.codegen.targets.experimental_cuda import ExperimentalCUDACodeGen

cuda_impl = Config.get('compiler', 'cuda', 'implementation')
if cuda_impl == 'legacy':
cudaClass = CUDACodeGen
elif cuda_impl == 'experimental':
cudaClass = ExperimentalCUDACodeGen

if framecode is None:
cuda_codegen_in_device = False
else:
for codegen in framecode.targets:
if isinstance(codegen, CUDACodeGen):
if isinstance(codegen, cudaClass):
cuda_codegen_in_device = codegen._in_device_code
break
else:
Expand All @@ -266,11 +274,9 @@ def ptr(name: str, desc: data.Data, sdfg: SDFG = None, framecode=None) -> str:
root = name.split('.')[0]
if root in sdfg.arrays and isinstance(sdfg.arrays[root], data.Structure):
name = name.replace('.', '->')

# Special case: If memory is persistent and defined in this SDFG, add state
# struct to name
if (desc.transient and desc.lifetime in (dtypes.AllocationLifetime.Persistent, dtypes.AllocationLifetime.External)):

if desc.storage == dtypes.StorageType.CPU_ThreadLocal: # Use unambiguous name for thread-local arrays
return f'__{sdfg.cfg_id}_{name}'
elif not is_cuda_codegen_in_device(framecode): # GPU kernels cannot access state
Expand Down Expand Up @@ -936,7 +942,7 @@ def unparse_tasklet(sdfg, cfg, state_id, dfg, node, function_stream, callsite_st
# set the stream to a local variable.
max_streams = int(Config.get("compiler", "cuda", "max_concurrent_streams"))
if not is_devicelevel_gpu(sdfg, state_dfg, node) and (hasattr(node, "_cuda_stream")
or connected_to_gpu_memory(node, state_dfg, sdfg)):
and connected_to_gpu_memory(node, state_dfg, sdfg)):
if max_streams >= 0:
callsite_stream.write(
'int __dace_current_stream_id = %d;\n%sStream_t __dace_current_stream = __state->gpu_context->streams[__dace_current_stream_id];'
Expand Down
20 changes: 20 additions & 0 deletions dace/codegen/targets/cpu.py
Original file line number Diff line number Diff line change
Expand Up @@ -513,6 +513,13 @@ def allocate_array(self,

return
elif (nodedesc.storage == dtypes.StorageType.Register):

if nodedesc.dtype == dtypes.gpuStream_t:
ctype = dtypes.gpuStream_t.ctype
allocation_stream.write(f"{ctype}* {name} = __state->gpu_context->streams;")
define_var(name, DefinedType.Pointer, ctype)
return

ctypedef = dtypes.pointer(nodedesc.dtype).ctype
if nodedesc.start_offset != 0:
raise NotImplementedError('Start offset unsupported for registers')
Expand Down Expand Up @@ -588,6 +595,9 @@ def deallocate_array(self, sdfg: SDFG, cfg: ControlFlowRegion, dfg: StateSubgrap

if isinstance(nodedesc, (data.Scalar, data.View, data.Stream, data.Reference)):
return
elif nodedesc.dtype == dtypes.gpuStream_t:
callsite_stream.write(f"{alloc_name} = nullptr;")
return
elif (nodedesc.storage == dtypes.StorageType.CPU_Heap
or (nodedesc.storage == dtypes.StorageType.Register and
(symbolic.issymbolic(arrsize, sdfg.constants) or
Expand Down Expand Up @@ -1008,6 +1018,11 @@ def process_out_memlets(self,
dst_edge = dfg.memlet_path(edge)[-1]
dst_node = dst_edge.dst

if isinstance(dst_node, nodes.AccessNode) and dst_node.desc(state).dtype == dtypes.gpuStream_t:
# Special case: GPU Streams do not represent data flow - they assing GPU Streams to kernels/tasks
# Thus, nothing needs to be written and out memlets of this kind should be ignored.
continue

# Target is neither a data nor a tasklet node
if isinstance(node, nodes.AccessNode) and (not isinstance(dst_node, nodes.AccessNode)
and not isinstance(dst_node, nodes.CodeNode)):
Expand Down Expand Up @@ -1049,6 +1064,7 @@ def process_out_memlets(self,
# Tasklet -> array with a memlet. Writing to array is emitted only if the memlet is not empty
if isinstance(node, nodes.CodeNode) and not edge.data.is_empty():
if not uconn:
return
raise SyntaxError("Cannot copy memlet without a local connector: {} to {}".format(
str(edge.src), str(edge.dst)))

Expand Down Expand Up @@ -1585,6 +1601,10 @@ def define_out_memlet(self, sdfg: SDFG, cfg: ControlFlowRegion, state_dfg: State
cdtype = src_node.out_connectors[edge.src_conn]
if isinstance(sdfg.arrays[edge.data.data], data.Stream):
pass
elif isinstance(dst_node, nodes.AccessNode) and dst_node.desc(state_dfg).dtype == dtypes.gpuStream_t:
# Special case: GPU Streams do not represent data flow - they assing GPU Streams to kernels/tasks
# Thus, nothing needs to be written.
pass
elif isinstance(cdtype, dtypes.pointer): # If pointer, also point to output
desc = sdfg.arrays[edge.data.data]

Expand Down
Loading
Loading