-
Notifications
You must be signed in to change notification settings - Fork 6
Reduce device RTL memory footprint #139
base: amd-stg-openmp
Are you sure you want to change the base?
Conversation
Congratulations 🎉. DeepCode analyzed your code in 2.266 seconds and we found no issues. Enjoy a moment of no bugs ☀️. 👉 View analysis in DeepCode’s Dashboard | Configure the bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good start, but ...
please dont merge this yet, there are test faliures being looked into
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear how this change reduces the RTL memory footprint. It looks like the static array is constructed in LLVM instead of in the library code, and a compile time constant size is then passed as a runtime variable into the device functions. Is this patch missing some pieces?
CGF.getTarget().getGridValue(llvm::omp::GV_Warp_Slot_Size); | ||
size_t DataSharingMemorySlotSize = WarpSlotSize * 64; | ||
|
||
// creating a global array which will be used for data sharing slots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why construct this in codegen instead of as an array in the devicertl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because deviceRTL would not know about number of kernels present in the device image.
llvm::Type *Ty = | ||
llvm::ArrayType::get(CGF.CGM.Int8Ty, | ||
DataSharingMemorySlotSize); | ||
llvm::GlobalVariable *DataSharingMemorySlot = new llvm::GlobalVariable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same array as above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
@@ -433,8 +433,8 @@ EXTERN void __kmpc_kernel_prepare_parallel(void *WorkFn); | |||
EXTERN bool __kmpc_kernel_parallel(void **WorkFn); | |||
EXTERN void __kmpc_kernel_end_parallel(); | |||
|
|||
EXTERN void __kmpc_data_sharing_init_stack(); | |||
EXTERN void __kmpc_data_sharing_init_stack_spmd(); | |||
EXTERN void __kmpc_data_sharing_init_stack(char *Data, size_t size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably can't modify these prototypes without also modifying nvptx, can add more functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. Though these changes should work with nvptx as well but I don't know how to test.
StringRef DataSharingMemorySlotName = "openmp.data.sharing.memory.slot"; | ||
size_t WarpSlotSize = | ||
CGF.getTarget().getGridValue(llvm::omp::GV_Warp_Slot_Size); | ||
size_t DataSharingMemorySlotSize = WarpSlotSize * 64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a constant - I thought the idea was to compute this per-kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is bit different approach than computing per kernel. For now, I am allocating 1MB of memory globally and uniquely per kernel. So the memory footprint is proportional to the number of non-spmd kernels. This size can be adjusted later during clang-build-select-link phase.
@JonChesterfield I am first allocating 1MB of memory per kernel. And each kernel will have its own memory array. So, device memory footprint is directly dependent on the size of number of non-SPMD kernels. For e.g. a program having only one (non-spmd) kernel would take only 1MB of memory. The device_state var is now only around ~236MB from previous ~2.3GB. |
Per kernel or per launch? If per kernel, and this is the same underlying structure, I'd guess that's safe, though neither will be able to allocate the full 1mb and the second one to start may get none. Memory footprint would be dependent on the total number of non-spmd kernels, not on the total number that are currently running, right? So we'll have a lot of memory that is unused. It seem that a test case with a lot of kernels would exceed the fixed 2.3gb we currently use If I understand correctly, this memory is used for call frames. I.e. it'll be allocated in stack order. Sometimes we can calculate an upper bound on the state needed, sometimes we can calculate a lower bound. I suspect the best we can do is give each kernel a block of __private memory, sized to that kernel. If we know the upper bound and it's low enough that __private allocation works, we get a kernel that never needs malloc. If the upper bound is unknown or too high, we can do things with malloc/realloc. It seems desirable to only occupy memory on the gpu for the kernels that are running, as then we can handle multiple instances of the same kernel correctly and minimise the footprint. |
Yes, this is per kernel (emitting one global array per kernel). It seem possible to use private memory in case where memory usage is low.
You are right. These changes do not take care of multiple instances of same kernels launched in parallel. Previous implementation has proper locks in place in case of large number of kernels. |
No description provided.