Reduce device RTL memory footprint #139

pdhaliwal-amd · 2020-08-17T11:24:05Z

No description provided.

ghost · 2020-08-17T11:26:22Z

Congratulations 🎉. DeepCode analyzed your code in 2.266 seconds and we found no issues. Enjoy a moment of no bugs ☀️.

👉 View analysis in DeepCode’s Dashboard | Configure the bot

ronlieb

good start, but ...
please dont merge this yet, there are test faliures being looked into

JonChesterfield

I'm not clear how this change reduces the RTL memory footprint. It looks like the static array is constructed in LLVM instead of in the library code, and a compile time constant size is then passed as a runtime variable into the device functions. Is this patch missing some pieces?

JonChesterfield · 2020-08-17T12:40:54Z

clang/lib/CodeGen/CGOpenMPRuntimeNVPTX.cpp

+      CGF.getTarget().getGridValue(llvm::omp::GV_Warp_Slot_Size);
+  size_t DataSharingMemorySlotSize = WarpSlotSize * 64;
+
+  // creating a global array which will be used for data sharing slots


Why construct this in codegen instead of as an array in the devicertl?

because deviceRTL would not know about number of kernels present in the device image.

JonChesterfield · 2020-08-17T12:41:51Z

clang/lib/CodeGen/CGOpenMPRuntimeNVPTX.cpp

+    llvm::Type *Ty =
+        llvm::ArrayType::get(CGF.CGM.Int8Ty,
+                           DataSharingMemorySlotSize);
+    llvm::GlobalVariable *DataSharingMemorySlot = new llvm::GlobalVariable(


Is this the same array as above?

JonChesterfield · 2020-08-17T12:43:09Z

openmp/libomptarget/deviceRTLs/interface.h

@@ -433,8 +433,8 @@ EXTERN void __kmpc_kernel_prepare_parallel(void *WorkFn);
 EXTERN bool __kmpc_kernel_parallel(void **WorkFn);
 EXTERN void __kmpc_kernel_end_parallel();

-EXTERN void __kmpc_data_sharing_init_stack();
-EXTERN void __kmpc_data_sharing_init_stack_spmd();
+EXTERN void __kmpc_data_sharing_init_stack(char *Data, size_t size);


Probably can't modify these prototypes without also modifying nvptx, can add more functions

Will do. Though these changes should work with nvptx as well but I don't know how to test.

JonChesterfield · 2020-08-17T12:44:43Z

clang/lib/CodeGen/CGOpenMPRuntimeNVPTX.cpp

+  StringRef DataSharingMemorySlotName = "openmp.data.sharing.memory.slot";
+  size_t WarpSlotSize =
+      CGF.getTarget().getGridValue(llvm::omp::GV_Warp_Slot_Size);
+  size_t DataSharingMemorySlotSize = WarpSlotSize * 64;


This looks like a constant - I thought the idea was to compute this per-kernel?

This is bit different approach than computing per kernel. For now, I am allocating 1MB of memory globally and uniquely per kernel. So the memory footprint is proportional to the number of non-spmd kernels. This size can be adjusted later during clang-build-select-link phase.

pdhaliwal-amd · 2020-08-17T13:12:11Z

@JonChesterfield I am first allocating 1MB of memory per kernel. And each kernel will have its own memory array. So, device memory footprint is directly dependent on the size of number of non-SPMD kernels. For e.g. a program having only one (non-spmd) kernel would take only 1MB of memory. The device_state var is now only around ~236MB from previous ~2.3GB.

JonChesterfield · 2020-08-17T13:19:10Z

Per kernel or per launch? If per kernel, and this is the same underlying structure, I'd guess that's safe, though neither will be able to allocate the full 1mb and the second one to start may get none.

Memory footprint would be dependent on the total number of non-spmd kernels, not on the total number that are currently running, right? So we'll have a lot of memory that is unused. It seem that a test case with a lot of kernels would exceed the fixed 2.3gb we currently use

If I understand correctly, this memory is used for call frames. I.e. it'll be allocated in stack order. Sometimes we can calculate an upper bound on the state needed, sometimes we can calculate a lower bound. I suspect the best we can do is give each kernel a block of __private memory, sized to that kernel. If we know the upper bound and it's low enough that __private allocation works, we get a kernel that never needs malloc. If the upper bound is unknown or too high, we can do things with malloc/realloc.

It seems desirable to only occupy memory on the gpu for the kernels that are running, as then we can handle multiple instances of the same kernel correctly and minimise the footprint.

pdhaliwal-amd · 2020-08-17T13:51:22Z

Yes, this is per kernel (emitting one global array per kernel).

It seem possible to use private memory in case where memory usage is low.

It seems desirable to only occupy memory on the gpu for the kernels that are running, as then we can handle multiple instances of the same kernel correctly and minimise the footprint.

You are right. These changes do not take care of multiple instances of same kernels launched in parallel. Previous implementation has proper locks in place in case of large number of kernels.

Reduce device RTL memory footprint

1e6e3e7

pdhaliwal-amd requested review from gregrodgers, JonChesterfield and ronlieb August 17, 2020 11:24

ronlieb suggested changes Aug 17, 2020

View reviewed changes

JonChesterfield suggested changes Aug 17, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce device RTL memory footprint #139

Reduce device RTL memory footprint #139

pdhaliwal-amd commented Aug 17, 2020

ghost commented Aug 17, 2020 •

edited by ghost

Loading

ronlieb left a comment

JonChesterfield left a comment

JonChesterfield Aug 17, 2020

pdhaliwal-amd Aug 17, 2020

JonChesterfield Aug 17, 2020

pdhaliwal-amd Aug 17, 2020

JonChesterfield Aug 17, 2020

pdhaliwal-amd Aug 17, 2020

JonChesterfield Aug 17, 2020

pdhaliwal-amd Aug 17, 2020

pdhaliwal-amd commented Aug 17, 2020

JonChesterfield commented Aug 17, 2020 •

edited

Loading

pdhaliwal-amd commented Aug 17, 2020 •

edited

Loading

Reduce device RTL memory footprint #139

Are you sure you want to change the base?

Reduce device RTL memory footprint #139

Conversation

pdhaliwal-amd commented Aug 17, 2020

ghost commented Aug 17, 2020 • edited by ghost Loading

👉 View analysis in DeepCode’s Dashboard | Configure the bot

ronlieb left a comment

Choose a reason for hiding this comment

JonChesterfield left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdhaliwal-amd commented Aug 17, 2020

JonChesterfield commented Aug 17, 2020 • edited Loading

pdhaliwal-amd commented Aug 17, 2020 • edited Loading

ghost commented Aug 17, 2020 •

edited by ghost

Loading

JonChesterfield commented Aug 17, 2020 •

edited

Loading

pdhaliwal-amd commented Aug 17, 2020 •

edited

Loading