Skip to content
This repository has been archived by the owner on Jan 26, 2024. It is now read-only.

Reduce device RTL memory footprint #139

Open
wants to merge 1 commit into
base: amd-stg-openmp
Choose a base branch
from

Conversation

pdhaliwal-amd
Copy link
Contributor

No description provided.

@ghost
Copy link

ghost commented Aug 17, 2020

Congratulations 🎉. DeepCode analyzed your code in 2.266 seconds and we found no issues. Enjoy a moment of no bugs ☀️.

👉 View analysis in DeepCode’s Dashboard | Configure the bot

Copy link
Contributor

@ronlieb ronlieb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good start, but ...
please dont merge this yet, there are test faliures being looked into

Copy link
Contributor

@JonChesterfield JonChesterfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear how this change reduces the RTL memory footprint. It looks like the static array is constructed in LLVM instead of in the library code, and a compile time constant size is then passed as a runtime variable into the device functions. Is this patch missing some pieces?

CGF.getTarget().getGridValue(llvm::omp::GV_Warp_Slot_Size);
size_t DataSharingMemorySlotSize = WarpSlotSize * 64;

// creating a global array which will be used for data sharing slots
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why construct this in codegen instead of as an array in the devicertl?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because deviceRTL would not know about number of kernels present in the device image.

llvm::Type *Ty =
llvm::ArrayType::get(CGF.CGM.Int8Ty,
DataSharingMemorySlotSize);
llvm::GlobalVariable *DataSharingMemorySlot = new llvm::GlobalVariable(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same array as above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@@ -433,8 +433,8 @@ EXTERN void __kmpc_kernel_prepare_parallel(void *WorkFn);
EXTERN bool __kmpc_kernel_parallel(void **WorkFn);
EXTERN void __kmpc_kernel_end_parallel();

EXTERN void __kmpc_data_sharing_init_stack();
EXTERN void __kmpc_data_sharing_init_stack_spmd();
EXTERN void __kmpc_data_sharing_init_stack(char *Data, size_t size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably can't modify these prototypes without also modifying nvptx, can add more functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Though these changes should work with nvptx as well but I don't know how to test.

StringRef DataSharingMemorySlotName = "openmp.data.sharing.memory.slot";
size_t WarpSlotSize =
CGF.getTarget().getGridValue(llvm::omp::GV_Warp_Slot_Size);
size_t DataSharingMemorySlotSize = WarpSlotSize * 64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a constant - I thought the idea was to compute this per-kernel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bit different approach than computing per kernel. For now, I am allocating 1MB of memory globally and uniquely per kernel. So the memory footprint is proportional to the number of non-spmd kernels. This size can be adjusted later during clang-build-select-link phase.

@pdhaliwal-amd
Copy link
Contributor Author

@JonChesterfield I am first allocating 1MB of memory per kernel. And each kernel will have its own memory array. So, device memory footprint is directly dependent on the size of number of non-SPMD kernels. For e.g. a program having only one (non-spmd) kernel would take only 1MB of memory. The device_state var is now only around ~236MB from previous ~2.3GB.

@JonChesterfield
Copy link
Contributor

JonChesterfield commented Aug 17, 2020

Per kernel or per launch? If per kernel, and this is the same underlying structure, I'd guess that's safe, though neither will be able to allocate the full 1mb and the second one to start may get none.

Memory footprint would be dependent on the total number of non-spmd kernels, not on the total number that are currently running, right? So we'll have a lot of memory that is unused. It seem that a test case with a lot of kernels would exceed the fixed 2.3gb we currently use

If I understand correctly, this memory is used for call frames. I.e. it'll be allocated in stack order. Sometimes we can calculate an upper bound on the state needed, sometimes we can calculate a lower bound. I suspect the best we can do is give each kernel a block of __private memory, sized to that kernel. If we know the upper bound and it's low enough that __private allocation works, we get a kernel that never needs malloc. If the upper bound is unknown or too high, we can do things with malloc/realloc.

It seems desirable to only occupy memory on the gpu for the kernels that are running, as then we can handle multiple instances of the same kernel correctly and minimise the footprint.

@pdhaliwal-amd
Copy link
Contributor Author

pdhaliwal-amd commented Aug 17, 2020

Yes, this is per kernel (emitting one global array per kernel).

It seem possible to use private memory in case where memory usage is low.

It seems desirable to only occupy memory on the gpu for the kernels that are running, as then we can handle multiple instances of the same kernel correctly and minimise the footprint.

You are right. These changes do not take care of multiple instances of same kernels launched in parallel. Previous implementation has proper locks in place in case of large number of kernels.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants