Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shmem_malloc Interface to Leverage Hierarchical & Heterogenous Memory Characteristics #258

Closed
manjugv opened this issue Nov 26, 2018 · 12 comments

Comments

@manjugv
Copy link
Collaborator

manjugv commented Nov 26, 2018

Problem:

A typical node in the current HPC systems is composed of variety of memories and organized into multiple hierarchies and/or have different affinities to the PEs and threads. The OpenSHMEM programming model and its memory allocation routines are oblivious to these variations. As a consequence, it is a challenge for the OpenSHMEM program to leverage memory characteristics and capabilities to achieve higher performance in a portable way.

Proposal:

Introduce memory allocation interface that can pass hints to the OpenSHMEM implementations.
The memory hints are then utilized by the implementations to optimize memory for that behavior.
For example, if the user specifies that a particular allocation is used as a pSync array, then the implementation can use the memory that is available on NUMA memory bank that is near to the network. In the cases where the memory is available on the network interface, it can allocate that memory for the pSync array. This can impact the latency characteristics.

Impact on Users:

These interfaces should provide an opportunity to the user to provide usage information to
the implementation, which implementations can then utilize to optimize for that behavior. If the implementations
optimize for that behavior, the programs should achieve higher performance and/or scalability.

The OpenSHMEM programs not using these interfaces or using SHMEM_HINT_NONE is not
impacted.

Impact on Implementations:

This provides an opportunity for the implementations to optimize the behavior for particular usage.
If an implementation does not support optimizations, it is allowed to default to shmem_malloc behavior.

Useful References

  1. https://software.intel.com/sites/default/files/managed/5f/5e/MCDRAM_Tutorial.pdf
  2. SharP Unified Memory Allocator: https://www.osti.gov/biblio/1468045
  3. mbind: https://www.kernel.org/doc/html/v4.18/admin-guide/mm/numa_memory_policy.html
@naveen-rn
Copy link
Contributor

How is this different from #195? It looks like we are trying to address the same issue.

@manjugv
Copy link
Collaborator Author

manjugv commented Nov 26, 2018

@naveen-rn

Naveen, I knew you would ask this. :)

The main difference is that there is only one symmetric heap here (no change to symmetric address model) and complexity of memory management is handled by the library. It is a small change (adding only one interface) and we can get most of the benefits. The approach in #195 is more explicit and here it is more implicit. That said, I feel there is value in having both solutions and they can co-exist.

@jamesaross
Copy link
Contributor

So this proposal is for something like this?

void* shmem_malloc_hint(size_t size, int hint);

Where hint is something like SHMEM_HINT_IS_PSYNC?

@manjugv
Copy link
Collaborator Author

manjugv commented Dec 6, 2018

@jamesaross Correct.

@jamesaross
Copy link
Contributor

I see now that you already added a pull request #259 but I'll continue with the discussion here.

Are these types, as identified in #259, sufficient for all use cases: LOW_LAT_MEM, HIGH_BW_MEM, NEAR_NIC_MEM, DEVICE_GPU_MEM, DEVICE_NIC_MEM? There are probably dozens of device libraries that would have to be linked against or dlopened. The proposal seems to imply runtime device querying/identification for many different device vendors. Can we reasonably expect every OpenSHMEM implementation to have special memory allocators for every device that could be attached to a node? This is a lot of work. It also seems to make the implementation more fragile--it would need to be updated every time a new device/API is available.

Why put this significant burden on the OpenSHMEM implementer when it's the application developer that has a specific allocator and/or physical memory location in mind?

What do you think about alternative interfaces like these?

void* shmem_malloc_ptr(size_t size, void* (*ptr_malloc)(size_t));
void shmem_free_ptr(void* ptr, void (*ptr_free)(void*));

Application developers should just say what they want.

@naveen-rn
Copy link
Contributor

@jamesaross If I understand correctly - you are expecting the users to create the HEAP and pass the address to the OpenSHMEM implementation. If so, this looks more like MPI windows - MPI_Win_create. To me this would create unnecessary burden on OpenSHMEM implementations to maintain these base address and register them and do lookup operations unnecessarily.

In @manjugv proposal, these are hints for the libraries. It is not mandatory for all implementations to provide support for all memory types.

@jamesaross
Copy link
Contributor

@naveen-rn How does the current proposal get around the OpenSHMEM implementation creating a device heap and maintaining device addresses for every conceivable device? Also, a lookup for the default case is trivial if the implementation is clever about it. The address returned from shmem_malloc_ptr could be padded appropriately with meta data.

@naveen-rn
Copy link
Contributor

@jamesaross My understanding of this proposal is - implementations will pin/register a single big chunk of memory as SHEAP sometime (may be at shmem_init()) before the actual allocation. The total size of the SHEAP is fixed. Let us assume that this memory flows over multiple numa nodes (may be INTERLEAVED mmap). With the hints in this call - implementations can select a particular memory block (if possible) during the actual allocation. There is no new registration during allocation. Only offset calculation is necessary when we perform other RMA/AMO operations on this memory (this is where it gets different from #195. In #195 we need to perform both the SHEAP identification and offset calculation, since we have multiple SHEAP).

I haven't thought about the possible usages on all the hints mentioned in this proposal. But, hints like SHMEM_HINT_PSYNC, SHMEM_HINT_PWORK, and SHMEM_HINT_ATOMICS can be effectively used at the shmem_malloc operation.

@jamesaross
Copy link
Contributor

HPC application portability is rarely defined by the small burden of replacing an allocator or swapping it with a macro. If it's expected that most implementations won't bother with supporting most hint types and the implementations that do will support a very specific subset of devices, wouldn't it be simpler to have vendor-specific special allocator extensions?

Below is an example of a portable code with a vendor-specific special allocator.

#include "shmemx.h"
#if SHMEMX_SPECIAL_ALLOCATOR_AVAILABLE
#define shmem_malloc_special(size) shmemx_malloc_special((size), SHMEM_HINT_IS_PSYNC)
#else
#define shmem_malloc_special(size) shmem_malloc((size))
#endif
// ...
// this is now portable:
size_t sz = log(shmem_n_pes()) + 2;
int* pSync = shmem_malloc_special(sz);

@manjugv
Copy link
Collaborator Author

manjugv commented Jan 3, 2019

Not sure If I understand your point entirely.

With the interface in this proposal, there is a two way communication and agreement. (1) The user tells the library that a particular allocation will be used in a specific way. (2) The library uses that information and optimizes for that usage. If the user keeps up with the promise and the library can optimize, there will be performance benefits to the applications.

I disagree that it is a huge burden to implement. The network libraries can already support some of these hints and there is no way to provide these benefits to the user. Also, most of these hints are easy to implement with wrappers without any need for fancy allocators. For example, one could use Memkind to support many of these hints. I’m open to trim some of these hints, if we find something terribly difficult to implement and does not provide huge benefits. Again, remember supporting hints are optional.

@jdinan
Copy link
Collaborator

jdinan commented Jan 31, 2020

@manjugv Was this closed by #259?

@manjugv
Copy link
Collaborator Author

manjugv commented Jan 31, 2020

Yes @jdinan. Closing it now.

@manjugv manjugv closed this as completed Jan 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants