-
Notifications
You must be signed in to change notification settings - Fork 9
GNI provider design doc
This set of wiki pages describes the design of the GNI libfabric provider. This document assumes the reader has some familiarity with the libfabric API.
The primary design goals of the GNI libfabric provider are
- deliver high performance for multi-threaded MPI applications on Intel multi-core processors. It is anticipated that HPC MPI applications will need to employ multiple levels of parallelism to realize a significant percentage of the peak performance of Intel KNL and follow-on processors. Consequently, applications will most likely employ shared memory techniques on node (OpenMP, pthreads), while using MPI to exchange data between processes on different nodes, and with a limited number of processes running on the same KNL processor. The GNI libfabric provider will be designed to allow MPI to open multiple communication portals (RDM EPs/SEPs) within a given MPI process, reducing contention for libfabric internal resources when multiple application threads are accessing network hardware resources concurrently. This design goal implies thread safety.
- deliver performance within an MPI implementation comparable to performance obtained using GNI directly in the implementation
- support the tag matching interface.
- support asynchronous progress. The provider's state engine makes progress without the consumer application needing to make calls into libfabric.
- support RDM and MSG EPs. The later provide connectivity requirements typical of data center applications.
- support access to lower level Cray XC Aries HW features like the collective engine and hardware performance counters
- be compatible with checkpointing methodologies such as those based on BLCR.
The GNI libfabric provider will be supported on the following Cray software/hardware stacks: Cray XC systems running CLE 5.2 UP02 or higher. The GNI libfabric provider will not be supported on Cray XE/XK systems. The provider will work with jobs launched using the ALPS and SLURM. The provider will also work with the ALPS pDomain mechanism. The provider will also work with CCM.
The current plan is to release the provider code as part of the libfabric distribution, which in turn will be released as an OFED component. The GNI provider is available starting with the libfabric 1.2 release.
Reflecting the object oriented structure of libfabric, the GNI provider is organized around a set of classes. There are four basic types of classes within the GNI provider:
- top level classes having a one-to-one mapping to libfabric classes: a fabric class, a domain class, an address vector class, etc. These objects all have
fid
in the structure name. Thegnix_fid_domain
class is an example of such an object. - The second, lower level type of classes are those that embed one or more GNI handles. Classes in this category are associated with one or more GNI resources. An example is the
gnix_nic
class. These classes are not directly visible to libfabric consumers. - A third major type of classes are used to manage data motion transactions: puts, gets, send/recv, etc. An upper level transaction class maps one to one to libfabric data motion requests. A lower level transaction class maps one-to-one with a GNI Post or SMSG transaction. It is possible that multiple lower level transactions can map to a given upper level transaction object.
- A fourth major type of class maps indirectly to libfabric classes, but is not directly associated with a GNI resource. An example of this class is the message matching engine. These classes are essentially independent of the GNI provider and may eventually go into the
common
section of libfabric.
The GNI provider must implement a set of top level classes:
-
gnix_fid_fabric
- GNI provider implementation of the fabric class. Event queues and domain objects can be created from agnix_fid_fabric
object. Theops_open
method for this class can be used to adjust properties that are inherited by subsequently created domain and EP objects. This includes number of GNI datagrams (directed and wildcard) that will be allocated for libfabric EP's instantiated using domain objects, as well as datagram timeouts. -
gnix_fid_domain
- GNI provider implementation of the domain class. A domain object has a single parent fabric object. Agnix_fid_domain
object can be associated with one orgnix_nic
objects. Agnix_fid_domain
is associated with a given set of GNI RDMA credentials. -
gnix_fid_ep
- GNI provider implementation of the EP class. Currently only the FI_EP_RDM is being actively developed. Agnix_fid_ep
inherits the RDMA credentials of thegnix_fid_domain
object used to create it. It is associated with a singlegnix_cm_nic
. The gni device addr/cdm_id of this nic defines the address for this endpoint. Likewise, agni_fid_ep
is associated with a singlegnix_nic
. Multiplegni_fid_ep
objects may be associated with the samegnix_nic
. Per the libfabric design, agnix_fid_ep
can be associated with a send queue, receive queue, completion queue, and for FI_EP_RDM eps, an address vector. -
gnix_fid_eq
- GNI provider implementation of the event queue class. This will eventually be used for connection management when FI_EP_MSG EPs are implemented. It can also be used for error notifications reported byGNI_WaitErrorEvents
, as well as for non-blocking address vector insert operations and memory registration. -
gnix_fid_cq
- GNI provider implementation of the completion queue class. -
gnix_fid_av
- GNI provider implementation of the address vector class. -
gnix_fid_mem_desc
- GNI provider implementation of the memory region class. Notegnix_fid_mem_desc
is somewhat different from the other top level classes in that it as has an embedded GNI handle - a GNI memhndl.
Additional top level classes that will eventually be implemented include a completion counter class and wait object class.
The upper level classes use lower level classes to interface to GNI.
-
gnix_cm_nic
- class that defines the provider address for FI_EP_RDM endpoints. Eachgnix_cm_nic
has a unique address consisting of GNI RDMA credentials/ nic address/cdm_id. Objects of this class are also associated with a GNI datagram subsystem used to exchange data needed to set up GNI SMSG connections. Agnix_fid_domain
created withep_attr
type FI_EP_RDM is associated with agnix_cm_nic
object.gnix_fid_domains
with the same GNI RDMA credentials may share agnix_cm_nic
to conserve hardware resources. -
gnix_nic
- class that maps to a GNI FMA descriptor. A givengnix_nic
has a particular set of GNI RDMA credentials. When agnix_fid_ep
object is created, it is associated with agnix_nic
with the GNI RDMA credentials that matches the RDMA credentials it inherited from the domain object used to create it. If nognix_nic
with these RDMA credentials exists, one is created. Eachgnix_fid_ep
is associated with onegnix_nic
. For purposes of bookkeeping, agnix_nic
is also associated with one or more domains. Once all domains referencing a givengnix_nic
are destroyed, thegnix_nic
is also destroyed. The relation of thegni_cm_nic
andgni_nic
classes with the upper levelgnix_fid_domain
andgnix_fid_ep
classes are depicted below -
gnix_vc
- class for managing GNI SMSG based messaging. This class embeds a GNI endpoint. For FI_EP_RDM type endpoints, a vector (or hash table) ofgnix_vc
objects is maintained. In this case the SMSG connections are established using thegnix_cm_nic
associated with the EP. Agnix_vc
object has an associated send queue for fabric level send requests, send queue for tagged send requests, and a queue for fenced RMA/atomic ops. The relationship betweengnix_vc
objects and upper level GNI provider objects is illustrated below . -
gnix_tme
- class for managing fabric-level receive requests, whether tagged or untagged. An object of this class maintains a posted receive queue and an unexpected queue. For FI_EP_MSG EP's, a single gnix_tme object is used, while a table of of gnix_tme objects is associated with a FI_EP_RDM type EP. The relationship of agnix_tme
object with agnix_fid_ep
object and a table ofgnix_vc
objects is illustrated below .
The wide variety of transactions defined by the libfabric API - rdma read/write, send/recv, tagged send/recv, and atomic operations - cannot be mapped directly to underlying GNI_PostFma
, GNI_PostRdma
, and GNI_SmsgSend
primitives. In the general case, a given libfabric transaction may involve a
sequence of GNI transactions. As a result, there needs to be upper level transaction management classed that map directly to libfabric transactions, and an underlying class which maps to GNI transactions. The gnix_fab_req
class is used track libfabric level transactions issued by the libfabric consumer. Individual GNI_Post(Fma/Rdma)
and GNI_SmsgSend
requests are tracked using objects of the gnix_tx_descriptor
class. A free list of gnix_tx_descriptor
objects is allocated for each gnix_nic
. The number of these descriptors is limited, in order to avoid overrunning of the GNI TX CQ associated with the gnix_nic
object. In contrast, the number of outstanding gnix_fab_req
objects for a given domain is potentially unbounded, especially if target consumers of the provider are unable to handle returning -FI_EAGAIN
from libfabric transaction calls.
The libfabric FI_EP_RDM endpoint type is, from the libfabric consumer's standpoint, connectionless. However, because GNI SMSG is connection oriented, and is the most efficient way for sending modest size messages (up to a few KBs in size), the GNI provider must do connection management internally for this endpoint type.
The GNI SMSG connections required to support the FI_EP_RDM endpoint type are termed virtual connections (VCs) in the implementation. The task of managing VC connection setup is divided up among three distinct subsystems of the GNI provider: the VC management code which implements the logic for initiating connection requests and responses, the gnix_cm_nic
object which provides a send/recv protocol used by the VC management code, and an underlying GNI datagram management system which interfaces to the GNI GNI_EpPostData
and related functions.
The GNI datagram subsystem provides a simple interface for posting and dequeuing GNI datagrams. Its methods do not attempt to hide the exchange nature of the kgni GNI protocol. It provides methods for allocating and releasing datagram structures, and methods for posting them either as bound datagrams (those that will be sent via the kgni protocol to a particular peer) or wildcard datagrams. Methods are also provided for packing/unpacking the IN/OUT buffers associated with a datagram. Callback functions can be associated with datagrams. These callbacks are invoked by the datagram subsystem at various points in either posting datagrams to or dequeuing datagrams from kgni.
At a high level, the approach to establishing VCs is straightforward. When a process wants to initiate a send operation to a peer, a GNI datagram is constructed with information about the GNI SMSG mailbox which has been allocated at the process to allow for messaging with the peer. This datagram is then posted to the local kgni device driver using the send method of the gnix_cm_nic
associated with the endpoint's domain. The peer may also be trying to establish a connection with the process, in which case these datagrams with explicit peer addresses will likely be matched. To support asymmetric communication patters, each process also posts a number of wildcard datagrams with GNI SMSG mailbox information.
The GNI datagram protocol is somewhat difficult to use for generalized communication schemes. gnix_cm_nic
objects provide a more practical send/recv protocol which is easier to use. A non-blocking send method is provided to send messages to a remote gnix_cm_nic
. Prior to using a gnix_cm_nic
, a consumer of its functionality registers a callback function for receiving incoming messages. Out-of-band synchronization is not required in setting up a gnix_cm_nic
as incoming datagrams will remain queued in the kernel until matching datagrams are posted via the GNI datagram subsystem. gnix_cm_nic
does not use wildcard datagrams to exchange data (such as GNI mailbox information), but rather just to accept unexpected incoming messages from other nodes.
Owing to the need to conserve Aries hardware resources, multiple local FI_EP_RDM endpoints may use a single gnix_cm_nic
object to exchange messages with remote gnix_cm_nic
objects. In order to facilitate this multiplexing, when an endpoint is instantiated and associated with a gnix_cm_nic
, a pointer to the corresponding gnix_ep_fid
object is inserted into a small hash attached to the gnix_cm_nic
object, with its unique address (gni address/rdma credentials/cdm id) being used as the key. This facilitates processing of incoming connection requests by the VC management code.
The logic to manage the actual setup of VCs is contained in the VC management code. It employs a connection request/response protocol to set up a connection. The VC management code binds a receive callback function to a gnix_cm_nic
when a gnix_fid_ep
object is enabled via a fi_enable
call. When a data transfer operation is initiated to a remote address (fi_addr_t
), a hashtable (or vector) of vc's is queried for an existing vc. If none is found, a VC connection request is constructed and sent using the gnix_cm_nic
send method to the gnix_cm_nic
associated with the target gnix_fid_ep
. Depending on the communication pattern, it is possible that the remote gnix_fid_ep
is trying to simultaneously initiate a connection request to this endpoint. The flowchart below illustrates the steps taken to transition a VC from unconnected to connected both for the case of matching connection requests and a request/response pattern.
.
The procedure followed by the receive callback function for processing incoming connection request/response messages is shown in the flowchart below.
.
libfabric 1.0 currently supports two progression models - FI_PROGRESS_AUTO and FI_PROGRESS_MANUAL. A distinction is made in libfabric between progression of control and data operations. Providers must support the FI_PROGRESS_AUTO, for both types of operations, but should also support FI_PROGRESS_MANUAL if requested by the application in a fi_getinfo
call.
The GNI provider will rely on a combination of Aries network hardware capabilities, and helper threads, to implement FI_PROGRESS_AUTO. The helper threads will be able to optionally run on the cores reserved by the Cray Linux Environment (CLE) corespec feature. Data operation helper threads will be blocked in either GNI_CqVectorMonitor
of a poll
system call using the fd
returned from a call to GNI_CompChanCreate
. Control operations for the GNI provider almost exclusively involve connection management operations - either those explicitly requested by the application for FI_EP_MSG endpoint types, or via internal VC channels in the case of FI_EP_RDM endpoint types. For the former, connection management will take place in the kernel, assuring progress of connection setup for message endpoint types. For the later, a progress thread will be allocated for each gnix_cm_nic
, and will be blocked in a GNI_PostdataProbeWaitById
call.
According to the fi_domain man page, the GNI provider will need to insure progression when the application makes calls to read or wait on event queues, completion queues, and counters.
For event queues, the GNI provider will progress control operations. Note that currently, although the EQ object is implemented in the gni provider, it is not being used - there are no GNI provider bind operations which will currently succeed in binding an EQ to another libfabric object.
GNI data operations will need to be progressed when an application waits or reads Completion Queues objects - gnix_fid_cq
, or Counter objects. Counter objects are currently not implemented in the GNI provider. In order to progress data operations, the underlying gnix_nic
's TX and RX CQs will need to be polled, and any backlog of queued data control operations will need to be progressed. Since one of the design goals of the GNI provider is to provide good thread-hot support, progression needs to be done in a way that avoids contention for locks within the provider when, for example, multiple threads may be reading from different gnix_fid_cq
s or counters. The approach currently under consideration for progressing gnix_fid_cq
objects is to have a linked list of all underlying gnix_nic
's associated with the gnix_fid_cq
. Elements are added to this list as a gnix_fid_cq
object is bound to different gnix_fid_ep
's. As part of the libfabric level CQ read operation, the underlying GNI CQ's associated with each of these gnix_nic
objects is polled - GNI_CqGetEvent
and resulting GNI CQEs are processed. When the Counter objects are implemented, a similar approach can be taken. By using this approach, a thread-hot MPI implementation for example, can create a separate libfabric level CQ for each FI_EP_RDM endpoint it creates. When different threads poll different the different gnix_fid_cq
objects, there should not be contention between threads within the provider. This use case is illustrated in the following diagram:
.