Skip to content

GNI provider design doc

Howard Pritchard edited this page Nov 24, 2015 · 18 revisions

GNI Libfabric Provider design

This set of wiki pages describes the design of the GNI libfabric provider. This document assumes the reader has some familiarity with the libfabric API.

Design Goals

The primary design goals of the GNI libfabric provider are

  • deliver high performance for multi-threaded MPI applications on Intel multi-core processors. It is anticipated that HPC MPI applications will need to employ multiple levels of parallelism to realize a significant percentage of the peak performance of Intel KNL and follow-on processors. Consequently, applications will most likely employ shared memory techniques on node (OpenMP, pthreads), while using MPI to exchange data between processes on different nodes, and with a limited number of processes running on the same KNL processor. The GNI libfabric provider will be designed to allow MPI to open multiple communication portals (RDM EPs/SEPs) within a given MPI process, reducing contention for libfabric internal resources when multiple application threads are accessing network hardware resources concurrently. This design goal implies thread safety.
  • deliver performance within an MPI implementation comparable to performance obtained using GNI directly in the implementation
  • support the tag matching interface.
  • support asynchronous progress. The provider's state engine makes progress without the consumer application needing to make calls into libfabric.
  • support RDM and MSG EPs. The later provide connectivity requirements typical of data center applications.
  • support access to lower level Cray XC Aries HW features like the collective engine and hardware performance counters
  • be compatible with checkpointing methodologies such as those based on BLCR.

Design Constraints

The GNI libfabric provider will be supported on the following Cray software/hardware stacks: Cray XC systems running CLE 5.2 UP02 or higher. The GNI libfabric provider will not be supported on Cray XE/XK systems. The provider will work with jobs launched using the ALPS and SLURM. The provider will also work with the ALPS pDomain mechanism. The provider will also work with CCM.

Release process

The current plan is to release the provider code as part of the libfabric distribution, which in turn will be released as an OFED component. The GNI provider is available starting with the libfabric 1.2 release.

High Level Design - Classes

Reflecting the object oriented structure of libfabric, the GNI provider is organized around a set of classes. There are four basic types of classes within the GNI provider:

  • top level classes having a one-to-one mapping to libfabric classes: a fabric class, a domain class, an address vector class, etc. These objects all have fid in the structure name. The gnix_fid_domain class is an example of such an object.
  • The second, lower level type of classes are those that embed one or more GNI handles. Classes in this category are associated with one or more GNI resources. An example is the gnix_nic class. These classes are not directly visible to libfabric consumers.
  • A third major type of classes are used to manage data motion transactions: puts, gets, send/recv, etc. An upper level transaction class maps one to one to libfabric data motion requests. A lower level transaction class maps one-to-one with a GNI Post or SMSG transaction. It is possible that multiple lower level transactions can map to a given upper level transaction object.
  • A fourth major type of class maps indirectly to libfabric classes, but is not directly associated with a GNI resource. An example of this class is the message matching engine. These classes are essentially independent of the GNI provider and may eventually go into the common section of libfabric.

Top Level Classes

The GNI provider must implement a set of top level classes:

  • gnix_fid_fabric - GNI provider implementation of the fabric class. Event queues and domain objects can be created from a gnix_fid_fabric object. The ops_open method for this class can be used to adjust properties that are inherited by subsequently created domain and EP objects. This includes number of GNI datagrams (directed and wildcard) that will be allocated for libfabric EP's instantiated using domain objects, as well as datagram timeouts.
  • gnix_fid_domain - GNI provider implementation of the domain class. A domain object has a single parent fabric object. A gnix_fid_domain object can be associated with one or gnix_nic objects. A gnix_fid_domain is associated with a given set of GNI RDMA credentials.
  • gnix_fid_ep - GNI provider implementation of the EP class. Currently only the FI_EP_RDM is being actively developed. A gnix_fid_ep inherits the RDMA credentials of the gnix_fid_domain object used to create it. It is associated with a single gnix_cm_nic. The gni device addr/cdm_id of this nic defines the address for this endpoint. Likewise, a gni_fid_ep is associated with a single gnix_nic. Multiple gni_fid_ep objects may be associated with the same gnix_nic. Per the libfabric design, a gnix_fid_ep can be associated with a send queue, receive queue, completion queue, and for FI_EP_RDM eps, an address vector.
  • gnix_fid_eq - GNI provider implementation of the event queue class. This will eventually be used for connection management when FI_EP_MSG EPs are implemented. It can also be used for error notifications reported by GNI_WaitErrorEvents, as well as for non-blocking address vector insert operations and memory registration.
  • gnix_fid_cq - GNI provider implementation of the completion queue class.
  • gnix_fid_av - GNI provider implementation of the address vector class.
  • gnix_fid_mem_desc - GNI provider implementation of the memory region class. Note gnix_fid_mem_desc is somewhat different from the other top level classes in that it as has an embedded GNI handle - a GNI memhndl.

Additional top level classes that will eventually be implemented include a completion counter class and wait object class.

Lower Level Classes

The upper level classes use lower level classes to interface to GNI.

  • gnix_cm_nic - class that defines the provider address for FI_EP_RDM endpoints. Each gnix_cm_nic has a unique address consisting of GNI RDMA credentials/ nic address/cdm_id. Objects of this class are also associated with a GNI datagram subsystem used to exchange data needed to set up GNI SMSG connections. A gnix_fid_domain created with ep_attr type FI_EP_RDM is associated with a gnix_cm_nic object. gnix_fid_domains with the same GNI RDMA credentials may share a gnix_cm_nic to conserve hardware resources.
  • gnix_nic - class that maps to a GNI FMA descriptor. A given gnix_nic has a particular set of GNI RDMA credentials. When a gnix_fid_ep object is created, it is associated with a gnix_nic with the GNI RDMA credentials that matches the RDMA credentials it inherited from the domain object used to create it. If no gnix_nic with these RDMA credentials exists, one is created. Each gnix_fid_ep is associated with one gnix_nic. For purposes of bookkeeping, a gnix_nic is also associated with one or more domains. Once all domains referencing a given gnix_nic are destroyed, the gnix_nic is also destroyed. The relation of the gni_cm_nic and gni_nic classes with the upper level gnix_fid_domain and gnix_fid_ep classes are depicted below
  • gnix_vc - class for managing GNI SMSG based messaging. This class embeds a GNI endpoint. For FI_EP_RDM type endpoints, a vector (or hash table) of gnix_vc objects is maintained. In this case the SMSG connections are established using the gnix_cm_nic associated with the EP. A gnix_vc object has an associated send queue for fabric level send requests, send queue for tagged send requests, and a queue for fenced RMA/atomic ops. The relationship between gnix_vc objects and upper level GNI provider objects is illustrated below .
  • gnix_tme - class for managing fabric-level receive requests, whether tagged or untagged. An object of this class maintains a posted receive queue and an unexpected queue. For FI_EP_MSG EP's, a single gnix_tme object is used, while a table of of gnix_tme objects is associated with a FI_EP_RDM type EP. The relationship of a gnix_tme object with a gnix_fid_ep object and a table of gnix_vc objects is illustrated below .

Transaction Management

The wide variety of transactions defined by the libfabric API - rdma read/write, send/recv, tagged send/recv, and atomic operations - cannot be mapped directly to underlying GNI_PostFma, GNI_PostRdma, and GNI_SmsgSend primitives. In the general case, a given libfabric transaction may involve a sequence of GNI transactions. As a result, there needs to be upper level transaction management classed that map directly to libfabric transactions, and an underlying class which maps to GNI transactions. The gnix_fab_req class is used track libfabric level transactions issued by the libfabric consumer. Individual GNI_Post(Fma/Rdma) and GNI_SmsgSend requests are tracked using objects of the gnix_tx_descriptor class. A free list of gnix_tx_descriptor objects is allocated for each gnix_nic. The number of these descriptors is limited, in order to avoid overrunning of the GNI TX CQ associated with the gnix_nic object. In contrast, the number of outstanding gnix_fab_req objects for a given domain is potentially unbounded, especially if target consumers of the provider are unable to handle returning -FI_EAGAIN from libfabric transaction calls.

Connection Management for FI_EP_RDM type endpoints

The libfabric FI_EP_RDM endpoint type is, from the libfabric consumer's standpoint, connectionless. However, because GNI SMSG is connection oriented, and is the most efficient way for sending modest size messages (up to a few KBs in size), the GNI provider must do connection management internally for this endpoint type.

The GNI SMSG connections required to support the FI_EP_RDM endpoint type are termed virtual connections (VCs) in the implementation. The task of managing VC connection setup is divided up among three distinct subsystems of the GNI provider: the VC management code which implements the logic for initiating connection requests and responses, the gnix_cm_nic object which provides a send/recv protocol used by the VC management code, and an underlying GNI datagram management system which interfaces to the GNI GNI_EpPostData and related functions.

The GNI datagram subsystem provides a simple interface for posting and dequeuing GNI datagrams. Its methods do not attempt to hide the exchange nature of the kgni GNI protocol. It provides methods for allocating and releasing datagram structures, and methods for posting them either as bound datagrams (those that will be sent via the kgni protocol to a particular peer) or wildcard datagrams. Methods are also provided for packing/unpacking the IN/OUT buffers associated with a datagram. Callback functions can be associated with datagrams. These callbacks are invoked by the datagram subsystem at various points in either posting datagrams to or dequeuing datagrams from kgni.

At a high level, the approach to establishing VCs is straightforward. When a process wants to initiate a send operation to a peer, a GNI datagram is constructed with information about the GNI SMSG mailbox which has been allocated at the process to allow for messaging with the peer. This datagram is then posted to the local kgni device driver using the send method of the gnix_cm_nic associated with the endpoint's domain. The peer may also be trying to establish a connection with the process, in which case these datagrams with explicit peer addresses will likely be matched. To support asymmetric communication patters, each process also posts a number of wildcard datagrams with GNI SMSG mailbox information.

The GNI datagram protocol is somewhat difficult to use for generalized communication schemes. gnix_cm_nic objects provide a more practical send/recv protocol which is easier to use. A non-blocking send method is provided to send messages to a remote gnix_cm_nic. Prior to using a gnix_cm_nic, a consumer of its functionality registers a callback function for receiving incoming messages. Out-of-band synchronization is not required in setting up a gnix_cm_nic as incoming datagrams will remain queued in the kernel until matching datagrams are posted via the GNI datagram subsystem. gnix_cm_nic does not use wildcard datagrams to exchange data (such as GNI mailbox information), but rather just to accept unexpected incoming messages from other nodes.

Owing to the need to conserve Aries hardware resources, multiple local FI_EP_RDM endpoints may use a single gnix_cm_nic object to exchange messages with remote gnix_cm_nic objects. In order to facilitate this multiplexing, when an endpoint is instantiated and associated with a gnix_cm_nic, a pointer to the corresponding gnix_ep_fid object is inserted into a small hash attached to the gnix_cm_nic object, with its unique address (gni address/rdma credentials/cdm id) being used as the key. This facilitates processing of incoming connection requests by the VC management code.

The logic to manage the actual setup of VCs is contained in the VC management code. It employs a connection request/response protocol to set up a connection. The VC management code binds a receive callback function to a gnix_cm_nic when a gnix_fid_ep object is enabled via a fi_enable call. When a data transfer operation is initiated to a remote address (fi_addr_t), a hashtable (or vector) of vc's is queried for an existing vc. If none is found, a VC connection request is constructed and sent using the gnix_cm_nic send method to the gnix_cm_nic associated with the target gnix_fid_ep. Depending on the communication pattern, it is possible that the remote gnix_fid_ep is trying to simultaneously initiate a connection request to this endpoint. The flowchart below illustrates the steps taken to transition a VC from unconnected to connected both for the case of matching connection requests and a request/response pattern.

.

The procedure followed by the receive callback function for processing incoming connection request/response messages is shown in the flowchart below.

.

Transaction Management Examples

Day in the life of a short(eager) fi_send/fi_recv

Day in the life of a long(rendezvous) fi_send/fi_recv

Day in the life of a RMA operation - fi_write

Progress Model Support

libfabric 1.0 currently supports two progression models - FI_PROGRESS_AUTO and FI_PROGRESS_MANUAL. A distinction is made in libfabric between progression of control and data operations. Providers must support the FI_PROGRESS_AUTO, for both types of operations, but should also support FI_PROGRESS_MANUAL if requested by the application in a fi_getinfo call.

The GNI provider will rely on a combination of Aries network hardware capabilities, and helper threads, to implement FI_PROGRESS_AUTO. The helper threads will be able to optionally run on the cores reserved by the Cray Linux Environment (CLE) corespec feature. Data operation helper threads will be blocked in either GNI_CqVectorMonitor of a poll system call using the fd returned from a call to GNI_CompChanCreate. Control operations for the GNI provider almost exclusively involve connection management operations - either those explicitly requested by the application for FI_EP_MSG endpoint types, or via internal VC channels in the case of FI_EP_RDM endpoint types. For the former, connection management will take place in the kernel, assuring progress of connection setup for message endpoint types. For the later, a progress thread will be allocated for each gnix_cm_nic, and will be blocked in a GNI_PostdataProbeWaitById call.

According to the fi_domain man page, the GNI provider will need to insure progression when the application makes calls to read or wait on event queues, completion queues, and counters.

For event queues, the GNI provider will progress control operations. Note that currently, although the EQ object is implemented in the gni provider, it is not being used - there are no GNI provider bind operations which will currently succeed in binding an EQ to another libfabric object.

GNI data operations will need to be progressed when an application waits or reads Completion Queues objects - gnix_fid_cq, or Counter objects. Counter objects are currently not implemented in the GNI provider. In order to progress data operations, the underlying gnix_nic's TX and RX CQs will need to be polled, and any backlog of queued data control operations will need to be progressed. Since one of the design goals of the GNI provider is to provide good thread-hot support, progression needs to be done in a way that avoids contention for locks within the provider when, for example, multiple threads may be reading from different gnix_fid_cqs or counters. The approach currently under consideration for progressing gnix_fid_cq objects is to have a linked list of all underlying gnix_nic's associated with the gnix_fid_cq. Elements are added to this list as a gnix_fid_cq object is bound to different gnix_fid_ep's. As part of the libfabric level CQ read operation, the underlying GNI CQ's associated with each of these gnix_nic objects is polled - GNI_CqGetEvent and resulting GNI CQEs are processed. When the Counter objects are implemented, a similar approach can be taken. By using this approach, a thread-hot MPI implementation for example, can create a separate libfabric level CQ for each FI_EP_RDM endpoint it creates. When different threads poll different the different gnix_fid_cq objects, there should not be contention between threads within the provider. This use case is illustrated in the following diagram: .

Diagrams