The Interoperability Challenge

Interoperability of programming systems, such as MPI, OpenMP and PGAS, is a key aspect in exascale computing as it is likely that exascale applications will use a combination of programming systems to use efficiently different kind of communications, i.e. inter-node and intra-node communications. In EPiGRAM, we implemented a PGAS-based MPI to fully integrate message-passing and PGAS programming models, we introduced MPI endpoints in this MPI implementation and improved the GPI interoperability with MPI.

EMPI4Re: A PGAS-based MPI Implementation. EPiGRAM integrates and combines message-passing and PGAS programming models in one MPI implementation. The EPiGRAM MPI library for Research (EMPI4Re) is an MPI-1 library created by EPCC at the University of Edinburgh as a vehicle for research into new MPI functionality. The library adopts the conceptual model of PGAS and assumes hardware support for RDMA operations. This conceptual model enables efficient implementation of remotely accessible double buffered first-in first-out (FIFO) queues, used for point-to-point operations, and distributed state control structures, used for collective operations.

The code-base for the EMPI4Re library currently consists of 55,495 lines of C code (OpenMPI version 1.8.6 consists of 933,889 lines for comparison). The current implementation of EMPI4Re is based on DMAPP (a Cray one-sided communication API) and there is an ongoing effort in EPiGRAM to replace DMAPP with GPI. Overall, we found in EPiGRAM that the EMPI4Re library is a useful research vehicle for rapidly prototyping and assessing code changes to MPI functionality without the complexity of managing a large code-base or production MPI implementation.

MPI Endpoints. MPI is typically targeted at communication between distributed memory spaces. For a pure MPI programming approach, multi-core nodes require an OS process per core in order to take advantage of the available compute capability. This requires multiple instances of the MPI library per shared-memory node including communication buffers, topology informations and connection resources. Hybrid programming, commonly referred to as MPI+X, where X is programming model that supports threads, only requires a single instance of the MPI library per shared-memory node and so it should scale with increasing per-node core-count better than pure MPI. However, there are restrictions on how MPI can be used in multi-threaded OS processes that make it difficult to efficiently achieve high performance with hybrid programming. In particular, threads cannot be individually identified as the source or target of MPI messages.

MPI endpoints have been designed to remove or alleviate threading restrictions in MPI and facilitate high performance communication between multi-threaded OS processes. MPI endpoints allow the programmer to create additional ranks at each MPI process. Each endpoint rank can be then distributed to threads in system-level programming models enabling these threads to act as MPI processes and interoperate with MPI directly.

The EPiGRAM project is implementing the MPI endpoints in EMPI4Re. The initial approach taken in the EMPI4Re library is to create each communicator handle as normal: generating a new structure for each one, including a full mapping of all ranks to their associated location. This is exactly what would happen if each of the members of the new communicator were individual MPI processes each in their own OS processes. EMPI4Re is already designed to be able to cope with each member of a communicator using a different context identifier for a particular communicator so this approach does not cause a conflict. The next step is to de-duplicate the internal data-structures so that multiple MPI endpoints in the same OS process share a single copy of the mapping information and share a reduced number of matching data-structures and communication buffers. In EMPI4Re, various design choices, such as having a different context identifier at each MPI process for a communicator thereby avoiding the use of a distributed agreement algorithm, simplify the addition of new features.

GPI Interoperability with MPI. Large parallel applications that have been developed over several years often reach several thousands or even millions of lines of code. Moreover, there is a large set of available libraries and tools which run with MPI. For this reason, it is important to enable GPI full cooperation with MPI so that both can be used simultaneously in an efficient way. This interoperability allows an incremental porting of large applications to GPI and an effective usage of existing MPI libraries and infrastructure. This tighter support for MPI interoperability was integrated in GPI during the EPiGRAM project (GPI release v1.1.0 in June 2014) by introducing the so-called mixed-mode. In this mode, GPI sets its environment reusing MPI instead of relying on its own startup mechanism (gaspi run). The only constraint is that MPI must be initialized (MPI_Init) before GPI (gaspi_proc_init). In this mode, as MPI and GPI both follow a Single Program Multiple Data (SPMD) model, there is a direct match between the MPI the GPI ranks. This simplifies the reasoning about the hybrid GPI-MPI application.

In addition, an interface allowing memory management interoperability has been recently established in the GASPI standard. GASPI handles memory spaces in so-called segments, which are accessible from every thread of every GASPI process. The GASPI standard has been extended to allow users to provide an already existing memory buffer as the memory space of a GASPI segment. This new function will allow future applications to communicate data from memory that is not allocated by the GASPI runtime system but provided to it, i.e. by MPI. If an MPI program calls GPI libraries, the GPI libraries need to be isolated, so that the communication in a library does not interfere with the communication in the main application, or any other library. This is required to guarantee correct results. The GASPI interface has been extended to offer a clear separation: a library is now able to create its own communication queues and thus have an isolated communication channel. For all other resources, i.e. segments, GPI already provides some mechanism to query their usage and to select an unused resource.