Algorithms and Implementations for Collective Communication

The second issue EPiGRAM addresses is the performance of collective communication operations at exascale by investigating improved sparse collectives and studying non-blocking collectives in GPI.

MPI Isomorphic Sparse Collectives. The MPI specification has functionality for sparse collective communication where processes communicate with a subset of other processes in a local neighborhood. Sparse neighborhoods can be explicitly specified using the general graph topology functionalities or Cartesian topology. However, both approaches have problems. In the first case, this mechanism is cumbersome to use, and the necessary collective communication and computation to create the local neighborhoods as well as the creation of a new, possibly reordered communicator can be very expensive. Alternatively, neighborhoods can be given implicitly with a Cartesian communicator. On Cartesian MPI communicators, neighborhood collective communication is possible with these implicit neighborhoods. Although neighborhood collective communication should be oblivious to how neighborhoods are set up, there are some differences between explicit and implicit neighborhoods in the MPI standard. For instance, non-existing neighbors are possible for Cartesian neighborhoods and buffer space needs to be calculated for such non-existing neighbors; this is neither possible nor allowed for graph topologies. On the other hand, graph topologies can associate weights with the graph edges (that may reflect communication costs in different ways and thus can permit better process mappings), but this is not possible for Cartesian topologies.

EPiGRAM provides a middle ground between these two approaches: a mechanism for structured, sparse collective communication with a much smaller overhead than the general graph topology mechanism but with more flexibility and expressivity than the Cartesian neighborhoods. In EPiGRAM, we introduced the concept of isomorphic sparse collective: isomorphic sparse collective communication is a form of collective communication in which all involved processes communicate in small, identically structured neighborhoods of other processes. Isomorphic sparse collective communication is useful for implementing stencil and other regular, sparse distributed computations, where the assumption that all processes behave symmetrically is justified. The concept of isomorphic neighborhood extends and generalizes what is possible with the limited MPI Cartesian topologies.

In EPiGRAM, a library for isomorphic sparse collective communication has been implemented. The library supports the navigation and query functionality, creation of isomorphic neighborhoods (by attaching the neighborhood information to a Cartesian communicator), functions for using relative neighbor lists to set up MPI graph communicators, and sparse isomorphic collective operations of the allgather, alltoall and reduction types. The performance improvements that can be achieved by using isomorphic sparse collectives are presented in.

GPI Non-blocking Collectives. Non-blocking collectives have been recently introduced in MPI-3 as a mean to overlap communication and computation during collective operations. In EPiGRAM, we have investigated the development of non-blocking collectives in GPI. Currently there are only two collective operations in GASPI: gaspi_barrier and gaspi_allreduce. Both collective operations in GASPI have a timeout argument that specifies after which period of time the collective operation can be interrupted. This timeout argument can be used as a form of non-blocking call. For instance, the time out argument can be set to GASPI_TEST. Each GPI process can call a collective function and when the function is interrupted as a consequence of the timeout other work can be performed, effectively implementing a non-blocking collective. In EPiGRAM, we added the support for this kind of non-blocking collectives and we performed performance tests of blocking and non-blocking GPI barriers and a comparison with MPI blocking barrier on the SuperMUC iDataPlex supercomputer up to 4,096 nodes (corresponding to a total of 65,536 cores) at the Leibniz Supercomputing Centre. Panel b of Fig. 2 presents the execution time for the different implementations of barriers, showing a reduced execution time for GPI non-blocking barrier (blue line) with respect to the execution of blocking GPI (green line) and MPI (red line) implementations.

execution time for the different implementations of barriers

Fig. 2. Panel a shows the memory consumption using GPI static all-to-all connection (green bars) and dynamic connections (red bars) when using different number of computing nodes. Panel b shows the execution time for MPI blocking (red line), GPI blocking (green line) and non-blocking (blue line) barriers increasing the number of nodes. The tests have been performed on SuperMUC iDataPlex supercomputer at the Leibniz Supercomputing Centre. (Color figure online)