Memory-Footprint and Efficient Memory Usage

The first issue EPiGRAM addresses is the memory consumption of MPI and GPI at exascale. This is done by designing and implementing zero-copy MPI collectives and GPI dynamic connections.

MPI Zero-Copy Collectives. The MPI datatype mechanism allows application programmers to describe the structure of the data to be communicated in a concise way. In particular, non-consecutive layouts of data can be described as vectors, indexed or structured types, allowing a compact representation of complex layouts. The application programmer describes the layout of the data to be communicated, and the MPI library implementation performs the actual access of the data.

An efficient MPI implementation of data types can save memory copy operations. In fact, an explicit pack operation, implemented by the application programmer, copies data into some intermediate communication buffer, and then the MPI library may entail another copy of this buffer. This extra copy can be sometimes eliminated completely with datatypes, or in part for large data where pipelining may be applied by the library. In particular, the MPI datatype mechanism permits so-called zero-copy implementations, in which no explicit data movements are present in the application and all data access and manipulation are carried out implicitly by the MPI library implementation.

In EPiGRAM, we studied the design and implementation of different zero-copy collective operations, focusing on obstacles that might prevent the design and implementation of such operations in an efficient way. We have investigated the use of the derived datatype mechanism of MPI in the implementation of the classic all-to-all communication algorithm of Bruck et al. Through a series of improvements to the canonical implementation of the algorithm we gradually eliminated initial and final processor-local data reorganizations, culminating in a zero-copy version that contains no explicit, process-local data movement or copy operations. In this case, all necessary data movements are carried out as part of the communication operations. We also showed how the improved algorithm can be used to solve irregular all-to-all communication problems. In particular, in EPiGRAM we used and implemented three new derived datatypes (bounded_vector, circular_vector, and bucket) that are not in MPI. On two supercomputers at the Vienna University of Technology, we experimentally compared the algorithmic improvements to the Bruck et al. algorithm when implemented on top of MPI, showing the zero-copy version to perform significantly better than the initial, straight-forward implementation. One of our variants has also been implemented inside mvapich, and we showed it to perform better than the mvapich implementation of the Bruck et al. algorithm for the range of processes and problem sizes where it is enabled. Details about this work are provided in.

However, we showed in EPiGRAM that current collective interfaces cannot support zero-copy implementations in all cases. The problem is that the regular collective interfaces use a receive datatype to specify a per-process layout, whereas sometimes a different layout is needed for each process. Such cases cannot be accounted for; only for applications using all-to-all communication, the required flexibility is provided in the form of the tedious, non-scalable MPI_Alltoallw operation. In we show a simple, and in many cases backwards compatible and mostly non-intrusive solution to the problems in the form of slightly changed collective interfaces (all other communication interfaces would have to be reinterpreted in a similar way). The key to the solution is to separate the number of elements to be communicated from the overall structure of the data. The latter is described by a datatype; the former by an element count. The current MPI specification mixes these two concerns, leading to the problems discussed.

GPI Dynamic Connections. We have identified GPI memory consumption as one of the main aspects to be improved for large scale execution of GPI applications. The memory consumption is strongly related to the management of the communication infrastructure in GPI. The GASPI specification, that GPI implements, defines that the communication infrastructure should be either built during initialization or performed explicitly by the application. This can be set through a configuration parameter where the default value is TRUE. In this case, the communication infrastructure is built at start-up by default. In fact, details of the initialization of the GPI communication infrastructure are left to the implementation.

Before EPiGRAM, communication infrastructure was built-up statically in GPI. In this case, each computing node (a GPI rank) establishes a connection to all the other computing nodes during initialization. This results in an all-to-all communication topology. Despite this is acceptable on small scale and typical executions, the problem becomes evident when running large-scale GPI applications. For this reason, we have extended GPI to allow three modes of topology building: GASPI_TOPOLOGY_NONE where the application explicitly handles the infrastructure setup, GASPI_TOPOLOGY_STATIC where, as before, an all-to-all connection is established and GASPI_TOPOLOGY_DYNAMIC where connections are dynamically established as the first communication request between two nodes is performed. We were able to verify and measure the effects of such GPI extension during the Extreme Scale Workshop using the full SuperMUC iDataPlex supercomputer, consisting of 3,072 nodes, at the Leibniz Supercomputing Centre. The establishment of dynamic connections provides a much more efficient and scalable resource consumption in terms of memory footprint. Panel a of Fig. 2 presents the memory consumption (per rank) after initialization using static and dynamic connections. The non-scalable behavior of the GPI all-to-all connection is evident. We can now alleviate that using GPI dynamic connections.