The PGAS model is well suited for executing irregular applications on cluster-based systems, due to its efficient support for short, one-sided messages. However, there are currently two major limitations faced by PGAS applications. The first relates to scalability: despite the availability of APIs that support non-blocking operations in special cases, many PGAS operations on remote locations are synchronous by default, which can lead to long-latency stalls and poor scalability. The second relates to productivity: while it is simpler for the developer to express all communications at a fine-grained granularity that is natural to the application, experiments have shown that such a natural expression results in performance that is 20x slower than more efficient but less productive code that requires manual message aggregation and termination detection. In this paper, we introduce a new programming system for PGAS applications, in which point-to-point remote operations can be expressed as fine-grained asynchronous actor messages. In this approach, the programmer does not need to worry about programming complexities related to message aggregation and termination detection. Our approach can also be viewed as extending the classical Bulk Synchronous Parallelism model with fine-grained asynchronous communications within a phase or superstep. We believe that our approach offers a desirable point in the productivity-performance space for PGAS applications, with more scalable performance and higher productivity relative to past approaches. Specifically, for seven irregular mini-applications from the Bale benchmark suite executed using 2048 cores in the NERSC Cori system, our approach shows geometric mean performance improvements of >=20x relative to standard PGAS versions (UPC and OpenSHMEM) while maintaining comparable productivity to those versions.
翻译:PGAS模型非常适合在基于集群的系统中执行非常规应用,原因是它有效支持短片片面信息。然而,目前PGAS应用程序面临两大限制。第一个模型涉及可缩放性:尽管存在支持特殊情况下不阻塞操作的APIS,但许多远程地点的PGAS操作因默认而同步,这可能导致长期延迟和缩放性差。第二个模型与生产率有关:虽然开发者以对应用程序而言自然而然的精细的颗粒度表示所有通信比较简单,但实验表明,这种自然表达式的改进性能比效率低20倍,但要求人工汇总和检测终止性能低的代码。在本文件中,我们为PGAS应用程序引入了一个新的编程系统系统,其中点到点定位性能操作可以表现为精细的超紧凑性动作。在这个方法中,程序员不必担心与信息汇总和终止性能检测相关的复杂程序化方法,而相对性价比值的运行方法则显示,我们相对的S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-