最佳努力通信改进常规硬件的性能和规模 (Best-Effort Communication Improves Performance and Scales Robustly on Conventional Hardware)

Here, we test the performance and scalability of fully-asynchronous, best-effort communication on existing, commercially-available HPC hardware. A first set of experiments tested whether best-effort communication strategies can benefit performance compared to the traditional perfect communication model. At high CPU counts, best-effort communication improved both the number of computational steps executed per unit time and the solution quality achieved within a fixed-duration run window. Under the best-effort model, characterizing the distribution of quality of service across processing components and over time is critical to understanding the actual computation being performed. Additionally, a complete picture of scalability under the best-effort model requires analysis of how such quality of service fares at scale. To answer these questions, we designed and measured a suite of quality of service metrics: simulation update period, message latency, message delivery failure rate, and message delivery coagulation. Under a lower communication-intensivity benchmark parameterization, we found that median values for all quality of service metrics were stable when scaling from 64 to 256 process. Under maximal communication intensivity, we found only minor -- and, in most cases, nil -- degradation in median quality of service. In an additional set of experiments, we tested the effect of an apparently faulty compute node on performance and quality of service. Despite extreme quality of service degradation among that node and its clique, median performance and quality of service remained stable.

翻译：在此,我们测试现有、商业上可得到的HPC硬件完全不同步、最努力的通信的性能和可扩展性。第一组实验测试了与传统的完美通信模式相比,最努力的通信战略能否使业绩获益于传统的完美通信模式。在高CPU计数时,最努力的通信改进了每个单位时间执行的计算步骤的数量以及在固定时间运行窗口内实现的解决方案质量。在最佳努力模式下,将服务质量在各处理组件和时间之间的分配定性化,对于了解实际计算情况至关重要。此外,在最佳最大努力模式下,完全的可扩展性要求分析这种服务质量相对于传统的完美通信模式如何有利于业绩。为了回答这些问题,我们设计并测量了一套服务质量衡量标准:模拟更新期、信息内衣、信息发送失败率和电文发送协调。在低通信密度基准参数参数化下,我们发现所有服务质量衡量标准的中位值在从64到256的计算过程中保持稳定。在最大质量模型化下,我们发现服务质量质量质量的中位值在最低水平和最低的测试中位性质量方面,我们发现服务质量的中位化是最低的测试。在最低的测试中位性、最低的测试。在服务质量方面,在最低质量方面,我们所处发现,在最低的测试中位的测试的测试中位、最低的测试了服务中位的测试中位。