Computers used for data analytics are often NUMA systems with multiple sockets per machine, multiple cores per socket, and multiple thread contexts per core. To get the peak performance out of these machines requires the correct number of threads to be placed in the correct positions on the machine. One particularly interesting element of the placement of memory and threads is the way it effects the movement of data around the machine, and the increased latency this can introduce to reads and writes. In this paper we describe work on modeling the bandwidth requirements of an application on a NUMA compute node based on the placement of threads. The model is parameterized by sampling performance counters during 2 application runs with carefully chosen thread placements. Evaluating the model with thousands of measurements shows a median difference from predictions of 2.34% of the bandwidth. The results of this modeling can be used in a number of ways varying from: Performance debugging during development where the programmer can be alerted to potentially problematic memory access patterns; To systems such as Pandia which take an application and predict the performance and system load of a proposed thread count and placement; To libraries of data structures such as Parallel Collections and Smart Arrays that can abstract from the user memory placement and thread placement issues when parallelizing code.
翻译:用于数据分析的计算机通常是 NUMA 系统,每个机器有多个插座,每个插座有多个核心,每个核心有多个线条背景。要将这些机器的顶峰性能从这些机器中取出,就需要将正确数量的线条放在机器的正确位置上。内存和线线条的位置特别有趣的一个元素是它如何影响机器周围的数据移动,以及它可以引入读写和写入的加宽度的增加。在本文中,我们描述了在基于线条放置的NUMA 计算节点上应用应用程序的带宽要求模型的工作。模型通过在2个应用程序中以精心选择的线条位置运行的测试性能计来进行参数化。用数千个测量来评估模型的中位数显示与2.34%的带宽的预测值之间的中位差。这种模型的结果可以被多种不同的方式使用: 在开发过程中,程序员可以被提醒可能存在问题的记忆存取模式; 在Pandia等系统,在2个应用程序中,并预测拟议线条数和放置时的性能和系统负荷。 当Smarlimal 收藏时,例如Sliarrelamaslaim 和Slaction 等数据结构的图书馆的库可以使用。