Sparse matrix computation is crucial in various modern applications, including large-scale graph analytics, deep learning, and recommendation systems. The performance of these algorithms varies greatly depending on the structure of the input matrix, making it difficult to gain a comprehensive understanding of sparse computation and its relationship to inputs, algorithms, and target machine architecture. Despite extensive research on certain sparse algorithms, such as SpMV, the overall family of sparse algorithms has yet to be investigated as a whole. In this paper, we introduce SpChar, a workload characterization methodology for general sparse computation. SpChar employs tree-based models to identify the most relevant hardware and input characteristics, starting from hardware and input-related metrics gathered from Performance Monitoring Counters and matrices. Our analysis enables the creation of a characterization loop that facilitates the optimization of sparse computation by mapping the impact of architectural features to inputs and algorithmic choices. We apply SpChar to more than 600 matrices from the SuiteSparse Matrix collection and three state-of-the-art Arm CPUs to determine the critical hardware and software characteristics that affect sparse computation. In our analysis, we determine that the biggest limiting factors for high-performance sparse computation are (1) the latency of the memory system, (2) the pipeline flush overhead resulting from branch misprediction, and (3) the poor reuse of cached elements. However, the degree to which those impact the performance of a CPU greatly depends on the algorithm and the input data.
翻译:稀疏矩阵计算在各种现代应用中至关重要,包括大规模图形分析、深度学习和推荐系统。这些算法的性能因输入矩阵的结构而异,使得很难全面了解稀疏计算及其与输入、算法和目标机器架构之间的关系。尽管对某些稀疏算法进行了广泛的研究,例如 SpMV,但整个稀疏算法族尚未作为一个整体进行调查。在本文中,我们介绍 SpChar,一种通用稀疏计算工作负载表征方法。SpChar 采用基于树的模型来识别最相关的硬件和输入特征,从性能监测计数器和矩阵收集的硬件和输入相关指标开始。我们的分析使得稀疏计算的优化通过将架构特性映射到输入和算法选择来实现特征循环的创建。我们将 SpChar 应用于来自 SuiteSparse 矩阵收集的 600 多个矩阵和三个最先进的 Arm CPU,以确定影响稀疏计算的关键硬件和软件特征。在我们的分析中,我们确定导致高性能稀疏计算最大性能限制的因素是(1)存储器系统的延迟,(2)由于分支误判引起的流水线清除开销和(3)缓存元素再利用不足。然而,它们对 CPU 性能的影响程度在很大程度上取决于算法和输入数据。