Matrix libraries often focus on achieving high performance for problems considered to be either "small" or "large", as these two scenarios tend to respond best to different optimization strategies. We propose a unified technique for implementing matrix operations like general matrix multiplication (GEMM) that can achieve high performance for both small and large problem sizes. The key is to fuse packing -- an operation that copies data to a contiguous layout in memory and which is critical for large matrix performance -- with the first computational "pass" over that data. This boosts performance across the problem size spectrum. As a result, tuning general-purpose libraries becomes simpler since it obviates the need to carefully express and parameterize logic that chooses between a "small matrix" strategy and a "large matrix" strategy. A prototype implementation of the technique built with the BLAS-like Library Instantiation Software (BLIS) framework is described and performance on a range of architectures is reported.
翻译:矩阵库通常侧重于在被视为“小”或“大”的问题上取得高性能,因为这两种假设方案往往最能适应不同的优化战略。我们建议采用统一技术来实施矩阵操作,例如通用矩阵乘法(GEMM),能够对小问题和大问题大小都达到高性能。关键是将数据复制到记忆中毗连版式的操作,这对于大型矩阵性能至关重要。这是第一个计算“通道”于数据之上的操作。这提高了整个问题范围的性能。因此,调整通用图书馆变得更加简单,因为它避免了在“小矩阵”战略和“大矩阵”战略之间选择谨慎表达和参数化逻辑的必要性。介绍了与BLAS相似的图书馆耐光软件框架所建立的技术的原型实施情况,并报告了一系列建筑的性能。