Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation. Among all CF approaches, SimpleX is the state-of-the-art method that adopts a novel loss function and a proper number of negative samples. However, there is no work that optimizes SimpleX on multi-core CPUs, leading to limited performance. To this end, we perform an in-depth profiling and analysis of existing SimpleX implementations and identify their performance bottlenecks including (1) irregular memory accesses, (2) unnecessary memory copies, and (3) redundant computations. To address these issues, we propose an efficient CF training system (called HEAT) that fully enables the multi-level caching and multi-threading capabilities of modern CPUs. Specifically, the optimization of HEAT is threefold: (1) It tiles the embedding matrix to increase data locality and reduce cache misses (thus reduce read latency); (2) It optimizes stochastic gradient descent (SGD) with sampling by parallelizing vector products instead of matrix-matrix multiplications, in particular the similarity computation therein, to avoid memory copies for matrix data preparation; and (3) It aggressively reuses intermediate results from the forward phase in the backward phase to alleviate redundant computation. Evaluation on five widely used datasets with both x86- and ARM-architecture processors shows that HEAT achieves up to 65.3X speedup over existing CPU solution and 4.8X speedup and 7.9X cost reduction in Cloud over existing GPU solution with NVIDIA V100 GPU.
翻译:协同过滤是推荐技术中最有效的技术之一。在所有协同过滤方法中,SimpleX是采用新型损失函数和适当数量的负样本的最新方法。然而,目前没有对SimpleX在多核CPU上进行优化的研究,导致性能受限。为此,本文对现有SimpleX实现进行深入分析和剖析,发现其性能瓶颈包括:(1)不规则的访存(2)不必要的内存复制和(3)冗余计算。为解决这些问题,我们提出了一种高效的CF培训系统(称为HEAT),充分发挥现代CPU的多级缓存和多线程功能。具体来说,HEAT的优化有三个方面:(1)将瓷砖嵌入矩阵以增加数据局部性并减少缓存未命中(因此减少读延迟);(2)通过并行化向量乘积而非矩阵乘积,特别是其中的相似度计算,来优化随机梯度下降(SGD)与采样,以避免进行矩阵数据准备的内存复制;(3)在向后阶段中积极重用来自向前阶段的中间结果,以减轻冗余计算。在使用x86和ARM架构处理器的五个广泛使用的数据集上的评估显示,HEAT比现有CPU解决方案实现了高达65.3倍的加速,在与NVIDIA V100 GPU的现有GPU解决方案相比,节省了4.8倍的时间和7.9倍的成本。