Tensor decompositions, such as CANDECOMP/PARAFAC (CP), are widely used in a variety of applications, such as chemometrics, signal processing, and machine learning. A broadly used method for computing such decompositions relies on the Alternating Least Squares (ALS) algorithm. When the number of components is small, regardless of its implementation, ALS exhibits low arithmetic intensity, which severely hinders its performance and makes GPU offloading ineffective. We observe that, in practice, experts often have to compute multiple decompositions of the same tensor, each with a small number of components (typically fewer than 20), to ultimately find the best ones to use for the application at hand. In this paper, we illustrate how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity. Therefore, it becomes possible to make efficient use of GPUs for further speedups; at the same time the technique is compatible with many enhancements typically used in ALS, such as line search, extrapolation, and non-negativity constraints. We introduce the Concurrent ALS algorithm and library, which offers an interface to Matlab, and a mechanism to effectively deal with the issue that decompositions complete at different times. Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity.
翻译:CANDECOMP/PARAFAC(CP)等电离分解器在各种应用中广泛使用,如CANDECOMP/PARAFAC(CP),例如化学度计、信号处理和机器学习等。计算这种分解方法广泛使用,它依靠的是不同的最小方(ALS)算法。当部件数量很小时,不管其执行程度如何,ALS的算术强度都低,这严重妨碍其性能,使GPU卸载效率低下。我们注意到,实际上,专家往往必须计算同一个高方的多重分解器,每个方各有少量部件(通常少于20个),最终找到最佳的分解方法用于手头的应用程序。在本文中,我们说明如何在算法层面上将同一方的多重分解法结合在一起,以提高算术强度。因此,可以有效地使用GPUPS来进一步加快速度;与此同时,这种技术与ALS通常使用的许多分解器相兼容,例如线搜索、外推、非时间性积(通常少于20个),以便最终找到最佳的分解方法。我们可以将AALLS和数字实验室的分解为不同时间的分解结果。我们可以向不同的算。