Tensor decomposition has become an essential tool in many applications in various domains, including machine learning. Sparse Matricized Tensor Times Khatri-Rao Product (MTTKRP) is one of the most computationally expensive kernels in tensor computations. Despite having significant computational parallelism, MTTKRP is a challenging kernel to optimize due to its irregular memory access characteristics. This paper focuses on a multi-faceted memory system, which explores the spatial and temporal locality of the data structures of MTTKRP. Further, users can reconfigure our design depending on the behavior of the compute units used in the FPGA accelerator. Our system efficiently accesses all the MTTKRP data structures while reducing the total memory access time, using a distributed cache and Direct Memory Access (DMA) subsystem. Moreover, our work improves the memory access time by 3.5x compared with commercial memory controller IPs. Also, our system shows 2x and 1.26x speedups compared with cache-only and DMA-only memory systems, respectively.
翻译:电离分解已成为包括机器学习在内的多个领域许多应用中必不可少的工具。 粗略的三进制Tensor Tensor Times Khatri- Rao Product (MTTKRP) 是高压计算中计算中最昂贵的内核之一。 尽管在计算上存在显著的平行性, MTTKRP 因其不规则的内存存存访问特性而成为了优化的一个具有挑战性的内核。 本文侧重于一个多面内存系统, 探索MTTKRP数据结构的空间和时间位置。 此外, 用户可以根据FPGA 加速器中使用的计算器的行为重新配置我们的设计。 我们的系统有效地访问了所有MTTKRP的数据结构,同时使用分布的缓存和直接记忆存取(DMA)子系统减少了全部内存访问时间。 此外, 我们的工作比商业内存控制器IP增加了3.5x的内存访问时间。 此外, 我们的系统分别显示与缓存和DMA专用的内存系统相比, 2x 和1.26x加速。