In deep learning inference, model parameters are pruned and quantized to reduce the model size. Compression methods and common subexpression (CSE) elimination algorithms are applied on sparse constant matrices to deploy the models on low-cost embedded devices. However, the state-of-the-art CSE elimination methods do not scale well for handling large matrices. They reach hours for extracting CSEs in a $200 \times 200$ matrix while their matrix multiplication algorithms execute longer than the conventional matrix multiplication methods. Besides, there exist no compression methods for matrices utilizing CSEs. As a remedy to this problem, a random search-based algorithm is proposed in this paper to extract CSEs in the column pairs of a constant matrix. It produces an adder tree for a $1000 \times 1000$ matrix in a minute. To compress the adder tree, this paper presents a compression format by extending the Compressed Sparse Row (CSR) to include CSEs. While compression rates of more than $50\%$ can be achieved compared to the original CSR format, simulations for a single-core embedded system show that the matrix multiplication execution time can be reduced by $20\%$.
翻译:在深度学习推理中,将模型参数修剪和量化以减少模型大小。基于稀疏常量矩阵的压缩方法和公共子表达式(CSE)消除算法被应用于低成本嵌入式设备上部署模型。然而,目前最先进的CSE消除方法在处理大型矩阵时缩放能力有限。它们处理$200\times200$矩阵时需要几个小时,而它们的矩阵乘法算法执行时间长于传统的矩阵乘法方法。此外,不存在利用CSE的矩阵压缩方法。为了解决这个问题,本文提出了一种基于随机搜索算法的方法,用于提取常量矩阵的列对中的公共子表达式。它可以在一分钟内为$1000\times1000$矩阵生成加法树。为了压缩加法树,本文提出了一种扩展压缩稀疏行(CSR)以包括CSE的压缩格式。与原始CSR格式相比,可以实现超过$50\%$的压缩率,并且单核嵌入式系统的模拟表明,矩阵乘法执行时间可以减少$20\%$。