This paper introduces two architectures for the inference of convolutional neural networks (CNNs). Both architectures exploit weight sparsity and compression to reduce computational complexity and bandwidth. The first architecture uses multiply-accumulators (MACs) but avoids unnecessary multiplications by skipping zero weights. The second architecture exploits weight sparsity at the level of their bit representation by substituting resource-intensive MACs with much smaller Bit Layer Multiply Accumulators (BLMACs). The use of BLMACs also allows variable precision weights as variable size integers and even floating points. Some details of an implementation of the second architecture are given. Weight compression with arithmetic coding is also discussed as well as bandwidth implications. Finally, some implementation results for a pathfinder design and various technologies are presented.
翻译:本文介绍了两种神经神经网络(CNNs)的推论结构。两种结构都利用重量宽度和压缩来降低计算复杂性和带宽。第一种结构使用乘积器,但避免了不必要的乘积,跳过零重量。第二种结构利用比特代表水平的重量宽度,代之以资源密集的MACs,代之以小得多的比特层乘积器(BLMACs)。使用BLMACs还允许可变精确重量,作为可变大小的整数,甚至浮动点。提供了实施第二个结构的一些细节。还讨论了计算编码的轻压缩以及带宽影响。最后,介绍了路径设计的一些执行结果和各种技术。