Depthwise convolutions are widely used in lightweight convolutional neural networks (CNNs). The performance of depthwise convolutions is mainly bounded by the memory access rather than the arithmetic operations for classic convolutions so that direct algorithms are often more efficient than indirect ones (matrix multiplication-, Winograd-, and FFT-based convolutions) with additional memory accesses. However, the existing direct implementations of depthwise convolutions on ARMv8 architectures feature a bad trade-off between register-level reuse of different tensors, which usually leads to sub-optimal performance. In this paper, we propose new direct implementations of depthwise convolutions by means of implicit padding, register tiling, etc., which contain forward propagation, backward propagation and weight gradient update procedures. Compared to the existing ones, our new implementations can incur much less communication overhead between registers and cache. Experimental results on two ARMv8 CPUs show that our implementations can averagely deliver 4.88x and 16.4x performance improvement over the existing direct ones in open source libraries and matrix multiplications-based ones in Pytorch, respectively.
翻译:在轻量级神经神经网络(CNNs)中广泛使用深度变异。 深度变异的性能主要受记忆存取权而不是经典变异的算术操作的约束,因此直接算法往往比间接算法(矩阵倍增、 Winograd- 和FFFT- 基变异)更高效,并有额外的内存存存存存存取。然而,目前对ARMV8结构的深度变异的直接实施,其特点是不同发热器的注册再利用水平之间的差差差,通常导致次优化性能。在本文中,我们提议通过隐含的划线、登记拖动等手段,对深度变异进行新的直接演化,这些方法包含前向传播、后向传播和重量梯度更新程序。与现有的算法相比,我们的新实施可能大大降低登记册和缓存之间的通信管理费。 两种ARMV8 CPUs的实验结果表明,我们的实施率可以平均交付4.88x和16.4x性变异,分别用于开放源图书馆和矩阵中的现有直接变异。