Binary Neural Networks (BNNs) can significantly accelerate the inference time of a neural network by replacing its expensive floating-point arithmetic with bitwise operations. Most existing solutions, however, do not fully optimize data flow through the BNN layers, and intermediate conversions from 1 to 16/32 bits often further hinder efficiency. We propose a novel training scheme that can increase data flow and parallelism in the BNN pipeline; specifically, we introduce a clipping block that decreases the data-width from 32 bits to 8. Furthermore, we reduce the internal accumulator size of a binary layer, usually kept using 32-bit to prevent data overflow without losing accuracy. Additionally, we provide an optimization of the Batch Normalization layer that both reduces latency and simplifies deployment. Finally, we present an optimized implementation of the Binary Direct Convolution for ARM instruction sets. Our experiments show a consistent improvement of the inference speed (up to 1.91 and 2.73x compared to two state-of-the-art BNNs frameworks) with no drop in accuracy for at least one full-precision model.
翻译:二进制神经网络(Binary Neural Networks,BNNs)通过将昂贵的浮点运算转换为位运算,可以显着加快神经网络的推断时间。然而,大多数现有解决方案并未完全优化BNN层的数据流,且1到16/32位之间的中间转换通常进一步阻碍了效率。我们提出了一种新的训练方案,可以增加BNN管道中的数据流和并行性;具体来说,我们引入了一个剪切块,将数据宽度从32位减小到8位。此外,我们减少了二进制层的内部累加器大小,通常使用32位以防止数据溢出而不会失去准确性。此外,我们提供了一种批量归一化层的优化,可以同时减少延迟和简化部署。最后,我们提供了适用于ARM指令集的二进制直接卷积的优化实现。我们的实验表明,没有至少一个全精度模型的精度下降,推断速度有一致的改善(比两种最先进的BNN框架分别提高了1.91和2.73倍)。