深海神经网络加速速度的预测和量化:调查 (Pruning and Quantization for Deep Neural Network Acceleration: A Survey)

Deep neural networks have been applied in many applications exhibiting extraordinary abilities in the field of computer vision. However, complex network architectures challenge efficient real-time deployment and require significant computation resources and energy costs. These challenges can be overcome through optimizations such as network compression. Network compression can often be realized with little loss of accuracy. In some cases accuracy may even improve. This paper provides a survey on two types of network compression: pruning and quantization. Pruning can be categorized as static if it is performed offline or dynamic if it is performed at run-time. We compare pruning techniques and describe criteria used to remove redundant computations. We discuss trade-offs in element-wise, channel-wise, shape-wise, filter-wise, layer-wise and even network-wise pruning. Quantization reduces computations by reducing the precision of the datatype. Weights, biases, and activations may be quantized typically to 8-bit integers although lower bit width implementations are also discussed including binary neural networks. Both pruning and quantization can be used independently or combined. We compare current techniques, analyze their strengths and weaknesses, present compressed network accuracy results on a number of frameworks, and provide practical guidance for compressing networks.

翻译：深心神经网络已被应用于在计算机视觉领域表现出非凡能力的许多应用中。然而,复杂的网络结构对高效实时部署提出了挑战,需要大量的计算资源和能源成本。这些挑战可以通过网络压缩等优化来克服。网络压缩通常可以实现,但准确性可能略微降低。在某些情况下,精确性甚至可以提高。本文对两种类型的网络压缩进行了调查:修剪和量化。如果在运行时进行运行,在离线或动态情况下,普鲁宁可以被归类为静态。我们比较修剪技术和描述用于删除冗余计算的标准。我们讨论在元素、频道、形状、过滤器、分层甚至网络运行方面的交易。量化通过降低数据型的精确度,减少了计算数量。判断、偏差和激活一般可以四分化为八位整数,但讨论时宽度较低的实施情况,包括二进制神经网络。可以独立或合并地使用裁剪裁和四分化两种方法。我们比较当前技术的精确性框架,分析其精度和压缩网络的精度。