Quantization and pruning are core techniques used to reduce the inference costs of deep neural networks. State-of-the-art quantization techniques are currently applied to both the weights and activations; however, pruning is most often applied to only the weights of the network. In this work, we jointly apply novel uniform quantization and unstructured pruning methods to both the weights and activations of deep neural networks during training. Using our methods, we empirically evaluate the currently accepted prune-then-quantize paradigm across a wide range of computer vision tasks and observe a non-commutative nature when applied to both the weights and activations of deep neural networks. Informed by these observations, we articulate the non-commutativity hypothesis: for a given deep neural network being trained for a specific task, there exists an exact training schedule in which quantization and pruning can be introduced to optimize network performance. We identify that this optimal ordering not only exists, but also varies across discriminative and generative tasks. Using the optimal training schedule within our training framework, we demonstrate increased performance per memory footprint over existing solutions.
翻译:量化和修剪是用于降低深神经网络的推论成本的核心技术。目前,对重力和激活都采用最新量化技术;然而,修剪通常仅适用于网络的重量。在这项工作中,我们联合对培训期间深神经网络的重量和激活采用新的统一量化和无结构的修剪方法。我们使用方法,实证地评估了在一系列广泛的计算机视觉任务中目前接受的精度和量度范式,并观察了在对深神经网络的重量和激活适用时的非平衡性。我们通过这些观察,明确了非平衡假设:对于某个特定的任务,对于某个特定的深度神经网络,我们有一个精确的培训时间表,可以引入修剪裁和修剪,以优化网络的性能。我们发现,这种最佳的修剪不仅存在,而且还在差别和基因化任务中也各不相同。我们通过在培训框架内的最佳训练时间表,展示了每个现有解决方案的记忆足迹。