Post-training quantization (PTQ) is widely regarded as one of the most efficient compression methods practically, benefitting from its data privacy and low computation costs. We argue that an overlooked problem of oscillation is in the PTQ methods. In this paper, we take the initiative to explore and present a theoretical proof to explain why such a problem is essential in PTQ. And then, we try to solve this problem by introducing a principled and generalized framework theoretically. In particular, we first formulate the oscillation in PTQ and prove the problem is caused by the difference in module capacity. To this end, we define the module capacity (ModCap) under data-dependent and data-free scenarios, where the differentials between adjacent modules are used to measure the degree of oscillation. The problem is then solved by selecting top-k differentials, in which the corresponding modules are jointly optimized and quantized. Extensive experiments demonstrate that our method successfully reduces the performance drop and is generalized to different neural networks and PTQ methods. For example, with 2/4 bit ResNet-50 quantization, our method surpasses the previous state-of-the-art method by 1.9%. It becomes more significant on small model quantization, e.g. surpasses BRECQ method by 6.61% on MobileNetV2*0.5.
翻译:训练后量化(PTQ)被广泛认为是最具实用性的压缩方法之一,得益于其数据隐私性和低计算成本。本文主张,PTQ 方法中存在一个被忽视的振荡问题。在本文中,我们首先探索并呈现了一个理论证明来解释为什么基于 PTQ 的振荡问题是重要的。然后,我们尝试通过引入一个原则性和通用化的理论框架来解决这个问题。特别地,我们首先对 PTQ 中的振荡进行了规范化,并证明了该问题是由模块容量的差异引起的。为此,我们定义了模块容量 (ModCap),并在数据相关和数据无关的情况下对其进行了计算,其中相邻模块之间的差异被用来衡量振荡的程度。然后通过选择前 k 个差异度最大的模块,将其进行联合优化和量化,来解决这个问题。广泛的实验表明,我们的方法成功地降低了性能下降,并具有泛化到不同神经网络和 PTQ 方法的能力。例如,在 2/4 比特 ResNet-50 量化中,我们的方法超过了以前的最先进方法 1.9%。在小模型量化方面,例如 MobileNetV2*0.5 上,我们的方法的优越性更加显著,超过了 BRECQ 方法 6.61%。