The exponentially large discrete search space in mixed-precision quantization (MPQ) makes it hard to determine the optimal bit-width for each layer. Previous works usually resort to iterative search methods on the training set, which consume hundreds or even thousands of GPU-hours. In this study, we reveal that some unique learnable parameters in quantization, namely the scale factors in the quantizer, can serve as importance indicators of a layer, reflecting the contribution of that layer to the final accuracy at certain bit-widths. These importance indicators naturally perceive the numerical transformation during quantization-aware training, which can precisely and correctly provide quantization sensitivity metrics of layers. However, a deep network always contains hundreds of such indicators, and training them one by one would lead to an excessive time cost. To overcome this issue, we propose a joint training scheme that can obtain all indicators at once. It considerably speeds up the indicators training process by parallelizing the original sequential training processes. With these learned importance indicators, we formulate the MPQ search problem as a one-time integer linear programming (ILP) problem. That avoids the iterative search and significantly reduces search time without limiting the bit-width search space. For example, MPQ search on ResNet18 with our indicators takes only 0.06 seconds. Also, extensive experiments show our approach can achieve SOTA accuracy on ImageNet for far-ranging models with various constraints (e.g., BitOps, compress rate).
翻译:混合精密度量化( MPQ) 的巨型离散搜索空间巨大, 使得很难确定每一层的最佳比特宽度。 先前的工作通常在培训组上采用迭代搜索方法, 耗用数百甚至数千千个GPU小时。 在这次研究中, 我们发现, 量化中某些独特的可学习参数, 即量化中的比例因素, 可以作为一层的重要指标, 反映该层在某些位宽度上对最终精确度的贡献。 这些重要指标自然会看到量化- 觉识培训中的数字转换, 这可以准确和正确地提供各个层次的量化灵敏度度指标。 然而, 一个深层次的网络通常包含数百个这样的指标, 并且一对它们进行一项培训, 将会造成超时的成本。 为了克服这一问题, 我们提议了一个联合培训计划, 通过将最初的连续培训过程与某些位次级培训进程同步, 我们将MPQ的搜索问题发展成一次性整线性线性程序( ILP ) 。 避免大量搜索时间搜索, 并且用SVLE Q 来缩小搜索。 搜索时间 。 。 。 在搜索中, 搜索范围搜索中只标 。