Inference time, model size, and accuracy are three key factors in deep model compression. Most of the existing work addresses these three key factors separately as it is difficult to optimize them all at the same time. For example, low-bit quantization aims at obtaining a faster model; weight sharing quantization aims at improving compression ratio and accuracy; and mixed-precision quantization aims at balancing accuracy and inference time. To simultaneously optimize bit-width, model size, and accuracy, we propose pruning ternary quantization (PTQ): a simple, effective, symmetric ternary quantization method. We integrate L2 normalization, pruning, and the weight decay term to reduce the weight discrepancy in the gradient estimator during quantization, thus producing highly compressed ternary weights. Our method brings the highest test accuracy and the highest compression ratio. For example, it produces a 939kb (49$\times$) 2bit ternary ResNet-18 model with only 4\% accuracy drop on the ImageNet dataset. It compresses 170MB Mask R-CNN to 5MB (34$\times$) with only 2.8\% average precision drop. Our method is verified on image classification, object detection/segmentation tasks with different network structures such as ResNet-18, ResNet-50, and MobileNetV2.
翻译:测算时间、 模型大小和精确度是深模型压缩的三个关键因素。 大部分现有工作分别处理这三个关键因素, 因为很难同时优化它们。 例如, 低位平分旨在获得一个更快的模型; 权重共享量化旨在改进压缩比例和准确性; 混合精度量化旨在平衡精确度和推算时间。 为了同时优化位宽、 模型大小和精确度, 我们提议了双向裁量( PTQ): 一种简单、 有效、 对称的永久量化方法 。 我们整合了 L2 常规化、 运行和重量衰减条件, 以降低四分化过程中梯度估量器的重量差异, 从而产生高度压缩的裁量权重; 混合精度定量量化旨在平衡准确度和推算时间。 例如, 为了同时优化比重、 模型大小和精确度, 我们提出了939kb( 492 美元) 2bit ResNet-18 模型, 在图像网络数据集中仅下降 4<unk> 。 它压缩了 L2MB R- Net 和 AS- AS- realal Net ASyal Resmetal imetal imal imal imetal imal imal imactal impal leglegleglegal 。</s>