Inference time, model size, and accuracy are three key factors in deep model compression. Most of the existing work addresses these three key factors separately as it is difficult to optimize them all at the same time. For example, low-bit quantization aims at obtaining a faster model; weight sharing quantization aims at improving compression ratio and accuracy; and mixed-precision quantization aims at balancing accuracy and inference time. To simultaneously optimize bit-width, model size, and accuracy, we propose pruning ternary quantization (PTQ): a simple, effective, symmetric ternary quantization method. We integrate L2 normalization, pruning, and the weight decay term to reduce the weight discrepancy in the gradient estimator during quantization, thus producing highly compressed ternary weights. Our method brings the highest test accuracy and the highest compression ratio. For example, it produces a 939kb (49$\times$) 2bit ternary ResNet-18 model with only 4\% accuracy drop on the ImageNet dataset. It compresses 170MB Mask R-CNN to 5MB (34$\times$) with only 2.8\% average precision drop. Our method is verified on image classification, object detection/segmentation tasks with different network structures such as ResNet-18, ResNet-50, and MobileNetV2.
翻译:测算时间、 模型大小和精确度是深模型压缩的三个关键因素。 大部分现有工作分别处理这三个关键因素, 因为很难同时优化它们。 例如, 低位平分旨在获得一个更快的模型; 权重共享量化旨在改进压缩比例和准确性; 混合精度量化旨在平衡精确度和推算时间。 为了同时优化位宽、 模型大小和精确度, 我们提议了双弦化( PTQ) : 一个简单、 有效、 对称的永久量化方法 。 我们整合了 L2 常规化、 运行和重量衰减条件, 以降低四分化过程中梯度估量的重量差异, 从而产生高度压缩的裁量权重。 我们的方法带来了最高测试准确度和最高压缩率。 例如, 我们制作了939kb( 492 美元\ 时间 美元) 2biternNet-18 模型, 图像网络数据集上只有 4 ⁇ 精确度下降 。 它压缩了 L270MK R- Net 和重量 网络 的精确度结构, 与我们的平均 IMV 和 AS AS AS AS AS AS AS AS 5的 ASyal ASy ASyal ASyal ASyal ASyal ASyal ASyal ASyal ASyal ASyal ASyal AS ASyal ASyal et AS AS AS ASmmmm 。