We propose pruning ternary quantization (PTQ), a simple, yet effective, symmetric ternary quantization method. The method significantly compresses neural network weights to a sparse ternary of [-1,0,1] and thus reduces computational, storage, and memory footprints. We show that PTQ can convert regular weights to ternary orthonormal bases by simply using pruning and L2 projection. In addition, we introduce a refined straight-through estimator to finalize and stabilize the quantized weights. Our method can provide at most 46x compression ratio on the ResNet-18 structure, with an acceptable accuracy of 65.36%, outperforming leading methods. Furthermore, PTQ can compress a ResNet-18 model from 46 MB to 955KB (~48x) and a ResNet-50 model from 99 MB to 3.3MB (~30x), while the top-1 accuracy on ImageNet drops slightly from 69.7% to 65.3% and from 76.15% to 74.47%, respectively. Our method unifies pruning and quantization and thus provides a range of size-accuracy trade-off.
翻译:我们建议使用一种简单而有效、对称的对称永久量化法(PTQ ) 。 这种方法将神经网络重量大幅压缩到一个稀疏的[1,0,1],从而减少计算、储存和内存足迹。 我们显示, PTQ 将正常重量转换到永久正正态基数, 只需使用修剪和L2 投影即可。 此外, 我们引入了一个精细的直通估测器, 以最终确定和稳定定量加权数。 我们的方法可以提供ResNet-18 结构中最多46x压缩率, 可接受的精确度为65.36%, 优于领先方法。 此外, PTQ 可以将ResNet-18 模型从46 MB 压缩到 955KB (~48x), ResNet-50 模型从99 MB 到3.3MB (~30x), 而图像网顶部的精度则从69.7% 下降到65.3%, 从76. 15%下降到74.47% 47% 。 我们的方法是统一的, 并且提供了范围范围。