Factorizing a large matrix into small matrices is a popular strategy for model compression. Singular value decomposition (SVD) plays a vital role in this compression strategy, approximating a learned matrix with fewer parameters. However, SVD minimizes the squared error toward reconstructing the original matrix without gauging the importance of the parameters, potentially giving a larger reconstruction error for those who affect the task accuracy more. In other words, the optimization objective of SVD is not aligned with the trained model's task accuracy. We analyze this previously unexplored problem, make observations, and address it by introducing Fisher information to weigh the importance of parameters affecting the model prediction. This idea leads to our method: Fisher-Weighted SVD (FWSVD). Although the factorized matrices from our approach do not result in smaller reconstruction errors, we find that our resulting task accuracy is much closer to the original model's performance. We perform analysis with the transformer-based language models, showing our weighted SVD largely alleviates the mismatched optimization objectives and can maintain model performance with a higher compression rate. Our method can directly compress a task-specific model while achieving better performance than other compact model strategies requiring expensive model pre-training. Moreover, the evaluation of compressing an already compact model shows our method can further reduce 9% to 30% parameters with an insignificant impact on task accuracy.
翻译:将一个大矩阵放大为小矩阵是一个广受欢迎的模型压缩策略。 单值分解( SVD) 在压缩策略中发挥着关键作用, 大约是一个学习得来的、 参数较少的矩阵。 然而, SVD 将重建原始矩阵的平方错误最小化, 而不测量参数的重要性, 可能会给那些对任务准确性有更大影响的人带来更大的重建错误。 换句话说, SVD 的优化目标与经过培训的模型任务准确性不匹配。 我们分析这个以前未探讨过的问题, 发表观察意见, 并通过引入渔业信息来衡量影响模型预测的参数的重要性来解决这个问题。 这个想法导致我们的方法是: Fisherish-Weighted SVD (FWSVD) 。 尽管我们方法的因子化矩阵不会导致更小的重建错误, 但是我们发现我们由此产生的任务准确性更接近原始模型的性能。 我们与基于变压器的语言模型进行分析, 显示我们加权的SVD 基本减轻了不匹配的优化目标, 并且可以用更高的压缩模型保持模型的模型性能保持模型性能维持模型的模型性业绩。 我们的方法可以直接压缩一个比其它的缩缩压压压低的模型。 我们的进度的进度的模型可以比其它的缩压压压低的进度的进度的模型, 。