Model compression by way of parameter pruning, quantization, or distillation has recently gained popularity as an approach for reducing the computational requirements of modern deep neural network models for NLP. Pruning unnecessary parameters has emerged as a simple and effective method for compressing large models that is compatible with a wide variety of contemporary off-the-shelf hardware (unlike quantization), and that requires little additional training (unlike distillation). Pruning approaches typically take a large, accurate model as input, then attempt to discover a smaller subnetwork of that model capable of achieving end-task accuracy comparable to the full model. Inspired by previous work suggesting a connection between simpler, more generalizable models and those that lie within flat basins in the loss landscape, we propose to directly optimize for flat minima while performing task-specific pruning, which we hypothesize should lead to simpler parameterizations and thus more compressible models. In experiments combining sharpness-aware minimization with both iterative magnitude pruning and structured pruning approaches, we show that optimizing for flat minima consistently leads to greater compressibility of parameters compared to standard Adam optimization when fine-tuning BERT models, leading to higher rates of compression with little to no loss in accuracy on the GLUE classification benchmark.
翻译:作为减少NLP现代深神经网络模型的计算要求的一种方法,通过参数修剪、量子化或蒸馏等方法压缩模型的做法最近越来越受欢迎。 谨慎不必要参数已成为压缩大型模型的一种简单而有效的方法,这种模型与当代各种现成的现成硬件(不象量化)相容,而且不需要额外的培训(类似蒸馏)。 谨慎方法通常使用一个大而准确的模型作为投入,然后试图发现该模型的小型小网络,能够实现与完整模型相近的最终任务精确度。 以往的工作表明,更简单、更通用的模型与损失场景中平坦盆地内的模型之间有联系,因此,我们提议在进行任务特定的裁剪时,直接优化平式微型模型,我们应导致更简单的参数化,从而形成更简单的参数。 在实验中,将敏锐度最小性最小性最小性最小性最小化和结构调整方法相结合,然后试图使固定微小型模型不断导致更高程度的精确性参数在损失率上与标准的精准性调整时,在标准压缩的精度上,将微的精确度调整为最精确的精准的精确性调整。