During typical gradient-based training of deep neural networks, all of the model's parameters are updated at each iteration. Recent work has shown that it is possible to update only a small subset of the model's parameters during training, which can alleviate storage and communication requirements. In this paper, we show that it is possible to induce a fixed sparse mask on the model's parameters that selects a subset to update over many iterations. Our method constructs the mask out of the $k$ parameters with the largest Fisher information as a simple approximation as to which parameters are most important for the task at hand. In experiments on parameter-efficient transfer learning and distributed training, we show that our approach matches or exceeds the performance of other methods for training with sparse updates while being more efficient in terms of memory usage and communication costs. We release our code publicly to promote further applications of our approach.
翻译:在典型的深神经网络梯度培训中,模型的所有参数都在每次迭代时更新。最近的工作表明,在培训期间只能更新模型的一小部分参数,这可以减轻储存和通信要求。在本文中,我们表明,有可能在模型参数上诱导一个固定的稀疏遮罩,选择一个子集,以更新多个迭代。我们的方法在以最大的渔业信息为最简单的近似值的参数中构筑了隐藏面罩,说明哪些参数对当前的任务最为重要。在参数高效转移学习和分布式培训的实验中,我们表明我们的方法与其它培训方法的绩效相匹配或超过,但更新不多,同时在记忆使用和通信成本方面效率更高。我们公开发布我们的代码,以促进我们方法的进一步应用。