The vast majority of deep models use multiple gradient signals, typically corresponding to a sum of multiple loss terms, to update a shared set of trainable weights. However, these multiple updates can impede optimal training by pulling the model in conflicting directions. We present Gradient Sign Dropout (GradDrop), a probabilistic masking procedure which samples gradients at an activation layer based on their level of consistency. GradDrop is implemented as a simple deep layer that can be used in any deep net and synergizes with other gradient balancing approaches. We show that GradDrop outperforms the state-of-the-art multiloss methods within traditional multitask and transfer learning settings, and we discuss how GradDrop reveals links between optimal multiloss training and gradient stochasticity.
翻译:绝大多数深层模型使用多个梯度信号,通常与多个损失值之和相对应,以更新一套共同的可训练重量。然而,这些多重更新会通过将模型引向相互冲突的方向而阻碍最佳培训。我们介绍了一种概率化掩码程序,即根据一致性程度在活化层取样梯度。 梯度是作为一个简单的深层层执行的,可用于任何深网,并与其他梯度平衡方法协同使用。 我们显示,格拉德罗普在传统的多任务和转移学习环境中超越了最先进的多损失方法,我们讨论了梯度如何揭示最佳多损失培训与梯度转换之间的联系。