Softmax is widely used in neural networks for multiclass classification, gate structure and attention mechanisms. The statistical assumption that the input is normal distributed supports the gradient stability of Softmax. However, when used in attention mechanisms such as transformers, since the correlation scores between embeddings are often not normally distributed, the gradient vanishing problem appears, and we prove this point through experimental confirmation. In this work, we suggest that replacing the exponential function by periodic functions, and we delve into some potential periodic alternatives of Softmax from the view of value and gradient. Through experiments on a simply designed demo referenced to LeViT, our method is proved to be able to alleviate the gradient problem and yield substantial improvements compared to Softmax and its variants. Further, we analyze the impact of pre-normalization for Softmax and our methods through mathematics and experiments. Lastly, we increase the depth of the demo and prove the applicability of our method in deep structures.
翻译:多级分类、门结构和注意机制的神经网络中广泛使用软减法。 输入正常分布的统计假设支持软减法的梯度稳定性。 但是,当在变压器等关注机制中使用时,由于嵌入器之间的相关分数通常不分布,渐变问题出现,我们通过实验确认来证明这一点。 在这项工作中,我们建议用定期函数取代指数函数,并从价值和梯度的角度探索软减法的某些可能的周期性替代物。通过对LeViT的简单设计演示的实验,我们的方法证明能够缓解梯度问题,并比Softmax及其变异体产生实质性改进。此外,我们通过数学和实验分析软减法预正常化的影响和我们的方法。最后,我们提高演示深度,并证明我们的方法在深层结构中的适用性。