Lipschitz constants of neural networks have been explored in various contexts in deep learning, such as provable adversarial robustness, estimating Wasserstein distance, stabilising training of GANs, and formulating invertible neural networks. Such works have focused on bounding the Lipschitz constant of fully connected or convolutional networks, composed of linear maps and pointwise non-linearities. In this paper, we investigate the Lipschitz constant of self-attention, a non-linear neural network module widely used in sequence modelling. We prove that the standard dot-product self-attention is not Lipschitz for unbounded input domain, and propose an alternative L2 self-attention that is Lipschitz. We derive an upper bound on the Lipschitz constant of L2 self-attention and provide empirical evidence for its asymptotic tightness. To demonstrate the practical relevance of our theoretical work, we formulate invertible self-attention and use it in a Transformer-based architecture for a character-level language modelling task.
翻译:Lipschitz神经网络的常数在深层学习的各种背景中得到了探讨,例如可辨的对抗性强势、估计瓦瑟斯坦距离、稳定GANs的培训以及建立不可逆的神经网络,这类工作侧重于将Lipschitz完全相连或连通性网络的常数,包括线性地图和点性非线性。在本文中,我们调查了Lipschitz自我注意的常数,这是在序列建模中广泛使用的非线性神经网络模块。我们证明标准点产品自我注意不是用于无限制输入域的Lipschitz,我们建议采用Lipschitz的替代性L2自我注意方式。我们从L2自我注意的Lipschitz常数上捆绑,并提供了经验性证据,说明我们理论工作的实际相关性,我们制定了不可逆的自留力,并将它用于基于变形语言建模的架构中。