Attention mechanism has been used as an ancillary means to help RNN or CNN. However, the Transformer (Vaswani et al., 2017) recently recorded the state-of-the-art performance in machine translation with a dramatic reduction in training time by solely using attention. Motivated by the Transformer, Directional Self Attention Network (Shen et al., 2017), a fully attention-based sentence encoder, was proposed. It showed good performance with various data by using forward and backward directional information in a sentence. But in their study, not considered at all was the distance between words, an important feature when learning the local dependency to help understand the context of input text. We propose Distance-based Self-Attention Network, which considers the word distance by using a simple distance mask in order to model the local dependency without losing the ability of modeling global dependency which attention has inherent. Our model shows good performance with NLI data, and it records the new state-of-the-art result with SNLI data. Additionally, we show that our model has a strength in long sentences or documents.
翻译:然而,变换器(Vaswani等人,2017年)最近记录了机器翻译方面的最先进性能,只用注意力就大大减少了培训时间。它受到完全以注意力为基础的句子编码器 " 变换器 " 方向自关注网络 " (Shen等人,2017年)的鼓动,它通过在句子中使用前向和后向方向信息,展示了各种数据的良好性能。但是,在他们的研究中,完全没有考虑的是字词之间的距离,这是学习本地依赖性以帮助理解输入文本内容的一个重要特征。我们提议采用远程自我注意网络,通过使用简单的远程遮罩来考虑字数距离,以模拟当地依赖性,同时不丧失关注所固有的模拟全球依赖性的能力。我们的模型用NLI数据展示了良好的性能,它记录了新状态的结果和SNLI数据。此外,我们还表明我们的模型在长句或文件方面的力量。