We focus on the recognition of Dyck-n ($\mathcal{D}_n$) languages with self-attention (SA) networks, which has been deemed to be a difficult task for these networks. We compare the performance of two variants of SA, one with a starting symbol (SA$^+$) and one without (SA$^-$). Our results show that SA$^+$ is able to generalize to longer sequences and deeper dependencies. For $\mathcal{D}_2$, we find that SA$^-$ completely breaks down on long sequences whereas the accuracy of SA$^+$ is 58.82$\%$. We find attention maps learned by $\text{SA}{^+}$ to be amenable to interpretation and compatible with a stack-based language recognizer. Surprisingly, the performance of SA networks is at par with LSTMs, which provides evidence on the ability of SA to learn hierarchies without recursion.
翻译:我们把注意力集中在承认具有自我关注网络(SA)的Dyck-n(mathcal{D ⁇ n$)语言上,这被认为是这些网络的一项困难任务。我们比较了SA的两个变体的性能,一个变体的开头符号(SA$)和一个变体的性能(SA$-$),一个变体没有(SA$),我们的结果表明SA$能够概括更长的序列和更深的依附关系。对于$mathcal{D ⁇ 2$,我们发现SA$在长顺序上完全崩溃,而SA$的精确度是58.82$。我们发现用$(SA ⁇ $)所学的注意地图可以用于解释和与一个基于堆叠的语言识别器兼容。令人惊讶的是,SA网络的性能与LSTMs相当,它提供了SA在不重复的情况下学习高分立体的能力的证据。