Transformers have become a standard architecture for many NLP problems. This has motivated theoretically analyzing their capabilities as models of language, in order to understand what makes them successful, and what their potential weaknesses might be. Recent work has shown that transformers with hard attention are quite limited in capacity, and in fact can be simulated by constant-depth circuits. However, hard attention is a restrictive assumption, which may complicate the relevance of these results for practical transformers. In this work, we analyze the circuit complexity of transformers with saturated attention: a generalization of hard attention that more closely captures the attention patterns learnable in practical transformers. We show that saturated transformers transcend the limitations of hard-attention transformers. With some minor assumptions, we prove that the number of bits needed to represent a saturated transformer memory vector is $O(\log n)$, which implies saturated transformers can be simulated by log-depth circuits. Thus, the jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(\log n)$.
翻译:变压器已成为许多NLP问题的标准架构。 这促使从理论上分析他们作为语言模型的能力, 以便了解他们的成功之处, 以及他们潜在的弱点。 最近的工作显示, 注意力非常集中的变压器在容量上相当有限, 事实上, 可以通过持续深度电路模拟。 但是, 硬关注是一种限制性的假设, 这可能使这些结果对实用变压器的相关性复杂化。 在这项工作中, 我们分析饱和的变压器的电路复杂性: 一种集中的硬关注, 更密切地捕捉到在实际变压器中可以学习的注意模式。 我们显示饱和变压器超越了硬性感变压器的局限性。 我们用一些小的假设, 我们证明代表饱和变压器内载器的比特数是$O( nlog) $, 这意味着饱和变压器的变压器可以通过日志深度电路模拟。 因此, 从硬度到饱和化的注意的跳动可以被理解为通过 $O( n) 系数增加变压器的有效电路深。