Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if $\mathsf L \neq \mathsf P$ (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture's high parallelizability. We thus speculatively introduce the idea of a fundamental parallelism tradeoff: any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.
翻译:尽管在现代 NLP 中存在万能现象,但将变压器神经网的计算能力定性仍是一个有趣的开放问题。我们证明,其计算精确度为对数的变压器(而且其进料前网可以使用输入中的空间线性可比较)的计算精度在输入符数中可以被不断深入的对数空间-统一阈值电路模拟。这提供了使用复杂理论中已知结果对变压器力量的洞察力。例如,如果$\mathsf L\neq\mathsf P$(即并非所有多时问题都可以使用对数空间来解决),那么变压器甚至无法精确地解决线性等同,或者用空产品任意的无上下文格检查会籍。我们的结果从变压器结构高度平行性中直观地显现出来。因此,我们投机性地引入了基本平行性交易的概念:任何与变压器相平行的模型都会遵守类似的限制。由于平行性是大规模培训模型的关键,因此,因此变压模式的内在弱点表明缩略性的潜在弱点。