Chain of thought is a natural inference-time method for increasing the computational power of transformer-based large language models (LLMs), but comes at the cost of sequential decoding. Are there more efficient alternatives to expand a transformer's expressive power without adding parameters? We consider transformers with padding tokens as a form of parallelizable test-time compute. We show that averaging-hard-attention, masked-pre-norm transformers with polynomial padding converge to precisely the class $\mathsf{TC}^0$ of extremely parallelizable problems. While the $\mathsf{TC}^0$ upper bound was known, proving a matching lower bound had been elusive. Further, our novel analysis reveals the precise expanded power of padded transformers when coupled with another form of inference-time compute, namely dynamically increasing depth via looping. Our core technical contribution is to show how padding helps bring the notions of complete problems and reductions, which have been a cornerstone of classical complexity theory, to the formal study of transformers. Armed with this new tool, we prove that padded transformers with $O(\log^d n)$ looping on inputs of length $n$ recognize exactly the class $\mathsf{TC}^d$ of moderately parallelizable problems. Thus, padding and looping together systematically expand transformers' expressive power: with polylogarithmic looping, padded transformers converge to the class $\mathsf{NC}$, the best that could be expected without losing parallelism (unless $\mathsf{NC} = \mathsf{P}$). Our results thus motivate further exploration of padding and looping as parallelizable alternatives to chain of thought.
翻译:思维链是一种自然的推理时方法,用于提升基于Transformer的大语言模型(LLM)的计算能力,但其代价是顺序解码。是否存在更高效的替代方案,在不增加参数的情况下扩展Transformer的表达能力?我们将填充令牌视为一种可并行化的测试时计算形式。我们证明了采用平均硬注意力、掩码预归一化且具有多项式填充的Transformer精确收敛于极度可并行化问题类$\mathsf{TC}^0$。虽然$\mathsf{TC}^0$上界已知,但证明匹配的下界一直难以实现。此外,我们的新颖分析揭示了当填充Transformer与另一种推理时计算形式(即通过循环动态增加深度)结合时所获得的精确扩展能力。我们的核心技术贡献在于展示了填充如何帮助将完全问题和归约的概念——这些概念一直是经典复杂性理论的基石——引入到Transformer的形式化研究中。借助这一新工具,我们证明了在长度为$n$的输入上进行$O(\log^d n)$次循环的填充Transformer恰好识别中等可并行化问题类$\mathsf{TC}^d$。因此,填充与循环共同系统地扩展了Transformer的表达能力:通过多对数次循环,填充Transformer收敛到类$\mathsf{NC}$,这是在保持并行性的前提下所能期望的最佳结果(除非$\mathsf{NC} = \mathsf{P}$)。因此,我们的研究结果激励了对填充和循环作为思维链的可并行化替代方案的进一步探索。