We revisit a basic question in sequence modeling: is explicit self-attention actually necessary for strong performance and reasoning? We argue that standard multi-head attention is best seen as a form of tensor lifting: hidden vectors are mapped into a high-dimensional space of pairwise interactions, and learning proceeds by constraining this lifted tensor through gradient descent. This mechanism is extremely expressive but mathematically opaque, because after many layers it becomes very hard to describe the model with a small family of explicit invariants. To explore an alternative, we propose an attention-free architecture based on Grassmann flows. Instead of forming an L by L attention matrix, our Causal Grassmann layer (i) linearly reduces token states, (ii) encodes local token pairs as two-dimensional subspaces on a Grassmann manifold via Plucker coordinates, and (iii) fuses these geometric features back into the hidden states through gated mixing. Information therefore propagates by controlled deformations of low-rank subspaces over multi-scale local windows, so the core computation lives on a finite-dimensional manifold rather than in an unstructured tensor space. On the Wikitext-2 language modeling benchmark, purely Grassmann-based models with 13 to 18 million parameters achieve validation perplexities within about 10 to 15 percent of size-matched Transformers. On the SNLI natural language inference task, a Grassmann-Plucker head on top of DistilBERT slightly outperforms a Transformer head, with best validation and test accuracies of 0.8550 and 0.8538 compared to 0.8545 and 0.8511. We analyze the complexity of Grassmann mixing, show linear scaling in sequence length for fixed rank, and argue that such manifold-based designs offer a more structured route toward geometric and invariant-based interpretations of neural reasoning.


翻译:我们重新审视序列建模中的一个基本问题:显式自注意力机制对于强大的性能和推理能力是否真正必要?我们认为标准多头注意力最好被视为一种张量升维操作:隐藏向量被映射到高维的成对交互空间,学习过程通过梯度下降对该升维张量施加约束。这种机制表达能力极强但数学上不透明,因为在经过多层处理后,很难用少量显式不变量族来描述模型。为探索替代方案,我们提出一种基于格拉斯曼流形的无注意力架构。我们的因果格拉斯曼层不再构建L×L注意力矩阵,而是(i)线性降维词元状态,(ii)通过普吕克坐标将局部词元对编码为格拉斯曼流形上的二维子空间,(iii)通过门控混合将这些几何特征融合回隐藏状态。信息因此通过多尺度局部窗口内低秩子空间的受控形变进行传播,使得核心计算发生在有限维流形而非非结构化张量空间。在Wikitext-2语言建模基准测试中,纯格拉斯曼架构模型(参数规模1300万至1800万)的验证困惑度达到与同等规模Transformer模型相差约10%至15%的水平。在SNLI自然语言推理任务中,基于DistilBERT的格拉斯曼-普吕克头部略优于Transformer头部,最佳验证准确率和测试准确率分别达到0.8550和0.8538,而Transformer头部为0.8545和0.8511。我们分析了格拉斯曼混合的复杂度,证明了固定秩条件下序列长度的线性缩放特性,并论证此类基于流形的设计为神经推理的几何与不变量解释提供了更具结构化的实现路径。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员