Causal transformer language models (LMs), such as GPT-3, typically require some form of positional encoding, such as positional embeddings. However, we show that LMs without any explicit positional encoding are still competitive with standard models, and that this phenomenon is robust across different datasets, model sizes, and sequence lengths. Probing experiments reveal that such models acquire an implicit notion of absolute positions throughout the network, effectively compensating for the missing information. We conjecture that causal attention enables the model to infer the number of predecessors that each token can attend to, thereby approximating its absolute position. Our findings indicate that causal LMs might derive positional awareness not only from the explicit positioning mechanism, but also from the effects of the causal mask.
翻译:GPT-3等因果变压器语言模型(LMS)通常需要某种形式的位置编码,如位置嵌入。然而,我们发现,没有明确位置编码的LMS仍然与标准模型具有竞争力,而且这一现象在不同数据集、模型大小和序列长度中都非常明显。 试探实验显示,这种模型在整个网络中获得绝对位置的隐含概念,有效地弥补了缺失的信息。我们推测,因果关注使得模型能够推断出每个标志可以使用的前身数目,从而接近其绝对位置。我们的调查结果表明,因果LMs不仅可以从明确的定位机制中,而且可以从因果遮罩的影响中获得定位意识。