We study the ability of Transformer models to learn sequences generated by Permuted Congruential Generators (PCGs), a widely used family of pseudo-random number generators (PRNGs). PCGs introduce substantial additional difficulty over linear congruential generators (LCGs) by applying a series of bit-wise shifts, XORs, rotations and truncations to the hidden state. We show that Transformers can nevertheless successfully perform in-context prediction on unseen sequences from diverse PCG variants, in tasks that are beyond published classical attacks. In our experiments we scale moduli up to $2^{22}$ using up to $50$ million model parameters and datasets with up to $5$ billion tokens. Surprisingly, we find even when the output is truncated to a single bit, it can be reliably predicted by the model. When multiple distinct PRNGs are presented together during training, the model can jointly learn them, identifying structures from different permutations. We demonstrate a scaling law with modulus $m$: the number of in-context sequence elements required for near-perfect prediction grows as $\sqrt{m}$. For larger moduli, optimization enters extended stagnation phases; in our experiments, learning moduli $m \geq 2^{20}$ requires incorporating training data from smaller moduli, demonstrating a critical necessity for curriculum learning. Finally, we analyze embedding layers and uncover a novel clustering phenomenon: the model spontaneously groups the integer inputs into bitwise rotationally-invariant clusters, revealing how representations can transfer from smaller to larger moduli.
翻译:我们研究了Transformer模型学习由置换同余生成器(PCGs)生成的序列的能力,PCGs是伪随机数生成器(PRNGs)中广泛使用的一类方法。相较于线性同余生成器(LCGs),PCGs通过对隐藏状态应用一系列位级移位、异或运算、循环移位和截断操作,引入了显著的额外难度。然而,我们证明Transformer仍能成功地对来自不同PCG变体的未见序列进行上下文预测,这些任务超出了已发表的经典攻击方法。在我们的实验中,我们使用高达5000万个模型参数和包含高达50亿个标记的数据集,将模数扩展至$2^{22}$。令人惊讶的是,我们发现即使输出被截断为单个比特,模型仍能可靠地预测。当训练过程中同时呈现多个不同的PRNGs时,模型能够联合学习它们,从不同的置换中识别结构。我们展示了一个与模数$m$相关的缩放定律:实现近乎完美预测所需的上下文序列元素数量以$\sqrt{m}$增长。对于更大的模数,优化过程会进入延长的停滞阶段;在我们的实验中,学习模数$m \geq 2^{20}$需要整合来自较小模数的训练数据,这证明了课程学习的必要性。最后,我们分析了嵌入层,并发现了一种新颖的聚类现象:模型自发地将整数输入分组为位级循环移位不变的簇,揭示了表示如何从较小模数迁移到较大模数。