Transformer-based large language models are trained to make predictions about the next word by aggregating representations of previous tokens through their self-attention mechanism. In the field of cognitive modeling, such attention patterns have recently been interpreted as embodying the process of cue-based retrieval, in which attention over multiple targets is taken to generate interference and latency during retrieval. Under this framework, this work first defines an entropy-based predictor that quantifies the diffuseness of self-attention, as well as distance-based predictors that capture the incremental change in attention patterns across timesteps. Moreover, following recent studies that question the informativeness of attention weights, we also experiment with alternative methods for incorporating vector norms into attention weights. Regression experiments using predictors calculated from the GPT-2 language model show that these predictors deliver a substantially better fit to held-out self-paced reading and eye-tracking data over a rigorous baseline including GPT-2 surprisal. Additionally, the distance-based predictors generally demonstrated higher predictive power, with effect sizes of up to 6.59 ms per standard deviation on self-paced reading times (compared to 2.82 ms for surprisal) and 1.05 ms per standard deviation on eye-gaze durations (compared to 3.81 ms for surprisal).
翻译:以变异器为基础的大型语言模型经过培训,通过自我注意机制汇总先前象征物的表示方式,对下一个词作出预测。在认知模型领域,这种关注模式最近被解释为体现基于提示的检索过程,其中对多个目标的注意,以在检索过程中产生干扰和潜伏。在此框架下,这项工作首先定义了一种基于酶基的预测器,该预测器可以量化自我注意的分散性,以及基于距离的预测器,以捕捉跨时间步骤关注模式的渐进变化。此外,在最近进行的研究质疑关注重量的知情性之后,我们还试验了将病媒规范纳入注意重量的替代方法。使用从GPT-2语言模型计算的预测器进行倒退实验表明,这些预测器比严格基线(包括GPT-2 偏差) 的固态读取和跟踪数据更适合。此外,基于距离的预测器一般显示更高的预测力,在每标准偏差水平偏差每标准偏差1.51(相对于每标准偏差2.51米)上,预测力高达6.59米。