We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer. To do so, we train an affine probe for each block in a frozen pretrained model, making it possible to decode every hidden state into a distribution over the vocabulary. Our method, the \emph{tuned lens}, is a refinement of the earlier ``logit lens'' technique, which yielded useful insights but is often brittle. We test our method on various autoregressive language models with up to 20B parameters, showing it to be more predictive, reliable and unbiased than the logit lens. With causal experiments, we show the tuned lens uses similar features to the model itself. We also find the trajectory of latent predictions can be used to detect malicious inputs with high accuracy. All code needed to reproduce our results can be found at https://github.com/AlignmentResearch/tuned-lens.
翻译:我们从迭代推理的角度分析变压器, 试图了解模型预测如何按层层精炼。 为此, 我们训练了在冷冻预训练模型中为每个区块配置的方形探测器, 使每个隐藏状态解码到词汇的分布中成为可能。 我们的方法, 即“ emph{ 调控镜头 ”, 是对早期的“ logit class' ” 技术的精细分析, 它产生了有用的洞察力, 却往往非常差劲。 我们测试了各种自动递增语言模型的方法, 其参数可达20B, 显示它比对登录镜更具有预测性、 可靠和 公正。 我们通过因果实验, 显示调控镜头使用与模型本身相似的特征。 我们还发现潜伏预测的轨迹可以用来非常精确地检测恶意输入。 复制我们结果所需的所有代码都可以在 https:// github. com/ AlignmentResearch/ ordens 上找到 。</s>