The ability to extrapolate, i.e., to make predictions on sequences that are longer than those presented as training examples, is a challenging problem for current deep learning models. Recent work shows that this limitation persists in state-of-the-art Transformer-based models. Most solutions to this problem use specific architectures or training methods that do not generalize to other tasks. We demonstrate that large language models can succeed in extrapolation without modifying their architecture or training procedure. Our experimental results show that generating step-by-step rationales and introducing marker tokens are both required for effective extrapolation. First, we induce a language model to produce step-by-step rationales before outputting the answer to effectively communicate the task to the model. However, as sequences become longer, we find that current models struggle to keep track of token positions. To address this issue, we interleave output tokens with markup tokens that act as explicit positional and counting symbols. Our findings show how these two complementary approaches enable remarkable sequence extrapolation and highlight a limitation of current architectures to effectively generalize without explicit surface form guidance. Code available at https://github.com/MirelleB/induced-rationales-markup-tokens
翻译:外推能力,即对长于培训实例的序列作出预测的能力,是当前深层学习模式的一个棘手问题。最近的工作显示,这种限制在最新变异器模型中持续存在。这个问题的大多数解决办法都使用特定的结构或培训方法,这些结构或培训方法并不概括其他任务。我们证明,大语言模型可以在不修改其结构或培训程序的情况下成功外推。我们的实验结果表明,产生逐步推理和引入标记符号是有效外推所需要的。首先,我们引入一种语言模型,在输出答案以有效向模型传达任务之前,先逐步提出理由。然而,随着时间的变长,我们发现当前模型在努力跟踪象征性位置。为了解决这一问题,我们用标记来将输出符号与标记相隔开来,这些标记作为明确的定位和计数符号。我们的研究结果表明,这两种互补方法如何使显著的序列外推法和突出当前结构在没有明确表面格式指导的情况下有效普遍化的局限性。在 https://gilas-bal-imal-mail-Mirb/commation上可用的代码。