We demonstrate that, through appropriate prompting, GPT-3 family of models can be triggered to perform iterative behaviours necessary to execute (rather than just write or recall) programs that involve loops, including several popular algorithms found in computer science curricula or software developer interviews. We trigger execution and description of Iterations by Regimenting Self-Attention (IRSA) in one (or a combination) of three ways: 1) Using strong repetitive structure in an example of an execution path of a target program for one particular input, 2) Prompting with fragments of execution paths, and 3) Explicitly forbidding (skipping) self-attention to parts of the generated text. On a dynamic program execution, IRSA leads to larger accuracy gains than replacing the model with the much more powerful GPT-4. IRSA has promising applications in education, as the prompts and responses resemble student assignments in data structures and algorithms classes. Our findings hold implications for evaluating LLMs, which typically target the in-context learning: We show that prompts that may not even cover one full task example can trigger algorithmic behaviour, allowing solving problems previously thought of as hard for LLMs, such as logical puzzles. Consequently, prompt design plays an even more critical role in LLM performance than previously recognized.
翻译:我们证明,通过适当的提示,GPT-3 系列模型可以被触发执行(而不仅仅是写入或回忆)涉及循环的程序,包括计算机科学课程或软件开发人员面试中的几个常见算法。我们通过一种或三种组合的方式触发迭代行为,以执行重复行为:通过管理自注意力 (IRSA) 1) 在一个特定输入的目标程序的执行路径示例中使用强烈的重复结构, 2) 提示执行路径的片段,以及 3) 明确禁止 (跳过) 生成的文本的某些部分的自注意力。在动态程序执行中,IRSA 的准确度提升要比将模型替换为更强大的 GPT-4 更高,IRSA 在教育领域具有很有前途的应用,因为提示和响应类似于数据结构和算法课程中的学生作业。我们的发现具有评估 LLMs 的意义,这些模型通常针对上下文学习:我们表明,甚至可能不覆盖一个完整任务示例的提示可以触发算法行为,从而允许解决先前认为 LLMs 很难解决的问题,例如逻辑难题。因此,与以前认为的相比,提示设计在 LLMs 的性能评估中扮演了更为关键的角色。