Fine-tuning language models on tasks with instructions has demonstrated potential in facilitating zero-shot generalization to unseen tasks. In this paper, we introduce a straightforward yet effective method for enhancing instruction tuning by employing symbolic tasks. Compared to crowdsourced human tasks or model-generated tasks, symbolic tasks present a unique advantage as they can be easily generated in vast quantities, theoretically providing an infinite supply of high-quality training instances. To explore the potential of symbolic tasks, we carry out an extensive case study on the representative symbolic task of SQL execution. Empirical results on various benchmarks validate that the integration of SQL execution leads to significant improvements in zero-shot scenarios, particularly in table reasoning. Notably, our 3B model surpasses both the 175B GPT-3 and ChatGPT in zero-shot table reasoning across four benchmarks. Furthermore, experimental results on BBH (27 tasks) and MMLU (57 tasks) reveal that language models can be enhanced through symbolic tasks without compromising their generality. We hope that our paper serves as a catalyst, inspiring increased efforts to incorporate symbolic tasks in instruction tuning.
翻译:摘要:在指令任务上对语言模型进行微调已经展现出在未见过的任务上进行零-shot泛化的潜力。在本文中,我们介绍了一种简单但有效的方法,通过采用符号任务来增强指令调整。与人工众包的任务或模型生成的任务相比,符号任务具有独特的优势,因为它们可以轻松地大量生成,理论上提供无限量的高质量训练实例。为了探索符号任务的潜力,我们进行了一项广泛的案例研究,针对SQL执行的典型符号任务。各种基准测试的实证结果验证了SQL执行的整合在零-shot情况下(特别是在表推理中)导致显著的改进。值得注意的是,我们的3B模型在四个基准测试中的零-shot表推理方面超过了175B GPT-3和ChatGPT。此外,在BBH(27项任务)和MMLU(57项任务)的实验结果中,表明可以通过使用符号任务来增强语言模型而不影响它们的广泛性。我们希望本文能够成为催化剂,激励人们加大力度将符号任务纳入到指令调整中。