Despite the recent advances showing that a model pre-trained on large-scale source code data is able to gain appreciable generalization capability, it still requires a sizeable amount of data on the target task for fine-tuning. And the effectiveness of the model generalization is largely affected by the size and quality of the fine-tuning data, which is detrimental for target tasks with limited or unavailable resources. Therefore, cross-task generalization, with the goal of improving the generalization of the model to unseen tasks that have not been seen before, is of strong research and application value. In this paper, we propose a large-scale benchmark that includes 216 existing code-related tasks. Then, we annotate each task with the corresponding meta information such as task description and instruction, which contains detailed information about the task and a solution guide. This also helps us to easily create a wide variety of ``training/evaluation'' task splits to evaluate the various cross-task generalization capabilities of the model. Then we perform some preliminary experiments to demonstrate that the cross-task generalization of models can be largely improved by in-context learning methods such as few-shot learning and learning from task instructions, which shows the promising prospects of conducting cross-task learning research on our benchmark. We hope that the collection of the datasets and our benchmark will facilitate future work that is not limited to cross-task generalization.
翻译:尽管最近取得了一些进展,表明对大规模源代码数据进行预先培训的模型能够取得明显的概括化能力,但仍需要大量关于目标任务的数据,以进行微调。而且模型的概括性效力在很大程度上受到微调数据的规模和质量的影响,而微调数据对资源有限或缺乏资源的目标任务有害。因此,交叉任务概括化,目的是改进模型的概括化,使之适应以前未曾见过的隐蔽任务,因此具有很强的研究和应用价值。在本文件中,我们提出了一个包括216项现有与代码有关的任务的大规模基准。然后,我们用相应的元信息来说明每项任务,例如任务说明和指示,其中载有关于任务和解决办法指南的详细信息。这也帮助我们容易地建立各种各样的“培训/评价”任务分工,以评价模型的各种跨任务概括化能力。然后,我们进行一些初步试验,以证明通过诸如微调的学习和教学等可大大改进模型的交叉任务。我们从任务描述和学习前景的基准,我们从未来的基准学习的有希望的基准,我们从任务学习到没有希望的基准,我们从任务学习有希望的交叉任务。