Many recent approaches to natural language tasks are built on the remarkable abilities of large language models. Large language models can perform in-context learning, where they learn a new task from a few task demonstrations, without any parameter updates. This work examines the implications of in-context learning for the creation of datasets for new natural language tasks. Departing from recent in-context learning methods, we formulate an annotation-efficient, two-step framework: selective annotation that chooses a pool of examples to annotate from unlabeled data in advance, followed by prompt retrieval that retrieves task examples from the annotated pool at test time. Based on this framework, we propose an unsupervised, graph-based selective annotation method, voke-k, to select diverse, representative examples to annotate. Extensive experiments on 10 datasets (covering classification, commonsense reasoning, dialogue, and text/code generation) demonstrate that our selective annotation method improves the task performance by a large margin. On average, vote-k achieves a 12.9%/11.4% relative gain under an annotation budget of 18/100, as compared to randomly selecting examples to annotate. Compared to state-of-the-art supervised finetuning approaches, it yields similar performance with 10-100x less annotation cost across 10 tasks. We further analyze the effectiveness of our framework in various scenarios: language models with varying sizes, alternative selective annotation methods, and cases where there is a test data domain shift. We hope that our studies will serve as a basis for data annotations as large language models are increasingly applied to new tasks. Our code is available at https://github.com/HKUNLP/icl-selective-annotation.
翻译:最近许多自然语言任务的方法都是建立在大型语言模型的非凡能力之上的。 大型语言模型可以进行文体学习, 从少数任务演示中学习新的任务, 而不更新参数。 这项工作审查了文体学习对创建新自然语言任务数据集的影响。 我们从最近的文体学习方法出发, 制定了一个具有说明效率的两步框架: 选择性说明, 选择一组示例, 提前从未加贴标签的数据中进行注释性说明; 其次是快速检索, 从测试时的附加说明的集合中检索任务实例。 基于此框架, 我们提议一种不受监督的、 基于图形的选择性说明方法, 用来为新的自然语言任务选择多样化的、 有代表性的例子。 在10个数据集( 包含分类、 常识推理、 对话、 文本/ 代码生成) 上的广泛实验表明, 我们选择性说明性的工作表现将大大改进。 平均而言, 投票- k 将12.9%/11 相对4% 用于测试性能的数值, 在类似预算周期中选择一个10/ 100 的模型, 将数据比重 。