Cross-task generalization is a significant outcome that defines mastery in natural language understanding. Humans show a remarkable aptitude for this, and can solve many different types of tasks, given definitions in the form of textual instructions and a small set of examples. Recent work with pre-trained language models mimics this learning style: users can define and exemplify a task for the model to attempt as a series of natural language prompts or instructions. While prompting approaches have led to higher cross-task generalization compared to traditional supervised learning, analyzing 'bias' in the task instructions given to the model is a difficult problem, and has thus been relatively unexplored. For instance, are we truly modeling a task, or are we modeling a user's instructions? To help investigate this, we develop LINGO, a novel visual analytics interface that supports an effective, task-driven workflow to (1) help identify bias in natural language task instructions, (2) alter (or create) task instructions to reduce bias, and (3) evaluate pre-trained model performance on debiased task instructions. To robustly evaluate LINGO, we conduct a user study with both novice and expert instruction creators, over a dataset of 1,616 linguistic tasks and their natural language instructions, spanning 55 different languages. For both user groups, LINGO promotes the creation of more difficult tasks for pre-trained models, that contain higher linguistic diversity and lower instruction bias. We additionally discuss how the insights learned in developing and evaluating LINGO can aid in the design of future dashboards that aim to minimize the effort involved in prompt creation across multiple domains.
翻译:交叉任务泛化是自然语言理解中定义精通的重要结果。人类在此方面表现出非凡的才能,可以在给定文本指令和一小组示例的情况下解决许多不同类型的任务。最近的预训练语言模型的工作模仿了这种学习方式:用户可以用自然语言提示或指令定义和举例一个任务,以供模型尝试。尽管提示方法相对于传统的监督学习在交叉任务泛化方面表现更好,但分析任务指令中的“偏差”是一个困难的问题,因此相对未被探索。例如,我们是否真正建模了一个任务,还是在建模用户的指令?为此,我们开发了 LINGO,一种新颖的可视分析界面,支持有效的任务驱动工作流程,可以帮助 (1) 识别自然语言任务指令中的偏差,(2) 改变(或创建)任务指令以减少偏差,以及 (3) 评估预训练模型在去偏任务指令上的表现。为了稳健地评估 LINGO,我们在包含 1616 个语言任务及其自然语言指令的跨越 55 种不同语言的数据集上进行了一项用户研究,包括初学者和专家指令创建者。对于两个用户群体,LINGO 促进了创建更难的任务,其中包含更高的语言多样性和更低的指令偏差。我们还讨论了在开发和评估 LINGO 时所学习的见解如何有助于设计旨在在多个领域中最小化提示创建所需的工作量的未来仪表盘。