Building dialogue systems requires a large corpus of annotated dialogues. Such datasets are usually created via crowdsourcing, which is expensive and time-consuming. In this paper, we propose \textsc{Dialogic}, a novel dialogue simulation method based on large language model in-context learning to automate dataset creation. Seeded with a few annotated dialogues, \textsc{Dialogic} automatically selects in-context examples for demonstration and prompts GPT-3 to generate new dialogues and annotations in a controllable way. Our method can rapidly expand a small set of dialogue data with minimum or zero \textit{human involvement} and \textit{parameter update} and is thus much more cost-efficient and time-saving than crowdsourcing. Experimental results on the MultiWOZ dataset demonstrate that training a model on the simulated dialogues leads to even better performance than using the same amount of human-generated dialogues under the challenging low-resource settings, with as few as 85 dialogues as a seed. When the full training set is given, our method can still serve as an effective data augmentation method to further improve performance. Human evaluation results show that our simulated dialogues have near-human fluency and annotation accuracy. The code and data are available at \textbf{\url{https://github.com/Leezekun/dialogic}}.
翻译:建立对话框系统需要大量附加说明的对话框。 这些数据集通常是通过众包创建的, 费用昂贵且耗时。 在本文中, 我们提议了\ textsc{ Dialogic}, 这是基于大型语言模型内文体学习的新的对话模拟方法, 以自动创建数据集。 种子用一些附加说明的对话框,\ textsc{ Dialogic} 自动选择演示的文本示例, 并促使 GPT-3 以可控制的方式生成新的对话框和说明。 我们的方法可以快速扩展小套对话数据, 其最小或零 \ textit{ 人类参与} 和\ textit{ parameter 更新}, 并且因此比众包更具有成本效益和节省时间。 MultiWoZ 数据集的实验结果显示, 在挑战性低资源环境下培训一个模型比使用同样数量的人类生成的对话, 少至85个对话作为种子。 当提供全部培训时, 我们的方法仍然可以作为近于 http/ annubrqual 的增强数据/ drodeal disal disal dal comdeal