Popular dialog datasets such as MultiWOZ are created by providing crowd workers an instruction, expressed in natural language, that describes the task to be accomplished. Crowd workers play the role of a user and an agent to generate dialogs to accomplish tasks involving booking restaurant tables, calling a taxi etc. In this paper, we present a data creation strategy that uses the pre-trained language model, GPT2, to simulate the interaction between crowd workers by creating a user bot and an agent bot. We train the simulators using a smaller percentage of actual crowd-generated conversations and their corresponding instructions. We demonstrate that by using the simulated data, we achieve significant improvements in low-resource settings on two publicly available datasets - the MultiWOZ dataset and the Persona chat dataset.
翻译:多 WOZ等大众对话数据集是通过向人群工人提供一种用自然语言表达的指令来创建的,该指令描述了要完成的任务。人群工人扮演用户和代理人的角色,以生成对话来完成与预订餐桌、叫出租车等有关的任务。在本文中,我们提出了一个数据创建战略,使用预先培训的语言模式GPT2,通过创建用户机器人和一个代理机器人来模拟人群工人之间的互动。我们用一小部分实际人群生成的谈话及其相应的指示来培训模拟器。我们通过使用模拟数据来证明,我们在两种公开的数据集----多WOZ数据集和人聊天数据集----的低资源环境中取得了显著的改进。