We demonstrate that large language models are able to simulate Task Oriented Dialogues in novel domains, provided only with an API implementation and a list of goals. We show these simulations can formulate online, automatic metrics that correlate well with human evaluations. Furthermore, by checking for whether the User's goals are met, we can use simulation to repeatedly generate training data and improve the quality of simulations themselves. With no human intervention or domain-specific training data, our simulations bootstrap end-to-end models which achieve a 37\% error reduction in previously unseen domains. By including as few as 32 domain-specific conversations, bootstrapped models can match the performance of a fully-supervised model with $10\times$ more data. To our knowledge, this is the first time simulations have been shown to be effective at bootstrapping models without explicitly requiring any domain-specific training data, rule-engineering, or humans-in-the-loop.
翻译:我们证明,大型语言模型能够在新领域模拟以任务为导向的对话,只提供实施 API 和 目标列表。 我们显示,这些模拟可以制定与人类评估密切相关的在线自动衡量标准。 此外,通过检查用户的目标是否实现,我们可以使用模拟来反复生成培训数据,提高模拟本身的质量。在没有人类干预或特定领域培训数据的情况下,我们的模拟陷阱端至端模型可以在先前看不见的域中减少37 ⁇ 错误。通过包含多达32个特定域的谈话,靴式模型可以匹配完全受监督的模型的性能,同时提供10美元的数据。 据我们所知,这是首次在没有明确要求任何特定领域培训数据、规则工程或人行内操作的情况下,模拟被证明对靴式有效。