Acquiring training data to improve the robustness of dialog systems can be a painstakingly long process. In this work, we propose a method to reduce the cost and effort of creating new conversational agents by artificially generating more data from existing examples, using paraphrase generation. Our proposed approach can kick-start a dialog system with little human effort, and brings its performance to a level satisfactory enough for allowing actual interactions with real end-users. We experimented with two neural paraphrasing approaches, namely Neural Machine Translation and a Transformer-based seq2seq model. We present the results obtained with two datasets in English and in French:~a crowd-sourced public intent classification dataset and our own corporate dialog system dataset. We show that our proposed approach increased the generalization capabilities of the intent classification model on both datasets, reducing the effort required to initialize a new dialog system and helping to deploy this technology at scale within an organization.
翻译:为改善对话系统的稳健性而获取培训数据可以是一个漫长的过程。 在这项工作中,我们提出了一个方法来降低创建新的对话代理器的成本和努力,办法是利用参数生成,人为地从现有实例中生成更多数据。我们提议的方法可以启动一个对话系统,但人力投入很少,其性能足以允许与真正的终端用户进行实际互动。我们试验了两种神经分解方法,即神经机器翻译和基于变换器的后继2Seq模型。我们用英文和法文用两个数据集(~~a 众源公共意图分类数据集和我们自己的公司对话系统数据集)介绍了结果。我们表明,我们提议的方法提高了两个数据集中意图分类模型的通用能力,减少了启动新对话系统所需的努力,并帮助在组织内部大规模部署这一技术。