Intent classifiers are vital to the successful operation of virtual agent systems. This is especially so in voice activated systems where the data can be noisy with many ambiguous directions for user intents. Before operation begins, these classifiers are generally lacking in real-world training data. Active learning is a common approach used to help label large amounts of collected user input. However, this approach requires many hours of manual labeling work. We present the Nearest Neighbors Scores Improvement (NNSI) algorithm for automatic data selection and labeling. The NNSI reduces the need for manual labeling by automatically selecting highly-ambiguous samples and labeling them with high accuracy. This is done by integrating the classifier's output from a semantically similar group of text samples. The labeled samples can then be added to the training set to improve the accuracy of the classifier. We demonstrated the use of NNSI on two large-scale, real-life voice conversation systems. Evaluation of our results showed that our method was able to select and label useful samples with high accuracy. Adding these new samples to the training data significantly improved the classifiers and reduced error rates by up to 10%.
翻译:对虚拟代理系统的成功运作来说,内在分类是关键。在语音激活系统中,数据会以许多模糊的用户意图方向吵闹起来,这一点尤其重要。在操作开始之前,这些分类者一般缺乏真实世界的培训数据。积极学习是用来帮助标签大量收集的用户输入的通用方法。然而,这种方法需要许多小时的手工标签工作。我们为自动数据选择和标签提供了近邻评分改进算法(NNSSI) 。 NNSSI通过自动选择高度模糊的样本和高度精确地标注这些样本来减少人工标签的需要。在操作开始之前,这些分类者通常缺乏真实世界的培训数据。积极学习是用来帮助标签大量收集的用户输入的通用方法。但是,这种方法需要用许多小时的手工标签工作。我们展示了两个大规模、真实的语音对话系统使用NNSSI。对我们的评估结果显示,我们的方法能够以很高的准确度选择和标注有用的样本。把这些新样本添加到培训数据中,极大地改进了分类者,并将错误率降低到10%。