Conversational systems are of primary interest in the AI community. Chatbots are increasingly being deployed to provide round-the-clock support and to increase customer engagement. Many of the commercial bot building frameworks follow a standard approach that requires one to build and train an intent model to recognize a user input. Intent models are trained in a supervised setting with a collection of textual utterance and intent label pairs. Gathering a substantial and wide coverage of training data for different intent is a bottleneck in the bot building process. Moreover, the cost of labeling a hundred to thousands of conversations with intent is a time consuming and laborious job. In this paper, we present an intent discovery framework that involves 4 primary steps: Extraction of textual utterances from a conversation using a pre-trained domain agnostic Dialog Act Classifier (Data Extraction), automatic clustering of similar user utterances (Clustering), manual annotation of clusters with an intent label (Labeling) and propagation of intent labels to the utterances from the previous step, which are not mapped to any cluster (Label Propagation); to generate intent training data from raw conversations. We have introduced a novel density-based clustering algorithm ITER-DBSCAN for unbalanced data clustering. Subject Matter Expert (Annotators with domain expertise) manually looks into the clustered user utterances and provides an intent label for discovery. We conducted user studies to validate the effectiveness of the trained intent model generated in terms of coverage of intents, accuracy and time saving concerning manual annotation. Although the system is developed for building an intent model for the conversational system, this framework can also be used for a short text clustering or as a labeling framework.
翻译:互换系统是AI 社区的主要兴趣所在。 聊天室正在越来越多地部署, 以提供全天候支持和增加客户参与。 许多商用机器人建筑框架都遵循标准方法, 要求建立和训练一种识别用户投入的意向模型。 意向模型在监督环境下经过培训, 收集了文本表达和意向标签配对。 为不同目的收集大量和广泛的培训数据是机器人建设过程中的一个瓶颈。 此外, 标注成百上千次有意对话的成本是耗时和艰苦的专长。 在本文中, 我们展示了一个意向发现框架, 包括4个主要步骤: 使用预先培训过的域域描述分析法案分类(DataGripationon), 将类似的用户表达(Clustering)自动组合, 以模型标签( Weabiling) 和意图标签框架传播到前一步骤的口述, 用于构建任何意向的意向准确度(Label Propaggation), 我们展示了一个用于数据库的系统, 将目标数据升级的版本用于数据库。