New intent discovery is of great value to natural language processing, allowing for a better understanding of user needs and providing friendly services. However, most existing methods struggle to capture the complicated semantics of discrete text representations when limited or no prior knowledge of labeled data is available. To tackle this problem, we propose a novel framework called USNID for unsupervised and semi-supervised new intent discovery, which has three key technologies. First, it takes full use of unsupervised or semi-supervised data to mine shallow semantic similarity relations and provide well-initialized representations for clustering. Second, it designs a centroid-guided clustering mechanism to address the issue of cluster allocation inconsistency and provide high-quality self-supervised targets for representation learning. Third, it captures high-level semantics in unsupervised or semi-supervised data to discover fine-grained intent-wise clusters by optimizing both cluster-level and instance-level objectives. We also propose an effective method for estimating the cluster number in open-world scenarios without knowing the number of new intents beforehand. USNID performs exceptionally well on several intent benchmark datasets, achieving new state-of-the-art results in unsupervised and semi-supervised new intent discovery and demonstrating robust performance with different cluster numbers.
翻译:新的意图发现对于自然语言处理非常有价值,可以更好地理解用户需求并提供友好的服务。然而,大多数现有的方法在有限或没有标记数据的先验知识时很难捕捉离散文本表示的复杂语义。为了解决这个问题,我们提出了一种名为USNID的新型无监督和半监督新意图发现框架,它具有三项关键技术。首先,它充分利用无监督或半监督数据挖掘浅层语义相似性关系,并为聚类提供良好初始化表示。其次,它设计了一个中心导向的聚类机制来解决聚类分配不一致的问题,并为表示学习提供高质量的自监督目标。第三,它在无监督或半监督数据中捕捉高级语义,通过优化聚类级别和实例级别目标来发现细粒度的意图聚类。我们还提出了一种有效的方法,在不预先知道新意图数量的开放世界场景中估算聚类数量。USNID在几个意图基准数据集上表现出色,在无监督和半监督新意图发现方面取得了新的最佳结果,并展示了在不同聚类数量下的强大性能。