Topic taxonomies, which represent the latent topic (or category) structure of document collections, provide valuable knowledge of contents in many applications such as web search and information filtering. Recently, several unsupervised methods have been developed to automatically construct the topic taxonomy from a text corpus, but it is challenging to generate the desired taxonomy without any prior knowledge. In this paper, we study how to leverage the partial (or incomplete) information about the topic structure as guidance to find out the complete topic taxonomy. We propose a novel framework for topic taxonomy completion, named TaxoCom, which recursively expands the topic taxonomy by discovering novel sub-topic clusters of terms and documents. To effectively identify novel topics within a hierarchical topic structure, TaxoCom devises its embedding and clustering techniques to be closely-linked with each other: (i) locally discriminative embedding optimizes the text embedding space to be discriminative among known (i.e., given) sub-topics, and (ii) novelty adaptive clustering assigns terms into either one of the known sub-topics or novel sub-topics. Our comprehensive experiments on two real-world datasets demonstrate that TaxoCom not only generates the high-quality topic taxonomy in terms of term coherency and topic coverage but also outperforms all other baselines for a downstream task.
翻译:代表了文件收集的潜在主题(或类别)结构的专题分类学,它为诸如网络搜索和信息过滤等许多应用提供了宝贵的内容知识。最近,开发了几种未经监督的方法,以便从文字文体中自动构建专题分类学,但是在不事先知情的情况下生成理想的分类学却具有挑战性。在本文件中,我们研究如何利用关于专题结构的部分(或不完整)信息,作为查找完整专题分类学的指导。我们提议了一个专题分类学完成新颖框架,名为 " ExalogoCom ",它通过发现新的术语和文件子专题分组,重新扩展专题分类学。为了在分级主题结构中有效识别新专题分类学,TaxoCom设计其嵌入和组合技术,以便彼此密切关联:(一) 本地歧视性嵌入优化将空间嵌入的文字,以在已知的(一)分专题中进行区分。和(二) 创新的适应性适应组合组合将术语分配给已知的子专题或新的子专题分类学,通过发现新的分专题术语和文件。为了有效地查明分专题,我们在税制税制专题中的全面实验范围中,也展示了两个实体税系定义。