Automatic construction of a taxonomy supports many applications in e-commerce, web search, and question answering. Existing taxonomy expansion or completion methods assume that new concepts have been accurately extracted and their embedding vectors learned from the text corpus. However, one critical and fundamental challenge in fixing the incompleteness of taxonomies is the incompleteness of the extracted concepts, especially for those whose names have multiple words and consequently low frequency in the corpus. To resolve the limitations of extraction-based methods, we propose GenTaxo to enhance taxonomy completion by identifying positions in existing taxonomies that need new concepts and then generating appropriate concept names. Instead of relying on the corpus for concept embeddings, GenTaxo learns the contextual embeddings from their surrounding graph-based and language-based relational information, and leverages the corpus for pre-training a concept name generator. Experimental results demonstrate that GenTaxo improves the completeness of taxonomies over existing methods.
翻译:自动分类学的构建支持了电子商务、网络搜索和答题方面的许多应用。现有的分类学扩展或完成方法假定新概念已经得到准确的提取,它们从文本中学习了嵌入矢量。然而,在确定分类学的不完整性方面,一个关键和根本性的挑战是所提取的概念的不完整,特别是对那些名称多词因而在文体中频率较低的概念而言。为了解决基于提取方法的局限性,我们建议GenTaxo通过查明需要新概念的现有分类学的位置,然后产生适当的概念名称,提高分类学的完成程度。GenTaxo不依靠概念嵌入体,而是从周围的基于图表和基于语言的关系信息中学习背景嵌入物,并利用模型来预先培训概念名生成器。实验结果表明,GenTaxo改进了现有方法的分类的完整性。