Open Information Extraction (OpenIE) methods extract (noun phrase, relation phrase, noun phrase) triples from text, resulting in the construction of large Open Knowledge Bases (Open KBs). The noun phrases (NPs) and relation phrases in such Open KBs are not canonicalized, leading to the storage of redundant and ambiguous facts. Recent research has posed canonicalization of Open KBs as clustering over manuallydefined feature spaces. Manual feature engineering is expensive and often sub-optimal. In order to overcome this challenge, we propose Canonicalization using Embeddings and Side Information (CESI) - a novel approach which performs canonicalization over learned embeddings of Open KBs. CESI extends recent advances in KB embedding by incorporating relevant NP and relation phrase side information in a principled manner. Through extensive experiments on multiple real-world datasets, we demonstrate CESI's effectiveness.
翻译:开放信息提取( OpenIE) 方法提取( 名词、 关系短语、 名词) 从文本中提取的( 名词、 关系短语、 名词) 三倍, 导致大型开放知识库( Open KB) 的构建。 这种开放 KB 中的名词和关系短语没有被可理解化, 导致储存多余和模糊的事实。 最近的研究将开放 KB 的语句作为人工定义的功能空间的组合而进行了可理解化。 手动特征工程成本昂贵, 且往往不理想。 为了克服这一挑战, 我们提议使用嵌入和侧边信息( CESI) 来显示 Canonicization, 这是一种新颖的方法, 以原则方式将公开 KB 所学的嵌入内容与相关 NP 和 关联短语侧面信息结合起来, 扩展了 KB 的最近的进展。 通过对多个真实世界数据集的广泛实验, 我们展示 CESI 的有效性 。