Open information extraction (OIE) methods extract plenty of OIE triples <noun phrase, relation phrase, noun phrase> from unstructured text, which compose large open knowledge bases (OKBs). Noun phrases and relation phrases in such OKBs are not canonicalized, which leads to scattered and redundant facts. It is found that two views of knowledge (i.e., a fact view based on the fact triple and a context view based on the fact triple's source context) provide complementary information that is vital to the task of OKB canonicalization, which clusters synonymous noun phrases and relation phrases into the same group and assigns them unique identifiers. However, these two views of knowledge have so far been leveraged in isolation by existing works. In this paper, we propose CMVC, a novel unsupervised framework that leverages these two views of knowledge jointly for canonicalizing OKBs without the need of manually annotated labels. To achieve this goal, we propose a multi-view CH K-Means clustering algorithm to mutually reinforce the clustering of view-specific embeddings learned from each view by considering their different clustering qualities. In order to further enhance the canonicalization performance, we propose a training data optimization strategy in terms of data quantity and data quality respectively in each particular view to refine the learned view-specific embeddings in an iterative manner. Additionally, we propose a Log-Jump algorithm to predict the optimal number of clusters in a data-driven way without requiring any labels. We demonstrate the superiority of our framework through extensive experiments on multiple real-world OKB data sets against state-of-the-art methods.
翻译:开放信息提取 (OIE) 方法从无结构化的文本中提取大量 OIE 3⁄2 un 字词< noun legs, 关系短语, noun 字词>, 组成大型开放知识库( OKBs) 的无结构化文本( OKBs) 。 在这种 OKB 中, 名词词词和关系短语没有被分割, 导致分散和冗余的事实。 发现两种知识观点( 以事实三重和基于事实三重源背景的背景观点为基础的事实观点) 提供了对 CPB Canonicalization 任务至关重要的补充信息, 即将多个名词和关系短语分组成同一个组群, 并指定了它们独特的识别符号。 然而, 这两种知识观点迄今为止被现有作品孤立地利用了。 本文中, CMVC是一个全新的、 不受监督的框架, 利用这两种知识观点共同使 OKBs 能够不需要手工加注的标签。 为了实现这一目标, 我们建议一个多视图 CH- Means commus commission commation to the the commission view of the view of the expeal expeal liversal liversal liversal view dal deal deal deal deal dealal deal deal deal deal dal violview dal dal deal legal viewal view legal deal legal dal legildal legildal dal legaldal legildal deal legal dal levelmentaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldal violal violaldal 。我们提出出一个我们我们用一种我们用一种我们用一种我们用一种通过不同的数据到一个不同的数据到一个不同的数据到一个不同的精确到一个不同的数据到一个不同的数据到一个不同的数据流的方法, 方法, 方法, 方法, 通过不同的数据流到一个不同的数据流到一个不同的数据流到一个不同的数据流到一个不同的