Link prediction is a crucial problem in graph-structured data. Due to the recent success of graph neural networks (GNNs), a variety of GNN-based models were proposed to tackle the link prediction task. Specifically, GNNs leverage the message passing paradigm to obtain node representation, which relies on link connectivity. However, in a link prediction task, links in the training set are always present while ones in the testing set are not yet formed, resulting in a discrepancy of the connectivity pattern and bias of the learned representation. It leads to a problem of dataset shift which degrades the model performance. In this paper, we first identify the dataset shift problem in the link prediction task and provide theoretical analyses on how existing link prediction methods are vulnerable to it. We then propose FakeEdge, a model-agnostic technique, to address the problem by mitigating the graph topological gap between training and testing sets. Extensive experiments demonstrate the applicability and superiority of FakeEdge on multiple datasets across various domains.
翻译:链接预测是图表结构数据中的一个关键问题。 由于图表神经网络(GNNs)最近的成功,提出了各种基于GNN的模型来应对链接预测任务。具体地说,GNNS利用信息传递模式来获取节点代表,这依赖于链接连接。然而,在一个链接预测任务中,培训数据集中总是存在链接,而测试数据集中的链接尚未形成,从而导致连接模式和所学代表性的偏差不一致。这导致了数据集转换问题,使模型性能下降。在本文中,我们首先确定了链接预测任务中的数据集转换问题,并对现有链接预测方法的脆弱性进行了理论分析。我们然后提出FakeEdge,这是一种模型-不可知性技术,通过缩小培训和测试数据集之间的图表表层差距来解决这一问题。广泛的实验表明FakeEdge在不同领域多个数据集上的应用性和优越性。