Low-dimensional embeddings of nodes in large graphs have proved extremely useful in a variety of prediction tasks, from content recommendation to identifying protein functions. However, most existing approaches require that all nodes in the graph are present during training of the embeddings; these previous approaches are inherently transductive and do not naturally generalize to unseen nodes. Here we present GraphSAGE, a general, inductive framework that leverages node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, we learn a function that generates embeddings by sampling and aggregating features from a node's local neighborhood. Our algorithm outperforms strong baselines on three inductive node-classification benchmarks: we classify the category of unseen nodes in evolving information graphs based on citation and Reddit post data, and we show that our algorithm generalizes to completely unseen graphs using a multi-graph dataset of protein-protein interactions.
翻译:从内容建议到确定蛋白质功能,大图中节点的低维嵌入已被证明在各种预测任务中极为有用,从内容建议到确定蛋白质功能。然而,大多数现有方法要求在嵌入训练期间,图中的所有节点都存在;这些先前的方法本质上是传输性的,并不自然地概括为不见的节点。在这里,我们展示了GigmaSAGE,这是一个利用节点特征信息(如文本属性)来有效生成先前未见数据的节点嵌入的普通、感化框架。我们学到了一个函数,它通过取样和集合某个节点的本地社区特征生成嵌入。我们的算法超越了三个缩进节点分类基准的强基线:我们根据引力和再插入后数据对不断演变的信息图表中的隐蔽节点类别进行了分类,我们显示,我们的算法将利用蛋白质-蛋白相互作用的多谱数据集对完全隐蔽的图表进行概括。